Large-scale Data Analytics with Apache Hadoop Service

Fast turnover of experiments on large datasets made easy

Hadoop is a framework that makes fast data-parallel analysis of very large datasets possible by applying the concept of data locality. The volume of scientific data collected using modern technologies provides a challenge that can only be addressed by large-scale computing. Hadoop provides interfaces that enable very fast experimentation on such data, without having to bother with the specifics of parallelism. That, in combination with the commodity hardware that the service runs on, makes Hadoop one of the most efficient data-parallel computing platforms currently available.

Apache Hadoop

Hadoop is a software framework that is developed to enable easy, fast and efficient data-parallel batch processing of very large datasets. It exists of a distributed file system called HDFS, and a parallel processing interface called MapReduce. A fundamental feature of Hadoop is that the processing of the data is done on the same machines where the data is stored, a property called data locality. This makes the throughput extremely high and minimizes the need for investments in expensive networking hardware. Due to its speed and economic efficiency Hadoop has gained much popularity in both academics and industry; there is no other large-scale processing framework that is as widely adopted as Hadoop.

The BiG Grid Apache Hadoop Service

After a year of testing on a prototype facility, SURFsara has an Hadoop facility available for scientists in the Netherlands since mid 2012. The prototype facility has successfully assisted scientists in the domains of Information Retrieval, Natural Language Processing, Computer Science and Bioinformatics in their research. The facility complements the Dutch national computing infrastructure as a system for data-parallel, high-throughput capacity computing. Due to its simplicity we expect Hadoop to be an important enabler of large-scale computing and e-Science for researchers that are not traditionally involved in these fields.

e-Science and Fast Experimentation

To enable fast experimentation Hadoop offers a number of interfaces. Its native Application Programming Interfaces makes it easy for programmers to create, for example, portals that allow researchers to run a selected algorithm over a selected dataset without ever having to interact with a command-line interface. And other generic interfaces are built on top of Hadoop. With Apache Pig, data analytics can be performed using a syntax much like SQL – but then over terabytes of data. Apache Hbase enables in-memory real-time data access to very large datasets. And the soon to be released Apache HCatalog will provide an abstraction on top of HDFS, enabling easy data curation, sharing and discovery.

More information

For more information, please contact helpdesk@surfsara.nl. Whether you want to use our Hadoop service or want more information on Hadoop itself, we are happy to hear from you!

Best cases

The Hadoop service is used already in a number of best cases: