SARA starts Apache Hadoop Proof-of-Concept


Apache Hadoop has found its way to Scientific Computing in the Netherlands. This technology is already commonly used in high-profile IT companies as a tool for storage and analysis of extremely large datasets. SARA has initiated a Proof-of-Concept project to evaluate the Hadoop software stack for scientific use. We expect Hadoop to be particularly beneficial for sciences that utilize large-scale datasets, such as Natural Language Processing, BioInformatics, Social Sciences and Humanities.

A selected number of scientists are invited to participate in the project and use Hadoop to store and analyze their datasets. The kick-off of the Proof-of-Concept Hadoop service will take place on December 7th with a Hackathon – a day long event where people can experiment with Hadoop, with hands-on support of experienced users.

Apache Hadoop

Apache Hadoop is an open-source software stack that has been developed in reaction to two papers (The Google File System and MapReduce) published by Google in 2004 which are based on Google's experiences in storing and analyzing data. Now, six years later, Hadoop has been adopted and is being developed by Internet giants like Yahoo!, Facebook, eBay,, Twitter, and many more, as the solution for handling Internet-scale datasets. In terms of High Performance Computing a Hadoop cluster is a highly parallel throughput computing-system, enabled through its MapReduce component. In terms of Mass Storage it is a distributed file system, enabled through its Hadoop Distributed File System (HDFS) component.


In order to evaluate its potential for scientific computing and data storage SARA is setting up a prototype Hadoop service. This service will be made available to a selected number of users, for a limited period, to perform scientific data analysis over large amounts of data. Next to the public evaluation, SARA will perform a number of experiments on HDFS to determine the extent to which it can be used as a more generic data store.


On December 7, SARA organizes a day-long hackathon to kick-off the Proof-of-Concept Hadoop service, and give the opportunity to experiment with Hadoop with support of experienced users. People who are interested can work with Hadoop on a case of choice, or only play with datasets like Wikipedia, the ENRON dataset, White House visitor records, Genome data or others.

More Information

For more information, please contact Evert Lammerts at