The Hadoop Ecosystem.

Big Data is more than just the Apache Hadoop kernel; it is a collection of related projects. Some of them are mandatory (you cannot do Big Data without them) and others are optional (like data loaders and flow processing languages).

The mandatory Hadoop projects are HDFS, MapReduce and Common.  Common is a collection of components and interfaces for distributed processing and I/O. In addition to the file system itself and the language, Common implements the required components of distributed computing (autonomous computing entities with their own local memory that communicate through message passing).

There are a number of projects in the ecosystem that complement the mandatory projects. Below are some of the higher-level projects

  • ZooKeeper

A distributed coordination service. HDFS relies on nodes and there is a Master “NameNode” that knows where each piece of data needed for processing resides on the file system, the “DataNodes”. The NameNode, even if it is duplicated in a Secondary NameNode is still a single point of failure. Promoting a secondary NameNode to be the primary can take hours on really large clusters.

The way that this is solved is to have a pair of NameNodes in an active-standby configuration. To make this feasible the NameNodes must each be able to share the edit log (a file that keeps track of all the changes that are made so that if needed it can be replayed to bring the system back into sync.)

To make HDFS Highly Available (HA) each DataNode will send block reports to both NameNodes (block mappings are stored in the NameNode’s memory, not on disk.)

In this way, the standby NameNode has its namespace loaded in memory already and is servicing block requests.

To make Hadoop HA, ZooKeeper manages the shared storage that the NameNodes need to access the edit logs.

  • Hive

A Data Warehouse project. Data Warehouse systems are predominantly relational data. More and more unstructured data is being stored in no relational file systems like HDFS and organizations that have large Data Warehouse operations (Business Analysts) want to continue to use what they know but on this new rich media. The Hive project makes all your data look relational (whether it is or not) and implements an SQL dialect. This last bit (implements an SQL dialect) is being pedantic because it’s not 100% SQL, however as far as your BA’s are concerned it’s near enough for them to continue doing what they know best. Making your data look relational is a simplification and the pedantic is implements relational table abstractions.

  • HBase

Hadoop implements HDFS as a file system, but it does not have to be HDFS. HBase is an open source column oriented key value NoSQL database built on top of HDFS (so it is distributed) and supports MapReduce and good response times for point queries although it does not implement an SQL dialect. Column oriented databases have advantages over row oriented databases depending on your workload and can offer additional benefits in data compression because of their architecture. If you can design a Big Data architecture from scratch, you need to look at what type of solution it is going to be (i.e. OLAP, OLTP etc.) and how you are going to be using it.

There are also projects in their own right that deal with utility or specialized functions such as

  • Flow Processing Languages (pig)
  • Cross Language Serialization (Avro)
  • Extract Transform Load (Sqoop)
  • Scheduler (Oozie)

This is not a definitive list of projects, there are many more, right down to JDBC Drivers, Database Connection Poolers, Logging Packages. This post is to give you a summary of the mandatory ecosystem projects, an introduction to the main optional projects that you are likely to come across and an idea of the utility projects available. Other posts on specific Big Data implementations (like Data Warehousing) will cover their list of projects at and appropriate level of depth