HDFS, The Hadoop Distributed File System.

When you Map a problem you break it down into key value pairs and spread it out across a number of processors – abstract it to its simplest components and give them to lots of workers.

HDFS is a file system specifically designed to handle LOTS of key value pair data. The data is organized specifically to be tolerant of faults (hardware failure like ‘Backhoe Fade’), always available and quickly available. In addition to the architecture of the storage system (key value pairs) the data itself is duplicated by default 3 times. This duplication is done on physically separate hardware (eliminate effects of backhoe fade) there are essentially online backups of your live data – since the data is in 3 places its essentially hot standby and since it’s in 3 places and up to date you can read and write to these places – this makes it quick.

HDFS

HDFS relies on things called nodes and there is a Master “NameNode” that knows where each piece of data needed for processing resides on the file system, the “DataNodes”. The NameNode keeps track of the multiple copies of data required to achieve protection against data loss. The NameNode is duplicated in a Secondary NameNode however this is not a hot standby. In the event that the secondary NameNode needs to take over from the primary it has to load its namespace image into memory replay the edit log and service enough block reports from the DataNodes to cross the threshold of operability.

That is it, really. It’s pretty straightforward. MapReduce and HDFS are designed to scale linearly (i.e. there is a direct linear proportion to how much money you spend and the size of the bang you get – like nuclear weapons). This stuff would pretty much be useless unless it could get staggeringly big and still be manageable.

There is some magic behind the scenes however. Keeping track of the state and location of the data and processes operating on it. To make all of this stuff work there is an army of specialized distributing processes all coordinating actions on the data AND coordinating actions between themselves. There are:

  • Distributed Lock Managers
  • Distributed Schedulers
  • Distributed Clients
  • Distributed Coordinators

The detail on how they work is as easily accessible as the early work on computability theory (you can get hold of it). Whether you can understand it or not is a different matter.