Hadoop – HDFS and MapReduce.

Hadoop came out in 2008 and the primary components of it (HDFS and MapReduce) conceptually look like this.

Hadoop, HDFS and MapReduce

MapReduce splits work across many machines.

Map takes the job, breaks it into many small reproducible pieces and sends them to different machines and runs them in parallel.

Reduce takes the results and puts them back together to get a single answer.

Reduce works recursively. You HAVE to design your Reduce job so that it can consume it’s own output. These successive Reduce operations over a smaller and smaller set of refined data give you your unique answer. The result of the map computation is always list of key/value pairs. For each set of values that have the same key, Reduce will combine them into a single value. You don’t know if this was 100 steps or 1, the end result looks like a single Map.

Key Value pair data is stored in the Hadoop Distributed File System.

The data in HDFS is organized specifically to be tolerant of faults (hardware failure like ‘Backhoe Fade’), always available and quickly available. In addition to the architecture of the storage system (key value pairs) the data itself is duplicated by default 3 times. This duplication is done on physically separate hardware (eliminate effects of backhoe fade) there are essentially online backups of your live data – since the data is in 3 places its essentially hot standby and since it’s in 3 places and up to date you can read and write to these places – this makes it quick.

What you get is a Key Value pair datastore (fodder for MapReduce queries) that is fault tolerant, highly available and can handle staggering amounts of data quickly.