Imagine what Big Data can do for you.

Big Data is a term, like database, that can mean different but similar things depending on context:

  • Big Data is archetypically the Hadoop Distributed File System (HDFS) and MapReduce. The open source Apache project Hadoop includes HDFS and an implementation of MapReduce.
  • Big Data is a collection of related open source projects – an ecosystem.
  • Big Data open source components can be plugged in and out, you can even write your own. When you do this you have a solution in mind and whether that solution is also called something else like Data Warehousing, it’s also called Big Data.

If you understand Hadoop (HDFS and the MapReduce runtime environment) you know 90% of Big Data.


Map and Reduce came as a result of study in the 1930’s into the solvability of mathematical problems and the amount of processing required to get a result.

This field of study is called computability theory.

In mathematics, recursion theorists study the theory of relative computability and degree structures and reducibility notions and in computer science geeks study sub-recursive hierarchies and formal methods and formal languages and there’s a great deal more overlap between the two than you would expect on first hearing this.

Alan Turing and Alonzo Church showed that if you can describe a mathematical procedure unambiguously you could execute it algorithmically  (you have a Language). Knowing this it was just a question of the horsepower needed to execute the algorithm. The really cool thing about computability theory was that if you could state your question, it would tell you how much grunt would be needed to answer the question, which in the 1930’s placed this type of solution firmly in the hands of those with the most money. Douglas Adams was perhaps most eloquent at expressing the amount of gear required to answer really tough questions.

Cheap commodity server hardware solved the horsepower problem, so that now you didn’t need a computer the size of a planet to answer questions,  and the distributed processing and file systems needed to support it that were described in the 1960’s and saw first fruit in the 1970s’ (Ethernet – thank you very much) came together with the MapReduce language in 2008 with Hadoop.

MapReduce is a language (like Database, Galvanization and Big Data, the definition of MapReduce has it’s own little context dependency that separates the geeks from the rest of humanity).

To make a language work in a computer you need an environment for it to run in. This is called a runtime environment, no ambiguity there. The runtime is what allows the language to do what it does – it understands the data objects you are working with and how they are stored and organized. It makes these objects available to the language. It implements the syntax that allows you to interrogate and manipulate this data. So MapReduce is also a runtime and this is probably the most accurate definition. Runtimes implement the behavior of a language, so while we can say MapReduce is a language, the runtime means that you can actually write code against the data system using any language (as long as it is supported by the runitme) – Java, Python, C++ etc. This begs the question, ‘If I can use languages I already have, why use another language?’ That’s a good question and especially relevant in Open Source projects (like Java and Hadoop). Open Source is a meritocracy, projects that add little value have small communities, the Map Reduce community is huge because it is a language designed to work with huge distributed processor farms and mind boggling amounts of multiply redundant data. It is possible to program against this with Java. If you have the Java chops and intestinal fortitude to handle the amazingly powerful (and huge) set of Java API’s that is part of the Hadoop ecosystem, you also have enough common sense to know when to use Java and when to use MapReduce against your data.

MapReduce runs on top of a distributed file system (HDFS) so it must have a distributed processing model and execution environment. The distributed file system keeps track of what data objects you have and where they are, the distributed processing model keeps track of your executing code (the MapReduce program) and all the other server processes that are needed to coordinate action across a large and potentially geographically disperse multiply redundant file system. The execution environment breaks down the Map and Reduce steps into tasks that can then be farmed out and executed in parallel across many cheap commodity servers. Sometimes your answer lies in the Map step so there may be no need to Reduce.

So what we have so far is a way to:

  • mathematically describe complex problems and get an idea of the processing requirements (the theory),
  • cheap freely available hardware with loads of disks and processors (the hardware),
  • the software to run distributed file systems and processors (Hadoop)
  • and the language needed to break complex data manipulations into a small number of atomic operations that can be linked (MapReduce), and
  • the runtime (once again MapReduce) that implements that language (MapReduce, Algol, Fortran, Assembler)

And a full page of blurb on MapReduce before you actually find out what it does.

What does MapReduce do?

No article on Big Data is complete without the obligatory description and schematic of MapReduce.

Here is mine.

It is 1930. Alan Turing and Alonzo Church in England have just come up with computability theory. The great depression is just kicking off in America. John D. Rockefeller, the worlds’ richest man and great philanthropist sees that America is suffering and it’s the kids that are hit the hardest. He is rich and can afford to pay anyone to do anything and can buy whatever he likes. He is very smart and employs field observers so he has first hand intelligence of the business world around him.  The Differential Analyzer was a high-speed moron not yet ready for prime time but the concepts behind solving massive problems was mathematically provable.

He thought that the US Library of Congress was just dusty old books. He knew that mans real treasure was what within these books, in their words of wisdom.

John had an epiphany!

“So – the key is the words and these are the things that I value. I have just MAPPED my problem into KEY VALUE PAIRS – that is totally cool because I just read something about that!”

John knew he could mine the Library of Congress and abstract this knowledge. It would make him not only the richest but also most omniscient man in the world. Wiser than Solomon, richer than … well nobody actually because he was already the richest man in the world.

Heady stuff.

To mine the library however the books would have to be destroyed.

John gathers all the kids in America at the Library and says:

John: “Kids, here’s some really sharp scissors, I want you to run in there and tear every page out of those books. $1.00 for every page you tear out!

Then I want you to cut each page into lines and then each line into words.  When you are done, just throw the words on the floor. I trust you kids, America has the finest education system in the world, you all know that ‘The’ is the same as “THE’ is the same as ‘tHE’ but not the same as ‘thee’, that the ‘P’ is silent in psittacosis, there’s no ‘F’ in rough and there’s no effin Physics because Science is hard and we pay teachers so poorly”

Kids: In unison “Gee big shot, a buck a page, that sucks”

John: “The fun ain’t done yet kids”

Kids: In unison “Tell us more John”

John: “Right, when you are tired of ripping out pages and cutting them up, just grab handfuls of those words on the floor. Lots of ‘em are the same, and that’s a waste. I want to REDUCE the waste. Go through the ones you pick up and for any duplicates, count ‘em up and write it on the back of one of them and drop it on the floor again. All the dupes, bring to me – there’s a dollar for each one. Just keep doing that until the dupes are gone”

So, schematically what John had each of his little helpers do was this

The kids rushed in and found that there was one book in the Library with 3 pages in it.  We don’t care about lines and after Fred left with $3 in his sweaty little paws there were 15 words lying on the floor.

Kids have small hands and can only grab about 3 words in each hand, so with Sharon’s first grab she got 3 AA’s (and $2) a BB an XX and a ZZ. Frank got 3 BB’s (and $2) an XX a CC and a DD. Robert was out of luck – 3 words only and no dupes.

Robert leapt like a leopard on the next stack of six lying on the floor and got a score – A BB netted him $1.

When Debbie left with the last $1; John D Rockefeller had the information he wanted – the words in the library of congress, 4 BB’s 3 AA’s, 2 XX’s, 2 EE’s and one each of ZZ, CC, DD and FF.

There you have it, a perfectly executed plan against simple and well-defined data and a single property of that data.

As you will see later, not all data is key value pair and not all data is in a computer, the most important data is in your head – what is that one piece of information that you need about something to solve a problem. Once you start thinking in these terms there is a proven way of getting this problem solved with computers.

It has been said that to err is human but to really screw up, you need a computer. Likewise, garbage in, garbage out.

Even if you are the richest man in America and can clearly state your vision there is a translation between thought and action that needs careful consideration. John had his answer and besides the obvious (a word count), he had the understanding that Information Processing processes Information, like words. Knowledge, and thus the wisdom he sought required processing of a more cerebral kind.

So if you throw into the mix loads of project managers, business analysts, developers, program managers and business managers you can see the possibility of something getting lost in the translation is directly proportional to the outstanding potential Big Data has for answering previously unanswerable questions.

People are gonna end up with answers that they weren’t expecting because they framed their question incorrectly – the primary source of human inspiration, experience is something you get when you are looking for something else, ask John.

Having tried and missed, you have all sorts of new questions; you are practiced at asking the questions and more certain of how to express yourself – sort of like life in general.

John could have asked

  • “I wonder how many books have pictures of me in them?”
  • “I wonder what percentage of books contain language offensive to my Baptist sensibilities”
  • “I wonder if anyone has infringed my patents or copyrights”
  • “I wonder what they look like”
  • “I wonder where they live”

Anyhow, John destroyed all his data so he was prevented from these perfectly normal human lines of enquiry.

Data today is forever. We design highly available fault tolerant data systems and for every method of getting data stored electronically there is a method to retrieve it and multiple methods for manipulating it. It’s what Information Processing is all about.

If you are wondering how a modern day Rockefeller could answer these types of questions, consider how Facebook knows who your friends are likely to be by analyzing their photographs or how law enforcement can capture an image of you running a red light, recognize the license plate, identify your mug through the windscreen and send you a bill.

All Big Data type solutions.

All implemented and working right now.

MapReduce splits work across many machines.

Map takes the job, breaks it into many small reproducible pieces and sends them to different machines and runs them in parallel.

Reduce takes the results and puts them back together to get a single answer.

Reduce works recursively. You HAVE to design your Reduce job so that it can consume it’s own output. These successive Reduce operations over a smaller and smaller set of refined data give you your unique answer. The result of the map computation is always list of key/value pairs. For each set of values that have the same key, Reduce will combine them into a single value. You don’t know if this was 100 steps or 1, the end result looks like a single Map.

In this case a list of every single word that occurs in the US library of Congress and a count of how many times it occurs i.e:

  • The, 3,00,000,000
  • Cat, 12
  • Sat, 4,553,353,167,187,867,655.076591
  • On, 5
  • Mat, 6

This can be used to fuel still other queries, and like the data is persistent these days, so is any part of the reduce operation that is unaffected by your new question. More of this later, but what it means is that once you have done a complete reduction first time it is possible to re run components of your solution without having to go through all the reduction steps again – nice huh?

Why do you care?

The reason that people use Big Data is that it can get results in days or hours that would previously take months or years to get.

More than this, it can get answers that were previously unavailable.

Knowing how MapReduce works is a great way of appreciating what is actually happening when people say “We are using Big Data to solve this particular problem.”

Although you don’t have to be Rockefeller to get yourself a Big Data solution, you should also know that solving problems with Big Data is still not cheap, you need

  • Hard core open source expertise
  • Network administration and Linux administration with an emphasis on HA
  • Hardware expertise – Both servers and networks
  • Data architects
  • People to move data
  • People to build MapReduce applications
  • People to performance test and tune
  • People to manage it all

Like owning a car, you can buy it yourself or you can lease it. It’s not that Big Data requires esoteric equipment and storage, it just needs lots of it. There are Cloud based products that mean that you can literally run a Big Data behemoth without owning a lick of gear. The Cloud is designed for this (Distributed Computing) and since you pay for storage, Cloud service providers love Big Data.

However you do it Big Data might not be cheap, but it’s invaluable.

If you want to know what other people have used it for, look at the Big Data Implementations section below.

HDFS – Distributed file system

When you Map a problem you break it down into key value pairs and spread it out across a number of processors – abstract it to its simplest components and give them to lots of workers.

HDFS is a file system specifically designed to handle LOTS of key value pair data. The data is organized specifically to be tolerant of faults (hardware failure like ‘Backhoe Fade’), always available and quickly available. In addition to the architecture of the storage system (key value pairs) the data itself is duplicated by default 3 times. This duplication is done on physically separate hardware (eliminate effects of backhoe fade) there are essentially online backups of your live data – since the data is in 3 places its essentially hot standby and since it’s in 3 places and up to date you can read and write to these places – this makes it quick.

HDFS and Hadoop

HDFS relies on nodes and there is a Master “NameNode” that knows where each piece of data needed for processing resides on the file system, the “DataNodes”. The NameNode keeps track of the multiple copies of data required to achieve protection against data loss. The NameNode is duplicated in a Secondary NameNode however this is not a hot standby. In the event that the secondary NameNode needs to take over from the primary it has to load its namespace image into memory, replay the edit log and service enough block reports from the DataNodes to cross the threshold of operability.

That is it, really. It’s pretty straightforward. MapReduce and HDFS are designed to scale linearly (i.e. there is a direct linear proportion to how much money you spend and the size of the bang you get – like nuclear weapons). This stuff would pretty much be useless unless it could get staggeringly big and still be manageable.

There is some magic behind the scenes however. Keeping track of the state and location of the data and processes operating on it. To make all of this stuff work there is an army of specialized distributing processes all coordinating actions on the data AND coordinating actions between themselves. There are:

  • Distributed Lock Managers
  • Distributed Schedulers
  • Distributed Clients
  • Distributed Coordinators

The detail on how they work is as easily accessible as the early work on computability theory (you can get hold of it). Whether you can understand it or not is a different matter.


Hadoop came out in 2008 and the primary components of it (HDFS and MapReduce) conceptually look like this.

Hadoop, MapReduce and HDFS

The Hadoop ecosystem.

Big Data is more than just the Apache Hadoop kernel; it is a collection of related projects. Some of them are mandatory (you cannot do Big Data without them) and others are optional (like data loaders and flow processing languages).

The mandatory Hadoop projects are HDFS, MapReduce and Common.  Common is a collection of components and interfaces for distributed processing and I/O. In addition to the file system itself and the language, Common implements the required components of distributed computing (autonomous computing entities with their own local memory that communicate through message passing).

There are a number of projects in the ecosystem that complement the mandatory projects. Below are some of the higher-level projects

  • ZooKeeper

A distributed coordination service. HDFS relies on nodes and there is a Master “NameNode” that knows where each piece of data needed for processing resides on the file system, the “DataNodes”. The NameNode, even if it is duplicated in a Secondary NameNode is still a single point of failure. Promoting a secondary NameNode to be the primary can take hours on really large clusters.

The way that this is solved is to have a pair of NameNodes in an active-standby configuration. To make this feasible the NameNodes must each be able to share the edit log (a file that keeps track of all the changes that are made so that if needed it can be replayed to bring the system back into sync.)

To make HDFS Highly Available (HA) each DataNode will send block reports to both NameNodes (block mappings are stored in the NameNode’s memory, not on disk.)

In this way, the standby NameNode has its namespace loaded in memory already and is servicing block requests.

Any any clients that attach to the Master must be able to handle NameNode failover. It’s no use if the system fails over gracefully but the clients attached to that system cannot. To make Hadoop HA, ZooKeeper manages the shared storage that the NameNodes need to access the edit logs.


  • Hive

A Data Warehouse project. Data Warehouse systems are predominantly relational data. More and more unstructured data is being stored in no relational file systems like HDFS and organizations that have large Data Warehouse operations (Business Analysts) want to continue to use what they know but on this new rich media. The Hive project makes all your data look relational (whether it is or not) and implements an SQL dialect. This last bit (implements an SQL dialect) is being pedantic because it’s not 100% SQL, however as far as your BA’s are concerned it’s near enough for them to continue doing what they know best. Making your data look relational is a simplification and the pedantic is implements relational table abstractions.

  • HBase

Hadoop implements HDFS as a file system, but it does not have to be HDFS. HBase is an open source column oriented key value NoSQL database built on top of HDFS (so it is distributed) and supports MapReduce and good response times for point queries although it does not implement an SQL dialect. Column oriented databases have advantages over row oriented databases depending on your workload and can offer additional benefits in data compression because of their architecture. If you can design a Big Data architecture from scratch, you need to look at what type of solution it is going to be (i.e. OLAP, OLTP etc.) and how you are going to be using it.

There are also projects in their own right that deal with utility or specialized functions such as

  • Flow Processing Languages (pig)
  • Cross Language Serialization (Avro)
  • Extract Transform Load (Sqoop)
  • Scheduler (Oozie)

This is not a definitive list of projects, there are many more, right down to JDBC Drivers, Database Connection Poolers, Logging Packages. This post is to give you a summary of the mandatory ecosystem projects, an introduction to the main optional projects that you are likely to come across and an idea of the utility projects available. Other posts on specific Big Data implementations (like Data Warehousing) will cover their list of projects at and appropriate level of depth

Big Data Implementations.

Big Data is conceptually a collection of data and the means to store (a file system) and access it (a runtime).

This means that all of the components are swappable including the very core components that were the archetypal definition of Big Data – the file system (HDFS) and the execution environment (MapReduce). The resulting system will still be Big Data, and if any of this still fails to meet your needs you can write your own and this will still be Big Data (after some smart people look at it first).

It’s like Granddad’s axe. The handle has been replaced 3 times and the head once, but it’s still Granddad’s axe.

Alternatives to MapReduce and HDFS started to get published in 2009 (signs of a healthy ecosystem), and the people that were deciding that there needed to be alternatives were predominantly the people who wrote this stuff in the first place – Google (signs of Google eating their own dog food). There are several excellent articles on these alternatives. What is encouraging is that the initial Google projects (like Pregel) also have open source counterparts (Giraph), what is more encouraging is that there are Google projects as well as Big Data solutions out there that don’t have an open source project.

In this case the definition of Big Data for you could be your very own open source project and community (be careful what you wish for.)

Big Data is also a solution. It is a general data processing tool built to address the real world use case of speedy access and analytics of huge datasets.

Examples of these types of solutions are:

  • Facebook uses it to handle 50 Billion photographs
  • President Barack Obama committed $200 million to Federal Initiatives to fight cancer and traffic congestion with Big Data. Mr. Obama knows the value of Big Data – it helped him get re-elected.
  • BigSQL – the poster child for Data Warehousing with Big Data.
  • Decoding the Human genome – what took 10 years to do originally, Big Data does in a week.
  • NASA uses it to observe and simulate the weather
  • The NSA are definitely going to use it for the betterment of the human condition when they fire up their yottabyte capable Big Data machine on information they capture over the internet. 1000000000000000000000000 bytes – it’s a lot bigger than it looks, trust me. Just imagine what those little rascals could get up to with all of that lovely data.
  • Architects use it to enforce a consistent design approach and increase productivity (Nice to hear, I assume it also makes buildings safer as well as profits bigger).
  • MIT hosts the Intel Science and Technology Center for Big Data (If you can’t do, teach. Big Data eats Intel chips like a pig eats truffles).
  • McKinsey bills millions on their MIT graduate Big Data consultants.
  • Made finding the God Particle (Higgs Boson) possible.

As you can see, the limit of natty little Big Data implementations is your imagination.

 Imagine What Big Data can do for You.