MapReduce.

MapReduce has a language and there is a syntax to this language (like English). Behind MapReduce are concepts that the language implements. You don’t need to understand the language to appreciate the elegance of MapReduce, you need a good analogy.

Here’s mine:

It is 1930. Alan Turing and Alonzo Church in England have just come up with computability theory. The great depression is just kicking off in America. John D. Rockefeller, the worlds’ richest man and great philanthropist sees that America is suffering and it’s the kids that are hit the hardest. He is rich and can afford to pay anyone to do anything and can buy whatever he likes. He is very smart and employs field observers so he has first hand intelligence of the business world around him.  The Differential Analyzer was a high-speed moron not yet ready for prime time but the concepts behind solving massive problems was mathematically provable, he just read that – Alan Turing said so it was right there in the US Library of Congress where he was sitting wondering how many of these books had pictures of him in them.

He thought that the US Library of Congress was just dusty old books. He knew that mans real treasure was what within these books, in their words of wisdom (That’s why he was in there – reading).

John had an epiphany!

“So – the key is the words and these are the things that I value. I have just MAPPED my problem into KEY VALUE PAIRS – that is totally cool because Alan Turing says so!”

John knew he could mine the Library of Congress and abstract this knowledge. It would make him not only the richest but also most omniscient man in the world. Wiser than Solomon, richer than … well nobody actually because he was already the richest man in the world.

Heady stuff.

To mine the library however the books would have to be destroyed.

John gathers all the kids in America at the Library and says:

John: “Kids, here’s some really sharp scissors, I want you to run in there and tear every page out of those books. $1.00 for every page you tear out!

Then I want you to cut each page into lines and then each line into words.  When you are done, just throw the words on the floor. I trust you kids, America has the finest education system in the world, you all know that ‘The’ is the same as “THE’ is the same as ‘tHE’ but not the same as ‘thee’, that the ‘P’ is silent in psittacosis, there’s no ‘F’ in rough and there’s no effin Physics because Science is hard and we pay teachers so poorly”

Kids: In unison “Gee big shot, a buck a page, that is markedly unphilantropic”

John: “The fun ain’t done yet kids”

Kids: In unison “Tell us more John”

John: “Right, when you are tired of ripping out pages and cutting them up, just grab handfuls of those words on the floor. Lots of ‘em are the same, and that’s a waste. I want to REDUCE the waste. Go through the ones you pick up and for any duplicates, count ‘em up and write it on the back of one of them and drop it on the floor again. All the dupes, bring to me – there’s a dollar for each one. Just keep doing that until the dupes are gone”

So, schematically what John had each of his little helpers do was this

MapReduce

The kids rushed in and found that there was one book in the Library with 3 pages in it.  We don’t care about lines and after Fred left with $3 in his sweaty little paws there were 15 words lying on the floor.

Kids have small hands and can only grab about 3 words in each hand, so with Sharon’s first grab she got 3 AA’s (and $2) a BB an XX and a ZZ. Frank got 3 BB’s (and $2) an XX a CC and a DD. Robert was out of luck – 3 words only and no dupes.

Robert leapt like a leopard on the next stack of six lying on the floor and got a score – A BB netted him $1.

When Debbie left with the last $1; John D Rockefeller had the information he wanted – the words in the library of congress, 4 BB’s 3 AA’s, 2 XX’s, 2 EE’s and one each of ZZ, CC, DD and FF.

There you have it, a perfectly executed plan against simple and well-defined data and a single property of that data.