BigSQL Architecture

BigSQL Up and Running

BigSQL Up and Running

All Big Data projects are a collection of related projects, each suited to a specific function to implement that Big Data solution. BigSQL can be used as a Big Data Data Warehousing solution. In addition to the archetypical Big Data components (Hadoop Distributed File System and MapReduce), there are additional projects in BigSQL that allow you to integrate Big Data and Relational Data and query it through SQL.

These components are designed to work together and to be interchangeable. This is great for flexibility.

As far as installation goes it’s the same as with all open source projects, you can build each component from source or you can lay down an integrated bundle. However you do this you need to know what components are being installed, how they are configured and what they are doing.

First time through you can go through any of the excellent texts on installing a Data Warehousing solution with Big Data. If you follow a procedure in a book, you will end up with that implementation working in your environment. OpenSCG has taken all the components that are required in a Big Data Data Warehousing solution and bundled them together as BigSQL. You can download it, run it and within a few minutes have a complete BigSQL environment up and running.

This post takes you through the installation process so you can understand what components are being installed and where they are being installed.

Whether you build each component from source, have someone do it for you, or use an installer bundle, if you are working with BigSQL you need to understand conceptually what components make up your solution. This is an annotated description of what happens when you run the BigSQL installer. BigSQL is a superb mechanism for:

  • Creating quick test instances on the fly for demo, proof of concept, benchmarking etc.
  • Understanding the installation and configuration of a BigSQL solution (and by extension any Big Data solution.)
  • Gaining a deeper understanding of the core components within BigSQL
  • Taking the open source installer bundles and using them to create your own for Dev, QA, Integration, Regression and Production environments

BigSQL will install equally easily in Mac OS X and Linux. Mac OS X is an excellent development environment, Linux is used for production.

First, download your BigSQL bundle here.

Second, fear not! You can’t really make a mistake. If you run into problems or have had enough fun for one day, you can blow the whole thing away and just restart.

Installing and Running BigSQL.

Unzip the tarball you just downloaded and go to the newly created bigsql directory. Mine is bigsql-13.02.7, yours will be qualified by whatever version you download.

./bigsql start

Regardless of what other components that may be incorporated into the BigSQL bundle, once you have it on your machine, it’s just a question of typing

./bigsql start

Now, when you do this there’s a number of things that will happen behind the scenes and I’m going to take this opportunity to explain what is happening. On My Mac (2.4 GHz Intel Core i5, 8GB RAM), it takes about 90 seconds to unzip the BigSQL tarball (your mileage may vary, I have other stuff going on at the same time.)

So, while the tarball is unzipping you can read what will happen when you start BigSQL for the first time:

./bigsql start

  1. BigSQL will check to see if the ports for PostgresHA, ZooKeeper, Hadoop and Hive are available. If they are:
    1. BigSQL will go through and set up the required environment variables for each of these
  2. BigSQL will then start each of the components in the right order. If they do not exist yet and in the following order:
    1. BigSQL will initialize ZooKeeper and start it on the default port
    2. BigSQL will initialize PostgresHA and start it on the default port
  3. BigSQL will initialize Hadoop
  4. Create an Hadoop Distributed File System
  5. Start the Hadoop NameNode on its default port and start the Job Tracker
  6. BigSQL will then initialize Hive
  7. Create the Hive Metastore in PostgresHA
  8. Start the MetaStore
  9. Start the Hive Server and finally
  10. Check the status of the newly installed BigSQL Data Warehouse

BigSQL is now up and running on your machine and by my count on my machine, the process took about 30 seconds and when it runs, it looks like this:

BigSQL initializes ZooKeeper.

ZooKeeper Start

BigSQL initializes Hadoop.

Once Hadoop is initialized BigSQL creates an HDFS file system and starts its NameNode and the Job Tracker:

Hadoop Start

BigSQL initializes Hive.

Once Hive is initialized BigSQL creates the Hive MetaStore in PostgreSQL, starts the MetaStore and starts the Hive Server:

Hive Start

and lastly

BigSQL is ready for use!

BigSQL Status

And, Just in case you have any questions.

The BigSQL mailing list is available here:

BigSQL Architecture

BigSQL

BigSQL.

From data to information, from information to insight.

A state-of-the-art Big Data Warehouse solution that is fast, secure and continuously available. BigSQL will scale from your desktop to the cloud. Run real time OLAP directly from the worlds most secure RDBMS.

Get started with BigSQL right now.

You can immediately put BigSQL to work on your Relational Data and Big Data. BigSQL is an integrated bundle of:

  • PostgresHA – Highly Available PostgreSQL, the worlds most advanced open source database,
  • Hadoop, the archetypical Big Data solution and
  • Hive, an implementation of relational table abstrations.

BigSQL Architecture.

BigSQL Architecture

This is the core BigSQL engine and together they give you a Highly Available Big Data Warehouse solution.

When you add in components like Enterprise Search (Solr), Streams Processors (Pig), ETL (Sqoop) you have all the components required to analyze real time data directly from PostgreSQL including your NoSQL data in Hadoop.

Linear Scalability.

BigSQL leverages the linear scalability of Hadoop and HDFS across low cost commodity hardware and/or the cloud. It can easily scale to petabytes of information.

Platform Ubiquity.

BigSQL will lay down cleanly in 64 bit Linux (Production) and 64 bit OS X (Development) distros.

 24 x 7.

Every part of your Big Data stack should be hardened. The Hive Relational Metastore in BigSQL is PostgresHA, a Highly Available PostgreSQL implementation that can be set up and distributed exactly the same way that you would any Big Data implementation. You can have Active Standby clusters in the same datacenter but in different racks, you can stream to a remote Disaster Recover node.

Open Source.

Every component of BigSQL is Open Source. Some components serve double duty.

ZooKeeper is used as the distributed coordinator for HDFS and is used as the distributed lock manager in Hive. PostgresSQL, through PostgresHA is the relational metastore in Hive and a Relational Data Warehouse in it’s own right.

Each software component is free and it runs on cheap freely available hardware. If you cobble together enough Raspberry Pi’s, your entire hardware software stack could be open source.

Security.

BigSQL is built on PostgreSQL, the worlds most secure RDBMS.

Data Equivalence.

BigSQL gives you equivalent access to your Big Data and Relational Data through psql (the PostgresHA Command Line Interface) and the Hadoop Foreign Data Wrapper.

Help.

OpenSCG The people that built BigSQL are here to help, from package customization through on-site consulting to 24 x 7 database administration.

BigSQL

“From data to information, from information to insight.”

PostgresHA 9.2.4 Released

We are proud to announce that our 100% free and open source Postgres High Availability Bundle (PostgresHA = PostgreSQL + pgHA) is now available as a Developers Sandbox for OSX and Linux 64.

This proven solution leverages PostgreSQL’s native streaming replication technology and pgBouncer. It efficiently supports multiple read-only slaves, seamless DDL replication, and various failover and/or disaster recovery modes of operation. All of this without any application level program changes!!

Please check it out at http://community.openscg.com/se/postgresha

–Luss

PostgreSQL HA Zero Downtime

PostgreSQL Clustering with Postgres HA

PostgreSQL Clustering made easy with PostgresHA.

Authored jointly with Scott Mead, Sr. Architect OpenSCG and Architect of PostgresHA.

PostgresHA is Highly Available (HA) PostgreSQL.

High Availability

High availability is not new. Financial trading systems and the telephone network were federally mandated to provide highly availability. People didn’t do HA because it was a fun project, they did it because they were told to.

What is new is that EVERYTHING these days is HA. Your email and blog posts on Google, your shopping cart on Amazon, your pictures on Facebook. People have gone from doing this because it’s critical for their business and doing it because its good for business.

Open Source is also good for business and by virtue of all the components of a Big Data solution being publicly available some very smart people have gone to the trouble of making this stuff accessible to the likes of you and me. They have made it so that we can understand it and use it. You don’t have to be Ma Bell or the NYSE to enjoy the benefits of always available computing.

PostgreSQL

The worlds most advanced Open Source Database. PostgreSQL has synchronous replication. In computer theory terms it is called 2-safe replication, in English it means that you have two databases instances running simultaneously and your Primary (Master) database is synchronized with a Secondary (Slave) database. Unless both databases crash simultaneously, you won’t lose data. If your databases are on separate machines in separate racks in separate buildings you can reduce the possibility of simultaneous failure to near zero. Building a HA solution that survives an Extinction Level Event like a meteor strike is beyond the scope of this paper and would not get read anyway.

With synchronous replication, each write waits until confirmation is received by the transaction log of both the Master and Slave. This increases the transaction response time and the minimum is the roundtrip time between them. All two-phase commit actions require waits (prepare and commit.)

However, there is good news, read only transactions and transaction rollbacks need not wait for replies from standby servers (PostgreSQL, We’re optimized for Rollback!) Only top level transactions require commits, sub-transaction commits do not need to wait for standby servers and long running data loading or index building only need the very last commit. Good for speed, but if you lose it halfway through, you lose it all – you pays your money, you takes your choice.

PostgreSQL makes a uses the same concept of active standby as ZooKeeper – keep a standby hot for fail over (Master / Slave, Primary / Secondary, Active / Standby – you say potato I say potato.)

OpenSCG makes PostgreSQL HA with the Open Source project PostgresHA .

At a high level, PostgresHA monitors both Master and Slave databases. If it spots something that could be a failure it enters a state of heightened awareness and starts yelling about it. Heightened awareness means continuous pinging, more pinging and deeper pinging.

At some point it’s serious enough to warrant failover.

This is where you need to consider what you want (In advance).

Some things (like a head crash) you want to know right away and react as soon as possible, and things like auto failover sound like a good idea.

It’s possible that there is a transient problem that no amount of intelligent probing is going to find out and it will go away without causing too much trouble. In this case you probably don’t want to fail over automatically, you might want to watch it for a little bit and when you can bear it no more, you punch a big red failover button.

Sometimes you might want to crank up the sensitivity of the probes anyway just to see what nominal behavior looks like.

PostgresHA essentially does this for you.  The great part is that OpenSCG has a free distribution of PostgreSQL HA here!

Here’s how:

PostgresHA Architecture

Your Applications (clients) attach to a connection pooler (pgBouncer). The connection pooler simply holds all the client connections and switches them between a smaller number of database connections. Everyone does it, it saves expensive database connection resources. Another nice property is that it can switch these connections to a different database if needed.

PostgresHA Steady State.

PostgresHA Steady State

With Streaming Replication in PostgreSQL, we can have a Master and Slave running synchronously, and through pgBouncer we can route read-only connections to the slave and write transactions to the Master (nice). To protect against a bad hardware failure, you can put the Master and Slave databases in different racks. For Disaster Recovery, you can use log streaming to replicate to a separate data center.

PostgresHA handles failure scenarios the following way:

Master cannot communicate with Slave

PostgresHA, Master Slave Disconnect

Slave becomes isolated:

PostgresHA, Slave Isolated

Master Isolated:

Master Isolated

The Hadoop Ecosystem

The Hadoop Ecosystem.

Big Data is more than just the Apache Hadoop kernel; it is a collection of related projects. Some of them are mandatory (you cannot do Big Data without them) and others are optional (like data loaders and flow processing languages).

The mandatory Hadoop projects are HDFS, MapReduce and Common.  Common is a collection of components and interfaces for distributed processing and I/O. In addition to the file system itself and the language, Common implements the required components of distributed computing (autonomous computing entities with their own local memory that communicate through message passing).

There are a number of projects in the ecosystem that complement the mandatory projects. Below are some of the higher-level projects

  • ZooKeeper

A distributed coordination service. HDFS relies on nodes and there is a Master “NameNode” that knows where each piece of data needed for processing resides on the file system, the “DataNodes”. The NameNode, even if it is duplicated in a Secondary NameNode is still a single point of failure. Promoting a secondary NameNode to be the primary can take hours on really large clusters.

The way that this is solved is to have a pair of NameNodes in an active-standby configuration. To make this feasible the NameNodes must each be able to share the edit log (a file that keeps track of all the changes that are made so that if needed it can be replayed to bring the system back into sync.)

To make HDFS Highly Available (HA) each DataNode will send block reports to both NameNodes (block mappings are stored in the NameNode’s memory, not on disk.)

In this way, the standby NameNode has its namespace loaded in memory already and is servicing block requests.

To make Hadoop HA, ZooKeeper manages the shared storage that the NameNodes need to access the edit logs.

  • Hive

A Data Warehouse project. Data Warehouse systems are predominantly relational data. More and more unstructured data is being stored in no relational file systems like HDFS and organizations that have large Data Warehouse operations (Business Analysts) want to continue to use what they know but on this new rich media. The Hive project makes all your data look relational (whether it is or not) and implements an SQL dialect. This last bit (implements an SQL dialect) is being pedantic because it’s not 100% SQL, however as far as your BA’s are concerned it’s near enough for them to continue doing what they know best. Making your data look relational is a simplification and the pedantic is implements relational table abstractions.

  • HBase

Hadoop implements HDFS as a file system, but it does not have to be HDFS. HBase is an open source column oriented key value NoSQL database built on top of HDFS (so it is distributed) and supports MapReduce and good response times for point queries although it does not implement an SQL dialect. Column oriented databases have advantages over row oriented databases depending on your workload and can offer additional benefits in data compression because of their architecture. If you can design a Big Data architecture from scratch, you need to look at what type of solution it is going to be (i.e. OLAP, OLTP etc.) and how you are going to be using it.

There are also projects in their own right that deal with utility or specialized functions such as

  • Flow Processing Languages (pig)
  • Cross Language Serialization (Avro)
  • Extract Transform Load (Sqoop)
  • Scheduler (Oozie)

This is not a definitive list of projects, there are many more, right down to JDBC Drivers, Database Connection Poolers, Logging Packages. This post is to give you a summary of the mandatory ecosystem projects, an introduction to the main optional projects that you are likely to come across and an idea of the utility projects available. Other posts on specific Big Data implementations (like Data Warehousing) will cover their list of projects at and appropriate level of depth

HDFS

HDFS, The Hadoop Distributed File System.

When you Map a problem you break it down into key value pairs and spread it out across a number of processors – abstract it to its simplest components and give them to lots of workers.

HDFS is a file system specifically designed to handle LOTS of key value pair data. The data is organized specifically to be tolerant of faults (hardware failure like ‘Backhoe Fade’), always available and quickly available. In addition to the architecture of the storage system (key value pairs) the data itself is duplicated by default 3 times. This duplication is done on physically separate hardware (eliminate effects of backhoe fade) there are essentially online backups of your live data – since the data is in 3 places its essentially hot standby and since it’s in 3 places and up to date you can read and write to these places – this makes it quick.

HDFS

HDFS relies on things called nodes and there is a Master “NameNode” that knows where each piece of data needed for processing resides on the file system, the “DataNodes”. The NameNode keeps track of the multiple copies of data required to achieve protection against data loss. The NameNode is duplicated in a Secondary NameNode however this is not a hot standby. In the event that the secondary NameNode needs to take over from the primary it has to load its namespace image into memory replay the edit log and service enough block reports from the DataNodes to cross the threshold of operability.

That is it, really. It’s pretty straightforward. MapReduce and HDFS are designed to scale linearly (i.e. there is a direct linear proportion to how much money you spend and the size of the bang you get – like nuclear weapons). This stuff would pretty much be useless unless it could get staggeringly big and still be manageable.

There is some magic behind the scenes however. Keeping track of the state and location of the data and processes operating on it. To make all of this stuff work there is an army of specialized distributing processes all coordinating actions on the data AND coordinating actions between themselves. There are:

  • Distributed Lock Managers
  • Distributed Schedulers
  • Distributed Clients
  • Distributed Coordinators

The detail on how they work is as easily accessible as the early work on computability theory (you can get hold of it). Whether you can understand it or not is a different matter.

MapReduce

MapReduce.

MapReduce has a language and there is a syntax to this language (like English). Behind MapReduce are concepts that the language implements. You don’t need to understand the language to appreciate the elegance of MapReduce, you need a good analogy.

Here’s mine:

It is 1930. Alan Turing and Alonzo Church in England have just come up with computability theory. The great depression is just kicking off in America. John D. Rockefeller, the worlds’ richest man and great philanthropist sees that America is suffering and it’s the kids that are hit the hardest. He is rich and can afford to pay anyone to do anything and can buy whatever he likes. He is very smart and employs field observers so he has first hand intelligence of the business world around him.  The Differential Analyzer was a high-speed moron not yet ready for prime time but the concepts behind solving massive problems was mathematically provable, he just read that – Alan Turing said so it was right there in the US Library of Congress where he was sitting wondering how many of these books had pictures of him in them.

He thought that the US Library of Congress was just dusty old books. He knew that mans real treasure was what within these books, in their words of wisdom (That’s why he was in there – reading).

John had an epiphany!

“So – the key is the words and these are the things that I value. I have just MAPPED my problem into KEY VALUE PAIRS – that is totally cool because Alan Turing says so!”

John knew he could mine the Library of Congress and abstract this knowledge. It would make him not only the richest but also most omniscient man in the world. Wiser than Solomon, richer than … well nobody actually because he was already the richest man in the world.

Heady stuff.

To mine the library however the books would have to be destroyed.

John gathers all the kids in America at the Library and says:

John: “Kids, here’s some really sharp scissors, I want you to run in there and tear every page out of those books. $1.00 for every page you tear out!

Then I want you to cut each page into lines and then each line into words.  When you are done, just throw the words on the floor. I trust you kids, America has the finest education system in the world, you all know that ‘The’ is the same as “THE’ is the same as ‘tHE’ but not the same as ‘thee’, that the ‘P’ is silent in psittacosis, there’s no ‘F’ in rough and there’s no effin Physics because Science is hard and we pay teachers so poorly”

Kids: In unison “Gee big shot, a buck a page, that is markedly unphilantropic”

John: “The fun ain’t done yet kids”

Kids: In unison “Tell us more John”

John: “Right, when you are tired of ripping out pages and cutting them up, just grab handfuls of those words on the floor. Lots of ‘em are the same, and that’s a waste. I want to REDUCE the waste. Go through the ones you pick up and for any duplicates, count ‘em up and write it on the back of one of them and drop it on the floor again. All the dupes, bring to me – there’s a dollar for each one. Just keep doing that until the dupes are gone”

So, schematically what John had each of his little helpers do was this

MapReduce

The kids rushed in and found that there was one book in the Library with 3 pages in it.  We don’t care about lines and after Fred left with $3 in his sweaty little paws there were 15 words lying on the floor.

Kids have small hands and can only grab about 3 words in each hand, so with Sharon’s first grab she got 3 AA’s (and $2) a BB an XX and a ZZ. Frank got 3 BB’s (and $2) an XX a CC and a DD. Robert was out of luck – 3 words only and no dupes.

Robert leapt like a leopard on the next stack of six lying on the floor and got a score – A BB netted him $1.

When Debbie left with the last $1; John D Rockefeller had the information he wanted – the words in the library of congress, 4 BB’s 3 AA’s, 2 XX’s, 2 EE’s and one each of ZZ, CC, DD and FF.

There you have it, a perfectly executed plan against simple and well-defined data and a single property of that data.

Big Data

Imagine what Big Data can do for you.

Big Data is a term, like database, that can mean different but similar things depending on context:

  • Big Data is archetypically the Hadoop Distributed File System (HDFS) and MapReduce. The open source Apache project Hadoop includes HDFS and an implementation of MapReduce.
  • Big Data is a collection of related open source projects – an ecosystem.
  • Big Data open source components can be plugged in and out, you can even write your own. When you do this you have a solution in mind and whether that solution is also called something else like Data Warehousing, it’s also called Big Data.

If you understand Hadoop (HDFS and the MapReduce runtime environment) you know 90% of Big Data.

MapReduce.

Map and Reduce came as a result of study in the 1930’s into the solvability of mathematical problems and the amount of processing required to get a result.

This field of study is called computability theory.

In mathematics, recursion theorists study the theory of relative computability and degree structures and reducibility notions and in computer science geeks study sub-recursive hierarchies and formal methods and formal languages and there’s a great deal more overlap between the two than you would expect on first hearing this.

Alan Turing and Alonzo Church showed that if you can describe a mathematical procedure unambiguously you could execute it algorithmically  (you have a Language). Knowing this it was just a question of the horsepower needed to execute the algorithm. The really cool thing about computability theory was that if you could state your question, it would tell you how much grunt would be needed to answer the question, which in the 1930’s placed this type of solution firmly in the hands of those with the most money. Douglas Adams was perhaps most eloquent at expressing the amount of gear required to answer really tough questions.

Cheap commodity server hardware solved the horsepower problem, so that now you didn’t need a computer the size of a planet to answer questions,  and the distributed processing and file systems needed to support it that were described in the 1960’s and saw first fruit in the 1970s’ (Ethernet – thank you very much) came together with the MapReduce language in 2008 with Hadoop.

MapReduce is a language (like Database, Galvanization and Big Data, the definition of MapReduce has it’s own little context dependency that separates the geeks from the rest of humanity).

To make a language work in a computer you need an environment for it to run in. This is called a runtime environment, no ambiguity there. The runtime is what allows the language to do what it does – it understands the data objects you are working with and how they are stored and organized. It makes these objects available to the language. It implements the syntax that allows you to interrogate and manipulate this data. So MapReduce is also a runtime and this is probably the most accurate definition. Runtimes implement the behavior of a language, so while we can say MapReduce is a language, the runtime means that you can actually write code against the data system using any language (as long as it is supported by the runitme) – Java, Python, C++ etc. This begs the question, ‘If I can use languages I already have, why use another language?’ That’s a good question and especially relevant in Open Source projects (like Java and Hadoop). Open Source is a meritocracy, projects that add little value have small communities, the Map Reduce community is huge because it is a language designed to work with huge distributed processor farms and mind boggling amounts of multiply redundant data. It is possible to program against this with Java. If you have the Java chops and intestinal fortitude to handle the amazingly powerful (and huge) set of Java API’s that is part of the Hadoop ecosystem, you also have enough common sense to know when to use Java and when to use MapReduce against your data.

MapReduce runs on top of a distributed file system (HDFS) so it must have a distributed processing model and execution environment. The distributed file system keeps track of what data objects you have and where they are, the distributed processing model keeps track of your executing code (the MapReduce program) and all the other server processes that are needed to coordinate action across a large and potentially geographically disperse multiply redundant file system. The execution environment breaks down the Map and Reduce steps into tasks that can then be farmed out and executed in parallel across many cheap commodity servers. Sometimes your answer lies in the Map step so there may be no need to Reduce.

So what we have so far is a way to:

  • mathematically describe complex problems and get an idea of the processing requirements (the theory),
  • cheap freely available hardware with loads of disks and processors (the hardware),
  • the software to run distributed file systems and processors (Hadoop)
  • and the language needed to break complex data manipulations into a small number of atomic operations that can be linked (MapReduce), and
  • the runtime (once again MapReduce) that implements that language (MapReduce, Algol, Fortran, Assembler)

And a full page of blurb on MapReduce before you actually find out what it does.

What does MapReduce do?

No article on Big Data is complete without the obligatory description and schematic of MapReduce.

Here is mine.

It is 1930. Alan Turing and Alonzo Church in England have just come up with computability theory. The great depression is just kicking off in America. John D. Rockefeller, the worlds’ richest man and great philanthropist sees that America is suffering and it’s the kids that are hit the hardest. He is rich and can afford to pay anyone to do anything and can buy whatever he likes. He is very smart and employs field observers so he has first hand intelligence of the business world around him.  The Differential Analyzer was a high-speed moron not yet ready for prime time but the concepts behind solving massive problems was mathematically provable.

He thought that the US Library of Congress was just dusty old books. He knew that mans real treasure was what within these books, in their words of wisdom.

John had an epiphany!

“So – the key is the words and these are the things that I value. I have just MAPPED my problem into KEY VALUE PAIRS – that is totally cool because I just read something about that!”

John knew he could mine the Library of Congress and abstract this knowledge. It would make him not only the richest but also most omniscient man in the world. Wiser than Solomon, richer than … well nobody actually because he was already the richest man in the world.

Heady stuff.

To mine the library however the books would have to be destroyed.

John gathers all the kids in America at the Library and says:

John: “Kids, here’s some really sharp scissors, I want you to run in there and tear every page out of those books. $1.00 for every page you tear out!

Then I want you to cut each page into lines and then each line into words.  When you are done, just throw the words on the floor. I trust you kids, America has the finest education system in the world, you all know that ‘The’ is the same as “THE’ is the same as ‘tHE’ but not the same as ‘thee’, that the ‘P’ is silent in psittacosis, there’s no ‘F’ in rough and there’s no effin Physics because Science is hard and we pay teachers so poorly”

Kids: In unison “Gee big shot, a buck a page, that sucks”

John: “The fun ain’t done yet kids”

Kids: In unison “Tell us more John”

John: “Right, when you are tired of ripping out pages and cutting them up, just grab handfuls of those words on the floor. Lots of ‘em are the same, and that’s a waste. I want to REDUCE the waste. Go through the ones you pick up and for any duplicates, count ‘em up and write it on the back of one of them and drop it on the floor again. All the dupes, bring to me – there’s a dollar for each one. Just keep doing that until the dupes are gone”

So, schematically what John had each of his little helpers do was this

The kids rushed in and found that there was one book in the Library with 3 pages in it.  We don’t care about lines and after Fred left with $3 in his sweaty little paws there were 15 words lying on the floor.

Kids have small hands and can only grab about 3 words in each hand, so with Sharon’s first grab she got 3 AA’s (and $2) a BB an XX and a ZZ. Frank got 3 BB’s (and $2) an XX a CC and a DD. Robert was out of luck – 3 words only and no dupes.

Robert leapt like a leopard on the next stack of six lying on the floor and got a score – A BB netted him $1.

When Debbie left with the last $1; John D Rockefeller had the information he wanted – the words in the library of congress, 4 BB’s 3 AA’s, 2 XX’s, 2 EE’s and one each of ZZ, CC, DD and FF.

There you have it, a perfectly executed plan against simple and well-defined data and a single property of that data.

As you will see later, not all data is key value pair and not all data is in a computer, the most important data is in your head – what is that one piece of information that you need about something to solve a problem. Once you start thinking in these terms there is a proven way of getting this problem solved with computers.

It has been said that to err is human but to really screw up, you need a computer. Likewise, garbage in, garbage out.

Even if you are the richest man in America and can clearly state your vision there is a translation between thought and action that needs careful consideration. John had his answer and besides the obvious (a word count), he had the understanding that Information Processing processes Information, like words. Knowledge, and thus the wisdom he sought required processing of a more cerebral kind.

So if you throw into the mix loads of project managers, business analysts, developers, program managers and business managers you can see the possibility of something getting lost in the translation is directly proportional to the outstanding potential Big Data has for answering previously unanswerable questions.

People are gonna end up with answers that they weren’t expecting because they framed their question incorrectly – the primary source of human inspiration, experience is something you get when you are looking for something else, ask John.

Having tried and missed, you have all sorts of new questions; you are practiced at asking the questions and more certain of how to express yourself – sort of like life in general.

John could have asked

  • “I wonder how many books have pictures of me in them?”
  • “I wonder what percentage of books contain language offensive to my Baptist sensibilities”
  • “I wonder if anyone has infringed my patents or copyrights”
  • “I wonder what they look like”
  • “I wonder where they live”

Anyhow, John destroyed all his data so he was prevented from these perfectly normal human lines of enquiry.

Data today is forever. We design highly available fault tolerant data systems and for every method of getting data stored electronically there is a method to retrieve it and multiple methods for manipulating it. It’s what Information Processing is all about.

If you are wondering how a modern day Rockefeller could answer these types of questions, consider how Facebook knows who your friends are likely to be by analyzing their photographs or how law enforcement can capture an image of you running a red light, recognize the license plate, identify your mug through the windscreen and send you a bill.

All Big Data type solutions.

All implemented and working right now.

MapReduce splits work across many machines.

Map takes the job, breaks it into many small reproducible pieces and sends them to different machines and runs them in parallel.

Reduce takes the results and puts them back together to get a single answer.

Reduce works recursively. You HAVE to design your Reduce job so that it can consume it’s own output. These successive Reduce operations over a smaller and smaller set of refined data give you your unique answer. The result of the map computation is always list of key/value pairs. For each set of values that have the same key, Reduce will combine them into a single value. You don’t know if this was 100 steps or 1, the end result looks like a single Map.

In this case a list of every single word that occurs in the US library of Congress and a count of how many times it occurs i.e:

  • The, 3,00,000,000
  • Cat, 12
  • Sat, 4,553,353,167,187,867,655.076591
  • On, 5
  • Mat, 6

This can be used to fuel still other queries, and like the data is persistent these days, so is any part of the reduce operation that is unaffected by your new question. More of this later, but what it means is that once you have done a complete reduction first time it is possible to re run components of your solution without having to go through all the reduction steps again – nice huh?

Why do you care?

The reason that people use Big Data is that it can get results in days or hours that would previously take months or years to get.

More than this, it can get answers that were previously unavailable.

Knowing how MapReduce works is a great way of appreciating what is actually happening when people say “We are using Big Data to solve this particular problem.”

Although you don’t have to be Rockefeller to get yourself a Big Data solution, you should also know that solving problems with Big Data is still not cheap, you need

  • Hard core open source expertise
  • Network administration and Linux administration with an emphasis on HA
  • Hardware expertise – Both servers and networks
  • Data architects
  • People to move data
  • People to build MapReduce applications
  • People to performance test and tune
  • People to manage it all

Like owning a car, you can buy it yourself or you can lease it. It’s not that Big Data requires esoteric equipment and storage, it just needs lots of it. There are Cloud based products that mean that you can literally run a Big Data behemoth without owning a lick of gear. The Cloud is designed for this (Distributed Computing) and since you pay for storage, Cloud service providers love Big Data.

However you do it Big Data might not be cheap, but it’s invaluable.

If you want to know what other people have used it for, look at the Big Data Implementations section below.

HDFS – Distributed file system

When you Map a problem you break it down into key value pairs and spread it out across a number of processors – abstract it to its simplest components and give them to lots of workers.

HDFS is a file system specifically designed to handle LOTS of key value pair data. The data is organized specifically to be tolerant of faults (hardware failure like ‘Backhoe Fade’), always available and quickly available. In addition to the architecture of the storage system (key value pairs) the data itself is duplicated by default 3 times. This duplication is done on physically separate hardware (eliminate effects of backhoe fade) there are essentially online backups of your live data – since the data is in 3 places its essentially hot standby and since it’s in 3 places and up to date you can read and write to these places – this makes it quick.

HDFS and Hadoop

HDFS relies on nodes and there is a Master “NameNode” that knows where each piece of data needed for processing resides on the file system, the “DataNodes”. The NameNode keeps track of the multiple copies of data required to achieve protection against data loss. The NameNode is duplicated in a Secondary NameNode however this is not a hot standby. In the event that the secondary NameNode needs to take over from the primary it has to load its namespace image into memory, replay the edit log and service enough block reports from the DataNodes to cross the threshold of operability.

That is it, really. It’s pretty straightforward. MapReduce and HDFS are designed to scale linearly (i.e. there is a direct linear proportion to how much money you spend and the size of the bang you get – like nuclear weapons). This stuff would pretty much be useless unless it could get staggeringly big and still be manageable.

There is some magic behind the scenes however. Keeping track of the state and location of the data and processes operating on it. To make all of this stuff work there is an army of specialized distributing processes all coordinating actions on the data AND coordinating actions between themselves. There are:

  • Distributed Lock Managers
  • Distributed Schedulers
  • Distributed Clients
  • Distributed Coordinators

The detail on how they work is as easily accessible as the early work on computability theory (you can get hold of it). Whether you can understand it or not is a different matter.

Hadoop.

Hadoop came out in 2008 and the primary components of it (HDFS and MapReduce) conceptually look like this.

Hadoop, MapReduce and HDFS

The Hadoop ecosystem.

Big Data is more than just the Apache Hadoop kernel; it is a collection of related projects. Some of them are mandatory (you cannot do Big Data without them) and others are optional (like data loaders and flow processing languages).

The mandatory Hadoop projects are HDFS, MapReduce and Common.  Common is a collection of components and interfaces for distributed processing and I/O. In addition to the file system itself and the language, Common implements the required components of distributed computing (autonomous computing entities with their own local memory that communicate through message passing).

There are a number of projects in the ecosystem that complement the mandatory projects. Below are some of the higher-level projects

  • ZooKeeper

A distributed coordination service. HDFS relies on nodes and there is a Master “NameNode” that knows where each piece of data needed for processing resides on the file system, the “DataNodes”. The NameNode, even if it is duplicated in a Secondary NameNode is still a single point of failure. Promoting a secondary NameNode to be the primary can take hours on really large clusters.

The way that this is solved is to have a pair of NameNodes in an active-standby configuration. To make this feasible the NameNodes must each be able to share the edit log (a file that keeps track of all the changes that are made so that if needed it can be replayed to bring the system back into sync.)

To make HDFS Highly Available (HA) each DataNode will send block reports to both NameNodes (block mappings are stored in the NameNode’s memory, not on disk.)

In this way, the standby NameNode has its namespace loaded in memory already and is servicing block requests.

Any any clients that attach to the Master must be able to handle NameNode failover. It’s no use if the system fails over gracefully but the clients attached to that system cannot. To make Hadoop HA, ZooKeeper manages the shared storage that the NameNodes need to access the edit logs.

ZooKeeper

  • Hive

A Data Warehouse project. Data Warehouse systems are predominantly relational data. More and more unstructured data is being stored in no relational file systems like HDFS and organizations that have large Data Warehouse operations (Business Analysts) want to continue to use what they know but on this new rich media. The Hive project makes all your data look relational (whether it is or not) and implements an SQL dialect. This last bit (implements an SQL dialect) is being pedantic because it’s not 100% SQL, however as far as your BA’s are concerned it’s near enough for them to continue doing what they know best. Making your data look relational is a simplification and the pedantic is implements relational table abstractions.

  • HBase

Hadoop implements HDFS as a file system, but it does not have to be HDFS. HBase is an open source column oriented key value NoSQL database built on top of HDFS (so it is distributed) and supports MapReduce and good response times for point queries although it does not implement an SQL dialect. Column oriented databases have advantages over row oriented databases depending on your workload and can offer additional benefits in data compression because of their architecture. If you can design a Big Data architecture from scratch, you need to look at what type of solution it is going to be (i.e. OLAP, OLTP etc.) and how you are going to be using it.

There are also projects in their own right that deal with utility or specialized functions such as

  • Flow Processing Languages (pig)
  • Cross Language Serialization (Avro)
  • Extract Transform Load (Sqoop)
  • Scheduler (Oozie)

This is not a definitive list of projects, there are many more, right down to JDBC Drivers, Database Connection Poolers, Logging Packages. This post is to give you a summary of the mandatory ecosystem projects, an introduction to the main optional projects that you are likely to come across and an idea of the utility projects available. Other posts on specific Big Data implementations (like Data Warehousing) will cover their list of projects at and appropriate level of depth

Big Data Implementations.

Big Data is conceptually a collection of data and the means to store (a file system) and access it (a runtime).

This means that all of the components are swappable including the very core components that were the archetypal definition of Big Data – the file system (HDFS) and the execution environment (MapReduce). The resulting system will still be Big Data, and if any of this still fails to meet your needs you can write your own and this will still be Big Data (after some smart people look at it first).

It’s like Granddad’s axe. The handle has been replaced 3 times and the head once, but it’s still Granddad’s axe.

Alternatives to MapReduce and HDFS started to get published in 2009 (signs of a healthy ecosystem), and the people that were deciding that there needed to be alternatives were predominantly the people who wrote this stuff in the first place – Google (signs of Google eating their own dog food). There are several excellent articles on these alternatives. What is encouraging is that the initial Google projects (like Pregel) also have open source counterparts (Giraph), what is more encouraging is that there are Google projects as well as Big Data solutions out there that don’t have an open source project.

In this case the definition of Big Data for you could be your very own open source project and community (be careful what you wish for.)

Big Data is also a solution. It is a general data processing tool built to address the real world use case of speedy access and analytics of huge datasets.

Examples of these types of solutions are:

  • Facebook uses it to handle 50 Billion photographs
  • President Barack Obama committed $200 million to Federal Initiatives to fight cancer and traffic congestion with Big Data. Mr. Obama knows the value of Big Data – it helped him get re-elected.
  • BigSQL – the poster child for Data Warehousing with Big Data.
  • Decoding the Human genome – what took 10 years to do originally, Big Data does in a week.
  • NASA uses it to observe and simulate the weather
  • The NSA are definitely going to use it for the betterment of the human condition when they fire up their yottabyte capable Big Data machine on information they capture over the internet. 1000000000000000000000000 bytes – it’s a lot bigger than it looks, trust me. Just imagine what those little rascals could get up to with all of that lovely data.
  • Architects use it to enforce a consistent design approach and increase productivity (Nice to hear, I assume it also makes buildings safer as well as profits bigger).
  • MIT hosts the Intel Science and Technology Center for Big Data (If you can’t do, teach. Big Data eats Intel chips like a pig eats truffles).
  • McKinsey bills millions on their MIT graduate Big Data consultants.
  • Made finding the God Particle (Higgs Boson) possible.

As you can see, the limit of natty little Big Data implementations is your imagination.

 Imagine What Big Data can do for You.

PostgreSQL HA Zero Downtime

Introducing PostgresHA

OpenSCG is proud to publically announce the PostgresHA project. It is a 100% free and open source version of PostgreSQL that includes a set of proven scripts that supports streaming replication, read-only slaves, push-button or automatic failover for HA, and multi-datacenter asynchronous replication for Disaster Recovery.

Check it out for yourself at http://PostgresHA.org

–Luss