Postgres – Hadoop Data Trends

Every so often I like to take a look at trending data to get a feel for technology trends and attempt to draw some insights moving forward.  Recently with the overwhelming hype around BigData and Hadoop, I decided to take a closer look at the relative search traffic between Postgres and Hadoop.

Using Google Trends is a quick way to rank relative popularity of search terms over time.  For the trending below I used worldwide data.

Starting in 2011, you can see that Postgres is a more popular term than Hadoop.  Hadoop clearly is trending up relative to Postgres over the year.

In 2012, both terms are close with Hadoop continuing to close the gap.

In 2013, Hadoop passes Postgres but, not by much.  I personally find this interesting relative to the hype that exists around Hadoop and BigData in general.  Swapping the word Hadoop with BigData shows simular results.  Based on media Hype I would not have predicted this outcome.

I also noted that Postgres “searches” like almost all relational databases trended down from 2004 to 2009, but then flattened out around 2010.  This is followed by an uptick in searches over the last couple of years.  I attribute this to the focus on data and folks understanding in greater detail the importance of data management in their enterprises big and small.  I believe the BigData challenge is driving people to think about their data more holistically, and folks are realizing one size does not fit all.  At OpenSCG, we anticipate multiple data store technologies to continue to be leveraged to meet the diverse needs different types of data presents.

One last random observations from this exercise…  Oracle searches still outweigh Postgres searches 12 to 1.  Down from approximately 20 to 1 in 2004.  The Postgres community still has plenty of room for growth.

Good Enough?

Building upon my previous blog post, where I declared Postgres was "good enough", I want to explore what "good enough" means. I will use an aircraft design analogy. (You might remember me exploring a Postgres/aircraft parallel before.)

In aircraft design, everything is a tradeoff between weight, cost, reliability, and safety. Let's look at safety — what would be the safest material to use in constructing an airplane? One logical selection would be to make it from the same material as the flight data recorder, i.e. the black box. It is designed to withstand the most serious crashes, so obviously, it would be the safest.

Of course, the problem is that an aircraft made with material of the same strength as a flight data recorder would be so heavy it would never get off the ground. A material must be chosen that is safe enough, and light enough so the plane can fly. Obviously, measuring best in only one dimension leads to impractical problems in other areas.

Continue Reading »

Postgres Is Good Enough

With the increased attention Postgres is getting, many ask whether Postgres has the same features as Oracle, MS SQL, or DB2. Of course, the answer is that Postgres doesn't have all their features, and never will, but it is also true that those other databases will never have all of Postgres's features. The better question is whether Postgres has all the features you need for a successful deployment, and the answer to that is usually an enthusiastic "yes".

This gets into the larger question of when something is good enough. Good enough sounds negative, like you are giving up something to settle for a partial solution. However, good enough is often the best option, as the negative aspects of an ideal solution (cost, overhead, rigidity) often make good enough best. In response to Oracle's announcement of new hardware, this Slashdot comment captures how great good enough can be:

I used to work in exclusively Sun shops, and I've dealt with Oracle for years. There's little that the hardware and their database can do that can't be replicated by x64 and something like Postgres with some thought behind your architecture. For certain, the features they do have are not cost effective against the hundreds of thousands of dollars you pay for Oracle DB licensing, and the premium you pay for SPARC hardware and support.

Continue Reading »

Postgres as a Data Platform

With the all the excitement surrounding Postgres, I have heard speculation about why Postgres has become such a big success. Many point to our exemplary community structure, while others point to our technical rigor. One item of particular scrutiny has been our extensibility — I was asked about this twice at the recent New York City PGDay.

Postgres (named as post-Ingres) was designed to be extensible, meaning not only can users create server-side functions, but also custom data types, operators, index types, aggregates, and languages. This extensibility is used by our contrib extensions, PostGIS, and many others. Someone said it was like Postgres developed the SQL language, then laid the data types on top. Will Leinweber, in his NYC presentation, described Postgres as a "data platform rather than a database", and his use of hstore, JSON, and PLV8 illustrated the point well.

Extensibility certainly isn't the only reason for Postgres's success, but it is part of a larger set of ideal circumstances that allows Postgres to thrive in an environment where many other relational databases are struggling.

Post a Comment

Table Partitioning Needs Improvement

Postgres is very good at adding features built upon existing features. This process was used when table partitioning was added in Postgres 8.1 in 2005. It was built upon three existing features:

Though constraint_exclusion (which controls partition selection) has received some usability improvements since 2005, the basic feature has remained unchanged. Unfortunately, the feature as implemented has several limitations which still need to be addressed:

  • Partitioning requires users to create CHECK constraints to define the contents of each partition.
  • Because of CHECK constraints, there is no central recording of how the partitions are segmented.
  • Also because of CHECK constraints, there is no way to rapidly evaluate the partition contents — each CHECK constraint must be re-evaluated. Due to this expensive operation, partition selection is done while the plan is being created, rather than executed, meaning that only query constants can be used to select partitions — joining to a partitioned column cannot eliminate unnecessary partition scans.
  • CHECK constraint overhead also adds a serious performance overhead for systems using thousands of partitions.
  • To properly route write queries to the proper partition, triggers or rules must be created on the parent table. They must know about all partitions, and anytime a partition is created or removed, this trigger or rule must be updated, unless some type of partition-name wildcard pattern is used in the trigger code.
  • There is no parallel partition scanning, though that is more due to the lack of Postgres parallelism, rather than a partitioning limitation.

There is a wiki related to table partitioning that covers some of this in detail.

Continue Reading »

The Middleware Porting Challenge

Last month I visited India for EnterpriseDB and talked to many government and corporate groups. We don't have many independent Postgres community members in India, so I was surprised by the serious interest in Postgres and the number of large deployments under consideration.

The issue of porting applications to Postgres was discussed. Normally, when I think of porting to Postgres, I think of company-created applications that just require a proper Postgres interface library and simple SQL query modifications — and our community maintains porting guides for many databases, e.g. Oracle, MySQL, MSSQL, other. In addition, most open source application languages are rapidly moving to standardize on Postgres, so I thought most of the big hurdles for migration were solved.

However, there is one area that still has significant barriers for Postgres migration, and that is middleware, particularly proprietary middleware. If your application is written to use a middleware API, and that middleware doesn't support Postgres, there isn't much possibility of porting to Postgres unless you change your middleware software, and changing middleware software involves not just modifying queries, but rearchitecting the entire application to use a different middleware API.

Continue Reading »

The Future of Relational Databases

A few months ago I wrote an article about how NoSQL databases are challenging relational databases, and it has recently been published. My article's goal was to explain that while NoSQL databases have advantages over relational databases, there are many disadvantages that are often overlooked. I also explained how relational databases, particularly Postgres, are adding features to adapt to NoSQL workloads.

View or Post Comments

Parallelism Roadmap

In December of 2011, I blogged about the increasing need for parallelism in the Postgres backend. (Client applications have always been able to do parallelism with subprocesses, and with threads since 2003).

Thirteen months later, I have added two parallel code paths (1, 2) to pg_upgrade. I have learned a few things in the process:

  • parallelism can produce dramatic speed improvements (4x vs 4%)
  • adding parallelism isn't difficult to code, even for MS Windows
  • only certain tasks can benefit from parallelism

Using pg_upgrade as an example, parallelism can yield a 10x performance improvement, and it only took me a few weeks to accomplish. However, to get 10x improvement, you have to have multiple large databases, and be using multiple tablespaces — others will see more moderate gains. Fortunately, I have also improved pg_upgrade performance by 5x even without parallelism.

Continue Reading »