Today many organizations struggle to keep up with their database
requirements, for example, to:

  • store and analyze high-velocity and rapidly-growing data such as logs,
    package tracking events, sensor readings and machine-generated
    streams.
  • ensure 24/7 availability of customer-facing websites, services and
    apps even when a subset of their data centers, servers or data are
    offline.
  • support fast-growing internet-scale businesses by adding relatively
    inexpensive data servers rather than requiring million-dollar
    investments in high-end servers and storage.

Our industry is increasingly producing and exploring various Open Source
systems to provide solutions for requirements like these. However, many
such systems intending to offer degrees of Scalability and
Availability choose architectures that impose inherent limitations.

Many of these architectures have a node or a collection of nodes that
are treated as special. Think Master-Slave, NameNode-DataNode and so
forth. While each of these models serves a different set of use cases,
a common attribute across them is that they have a SPOF (Single Point
of Failure). Even when they offer some level of multiplicity to deal
with the SPOF issue, the problems continue: these special nodes can
become bottlenecks for the operations that only they are allowed to
carry out. Capacity Planning, Backup and Recovery, Fault
Tolerance
, Disaster Recovery and similar areas of operation all
become more complex. Moreover, the non-special nodes are typically
underutilized or entirely passive. Many of these architectures make it
virtually impossible to achieve peta-scale, multi-thousand-node clusters
with linear growth and failure tolerance atop today’s
dynamically-orchestrated infrastructure.

Enter Cassandra – A peer-to-peer, multi-datacenter active-active,
peta-scale, fault-tolerant distributed database system. Nowadays, it is
hard not to have heard of this excellent system as its user-base
continues to grow. The key to realize is that its peer-to-peer
architecture is the basis for its SPOF-free operation with the
understanding that failures are the norm in clustered environments.
Cassandra is also well known for lowering the latency relative to many
other big data systems. It is in use by over 1500 organizations
including Netflix, eBay, Instagram and CERN. To get an idea of the
scale, Apple’s production deployment has been well known in the
Cassandra community to comprise 75,000 nodes storing over 10 PB but in
September last year at the Cassandra Summit, their deployment was
reported to have exceeded 100,000 nodes.

We are great believers in Cassandra and Spark and are building a hybrid
data platform bringing the benefits of these systems to PostgreSQL. We
also hope that the benefits of the PostgreSQL platform will have a wider
reach through this. Our distribution, Postgres by BigSQL, provides easy
access to these two systems through our FDW extensions CassandraFDW and
HadoopFDW. The HadoopFDW extension provides not just access to Hadoop
but also to Spark which uses the same underlying network protocol and
SQL parser.

The combined array of advanced features that these two FDWs support is
impressive: write support (INSERT/UPDATE/DELETE), predicate pushdown,
IMPORT FOREIGN SCHEMA, and JOIN pushdown. We believe that of all the
externally-maintained FDWs, these two FDW extensions represent the
cutting-edge in terms of the PostgreSQL FDW technology as an
implementation of SQL/MED for big data systems.

With that context, we will focus on the CassandraFDW in the next blog
post in this series.