Integrating Cassandra, Spark, PostgreSQL and Hadoop as a Hybrid Data Platform – Part 2

In the previous post, we provided a business and architectural
background for the Postgres FDWs that we are developing for Spark,
Hadoop and Cassandra. In particular, we highlighted the key benefits of
bringing Cassandra and PostgreSQL together.

With this post, we will start taking a more technical look at the
Cassandra FDW.

The C* FDW speaks natively with Cassandra on two levels; it:

  • uses the binary CQL protocol instead of the legacy Thrift protocol.
  • directly relies on the DataStax Native C++ driver for Cassandra.

The DataStax C++ driver is performant and feature-rich; various load
balancing and routing options are available and configurable. We are
already making use of some of these features and plan to provide more of
these to our users.

While exploring Cassandra as a Postgres user, the defaults such as
automatic inclusion of the ALLOW FILTERING clause are useful as they
allow gradual familiarity; especially useful in small development
environments. Our intent is to support tuning for large environments
but to default to a configuration geared toward existing PostgreSQL

At this point, let us consider whether we are introducing a new SPOF by
using PostgreSQL with a Cassandra system. We believe not; a PostgreSQL
node at the edge of a Cassandra cluster – as a transactional or open-SQL
end point – is not at all the same as a central master node critical to
the operation of an entire cluster. We see some trade-offs but mostly
we see benefits of bringing PostgreSQL to Cassandra in this way as we
intend to elucidate through this series.

In the next post, we will show you how to get started with the Cassandra

Integrating Cassandra, Spark, PostgreSQL and Hadoop as a Hybrid Data Platform

Today many organizations struggle to keep up with their database
requirements, for example, to:

  • store and analyze high-velocity and rapidly-growing data such as logs,
    package tracking events, sensor readings and machine-generated
  • ensure 24/7 availability of customer-facing websites, services and
    apps even when a subset of their data centers, servers or data are
  • support fast-growing internet-scale businesses by adding relatively
    inexpensive data servers rather than requiring million-dollar
    investments in high-end servers and storage.

Our industry is increasingly producing and exploring various Open Source
systems to provide solutions for requirements like these. However, many
such systems intending to offer degrees of Scalability and
Availability choose architectures that impose inherent limitations.

Many of these architectures have a node or a collection of nodes that
are treated as special. Think Master-Slave, NameNode-DataNode and so
forth. While each of these models serves a different set of use cases,
a common attribute across them is that they have a SPOF (Single Point
of Failure). Even when they offer some level of multiplicity to deal
with the SPOF issue, the problems continue: these special nodes can
become bottlenecks for the operations that only they are allowed to
carry out. Capacity Planning, Backup and Recovery, Fault
, Disaster Recovery and similar areas of operation all
become more complex. Moreover, the non-special nodes are typically
underutilized or entirely passive. Many of these architectures make it
virtually impossible to achieve peta-scale, multi-thousand-node clusters
with linear growth and failure tolerance atop today’s
dynamically-orchestrated infrastructure.

Enter Cassandra – A peer-to-peer, multi-datacenter active-active,
peta-scale, fault-tolerant distributed database system. Nowadays, it is
hard not to have heard of this excellent system as its user-base
continues to grow. The key to realize is that its peer-to-peer
architecture is the basis for its SPOF-free operation with the
understanding that failures are the norm in clustered environments.
Cassandra is also well known for lowering the latency relative to many
other big data systems. It is in use by over 1500 organizations
including Netflix, eBay, Instagram and CERN. To get an idea of the
scale, Apple’s production deployment has been well known in the
Cassandra community to comprise 75,000 nodes storing over 10 PB but in
September last year at the Cassandra Summit, their deployment was
reported to have exceeded 100,000 nodes.

We are great believers in Cassandra and Spark and are building a hybrid
data platform bringing the benefits of these systems to PostgreSQL. We
also hope that the benefits of the PostgreSQL platform will have a wider
reach through this. Our distribution, Postgres by BigSQL, provides easy
access to these two systems through our FDW extensions CassandraFDW and
HadoopFDW. The HadoopFDW extension provides not just access to Hadoop
but also to Spark which uses the same underlying network protocol and
SQL parser.

The combined array of advanced features that these two FDWs support is
impressive: write support (INSERT/UPDATE/DELETE), predicate pushdown,
IMPORT FOREIGN SCHEMA, and JOIN pushdown. We believe that of all the
externally-maintained FDWs, these two FDW extensions represent the
cutting-edge in terms of the PostgreSQL FDW technology as an
implementation of SQL/MED for big data systems.

With that context, we will focus on the CassandraFDW in the next blog
post in this series.