Integrating Cassandra, Spark, PostgreSQL and Hadoop as a Hybrid Data Platform – Part 2

In the previous post, we provided a business and architectural
background for the Postgres FDWs that we are developing for Spark,
Hadoop and Cassandra. In particular, we highlighted the key benefits of
bringing Cassandra and PostgreSQL together.

With this post, we will start taking a more technical look at the
Cassandra FDW.

The C* FDW speaks natively with Cassandra on two levels; it:

  • uses the binary CQL protocol instead of the legacy Thrift protocol.
  • directly relies on the DataStax Native C++ driver for Cassandra.

The DataStax C++ driver is performant and feature-rich; various load
balancing and routing options are available and configurable. We are
already making use of some of these features and plan to provide more of
these to our users.

While exploring Cassandra as a Postgres user, the defaults such as
automatic inclusion of the ALLOW FILTERING clause are useful as they
allow gradual familiarity; especially useful in small development
environments. Our intent is to support tuning for large environments
but to default to a configuration geared toward existing PostgreSQL
users.

At this point, let us consider whether we are introducing a new SPOF by
using PostgreSQL with a Cassandra system. We believe not; a PostgreSQL
node at the edge of a Cassandra cluster – as a transactional or open-SQL
end point – is not at all the same as a central master node critical to
the operation of an entire cluster. We see some trade-offs but mostly
we see benefits of bringing PostgreSQL to Cassandra in this way as we
intend to elucidate through this series.

In the next post, we will show you how to get started with the Cassandra
FDW.

Integrating Cassandra, Spark, PostgreSQL and Hadoop as a Hybrid Data Platform

Today many organizations struggle to keep up with their database
requirements, for example, to:

  • store and analyze high-velocity and rapidly-growing data such as logs,
    package tracking events, sensor readings and machine-generated
    streams.
  • ensure 24/7 availability of customer-facing websites, services and
    apps even when a subset of their data centers, servers or data are
    offline.
  • support fast-growing internet-scale businesses by adding relatively
    inexpensive data servers rather than requiring million-dollar
    investments in high-end servers and storage.

Our industry is increasingly producing and exploring various Open Source
systems to provide solutions for requirements like these. However, many
such systems intending to offer degrees of Scalability and
Availability choose architectures that impose inherent limitations.

Many of these architectures have a node or a collection of nodes that
are treated as special. Think Master-Slave, NameNode-DataNode and so
forth. While each of these models serves a different set of use cases,
a common attribute across them is that they have a SPOF (Single Point
of Failure). Even when they offer some level of multiplicity to deal
with the SPOF issue, the problems continue: these special nodes can
become bottlenecks for the operations that only they are allowed to
carry out. Capacity Planning, Backup and Recovery, Fault
Tolerance
, Disaster Recovery and similar areas of operation all
become more complex. Moreover, the non-special nodes are typically
underutilized or entirely passive. Many of these architectures make it
virtually impossible to achieve peta-scale, multi-thousand-node clusters
with linear growth and failure tolerance atop today’s
dynamically-orchestrated infrastructure.

Enter Cassandra – A peer-to-peer, multi-datacenter active-active,
peta-scale, fault-tolerant distributed database system. Nowadays, it is
hard not to have heard of this excellent system as its user-base
continues to grow. The key to realize is that its peer-to-peer
architecture is the basis for its SPOF-free operation with the
understanding that failures are the norm in clustered environments.
Cassandra is also well known for lowering the latency relative to many
other big data systems. It is in use by over 1500 organizations
including Netflix, eBay, Instagram and CERN. To get an idea of the
scale, Apple’s production deployment has been well known in the
Cassandra community to comprise 75,000 nodes storing over 10 PB but in
September last year at the Cassandra Summit, their deployment was
reported to have exceeded 100,000 nodes.

We are great believers in Cassandra and Spark and are building a hybrid
data platform bringing the benefits of these systems to PostgreSQL. We
also hope that the benefits of the PostgreSQL platform will have a wider
reach through this. Our distribution, Postgres by BigSQL, provides easy
access to these two systems through our FDW extensions CassandraFDW and
HadoopFDW. The HadoopFDW extension provides not just access to Hadoop
but also to Spark which uses the same underlying network protocol and
SQL parser.

The combined array of advanced features that these two FDWs support is
impressive: write support (INSERT/UPDATE/DELETE), predicate pushdown,
IMPORT FOREIGN SCHEMA, and JOIN pushdown. We believe that of all the
externally-maintained FDWs, these two FDW extensions represent the
cutting-edge in terms of the PostgreSQL FDW technology as an
implementation of SQL/MED for big data systems.

With that context, we will focus on the CassandraFDW in the next blog
post in this series.

pgBackRest with PostgreSQL Sandbox on Debian / Ubuntu

pgBackRest is one of the most powerful backup solutions available for PostgreSQL. It has enterprise level features like compression, multiple channels (threads) of backup execution, incremental and differential backups etc.
The official documentation is Debian-centric in its focus. I wanted to test it out with the PostgreSQL sandbox from the BigSQL project.

Setting up PostgreSQL Sandbox and Installing pgBackRest

The BigSQL Project makes it easy to install and setup PostgreSQL and its associated components across different operating systems. In this document, we are going to look at how to set it up on Ubuntu 14.04. Linux binaries of the sandbox can be downloaded from the BigSQL download page

The sandbox installation requires only the unpacking of the downloaded file.

tar -xvf bigsql-9.5.3-5-linux64.tar.bz2
cd bigsql/

Using the command line utility (pgc) supplied with the sandbox, its very easy to initialize and start a PostgreSQL instance.

./pgc init pg95
./pgc start pg95

A PostgreSQL instance should now be up and running.
The same pgc utility can be used to install pgBackRest.

./pgc install backrest

Install Perl Dependencies

An important aspect to keep in mind is that pgBackrest is written in Perl and has many dependencies on different perl libraries and modules.
An easy way to install all the dependencies in one shot is to instruct the apt-get utility to install one of the leaf components in the dependency chain.

sudo apt-get install libdbd-pg-perl

This command should fetch all the perl dependencies of pgBackRest.

Setting Up a Backup Repository Directory

Set up a backup repository directory for pgBackRest with the following commands.

sudo mkdir /var/log/pgbackrest
sudo chmod 750 /var/log/pgbackrest

IMPORTANT for this test:

  1. pgbackrest and the postgres server process should run as the same OS user.
  2. The backup repository directory should be owned by the same OS user.

Change the ownership of the repository directory to the user under which the postgres process is running. If the user is “postgres” and the group is “postgres” then:

sudo chown -R postgres:postgres /var/log/pgbackrest

pgBackRest configuration

sudo vi /etc/pgbackrest.conf

Append the following entries to this file.

[demo]
db-path=/home/postgres/bigsql/data/pg95

[global]
repo-path=/var/log/pgbackrest

Note: if the entries already exist, modify them accordingly.

Change the ownership of this configuration file to the OS user that owns the postgres and pgbackrest process

sudo chown -R postgres:postgres /etc/pgbackrest.conf
chmod +x pgbackrest.conf

Modification Database Parameters

The archive_command needs to be modified to use pgbackrest. If the pgbackrest executable doesn’t exist in the path, please make sure that the full path is mentioned

alter system set archive_command = '/home/postgres/bigsql/backrest/bin/pgbackrest --stanza=demo archive-push %p';

A few other parameters that are also important for the proper working of pgBackRest:

alter system set archive_mode=on;
alter system set listen_addresses = '*';
alter system set max_wal_senders=3;
alter system set wal_level = 'hot_standby';

Modification of all these parameters requires a restart of the PostgreSQL instance.

./pgc restart pg95

In the event that our operating system user doesn’t exist as a superuser in our database, we need to create the user and assign superuser privileges

postgres=# create user vagrant with password 'vagrant';
postgres=# alter user vagrant with superuser;

Backing up database using pgBackRest

pgBackRest uses .pgpass file for authentication.
Add a line to .pgpass with the password of the superuser in the following format:

*:*:*:*:vagrant

once this is done, we are ready to backup the PostgreSQL instance.

backrest/bin/pgbackrest --stanza=demo --db-socket-path=/tmp --log-level-console=info backup

Restoring from backup

Imagine a scenario where the files in your data directory are corrupt or lost and you want to restore it from backup.
The first step is to bring down the PostgreSQL instance. This should release all file descriptors pointing to the current data directory.

Clean up the Data directory:
Before restoring the backup make sure that the data directory is clean and is stored on a reliable medium. The full path to the new data directory should be the same as the previous one (we can override this default, but for the sake of simplicity lets assume that the location remains the same).

Run the pgBackRest “restore” command to restore the data directory from the latest backup.

backrest/bin/pgbackrest --stanza=demo --db-socket-path=/tmp --log-level-console=info restore

Now we should be able to start up the PostgreSQL instance with the restored data directory.

./pgc start pg95

Our PostgreSQL cluster is now back online from the backup we restored.

MySQL Foreign Data Wrapper : A quick tour

Data centers are no longer dominated by a single DBMS. Many companies have heterogeneous environments and may want their Postgres database to talk to other database systems. Foreign Data Wrappers can be the right solution for many scenarios. The BigSQL Project provides a well tested, ready to use MySQL FDW with Postgres. This makes life easy for a DevOps or DataCenter person.

Here is a quick tour on how to configure Foreign Data Wrappers for MySQL, so that Postgres can query a MySQL table. For this quick guide, I use a CentOS Linux machine. This, or a similar setup, should work fine on all other operating systems.

Setting up a test MySQL server for the test

In this demo I’m going to create a table in MySQL  which should be available to Postgres though the FDW.
The FDW can talk to any MySQL distribution including Oracle’s MySQL, Percona Server or MariaDB. I’m going to use MariaDB, which is more community friendly.

Install MariaDB Server and Start the service

$ sudo yum install mariadb-server.x86_64
$ sudo systemctl start mariadb

Connect as root user of mariadb and create a database

$ mysql -uroot
MariaDB [(none)]> create database postgres;

Connect to Database and create a table

MariaDB [(none)]> use postgres;
MariaDB [postgres]> create table t1m(id int,name varchar(30));

Insert some data in the table:

MariaDB [postgres]> insert into t1m values (1,'abc');
Query OK, 1 row affected (0.04 sec)

MariaDB [postgres]> insert into t1m values (2,'def');
Query OK, 1 row affected (0.00 sec)

MariaDB [postgres]> insert into t1m values (3,'hij');
Query OK, 1 row affected (0.03 sec)

Setting up Postgres Database

Install Postgres

For this test, I’m going to use the Postgres DevOps Sandbox from the BigSQL project.
Download the Sandbox from BigSQL
Since this is a sandbox, you just need to unpack it

$ tar -xvf bigsql-9.5.3-5-linux64.tar.bz2

Install MySQL FDW

Go to the unpacked directory and invoke the bigsql command line tool to install MySQL FDW

$ cd bigsql
$ ./pgc list
Category | Component | Version | Status | Port | Updates
PostgreSQL pg92 9.2.17-5 NotInstalled
PostgreSQL pg93 9.3.13-5 NotInstalled
PostgreSQL pg94 9.4.8-5 NotInstalled
PostgreSQL pg95 9.5.3-5 NotInitialized
Extensions cassandra_fdw3-pg95 3.0.0-1 NotInstalled
Extensions hadoop_fdw2-pg95 2.5.0-1 NotInstalled
Extensions mysql_fdw2-pg95 2.1.2-1 NotInstalled
Extensions oracle_fdw1-pg95 1.4.0-1 NotInstalled
Extensions orafce3-pg95 3.3.0-1 NotInstalled
Extensions pgtsql9-pg95 9.5-1 NotInstalled
Extensions pljava15-pg95 1.5.0-1 NotInstalled
Extensions plv814-pg95 1.4.8-1 NotInstalled
Extensions postgis22-pg95 2.2.2-2 NotInstalled
Extensions slony22-pg95 2.2.5-2 NotInstalled
Extensions tds_fdw1-pg95 1.0.7-1 NotInstalled
Servers bam2 1.5.0 NotInstalled
Servers cassandra30 3.0.6 NotInstalled
Servers hadoop26 2.6.4 NotInstalled
Servers hive2 2.0.1 NotInstalled
Servers pgbouncer17 1.7.2-1 NotInstalled
Servers pgha2 2.1b NotInstalled
Servers pgstudio2 2.0.1-2 NotInstalled
Servers spark16 1.6.1 NotInstalled
Servers tomcat8 8.0.35 NotInstalled
Servers zookeeper34 3.4.8 NotInstalled
Applications backrest 1.02 NotInstalled
Applications birt 4.5.0 NotInstalled
Applications ora2pg 17.4 NotInstalled
Applications pgbadger 8.1 NotInstalled
Frameworks java8 8u92 NotInstalled
$ ./pgc install mysql_fdw2-pg95
['mysql_fdw2-pg95']
Get:1 http://s3.amazonaws.com/pgcentral mysql_fdw2-pg95-2.1.2-1-linux64
Unpacking mysql_fdw2-pg95-2.1.2-1-linux64.tar.bz2

Note:- We can use the same command line tool to initalize a new postgres cluster

$ ./pgc init pg95

## Initializing pg95 #######################

Superuser Password [password]:
Confirm Password:
Giving current user permission to data dir

Initializing Postgres DB at:
-D "/home/vagrant/bigsql/data/pg95"

Using PostgreSQL Port 5432

Password securely remembered in the file: /home/vagrant/.pgpass

to load this postgres into your environment, source the env file:
/home/vagrant/bigsql/pg95/pg95.env

Create the extension in the postgres database

create extension mysql_fdw;

Create foreign server

postgres=# CREATE SERVER mysql_svr
FOREIGN DATA WRAPPER mysql_fdw
OPTIONS (host 'localhost', port '3306');
CREATE SERVER
postgres=#

Create foreign table

postgres=# CREATE FOREIGN TABLE mysql_tab (
postgres(# id int,
postgres(# name varchar(30)
postgres(# )
postgres-# SERVER mysql_svr
postgres-# OPTIONS (dbname 'postgres', table_name 't1m');
CREATE FOREIGN TABLE
postgres=#

Create user mapping

postgres=# CREATE USER MAPPING FOR PUBLIC
postgres-# SERVER mysql_svr
postgres-# OPTIONS (username 'root');

(if your user is having password authentication to mysql, you have to pass that also in the format (username ‘username’, password ‘password’))

Now everything is set, You can test by querying the table.

postgres=# select * from mysql_tab;
id | name
----+-------
1 | abc
2 | def
3 | hij
(3 rows)


Note:- MySQL FDW for Postgres requires MySQL Client Libraries. Please make sure that libmysqlclient.so is there in the LD_LIBARY_PATH. if this file name is something different like “libmysqlclient.so.18.0.0″, you may have to create a softlink with name “libmysqlclient.so”