Video on FDW by Jim

Video – Behold! The Power of Foreign Data Wrappers!

If you’ve never heard of PostgreSQL’s “foreign data wrappers” — you’ve been missing out on a very powerful tool to connect and integrate heterogenous databases. By leveraging FDW’s, you can put PostgreSQL at the center of a integrated database federation and read and write data from other PostgreSQL servers, other relational databases (Oracle, MySQL), many NoSQL systems (MongoDB, Hadoop) and even Twitter!

Heck – FDW’s are so powerful that MongoDB just released their BI connector as a PostgreSQL FDW!! (excellent write up by John De Goes of SlamData)

At this year’s PostgresOpen, Jim Mlodgenski from OpenSCG gave a great talk about Federated PostgreSQL using FDW’s. Watch this video to learn the basics of what FDW’s are, how they work, and how you can easily combine data from PostgreSQL, other DBMS’s, and NoSQL db’s all in a single SQL query.

Thanks to Daily Tech Video for highlighting this talk in their post [Video 378] Jim Mlodgenski: PostgreSQL Federation.

cassandraeye2

BigSQL Adds Cassandra to the Hadoop and PostgreSQL Bundle

The BigSQL developer’s bundle has added a significant new component: Cassandra 2.0.4.  All of the bundles and VM’s now allow developers to get up and running with the latest Hadoop ecosystem and Cassandra.   Join our free Beta program now!  

As a significant number of our clients are leveraging Cassandra for Highly Available Multi Partition solutions, it was a natural fit to add to our bundle.    

In addition to Cassandra the latest beta includes all of the standard BigSQL components and many enhancements:

  • Continued HadoopFDW development (Postgres  – Hadoop Foreign data wrapper) fixes and improvements
  • JDK 7 is now a pre-req on all platforms
  • upgrade PostgreSQL from 9.3.1 to 9.3.2
  • upgrade DataFu from 1.1.1 to 1.2.0
  • upgrade PostgresqlStudio from 1.0 to 1.1
  • upgrade Hue from 3.0.0 to 3.5.
  • autostart Oozie & Hue in the vm
  • upgrade Sqoop from 1.4.4 to 1.99.3 (and autostart it)
  • include pgTSQL functionality from tpostgresq.org 
  • compact our quickstart virtual machine from 3 GB to 1.5 GB
  • configure eclipse for hadoop in the vm
At OpenSCG we are focus on helping you get the most value and leverage out of your data environments.  Its an exciting time in DATA and we thrive on helping our clients understand the latest changes in data technology.  We realize that there are great opportunities in new NOSQL and clustered data solutions and take pride in helping you evaluate your data options.

Hive, Postgres, and Hadoop Foreign Data Wrapper Video Tutorial

This demo shows how to run an example in BigSQL that uses Hive, Hadoop, PostgreSQL, and the Hadoop Foreign Data Wrapper to leverage the power of Hadoop from within PostgreSQL. This tutorial uses BigSQL 9.3 Release 2 – Beta2, which includes Hadoop-2.2 and Hive-0.12. You can download the newest release, or the Quick start VM for VMware, at BigSQL.org.

In fewer than four minutes this tutorial will show you how to:

  • Create a Hive tables
  • Load a CSV file into a Hive table
  • Set up the Hadoop Foreign Data Wrapper in PostgreSQL
  • Access data in Hadoop from PostgreSQL

[youtube url=”http://www.youtube.com/watch?v=YI006EMo_SY” autohide=”0″ fs=”1″]

PostgreSQL, Hadoop, and Pentaho to Analyze and Visualize Data

This tutorial shows how to use Postgres, Hadoop, Hive, Hbase, and Pig to load, refine, and store big data for visualization- all in fewer than five minutes.

You can watch the BigSQL, Postgres + Hadoop, Pentaho Demo here or follow along with the written version below.

[youtube url=”http://www.youtube.com/watch?v=eSEe_33pImA” fs=”1″ rel=”0″]

The following tutorial shows the scripts used to accomplish the demo. It also shows how to complete the additional step of loading the data into HBase, which allows you to update, insert and delete data in the HDFS from Postgres.

The data used for this example consists of three files:

  • Products, which includes the category of products and their corresponding urls
  • Logs, which is semi-structured website logs that includes data like a timestamp, user id, and geocoded ip address for each entry
  • Users, which contains the users’ ids, birthday and gender

In Bigsql, we will create the Hive table definitions and move all of the data to the hadoop file system:

   hive > CREATE TABLE users ( SWID STRING, BIRTH_DT STRING, GENDER_CD STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/data/users';

   hive > CREATE TABLE products ( url STRING, category STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/data/products';

   hive > CREATE TABLE Logs ( col_0 STRING, col_1 STRING, col_2 STRING, col_3 STRING, col_4 STRING, col_5 STRING, col_6 STRING, col_7 STRING, col_8 STRING, col_9 STRING, col_10 STRING, col_11 STRING, col_12 STRING, col_13 STRING, col_14 STRING, col_15 STRING, col_16 STRING, col_17 STRING, col_18 STRING, col_19 STRING, col_20 STRING, col_21 STRING, col_22 STRING, col_23 STRING, col_24 STRING, col_25 STRING, col_26 STRING, col_27 STRING, col_28 STRING, col_29 STRING, col_30 STRING, col_31 STRING, col_32 STRING, col_33 STRING, col_34 STRING, col_35 STRING, col_36 STRING, col_37 STRING, col_38 STRING, col_39 STRING, col_40 STRING, col_41 STRING,        col_42 STRING, col_43 STRING, col_44 STRING, col_45 STRING, col_46 STRING, col_47 STRING,        col_48 STRING, col_49 STRING, col_50 STRING, col_51 STRING, col_52 STRING, col_53 STRING, col_54 STRING, col_55 STRING, col_56 STRING, col_57 STRING, col_58 STRING, col_59 STRING, col_60 STRING, col_61 STRING, col_62 STRING, col_63 STRING, col_64 STRING, col_65 STRING, col_66 STRING, col_67 STRING, col_68 STRING, col_69 STRING, col_70 STRING, col_71 STRING, col_72 STRING, col_73 STRING, col_74 STRING, col_75 STRING, col_76 STRING, col_77 STRING, col_78 STRING, col_79 STRING, col_80 STRING, col_81 STRING, col_82 STRING, col_83 STRING, col_84 STRING, col_85 STRING, col_86 STRING, col_87 STRING, col_88 STRING, col_89 STRING, col_90 STRING, col_91 STRING, col_92 STRING, col_93 STRING, col_94 STRING, col_95 STRING, col_96 STRING, col_97 STRING, col_98 STRING, col_99 STRING, col_100 STRING, col_101 STRING, col_102 STRING, col_103 STRING, col_104 STRING, col_105 STRING, col_106 STRING, col_107 STRING, col_108 STRING, col_109 STRING, col_110 STRING, col_111 STRING, col_112 STRING, col_113 STRING, col_114 STRING, col_115 STRING, col_116 STRING, col_117 STRING, col_118 STRING, col_119 STRING, col_120 STRING, col_121 STRING, col_122 STRING, col_123 STRING, col_124 STRING, col_125 STRING, col_126 STRING, col_127 STRING, col_128 STRING, col_129 STRING, col_130 STRING, col_131 STRING, col_132 STRING, col_133 STRING, col_134 STRING, col_135 STRING, col_136 STRING, col_137 STRING, col_138 STRING, col_139 STRING, col_140 STRING, col_141 STRING, col_142 STRING, col_143 STRING, col_144 STRING, col_145 STRING, col_146 STRING, col_147 STRING, col_148 STRING, col_149 STRING, col_150 STRING, col_151 STRING, col_152 STRING, col_153 STRING, col_154 STRING, col_155 STRING, col_156 STRING, col_157 STRING, col_158 STRING, col_159 STRING, col_160 STRING, col_161 STRING, col_162 STRING, col_163 STRING, col_164 STRING, col_165 STRING, col_166 STRING, col_167 STRING, col_168 STRING, col_169 STRING, col_170 STRING, col_171 STRING, col_172 STRING, col_173 STRING, col_174 STRING, col_175 STRING, col_176 STRING, col_177 STRING, col_178 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/data/logs';

   hive > LOAD DATA LOCAL INPATH 'Downloads/webLogAnalytics/users.tsv' OVERWRITE INTO TABLE users;
   hive > LOAD DATA LOCAL INPATH 'Downloads/webLogAnalytics/products.tsv' OVERWRITE INTO TABLE products;
   hive > LOAD DATA LOCAL INPATH 'Downloads/webLogAnalytics/Omniture.tsv' OVERWRITE INTO TABLE logs;

Now, use the hbase shell to define the webloganalytics table:

   $ hbase shell
   hbase > create 'webloganalytics', 'info','user','location'

Next we will use the following Pig script to refine the data. You may need to export the following as your Hadoop_Classpath in order for pig to work correctly with HBase:

export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.94.11.jar:$HADOOP_HOME/hadoop-core-1.2.1.jar:$ZOOKEEPER_HOME/zookeeper-3.4.5.jar:$HBASE_HOME/lib/protobuf-java-2.4.0a.jar

First, register the piggybank, Hbase and time jars, and then define the two time functions we will use.

   grunt >  cd /user/data
   grunt >  REGISTER bigsql-9.3.0-rc2.56/pig-0.11.1-src/contrib/piggybank/java/piggybank.jar 
   grunt >  REGISTER bigsql-9.3.0-rc2.56/pig-0.11.1-src/build/ivy/lib/Pig/joda-time-1.6.jar
   grunt >  REGISTER bigsql-9.3.0-rc2.56/hbase/lib/protobuf-java-2.4.0a.jar
   grunt >  DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
   grunt >  DEFINE ISOYearsBetween org.apache.pig.piggybank.evaluation.datetime.diff.ISOYearsBetween();

Next, load the three tables from the hadoop file system:

   grunt >  users = LOAD '/user/data/users/users.csv' USING PigStorage(',') AS (SWID:chararray,BIRTH_DT:chararray,GENDER_CD:chararray);
   grunt >  products = LOAD '/user/data/products/products.tsv' USING PigStorage('\t') AS (url:chararray,category:chararray);
   grunt >  logs = LOAD '/user/data/logs/Omniture.0.tsv' USING PigStorage('\t') AS (col_0:chararray,col_1:chararray,col_2:chararray,col_3:chararray,col_4:chararray,col_5:chararray,col_6:chararray,col_7:chararray,col_8:chararray,col_9:chararray,col_10:chararray,col_11:chararray,col_12:chararray,col_13:chararray,col_14:chararray,col_15:chararray,col_16:chararray,col_17:chararray,col_18:chararray,col_19:chararray,col_20:chararray,col_21:chararray,col_22:chararray,col_23:chararray,col_24:chararray,col_25:chararray,col_26:chararray,col_27:chararray,col_28:chararray,col_29:chararray,col_30:chararray,col_31:chararray,col_32:chararray,col_33:chararray,col_34:chararray,col_35:chararray,col_36:chararray,col_37:chararray,col_38:chararray,col_39:chararray,col_40:chararray,col_41:chararray,col_42:chararray,col_43:chararray,col_44:chararray,col_45:chararray,col_46:chararray,col_47:chararray,col_48:chararray,col_49:chararray,col_50:chararray,col_51:chararray,col_52:chararray,col_53:chararray,col_54:chararray,col_55:chararray,col_56:chararray,col_57:chararray,col_58:chararray,col_59:chararray,col_60:chararray,col_61:chararray,col_62:chararray,col_63:chararray,col_64:chararray,col_65:chararray,col_66:chararray,col_67:chararray,col_68:chararray,col_69:chararray,col_70:chararray,col_71:chararray,col_72:chararray,col_73:chararray,col_74:chararray,col_75:chararray,col_76:chararray,col_77:chararray,col_78:chararray,col_79:chararray,col_80:chararray,col_81:chararray,col_82:chararray,col_83:chararray,col_84:chararray,col_85:chararray,col_86:chararray,col_87:chararray,col_88:chararray,col_89:chararray,col_90:chararray,col_91:chararray,col_92:chararray,col_93:chararray,col_94:chararray,col_95:chararray,col_96:chararray,col_97:chararray,col_98:chararray,col_99:chararray,col_100:chararray,col_101:chararray,col_102:chararray,col_103:chararray,col_104:chararray,col_105:chararray,col_106:chararray,col_107:chararray,col_108:chararray,col_109:chararray,col_110:chararray,col_111:chararray,col_112:chararray,col_113:chararray,col_114:chararray,col_115:chararray,col_116:chararray,col_117:chararray,col_118:chararray,col_119:chararray,col_120:chararray,col_121:chararray,col_122:chararray,col_123:chararray,col_124:chararray,col_125:chararray,col_126:chararray,col_127:chararray,col_128:chararray,col_129:chararray,col_130:chararray,col_131:chararray,col_132:chararray,col_133:chararray,col_134:chararray,col_135:chararray,col_136:chararray,col_137:chararray,col_138:chararray,col_139:chararray,col_140:chararray,col_141:chararray,col_142:chararray,col_143:chararray,col_144:chararray,col_145:chararray,col_146:chararray,col_147:chararray,col_148:chararray,col_149:chararray,col_150:chararray,col_151:chararray,col_152:chararray,col_153:chararray,col_154:chararray,col_155:chararray,col_156:chararray,col_157:chararray,col_158:chararray,col_159:chararray,col_160:chararray,col_161:chararray,col_162:chararray,col_163:chararray,col_164:chararray,col_165:chararray,col_166:chararray,col_167:chararray,col_168:chararray,col_169:chararray,col_170:chararray,col_171:chararray,col_172:chararray,col_173:chararray,col_174:chararray,col_175:chararray,col_176:chararray,col_177:chararray,col_178:chararray);

Select only the important fields from the logs table, convert the ‘yyyy-MM-dd HH:mm:ss’ format dates to an ISO timestamp, and ensure that all of the countries and states are all uppercase strings:

   grunt >  omniture = FOREACH logs GENERATE CustomFormatToISO(col_1, 'yyyy-MM-dd HH:mm:ss') AS logdate, col_7 AS ip, col_12 AS url, col_13 AS swid, col_49 AS city, col_50 AS country, col_52 AS state;
   grunt >  omniture2 = FOREACH omniture GENERATE logdate, url, ip, city, UPPER(state) AS state, UPPER(country) AS country, swid;

Convert the users’ table’s ‘dd-MMM-yy’ format birth dates to ISO timestamps and then calculate the age of each user:

   grunt >  users2 = FILTER users BY BIRTH_DT IS NOT NULL;
   grunt >  users3 = FOREACH users2 GENERATE SWID,ISOYearsBetween('2013-01-01T02:00:00.000Z', CustomFormatToISO(BIRTH_DT, 'dd-MMM-yy')) AS AGE,GENDER_CD;
   grunt >  users4 = JOIN users BY (SWID, GENDER_CD) LEFT OUTER, users3 BY (SWID, GENDER_CD);
   grunt >  users5 = FOREACH users4 GENERATE users::SWID AS swid, users3::AGE AS age, users::GENDER_CD AS gender;

Next, join the logs table with the user table by the user id, and then join that table with the products table by the url:

   grunt >  webloganalytics1 = JOIN omniture2 BY swid LEFT OUTER, users5 BY swid;
   grunt >  webloganalytics2 = JOIN webloganalytics1 BY url LEFT OUTER, products BY url;
   grunt >  webloganalytics3 = ORDER webloganalytics2 BY logdate;
   grunt >  webloganalytics4 = FOREACH webloganalytics3 GENERATE logdate, products::url, ip, city,state, country, category, users5::age, users5::gender;

For this tutorial, instead of just saving the data to HDFS, it will actually be loaded into hbase. In order to accomplish this, we need to first add a key to the webloganalytics table.

   grunt >  webloganalytics = RANK webloganalytics4;    
   grunt >  STORE webloganalytics INTO 'hbase://webloganalytics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:logdate info:url user:ip location:city location:state location:country info:category user:age user:gender');

The resulting data, not including the new id:

Now, to use the foreign data wrapper’s insert, update and delete capabilities, you need to create the corresponding hive table:

   hive > CREATE TABLE webloganalytics ( id INT, logdate STRING, url STRING, ip STRING, city STRING, state STRING, country STRING, category STRING,   age INT, gender STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" =":key,info:logdate,info:url,user:ip, location:city,location:state,location:country,info:category,user:age,user:gender") TBLPROPERTIES("hbase.table.name"="webloganalytics");

Next, use the Hadoop Foreign Data Wrapper to create the corresponding Postgres table:

   postgres=# CREATE EXTENSION hadoop_fdw;

   postgres=# CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw OPTIONS (address '127.0.0.1', port '10000');

   postgres=# CREATE USER MAPPING FOR PUBLIC SERVER hadoop_server;

   postgres=# CREATE FOREIGN TABLE webloganalytics ( id INT, logdate TIMESTAMP, url TEXT, ip TEXT, city TEXT, state TEXT, country TEXT, category TEXT, age INT, gender TEXT ) SERVER hadoop_server OPTIONS ( table 'webloganalytics' hbase_port '9090', hbase_address 'localhost', hbase_mapping ':key,info:logdate,info:url,user:ip,location:city, location:state,location:country,info:category,user:age,user:gender' );

Now we can show how we are able to select, insert, update, and delete data from postgres:

postgres=# SELECT * FROM webloganalytics WHERE state = 'CA' LIMIT 5;
  id    |       logdate     |                  url                 |       ip       |     city   | state  | country  |category | age | gender 
 --------+---------------------+-------------------------------------------+------------------+-------------+--------+---------+----------+-----+-------- 
 100002 | 2012-03-13 12:36:42 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100005 | 2012-03-13 12:36:43 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100010 | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119.238.163 | los angeles | CA    | USA     | shoes    |     | 
 100011 | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119.238.163 | los angeles | CA    | USA     | shoes    |     | 
 100012 | 2012-03-13 12:37:10 | http://www.acme.com/SH55126545/VD55179433 | 108.198.177.186 | los angeles | CA    | USA     | shoes    |  27 | M

postgres=# DELETE FROM webloganalytics WHERE id=100010;
postgres=# SELECT * FROM webloganalytics WHERE state = 'CA' LIMIT 5;
  id    |       logdate     |                  url                 |       ip       |     city   | state  | country  |category | age | gender 
 --------+---------------------+-------------------------------------------+------------------+-------------+--------+---------+----------+-----+-------- 
 100002 | 2012-03-13 12:36:42 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100005 | 2012-03-13 12:36:43 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100011 | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119.238.163 | los angeles | CA    | USA     | shoes    |     | 
 100012 | 2012-03-13 12:37:10 | http://www.acme.com/SH55126545/VD55179433 | 108.198.177.186 | los angeles | CA    | USA     | shoes    |  27 | M
 100013 | 2012-03-13 12:37:13 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U

postgres=# UPDATE webloganalytics SET age=30, gender='F' WHERE id=100011;
postgres=# SELECT * FROM webloganalytics WHERE state = 'CA' LIMIT 5;                                             
  id    |       logdate     |                  url                 |       ip       |     city   | state  | country  |category | age | gender 
 --------+---------------------+-------------------------------------------+------------------+-------------+--------+---------+----------+-----+-------- 
100002  | 2012-03-13 12:36:42 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U 
100005  | 2012-03-13 12:36:43 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U 
100011  | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119.238.163 | los angeles | CA    | USA     | shoes    |  30 | F 
100012  | 2012-03-13 12:37:10 | http://www.acme.com/SH55126545/VD55179433 | 108.198.177.186 | los angeles | CA    | USA     | shoes    |  27 | M
100013  | 2012-03-13 12:37:13 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U

postgres=# INSERT INTO webloganalytics VALUES(100010,'2012-03-13 12:37:07','http://www.acme.com/SH55126545/VD55179433','162.119/238/163','los angeles','CA','USA','shoes',30,'F'); INSERT 0 1 postgres=# SELECT * FROM webloganalytics WHERE state = 'CA' LIMIT 5;                                                                                                 
  id    |       logdate     |                  url                 |       ip       |     city   | state  | country  |category | age | gender 
 --------+---------------------+-------------------------------------------+------------------+-------------+--------+---------+----------+-----+-------- 
 100002 | 2012-03-13 12:36:42 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100005 | 2012-03-13 12:36:43 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100010 | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119/238/163 | los angeles | CA    | USA     | shoes    |  30 | F
 100011 | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119.238.163 | los angeles | CA    | USA     | shoes    |  30 | F
 100012 | 2012-03-13 12:37:10 | http://www.acme.com/SH55126545/VD55179433 | 108.198.177.186 | los angeles | CA    | USA     | shoes    |  27 | M

HadoopFDW

With the release of PostgreSQL 9.3, it let’s us do some really cool things with writable foreign tables. BigSQL has just released a Hadoop Foreign Data Wrapper that is writable into HDFS files and Hbase tables. The Hbase integration allows for full SELECT, INSERT, UPDATE and DELETE syntax through PostgreSQL and the HDFS integration allows for SELECT and INSERT.

The HadoopFDW is released under the PostgreSQL license and can be found here.

El

Hadoop Thrift Tutorial

At OpenSCG we have been using thrift to make a Hadoop Foreign Data Wrapper. This tool, which is already integrated into BigSQL, allows you to take advantage of the power of Hadoop from within PostgreSQL. The BigSQL Tutorial show how you can easily create a postgres table that references a Hive or Hbase table, run a parallel Hive query, and see your data returned, all in the psql terminal. Behind the scenes, the Hadoop Foreign Data Wrapper uses thrift tools to make Hive and Hbase clients in C++, then the user creates a foreign server from within PostgreSQL which the clients can then communicate to.

The goal of this tutorial is to show you a very simplified look at the making of thrift servers and clients. You will use thrift to make a non-blocking java server, a Python client and a C++ client. The clients will be able to send a structure and commands to the server, and the server will do calculations and print out the result. I made the structure and two functions described below:

  • UserProfile struct that consists of a user id, name, blurb, month, day, and year of birth
  • Repeat function that returns the user’s id, name, and blurb
  • Daysold function that returns the approximate number of days since the user was born

Setting up Thrift

Make sure that you have all of the appropriate apache thrift requirements; this includes getting the Libevent API, Boost Libraries, and language support for Java, Python and C++. Get the latest version tarball or build from source using the instructions on the Apache Thrift Download page.

Creating the Java Server

The first step to using thrift is to create the imaginary file Start.thrift in the thirft/compiler/cpp/ directory. This type of file is made up of thrift types and services that the server will implement and the client will call. Thrift allows you to make one file, no matter what language you intend on using, that will create all of the source code you need.

Start.thrift

#!/usr/local/bin/thrift --gen java

namespace java Test

    struct UserProfile {
        1: i32 uid,
        2: string name,
        3: string blurb,
        4: i32 month,
        5: i32 day,
        6: i32 year
        }
    service UserStorage {
        void repeat(1: UserProfile user)
        void daysold(1: UserProfile user)
        }

Create the Java code by running the following command from the cpp directory. The -r specifies that you want to generate all the included files.

    $ thrift -r --gen java start.thrift

Make the java server from scratch add the following files to the created gen-java directories:

UserStorageserver.Java

package Test;

import org.apache.thrift.transport.TServerSocket;
import org.apache.thrift.transport.TServerTransport;
import org.apache.thrift.server.TServer;
import org.apache.thrift.server.TServer.Args;
import org.apache.thrift.server.TSimpleServer;
import Test.UserProfile;
import Test.UserStorage;

public class UserStorageserver {
public static void StartsimpleServer(UserStorage.Processor processor) {

        try {
            TServerTransport serverTransport = new TServerSocket(9090);
            TServer server = new TSimpleServer(new Args(serverTransport).processor(processor));
            System.out.println("Starting the simple server...");
            server.serve();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        StartsimpleServer(new UserStorage.Processor(new UserStorageHandler()));
    }
}

UserStorageHandler.Java

package Test;

import java.util.Calendar;
import Test.UserProfile;
import Test.UserStorage;

    public class UserStorageHandler implements Test.UserStorage.Iface {
        @Override
        public void repeat(UserProfile user) {
            System.out.println("Repeat:\n");
            System.out.println("User's ID: " + user.uid + "\nUser's Name: " + user.name + "\nUser's Blurb: " + user.blurb+ "\n");
        }

    public void daysold(UserProfile user){
        System.out.println("Days Old:\n");
        int month = Calendar.getInstance().get(Calendar.MONTH);
        int year = Calendar.getInstance().get(Calendar.YEAR);
        int agem;
        int agey;
            if (month>user.month){
                agem=(month-user.month)*30;
                agey=(year-user.year)*360;
                }
            else{
                agem=month*30;
                agey=(year-user.year-1)*360;
            }
        int age=agem+agey+user.day;
        System.out.println("User is approximately " + age +" days old.\n");
        }
}

The easiest way to create the Java server is to make it in eclipse. I added the two files I had just created above, UserStorage.Java, and UserProfile.Java. Once these files are all in the correct packages, you need to add four addition .jar files to your build path (properties -> Java Build Path -> Add External Jars):

  • Libthrift-javadoc.jar
  • Libthrift.jar
  • Slf4j-api
  • Slf4j-simple

The thrift Jars can be found in the thrift/lib/java/build/ directory. The two Slf4j Jars you could get from the slf4j download, or you could just remove all of the code that has dependencies on it, it will not affect this example. Once everything is error-less in eclipse, you can run UserStorageserver.Java as a Java application, or export it as a runnable jar. You can run the jar from the command line with:

    $ Java -jar Java_server.jar

Creating the C++ Client

The next step is to make a Thrift client in C++, first you have to change the namespace from “java” to “cpp” in start.thrift. Next, generate the C++ code:

    $ thrift --gen cpp start.thrift
    $ cd gen-cpp

Compile the code using the g++ command. The arguments “-I/usr/local/include/thrift” specifies that you are include all of the header files in the thrift directory, “-Wall” indicates that you want to see any warnings, “-c” specifies to compile the .cpp file, and “-o” allows you to name the output file.

    $ g++ -Wall -I/usr/local/include/thrift -c start_constants.cpp -o constants.og++ -Wall -I/usr/local/include/thrift -c start_types.cpp -o types.o
    $ g++ -Wall -I/usr/local/include/thrift -c UserStorage.cpp -o UserStorage.o

If you’re trying this example on Ubuntu and get this error:

        /usr/local/include/thrift/protocol/TProtocol.h:646:11: error: 'uint32_t' does not name a type

All you need to do is add the following line before include thrift.h in the header file you cannot compile:

       #include <stdint.h>

Or if you have gotten this error:

    error: 'class apache::thrift::protocol::TProtocol' has no memeber named 'writeMessageEnd'
    error: 'class apache::thrift::protocol::TProtocol' has no memeber named 'writeEnd'

All you need to do is add the following line to the header file you cannot compile:

        #include <arpa/inet.h>

Create the following file in the gen-cpp directory:

UserStorage_Client.cpp

#include "UserStorage.h" // One of the autogenerated files

#include <transport/TSocket.h>
#include <transport/TBufferTransports.h>
#include <protocol/TBinaryProtocol.h>

using namespace ::apache::thrift;
using namespace ::apache::thrift::protocol;
using namespace ::apache::thrift::transport;

using namespace ::Test;

int main(int argc, char **argv) {
    boost::shared_ptr<TSocket> socket(new TSocket("192.168.1.127", 9090));
    boost::shared_ptr<TTransport> transport(new TBufferedTransport(socket));
    boost::shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));

    UserStorageClient client(protocol);

    transport->open();

    UserProfile user;
        user.uid=1;
        user.name="Cpp";
        user.blurb="Cpp's blurb";
        user.month=9;
        user.day=16;
        user.year=1990;

    printf("Sending User Info to Server\n");
    client.repeat(user);
    client.daysold(user);

    transport->close();
    return 0;
}

Compile and link together all of the .o files. The “-L/usr/local/lib” argument specifies which library to use while linking.

    $ g++ -Wall -I/usr/local/include/thrift -c UserStorage_Client.cpp -o client.o
    $ g++ -L/usr/local/lib *.o -o UserStorage_Client -lthrift

If you’re working on OSX and you’ve gotten the following error:

    /usr/local/include/thrift/transport/TSocket.h:240: error: expected ',' or '...' before '*' token
    /usr/local/include/thrift/transport/TSocket.h:240: error: ISO C++ forbids declaration of 'sockaddr' with no type
    /usr/local/include/thrift/transport/TSocket.h:293: error: 'sockaddr_in' does not name a type
    /usr/local/include/thrift/transport/TSocket.h:294: error: 'sockaddr_in6' does not name a type

You need to add the following to TSocket.h in /usr/local/include/thrift/transport:

        #include <sys/socket.h>
        #include <arpa/inet.h>

Once the code compiles and links, run the client by:

    $ ./UserStorage_Client

Right now you should get an error implying that the server is not on:

        Socket::open() connect() <Host: localhost Port: 9090>Connection refused
        Abort trap

If you get the following error:

    error while loading shared libraries: libthrift-0.9.0.so: cannot open shared object file: No such file or directory

It just means that you need to add libthrift-0.9.0 to your path:

     $ export LD_LIBRARY_PATH='/usr/lib':$LD_LIBRARY_PATH

Creating the Python Client

Creating a Python client is even easier than creating the C++ client. In order to use python, you must go into the Thrift/lib/py directory and run

    $ sudo python setup.py install

If you do not do this, you will get this error when you try to run the client:

    File "./UserStorage_Client.py", line 3, in <module> from Test import UserStorage
    File "thrift-0.9.0/compiler/cpp/gen-py/Test/UserStorage.py", line 9, in <module> from
    thrift.Thrift import TType, TMessageType, TException, TApplicationException
    ImportError: No module named thrift.Thrift

Next, go to the thrift/compiler/cpp directory, change the namespace from “cpp” to “py” and generate the python code:

    $ thrift -r --gen py start.thrift

In the newly created gen-py folder, add this file:

UserStorage_Client.py

import sys

from Test import UserStorage
from Test.ttypes import *

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
    try:

    # Make socket
    transport = TSocket.TSocket('localhost', 9090)

    # Buffering is critical. Raw sockets are very slow
    transport = TTransport.TBufferedTransport(transport)

    # Wrap in a protocol
    protocol = TBinaryProtocol.TBinaryProtocol(transport)

    # Create a client to use the protocol encoder
    client = UserStorage.Client(protocol)

    # Connect!
    transport.open()

    up = UserProfile(uid=2, name="Python", blurb="Python's blurb", month=1, day=28, year=1978)
    print 'Sending User Info to Server'

    info = client.repeat(up)
    days = client.daysold(up)

    # Close!
    transport.close()

    except Thrift.TException, tx:
    print '%s' % (tx.message)

Finally, run the client with the command:

    $ python UserStorage_Client.py

Once again, you should get an error implying that the server is not on:

    Could not connect to localhost:9090

Test Server and Clients

Start up the Java server (In eclipse- right click on UserStorageserver.Java and select “Run as Java Application”) Go to the directory where my Java_Server.jar is located and use the command:

    $ Java -jar Java_Server.jar

Execute the C++ client by going to the gen-cpp directory in a separate terminal window:

    $ ./UserStorage_Client

Execute the Python client by going to the gen-py directory:

    $ Python UserStorage_Client.py

If everything ran correctly, each client’s terminal should say “User Info Sent to Server” and the server’s terminal should show this:

Starting the simple server...
Repeat:
User's ID: 1
User's Name: Cpp
User's Blurb: Cpp's blurb

Days Old:
User is approximately 8146 days old.

Repeat:
User's ID: 2
User's Name: Python
User's Blurb: Python's blurb

Days Old:
User is approximately 12808 days old.

Postgres FDW’s are HOT

Over the last couple years Postgres Foreign Data Wrappers (FDW’s) having matured into something very powerful and complete. You can integrate with other Postgres and non-Postgres database instances natively and seamlessly. Check out this blog post from the good people at Heroku and you’ll start to understand the beauty and simplicity. The postgres_fdw data wrapper is built into standard vanilla PostgreSQL 9.3 and lets you easily federate remote Postgres instances.

–Luss

Elephants

Meetup: Leveraging Hadoop in your PostgreSQL Environment

Our very own Jim “Jimbo” Mlodgenski will be hosting a Meetup Thursday June 27th at 6pm to discuss: Leveraging Hadoop in your PostgreSQL Environment!

The event will be graciously held at:
Slader
199 Lafayette Street
Suite 3B
New York, NY

To register click here

Details….

There is buzz everywhere about Apache Hadoop and rightfully so. It’s an open-source framework that enables the construction of distributed, data-intensive applications running on clusters of commodity hardware. In short, it’s letting people get real value out of massive amounts of data without spending a fortune on proprietary software or high end hardware. But where does this leave the traditional relational databases like PostgreSQL?

This talk will begin with a discussion of the strengths of PostgreSQL and Hadoop. We will then lead into a high level overview of Hadoop and its community of projects like Hive, Flume and Sqoop. Finally, we will dig down into various use cases detailing how you can leverage Hadoop technologies for your PostgreSQL databases today. The use cases will range from using HDFS for simple database backups to using PostgreSQL and Foreign Data Wrappers to do low latency analytics on your Big Data.

header_images1

PostgreSQL FDW to Hadoop Hive Tutorial

PostgreSQL Foreign Data Wrapper to Hadoop Hive Tutorial 2

In our last tutorial, Hadoop Hive Tutorial – Zero to Results in 15 Minutes, we installed Postgres and Hadoop and ran some simple queries against a sample dataset directly against Hive tables.  In this tutorial we will create a FDW (Foreign Data Wrapper) in Postgres and access the same data using psql within Postgres.  If you have not completed the first tutorial, it is recommended that you at least ensure that you have set up bigsql, created the hive tables and populated the hive data.

Let’s Roll!


Verify that you have started BigSQL and ran the setENV.sh script.  (See the first tutorial if you need help with this step!)

The first step is to create the EXTENSION for the FDW, run the pslq interpreter and type the following:

CREATE EXTENSION hadoop_fdw;

Now we will create a SERVER using the EXTENSION we just created:

CREATE SERVER bigsql_server1 FOREIGN DATA WRAPPER hadoop_fdw
OPTIONS (url 'jdbc:hive://localhost:10000/<wbr>default');

Next we will create a USER MAPPING for the SERVER we just created:

CREATE USER MAPPING FOR user SERVER bigsql_server1
OPTIONS (username 'user', password 'password');

Finally, we will create the FOREIGN TABLE with the mappings from the Hive data store:

CREATE FOREIGN TABLE salesdata (
 s_num       FLOAT,
 s_borough   INT,
 s_neighbor  TEXT,
 s_b_class   TEXT,
 s_c_p       TEXT,
 s_block     TEXT,
 s_lot       TEXT,
 s_easement  TEXT,
 w_c_p_2     TEXT,
 s_address   TEXT,
 s_app_num   TEXT,
 s_zip       TEXT,
 s_res_units TEXT,
 s_com_units TEXT,
 s_tot_units INT,
 s_sq_ft     FLOAT,
 s_g_sq_ft   FLOAT,
 s_yr_built  INT,
 s_tax_c     INT,
 s_b_class2  TEXT,
 s_price     FLOAT,
 s_sales_dt  TEXT)
 SERVER bigsql_server1
 OPTIONS (table 'salesdata');

Now Assuming everything went well… We should be able to run some queries against the Hive table!

SELECT * from salesdata LIMIT 5;

You should see 5 records from the table.

  • Take a quick look at the structure of the salesdata table:

\d salesdata;

postgres=# \d salesdata;

Foreign table "public.salesdata"
Column      |       Type       | Modifiers | FDW Options
------------+------------------+-----------+-------------
s_num       | double precision |           |
s_borough   | integer          |           |
s_neighbor  | text             |           |
s_b_class   | text             |           |
s_c_p       | text             |           |
s_block     | text             |           |
s_lot       | text             |           |
s_easement  | text             |           |
w_c_p_2     | text             |           |
s_address   | text             |           |
s_app_num   | text             |           |
s_zip       | text             |           |
s_res_units | text             |           |
s_com_units | text             |           |
s_tot_units | integer          |           |
s_sq_ft     | double precision |           |
s_g_sq_ft   | double precision |           |
s_yr_built  | integer          |           |
s_tax_c     | integer          |           |
s_b_class2  | text             |           |
s_price     | double precision |           |
s_sales_dt  | text             |           |
Server: bigsql_server1
FDW Options: ("table" 'salesdata')
  • Lets run a query to see the sales by neighborhood in Manhattan;

SELECT s_neighbor, sum(s_price) FROM  salesdata GROUP BY s_neighbor;

Results should look simular to:


postgres=# SELECT s_neighbor,SUM(s_price) FROM salesdata GROUP BY s_neighbor;

         s_neighbor         |    sum
 ---------------------------+------------
  FINANCIAL                 | 1402998865
  MANHATTAN-UNKNOWN         |          0
  LITTLE ITALY              |  250821741
  ALPHABET CITY             |  211818189
  MIDTOWN CBD               | 3203149051
  WASHINGTON HEIGHTS UPPER  |  330247015
  KIPS BAY                  |  422209848
  GRAMERCY                  | 1021019603
  HARLEM-UPPER              |  258439080
  GREENWICH VILLAGE-CENTRAL | 1210462635
  JAVITS CENTER             |  148603249
  UPPER WEST SIDE (79-96)   | 1422565321
  UPPER WEST SIDE (96-116)  |  595009679
  MORNINGSIDE HEIGHTS       |   29809357
  FLATIRON                  | 1612799353
  HARLEM-CENTRAL            |  791802512
  LOWER EAST SIDE           |  658796147
  UPPER EAST SIDE (96-110)  |  210717582
  WASHINGTON HEIGHTS LOWER  |  367326009
  EAST VILLAGE              |  468213911
  FASHION                   | 1714396813
  CIVIC CENTER              |  612244158
  MANHATTAN VALLEY          |  220308024
  MIDTOWN EAST              | 1294451495
  GREENWICH VILLAGE-WEST    | 1670496319
  SOHO                      | 2929228590
  SOUTHBRIDGE               |  978509618
  UPPER EAST SIDE (59-79)   | 4325223036
  HARLEM-WEST               |  160041248
  CHELSEA                   | 2250264798
  CLINTON                   |  806269355
  MURRAY HILL               | 1441188759
  UPPER EAST SIDE (79-96)   | 3159172312
  CHINATOWN                 |  288838109
  INWOOD                    |  120332695
  HARLEM-EAST               |  217612503
  TRIBECA                   | 1663415012
  MIDTOWN WEST              | 6728533426
  UPPER WEST SIDE (59-79)   | 3404609800
 (39 rows)

How many properties in each community?

SELECT s_neighbor, count(s_price) FROM salesdata GROUP BY s_neighbor;

postgres=# SELECT s_neighbor, count(*) FROM salesdata GROUP BY s_neighbor;

         s_neighbor         | count
 ---------------------------+-------
  FINANCIAL                 |   491
  MANHATTAN-UNKNOWN         |     3
  LITTLE ITALY              |   101
  ALPHABET CITY             |   128
  MIDTOWN CBD               |   235
  WASHINGTON HEIGHTS UPPER  |   350
  KIPS BAY                  |   304
  GRAMERCY                  |   504
  HARLEM-UPPER              |   161
  GREENWICH VILLAGE-CENTRAL |   699
  JAVITS CENTER             |    38
  UPPER WEST SIDE (79-96)   |  1028
  UPPER WEST SIDE (96-116)  |   461
  MORNINGSIDE HEIGHTS       |    65
  FLATIRON                  |   528
  HARLEM-CENTRAL            |   867
  LOWER EAST SIDE           |   439
  UPPER EAST SIDE (96-110)  |   131
  WASHINGTON HEIGHTS LOWER  |   186
  EAST VILLAGE              |   233
  FASHION                   |   249
  CIVIC CENTER              |   263
  MANHATTAN VALLEY          |   268
  MIDTOWN EAST              |  1196
  GREENWICH VILLAGE-WEST    |   710
  SOHO                      |   432
  SOUTHBRIDGE               |    70
  UPPER EAST SIDE (59-79)   |  2408
  HARLEM-WEST               |    67
  CHELSEA                   |   961
  CLINTON                   |   495
  MURRAY HILL               |   771
  UPPER EAST SIDE (79-96)   |  1929
  CHINATOWN                 |   175
  INWOOD                    |   117
  HARLEM-EAST               |   298
  TRIBECA                   |   694
  MIDTOWN WEST              |  5856
  UPPER WEST SIDE (59-79)   |  1985
 (39 rows)

You will notice the results match the results the queries run directly in Hive. In our next tutorial we will add some additional data, create some cross platform joins and continue to exercise BigSQL.
For more information on BigSQL, click here.