BigSQL Hadoop + Postgres now Available as a VM

With the continued success of the BigSQL developer’s bundle, we are proud to announce the availability of a new VM image with BigSQL preinstalled.  Primarily created to provide our Windows friends access to the technology bundle, the Image is based on CentOS 6.4 Desktop including many tools to begin exploring today.  Join our free Beta program now!  

We see this as a quick way to open up the platform for Windows clients while we complete our native Windows bundle distribution which is on our roadmap and should be in alpha by late Q1 of 2014.

The beta includes all of the standard BigSQL components plus:

  • Hadoop-2.2
  • Hive-0.12
  • HBase-0.96
At OpenSCG we are focused on integrating existing SQL infrastructure WITH NoSQL extensions to ensure our clients get the most out of their investment dollars.  We understand DATA and thrive on helping our clients get maximum value out of DATA!

Installing and Running BigSQL, the Hadoop and Postgres Bundle

This tutorial will show you how to install and run BigSQL. You can download the bundle from and get started using Hadoop, Hive, Hbase, and PostgreSQL in less than ten minutes. The video below goes over how to:

  • Complete the prerequisites (the commands used can be found here)
  • Start the bundle
  • Open the psql, hive, and hbase shells
  • Use the web administration console

[youtube url=””]

Subscription OpenSCG

Hadoop Tutorials Hive HBase Flume Scoop Thrift and More

Over the last several months the team at OpenSCG has created a number of Hadoop Tutorials covering key  projects in the Hadoop ecosystem including Hive, HBase, Flume, Scoop and Thrift.  These Tutorials are designed to quickly highlight the usage of Hadoop with working examples that can be quickly worked through and modified!  To help you find what you are looking for here is a quick overview of our tutorials complete with links!

As a prerequisite we use BigSQL as our developer bundle.  This can be installed on OSX and Linux in under 10 minutes and includes all the components needed for the tutorial series.

BigSQL comes with its own self contained tutorial that can be found here: BigSQL Tutorial 

A great place to start is the Hadoop Hive Tutorial Zero to Results in under 15 minutes. In addition to providing examples of Hive in action, the tutorial covers the installation of BigSql for first timers.

Interested in Hbase?  Check out our Hadoop HBase Tutorial

BigSQL bundles Postgres into the distribution.  This makes it very easy to learn Scoop and Flume!

Starting with Scoop we have  Hadoop Scoop Tutorial HDFS – Postgres made easy!

And Flume at Using Hadoop to Flume Twitter Data 

And finally we take a look at Thrift here Hadoop Thrift Tutorial

Keep checking back as we plan on continuing to create tutorials to help demystify Hadoop with real world working examples!!


Hadoop Sqoop Tutorial HDFS Hive & Postgres

The Hadoop Sqoop Tutorial is the third in the series providing easy to follow examples of using a Hadoop cluster in conjunction with Postgres.  This tutorial will build on the Hadoop Hive Tutorial which should be considered a pre requisite if you want to run the examples provided. In this tutorial, we will create a Postgres table from public domain data provided by the New York City health department.  The data has been slightly cleaned in the version attached to this blog article to save everyone time :-)   This data will then be moved to HDFS Hive using the Sqoop IMPORT command.   Finally we will run some Hive queries and Postgres queries to validate the transfer!  Assuming all of the prerequisites are in place, this should be a 15 minute exercise.

Sqoop is really not that difficult after getting your environment up and running!  The best way to learn Sqoop is to jump in and start testing out the various options and configuration setting Sqoop makes available.

Let’s Get Started!!

First though let’s just make sure that Sqoop is available and ready to go:

  • Change to the tutorial directory: (if your not already there)

$ cd tutorial

  • check the version of sqoop you are running:

$ sqoop version

Sqoop 1.4.3 git commit id 82fde21d0f6c89059bade4588f3ba6989989dc76 Compiled by hari on Tue Feb 26 11:30:27 PST 2013 

If you do not see a simular response go back to the first tutorial and ensure you have Hadoop properly installed.

Now let’s create a table in Postgres to hold the data we will eventually Import into HDFS using Sqoop:

Here is a zip file containing a subset of the data in CSV format Tutorial Data: Sample Data

  • Unzip the data in the tutorial directory.  

Use the PostgreSQL COPY command to load the data into the Postgres tables.

  • Of course we need to create the tables before using the COPY command.

CREATE  TABLE nycrestdata (
s_camis     TEXT,
s_dba       TEXT,
s_boro      TEXT,
s_building  TEXT,
s_street    TEXT,
s_zipcode   TEXT,
s_phone     TEXT,
s_cuscode   TEXT,
s_insp_date TEXT,
s_action    TEXT,
s_viocode   TEXT,
s_score     TEXT,
s_grade     TEXT,
s_g_date    TEXT,
s_r_date    TEXT

Copy, "paste into psql" and run the above code to create the table.

  • Now use the COPY command to load the text file into the table:

COPY nycrestdata FROM '/home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/restdata.csv' (DELIMITER ','); 

Sqoop prefers data with a Primary Key.

  • Add a PRIMARY KEY to the data:

dont forget to \q out of Postgres.

Lets verify that Sqoop can "see the table" in PostgreSQL:
$ sqoop list-tables --connect jdbc:postgresql://localhost:5432/postgres --username postgres --password password 

13/06/20 17:48:43 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 
13/06/20 17:48:43 INFO manager.SqlManager: Using default fetchSize of 1000 nycrestdata   

Great there is our table nycrestdata!   
before moving further lets quickly  create a Sqoop Options file to store the parameters we will be using repetitively. 
Here is the options file contents, create a file called sqoop_parms.txt containing this data and save it to the tutorial directory:

  • Create the Options File

# Options file for Sqoop import

# Connect parameter and value
# Username parameter and value
# Password

Now we will look at the tables in PostgreSQL again using the sqoop_parm.txt file:

sqoop list-tables --options-file /home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/Sqoop_parms.txt

Now let’s import the data from PostgreSQL to HDFS!

sqoop import --options-file /home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/Sqoop_parms.txt --table nycrestdata --hive-import

You should see the Map Reducer jobs executed, assuming success we should now have the tables in both PostgreSQL and Hadoop Hive.

Lets run some queries to ensure that the import worked and the data matches.

Lets take a quick look at the number of restaurants by each score: (Multiple scores per establishment may be in the data!)


  • Run a few  SQL  Test Queries

SELECT s_grade,count(*) FROM nycrestdata group by s_grade;

JFI a grade of Z is a temporary grade until the data is updated!  The C’s you have to look out for!!

SELECT DISTINCT s_dba from nycrestdata where s_grade LIKE '%C%';

Now lets see if the Hive data correlates!

from the hive> prompt

SELECT s_grade,count(*) FROM nycrestdata group by s_grade;

And now for the C’s again!

SELECT DISTINCT s_dba from nycrestdata where s_grade LIKE 'C';

Looks like the data matches as Expected! Our next tutorial will be extended functions of the FDW.

If you have not seen the foreign data wrapper in action check out this tutorial!


Meetup: Leveraging Hadoop in your PostgreSQL Environment

Our very own Jim “Jimbo” Mlodgenski will be hosting a Meetup Thursday June 27th at 6pm to discuss: Leveraging Hadoop in your PostgreSQL Environment!

The event will be graciously held at:
199 Lafayette Street
Suite 3B
New York, NY

To register click here


There is buzz everywhere about Apache Hadoop and rightfully so. It’s an open-source framework that enables the construction of distributed, data-intensive applications running on clusters of commodity hardware. In short, it’s letting people get real value out of massive amounts of data without spending a fortune on proprietary software or high end hardware. But where does this leave the traditional relational databases like PostgreSQL?

This talk will begin with a discussion of the strengths of PostgreSQL and Hadoop. We will then lead into a high level overview of Hadoop and its community of projects like Hive, Flume and Sqoop. Finally, we will dig down into various use cases detailing how you can leverage Hadoop technologies for your PostgreSQL databases today. The use cases will range from using HDFS for simple database backups to using PostgreSQL and Foreign Data Wrappers to do low latency analytics on your Big Data.


PostgreSQL FDW to Hadoop Hive Tutorial

PostgreSQL Foreign Data Wrapper to Hadoop Hive Tutorial 2

In our last tutorial, Hadoop Hive Tutorial – Zero to Results in 15 Minutes, we installed Postgres and Hadoop and ran some simple queries against a sample dataset directly against Hive tables.  In this tutorial we will create a FDW (Foreign Data Wrapper) in Postgres and access the same data using psql within Postgres.  If you have not completed the first tutorial, it is recommended that you at least ensure that you have set up bigsql, created the hive tables and populated the hive data.

Let’s Roll!

Verify that you have started BigSQL and ran the script.  (See the first tutorial if you need help with this step!)

The first step is to create the EXTENSION for the FDW, run the pslq interpreter and type the following:


Now we will create a SERVER using the EXTENSION we just created:

CREATE SERVER bigsql_server1 FOREIGN DATA WRAPPER hadoop_fdw
OPTIONS (url 'jdbc:hive://localhost:10000/<wbr>default');

Next we will create a USER MAPPING for the SERVER we just created:

OPTIONS (username 'user', password 'password');

Finally, we will create the FOREIGN TABLE with the mappings from the Hive data store:

 s_num       FLOAT,
 s_borough   INT,
 s_neighbor  TEXT,
 s_b_class   TEXT,
 s_c_p       TEXT,
 s_block     TEXT,
 s_lot       TEXT,
 s_easement  TEXT,
 w_c_p_2     TEXT,
 s_address   TEXT,
 s_app_num   TEXT,
 s_zip       TEXT,
 s_res_units TEXT,
 s_com_units TEXT,
 s_tot_units INT,
 s_sq_ft     FLOAT,
 s_g_sq_ft   FLOAT,
 s_yr_built  INT,
 s_tax_c     INT,
 s_b_class2  TEXT,
 s_price     FLOAT,
 s_sales_dt  TEXT)
 SERVER bigsql_server1
 OPTIONS (table 'salesdata');

Now Assuming everything went well… We should be able to run some queries against the Hive table!

SELECT * from salesdata LIMIT 5;

You should see 5 records from the table.

  • Take a quick look at the structure of the salesdata table:

\d salesdata;

postgres=# \d salesdata;

Foreign table "public.salesdata"
Column      |       Type       | Modifiers | FDW Options
s_num       | double precision |           |
s_borough   | integer          |           |
s_neighbor  | text             |           |
s_b_class   | text             |           |
s_c_p       | text             |           |
s_block     | text             |           |
s_lot       | text             |           |
s_easement  | text             |           |
w_c_p_2     | text             |           |
s_address   | text             |           |
s_app_num   | text             |           |
s_zip       | text             |           |
s_res_units | text             |           |
s_com_units | text             |           |
s_tot_units | integer          |           |
s_sq_ft     | double precision |           |
s_g_sq_ft   | double precision |           |
s_yr_built  | integer          |           |
s_tax_c     | integer          |           |
s_b_class2  | text             |           |
s_price     | double precision |           |
s_sales_dt  | text             |           |
Server: bigsql_server1
FDW Options: ("table" 'salesdata')
  • Lets run a query to see the sales by neighborhood in Manhattan;

SELECT s_neighbor, sum(s_price) FROM  salesdata GROUP BY s_neighbor;

Results should look simular to:

postgres=# SELECT s_neighbor,SUM(s_price) FROM salesdata GROUP BY s_neighbor;

         s_neighbor         |    sum
  FINANCIAL                 | 1402998865
  MANHATTAN-UNKNOWN         |          0
  LITTLE ITALY              |  250821741
  ALPHABET CITY             |  211818189
  MIDTOWN CBD               | 3203149051
  KIPS BAY                  |  422209848
  GRAMERCY                  | 1021019603
  HARLEM-UPPER              |  258439080
  JAVITS CENTER             |  148603249
  UPPER WEST SIDE (79-96)   | 1422565321
  UPPER WEST SIDE (96-116)  |  595009679
  MORNINGSIDE HEIGHTS       |   29809357
  FLATIRON                  | 1612799353
  HARLEM-CENTRAL            |  791802512
  LOWER EAST SIDE           |  658796147
  UPPER EAST SIDE (96-110)  |  210717582
  EAST VILLAGE              |  468213911
  FASHION                   | 1714396813
  CIVIC CENTER              |  612244158
  MANHATTAN VALLEY          |  220308024
  MIDTOWN EAST              | 1294451495
  GREENWICH VILLAGE-WEST    | 1670496319
  SOHO                      | 2929228590
  SOUTHBRIDGE               |  978509618
  UPPER EAST SIDE (59-79)   | 4325223036
  HARLEM-WEST               |  160041248
  CHELSEA                   | 2250264798
  CLINTON                   |  806269355
  MURRAY HILL               | 1441188759
  UPPER EAST SIDE (79-96)   | 3159172312
  CHINATOWN                 |  288838109
  INWOOD                    |  120332695
  HARLEM-EAST               |  217612503
  TRIBECA                   | 1663415012
  MIDTOWN WEST              | 6728533426
  UPPER WEST SIDE (59-79)   | 3404609800
 (39 rows)

How many properties in each community?

SELECT s_neighbor, count(s_price) FROM salesdata GROUP BY s_neighbor;

postgres=# SELECT s_neighbor, count(*) FROM salesdata GROUP BY s_neighbor;

         s_neighbor         | count
  FINANCIAL                 |   491
  MANHATTAN-UNKNOWN         |     3
  LITTLE ITALY              |   101
  ALPHABET CITY             |   128
  MIDTOWN CBD               |   235
  KIPS BAY                  |   304
  GRAMERCY                  |   504
  HARLEM-UPPER              |   161
  JAVITS CENTER             |    38
  UPPER WEST SIDE (79-96)   |  1028
  UPPER WEST SIDE (96-116)  |   461
  FLATIRON                  |   528
  HARLEM-CENTRAL            |   867
  LOWER EAST SIDE           |   439
  UPPER EAST SIDE (96-110)  |   131
  EAST VILLAGE              |   233
  FASHION                   |   249
  CIVIC CENTER              |   263
  MANHATTAN VALLEY          |   268
  MIDTOWN EAST              |  1196
  SOHO                      |   432
  SOUTHBRIDGE               |    70
  UPPER EAST SIDE (59-79)   |  2408
  HARLEM-WEST               |    67
  CHELSEA                   |   961
  CLINTON                   |   495
  MURRAY HILL               |   771
  UPPER EAST SIDE (79-96)   |  1929
  CHINATOWN                 |   175
  INWOOD                    |   117
  HARLEM-EAST               |   298
  TRIBECA                   |   694
  MIDTOWN WEST              |  5856
  UPPER WEST SIDE (59-79)   |  1985
 (39 rows)

You will notice the results match the results the queries run directly in Hive. In our next tutorial we will add some additional data, create some cross platform joins and continue to exercise BigSQL.
For more information on BigSQL, click here.


Hadoop Hive Tutorial – Zero to Results in 15 Minutes

This Tutorial will walk through using Hadoop’s Hive to access data stored in HDFS.  This is the first tutorial (more to follow) and covers installing Hadoop as part of BigSQL and running some simple queries.

The Platform:  Linux (RHEL, CentOS or Ubuntu) or Mac OSX.  The Hadoop and Postgres Install was from  Takes about 10 – 15 minutes to install the developer bundle and get up and running.  ( dont skip the pre-requisites  ).

The Data: The original source data is from the NYC finance department and contains Rolling Sales files for the last 12 months in New York City.   The data for Manhattan was used in the tutorial.   A slightly cleaned file that can be used for the tutorial is available below.

Let’s get started!

  • If you have not previously installed bigsql,  download the bundle and tar the file.  Make sure to use the exact filename downloaded:

tar -xf bigsql-9.3xxx-linux64.tar.bz2

  • navigate to the home directory of your BigSQL install.  

I have installed BigSQL in the following directory:  /home/cgerth/bigsql/bigsql-9.3beta1-7   simply  cd to this directory

  • start BigSQL

./bigsql start

  • execute the environment script


  • create a tutorial directory, change to that directory

mkdir tutorial

cd tutorial

  • put the following three files in the tutorial directory: ex2data.csv, ex1createHiveTable.sql and ex1loadHiveTables.sql .  The files are zipped here

use the tool of your choice!

  • create the hive table structure.  Take a look at the ex1createHiveTable.sql script which is used to create the Hive Table.

s_num FLOAT,
s_borough INT,
s_neighbor STRING,
s_b_class STRING,
s_c_p STRING,
s_block STRING,
s_lot STRING,
s_easement STRING,
w_c_p_2 STRING,
s_address STRING,
s_app_num STRING,
s_zip STRING,
s_res_units STRING,
s_com_units STRING,
s_tot_units INT,
s_sq_ft FLOAT,
s_g_sq_ft FLOAT,
s_yr_built INT,
s_tax_c INT,
s_b_class2 STRING,
s_price FLOAT,
s_sales_dt STRING


$ hive -v -f ex1createHiveTables.sql

  • load the datafile into the structure created.  The ex1loadHiveTable.sql script can be used for this.  Take a look at the script before running.


hive -v -f ex1loadHiveTables.sql

  • Run a quick query to test the data load.  Run the following command and you should see the 5 records from the table.


SELECT * from salesdata LIMIT 5;

You should see 5 records from the table.

  • Take a quick look at the hive table created.

DESCRIBE salesdata;

col_name    data_type    comment
s_num    float    
s_borough    int    
s_neighbor    string    
s_b_class    string    
s_c_p    string    
s_block    string    
s_lot    string    
s_easement    string    
w_c_p_2    string    
s_address    string    
s_app_num    string    
s_zip    string    
s_res_units    string    
s_com_units    string    
s_tot_units    int    
s_sq_ft    float    
s_g_sq_ft    float    
s_yr_built    int    
s_tax_c    int    
s_b_class2    string    
s_price    float    
s_sales_dt    string    
Time taken: 0.194 seconds
  • Lets run a query to see the sales by neighborhood in Manhattan;

SELECT s_neighbor, sum(s_price) FROM  salesdata GROUP BY s_neighbor;

Results should look simular to:
ALPHABET CITY 2.11818189E8
CHELSEA 2.250264798E9
CHINATOWN 2.88838109E8
CIVIC CENTER 6.12244158E8
CLINTON 8.06269355E8
EAST VILLAGE 4.68213911E8
FASHION 1.714396813E9
FINANCIAL 1.402998865E9
FLATIRON 1.612799353E9
GRAMERCY 1.021019603E9
HARLEM-EAST 2.17612503E8
HARLEM-UPPER 2.5843908E8
HARLEM-WEST 1.60041248E8
INWOOD 1.20332695E8
JAVITS CENTER 1.48603249E8
KIPS BAY 4.22209848E8
LITTLE ITALY 2.50821741E8
LOWER EAST SIDE 6.58796147E8
MIDTOWN CBD 3.203149051E9
MIDTOWN EAST 1.294451495E9
MIDTOWN WEST 6.728533424E9
MURRAY HILL 1.441188759E9
SOHO 2.929228594E9
SOUTHBRIDGE 9.78509618E8
TRIBECA 1.663415012E9
UPPER EAST SIDE (59-79) 4.325223036E9
UPPER EAST SIDE (79-96) 3.159172312E9
UPPER EAST SIDE (96-110) 2.10717582E8
UPPER WEST SIDE (59-79) 3.4046098E9
UPPER WEST SIDE (79-96) 1.422565321E9
UPPER WEST SIDE (96-116) 5.95009679E8
Time taken: 3.656 seconds
  • How many properties in each community?

SELECT s_neighbor, count(s_price) FROM salesdata GROUP BY s_neighbor;

ALPHABET CITY                128
 CHELSEA                      961
 CHINATOWN                    175
 CIVIC CENTER                 263
 CLINTON                      495
 EAST VILLAGE                 233
 FASHION                      249
 FINANCIAL                    491
 FLATIRON                     528
 GRAMERCY                     504
 HARLEM-CENTRAL               867
 HARLEM-EAST                  298
 HARLEM-UPPER                 161
 HARLEM-WEST                  67
 INWOOD                       117
 JAVITS CENTER                38
 KIPS BAY                     304
 LITTLE ITALY                 101
 LOWER EAST SIDE              439
 MANHATTAN VALLEY             268
 MIDTOWN CBD                  235
 MIDTOWN EAST                 1196
 MIDTOWN WEST                 5856
 MURRAY HILL                  771
 SOHO                         432
 SOUTHBRIDGE                  70
 TRIBECA                      694
 UPPER EAST SIDE (59-79)      2408
 UPPER EAST SIDE (79-96)      1929
 UPPER EAST SIDE (96-110)     131
 UPPER WEST SIDE (59-79)      1985
 UPPER WEST SIDE (79-96)      1028
 UPPER WEST SIDE (96-116)     461
 Time taken: 3.79 seconds
  • Interesting that UNKNOW had 3 transactions but no $ volume, lets take a closer look:

SELECT s_price FROM salesdata WHERE s_neighbor LIKE  ‘%UNKNOWN%’

Mapred Local Task Succeeded . Convert the Join into MapJoin
 Time taken: 3.479 seconds

Yes, as expected they are all zero value transactions.

This completes the first starting Hive tutorial.

The next Tutorial in the series is PostgreSQL FDW to Hive  .

Please share feedback and look for more tutorials moving forward.

Click here for more information on BigSql!



SQL vs NoSQL Fireworks not Included


SQL has a lot going for it and that’s not going to change for quite a while.  Let’s face it, real big investments are being made every day on products like Oracle, SQL Server, PostgreSQL, Teradata, Firebird… you get the idea.  WikipediA provides a great comparison of relational database management systems listing over 50 players active in the space today.  They almost all provide ACID properties, Referential Integrity and Transactional support.  Traditional SQL solutions start to run into problems when dealing with extremely large amounts of data, or are faced with unstructured – semi structured data or are asked to run across large clusters of processing units.  There have been a number of solutions developed to handle these scenarios including Hadoop, Cassandra, HPCC, MongoDB… once again you get the idea!

Of course whenever a new technology hits the market the fireworks begin.  Some of the “pure” NoSQL proponents pushed the idea that SQL would shortly hit its end of life.  The arguments continue comparing the virtues of the new technology often taking shoots at good ole SQL.  But, as time marches forward there seems to be more alignment that its not either or but… both.  Realizing SQL has a tremendous following for data query both Hortonworks and Cloudera have recently launched improvements over Hadoop’s Hive with  Stinger and Impala.  Both tools provide “faster” query capability using a SQL like syntax.  An early mover in the space, Hadapt claims to natively integrate SQL with Hadoop.  Even IBM has joined the band and has launched Big SQL with a space ;)  providing SQL like syntax across a Hadoop Cluster. 

Our bet is better tools will continue to evolve, making it easier to glean insights from NoSQL clusters.  Additionally, we look for tighter integration between trusted SQL databases and NoSQL datastores over time.  Our BigSQL distribution and community looks to achieve both these objectives combining the power of Hadoop and PostgreSQL while making it easier to leverage the strengths of BOTH solutions in the real world.


OwnCloud: An Open Source DropBox

I’m very impressed with what this open source community has done. Take a look at for yourself. We are running an internal pilot of this at OpenSCG for the last few months. AFAIK it’s been very stable for us and it’s pretty feature rich.