The Hadoop Sqoop Tutorial is the third in the series providing easy to follow examples of using a Hadoop cluster in conjunction with Postgres.  This tutorial will build on the Hadoop Hive Tutorial which should be considered a pre requisite if you want to run the examples provided. In this tutorial, we will create a Postgres table from public domain data provided by the New York City health department.  The data has been slightly cleaned in the version attached to this blog article to save everyone time 🙂   This data will then be moved to HDFS Hive using the Sqoop IMPORT command.   Finally we will run some Hive queries and Postgres queries to validate the transfer!  Assuming all of the prerequisites are in place, this should be a 15 minute exercise.

Sqoop is really not that difficult after getting your environment up and running!  The best way to learn Sqoop is to jump in and start testing out the various options and configuration setting Sqoop makes available.

Let’s Get Started!!

First though let’s just make sure that Sqoop is available and ready to go:

  • Change to the tutorial directory: (if your not already there)

$ cd tutorial

  • check the version of sqoop you are running:

$ sqoop version

Sqoop 1.4.3 git commit id 82fde21d0f6c89059bade4588f3ba6989989dc76 Compiled by hari on Tue Feb 26 11:30:27 PST 2013 

If you do not see a simular response go back to the first tutorial and ensure you have Hadoop properly installed.

Now let’s create a table in Postgres to hold the data we will eventually Import into HDFS using Sqoop:

Here is a zip file containing a subset of the data in CSV format Tutorial Data: Sample Data

  • Unzip the data in the tutorial directory.  

Use the PostgreSQL COPY command to load the data into the Postgres tables.

  • Of course we need to create the tables before using the COPY command.
$psql

CREATE  TABLE nycrestdata (
s_camis     TEXT,
s_dba       TEXT,
s_boro      TEXT,
s_building  TEXT,
s_street    TEXT,
s_zipcode   TEXT,
s_phone     TEXT,
s_cuscode   TEXT,
s_insp_date TEXT,
s_action    TEXT,
s_viocode   TEXT,
s_score     TEXT,
s_grade     TEXT,
s_g_date    TEXT,
s_r_date    TEXT
);

Copy, "paste into psql" and run the above code to create the table.

  • Now use the COPY command to load the text file into the table:

</p>

<p>COPY nycrestdata FROM '/home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/restdata.csv' (DELIMITER ','); 

Sqoop prefers data with a Primary Key.

  • Add a PRIMARY KEY to the data:

ALTER TABLE nycrestdata ADD pk SERIAL PRIMARY KEY;

dont forget to \q out of Postgres.

Lets verify that Sqoop can "see the table" in PostgreSQL:
$ sqoop list-tables --connect jdbc:postgresql://localhost:5432/postgres --username postgres --password password 

13/06/20 17:48:43 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 
13/06/20 17:48:43 INFO manager.SqlManager: Using default fetchSize of 1000 nycrestdata   

Great there is our table nycrestdata!   
before moving further lets quickly  create a Sqoop Options file to store the parameters we will be using repetitively. 
Here is the options file contents, create a file called sqoop_parms.txt containing this data and save it to the tutorial directory:

  • Create the Options File

</p>

<p>#</p>

<h1>Options file for Sqoop import</h1>

<p>#</p>

<h1>Connect parameter and value</h1>

<p>--connect
jdbc:postgresql://localhost:5432/postgres</p>

<h1>Username parameter and value</h1>

<p>--username
postgres
#</p>

<h1>Password</h1>

<p>--password
password

Now we will look at the tables in PostgreSQL again using the sqoop_parm.txt file:

sqoop list-tables --options-file /home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/Sqoop_parms.txt

Now let’s import the data from PostgreSQL to HDFS!

sqoop import --options-file /home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/Sqoop_parms.txt --table nycrestdata --hive-import

You should see the Map Reducer jobs executed, assuming success we should now have the tables in both PostgreSQL and Hadoop Hive.

Lets run some queries to ensure that the import worked and the data matches.

Lets take a quick look at the number of restaurants by each score: (Multiple scores per establishment may be in the data!)

$psql

  • Run a few  SQL  Test Queries

</p>

<p>SELECT s_grade,count(*) FROM nycrestdata group by s_grade;</p>

<p>

JFI a grade of Z is a temporary grade until the data is updated!  The C’s you have to look out for!!

</p>

<p>SELECT DISTINCT s_dba from nycrestdata where s_grade LIKE '%C%';</p>

<p>

Now lets see if the Hive data correlates!

from the hive> prompt

SELECT s_grade,count(*) FROM nycrestdata group by s_grade;

And now for the C’s again!

SELECT DISTINCT s_dba from nycrestdata where s_grade LIKE 'C';

Looks like the data matches as Expected! Our next tutorial will be extended functions of the FDW.

If you have not seen the foreign data wrapper in action check out this tutorial!