The Hadoop Sqoop Tutorial is the third in the series providing easy to follow examples of using a Hadoop cluster in conjunction with Postgres. This tutorial will build on the Hadoop Hive Tutorial which should be considered a pre requisite if you want to run the examples provided. In this tutorial, we will create a Postgres table from public domain data provided by the New York City health department. The data has been slightly cleaned in the version attached to this blog article to save everyone time 🙂 This data will then be moved to HDFS Hive using the Sqoop IMPORT command. Finally we will run some Hive queries and Postgres queries to validate the transfer! Assuming all of the prerequisites are in place, this should be a 15 minute exercise.
Sqoop is really not that difficult after getting your environment up and running! The best way to learn Sqoop is to jump in and start testing out the various options and configuration setting Sqoop makes available.
Let’s Get Started!!
First though let’s just make sure that Sqoop is available and ready to go:
- Change to the tutorial directory: (if your not already there)
$ cd tutorial
- check the version of sqoop you are running:
$ sqoop version
Sqoop 1.4.3 git commit id 82fde21d0f6c89059bade4588f3ba6
If you do not see a simular response go back to the first tutorial and ensure you have Hadoop properly installed.
Now let’s create a table in Postgres to hold the data we will eventually Import into HDFS using Sqoop:
Here is a zip file containing a subset of the data in CSV format Tutorial Data: Sample Data
- Unzip the data in the tutorial directory.
Use the PostgreSQL COPY command to load the data into the Postgres tables.
- Of course we need to create the tables before using the COPY command.
$psqlCREATE TABLE nycrestdata ( s_camis TEXT, s_dba TEXT, s_boro TEXT, s_building TEXT, s_street TEXT, s_zipcode TEXT, s_phone TEXT, s_cuscode TEXT, s_insp_date TEXT, s_action TEXT, s_viocode TEXT, s_score TEXT, s_grade TEXT, s_g_date TEXT, s_r_date TEXT );
Copy, "paste into psql" and run the above code to create the table.
- Now use the COPY command to load the text file into the table:
</p> <p>COPY nycrestdata FROM '/home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/restdata.csv' (DELIMITER ',');
Sqoop prefers data with a Primary Key.
- Add a PRIMARY KEY to the data:
ALTER TABLE nycrestdata ADD pk SERIAL PRIMARY KEY;
dont forget to \q out of Postgres.
Lets verify that Sqoop can "see the table" in PostgreSQL:$ sqoop list-tables --connect jdbc:postgresql://localhost:5432/postgres --username postgres --password password
13/06/20 17:48:43 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
13/06/20 17:48:43 INFO manager.SqlManager: Using default fetchSize of 1000 nycrestdata
Great there is our table nycrestdata!
before moving further lets quickly create a Sqoop Options file to store the parameters we will be using repetitively.
Here is the options file contents, create a file called sqoop_parms.txt containing this data and save it to the tutorial directory:
- Create the Options File
</p> <p>#</p> <h1>Options file for Sqoop import</h1> <p>#</p> <h1>Connect parameter and value</h1> <p>--connect jdbc:postgresql://localhost:5432/postgres</p> <h1>Username parameter and value</h1> <p>--username postgres #</p> <h1>Password</h1> <p>--password password
Now we will look at the tables in PostgreSQL again using the sqoop_parm.txt file:
sqoop list-tables --options-file /home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/Sqoop_parms.txt
Now let’s import the data from PostgreSQL to HDFS!
sqoop import --options-file /home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/Sqoop_parms.txt --table nycrestdata --hive-import
You should see the Map Reducer jobs executed, assuming success we should now have the tables in both PostgreSQL and Hadoop Hive.
Lets run some queries to ensure that the import worked and the data matches.
Lets take a quick look at the number of restaurants by each score: (Multiple scores per establishment may be in the data!)
- Run a few SQL Test Queries
</p> <p>SELECT s_grade,count(*) FROM nycrestdata group by s_grade;</p> <p>
JFI a grade of Z is a temporary grade until the data is updated! The C’s you have to look out for!!
</p> <p>SELECT DISTINCT s_dba from nycrestdata where s_grade LIKE '%C%';</p> <p>
Now lets see if the Hive data correlates!
from the hive> prompt
SELECT s_grade,count(*) FROM nycrestdata group by s_grade;
And now for the C’s again!
SELECT DISTINCT s_dba from nycrestdata where s_grade LIKE 'C';
Looks like the data matches as Expected! Our next tutorial will be extended functions of the FDW.
If you have not seen the foreign data wrapper in action check out this tutorial!