PostgreSQL Foreign Data Wrapper to Hadoop Hive Tutorial 2

In our last tutorial, Hadoop Hive Tutorial – Zero to Results in 15 Minutes, we installed Postgres and Hadoop and ran some simple queries against a sample dataset directly against Hive tables.  In this tutorial we will create a FDW (Foreign Data Wrapper) in Postgres and access the same data using psql within Postgres.  If you have not completed the first tutorial, it is recommended that you at least ensure that you have set up bigsql, created the hive tables and populated the hive data.

Let’s Roll!


Verify that you have started BigSQL and ran the setENV.sh script.  (See the first tutorial if you need help with this step!)

The first step is to create the EXTENSION for the FDW, run the pslq interpreter and type the following:

CREATE EXTENSION hadoop_fdw;

Now we will create a SERVER using the EXTENSION we just created:

CREATE SERVER bigsql_server1 FOREIGN DATA WRAPPER hadoop_fdw
OPTIONS (url 'jdbc:hive://localhost:10000/<wbr>default');

Next we will create a USER MAPPING for the SERVER we just created:

CREATE USER MAPPING FOR user SERVER bigsql_server1
OPTIONS (username 'user', password 'password');

Finally, we will create the FOREIGN TABLE with the mappings from the Hive data store:

CREATE FOREIGN TABLE salesdata (
 s_num       FLOAT,
 s_borough   INT,
 s_neighbor  TEXT,
 s_b_class   TEXT,
 s_c_p       TEXT,
 s_block     TEXT,
 s_lot       TEXT,
 s_easement  TEXT,
 w_c_p_2     TEXT,
 s_address   TEXT,
 s_app_num   TEXT,
 s_zip       TEXT,
 s_res_units TEXT,
 s_com_units TEXT,
 s_tot_units INT,
 s_sq_ft     FLOAT,
 s_g_sq_ft   FLOAT,
 s_yr_built  INT,
 s_tax_c     INT,
 s_b_class2  TEXT,
 s_price     FLOAT,
 s_sales_dt  TEXT)
 SERVER bigsql_server1
 OPTIONS (table 'salesdata');

Now Assuming everything went well… We should be able to run some queries against the Hive table!

SELECT * from salesdata LIMIT 5;

You should see 5 records from the table.

  • Take a quick look at the structure of the salesdata table:

\d salesdata;

postgres=# \d salesdata;</p>

<p>Foreign table &quot;public.salesdata&quot;
Column      |       Type       | Modifiers | FDW Options
------------+------------------+-----------+-------------
s_num       | double precision |           |
s_borough   | integer          |           |
s_neighbor  | text             |           |
s_b_class   | text             |           |
s_c_p       | text             |           |
s_block     | text             |           |
s_lot       | text             |           |
s_easement  | text             |           |
w_c_p_2     | text             |           |
s_address   | text             |           |
s_app_num   | text             |           |
s_zip       | text             |           |
s_res_units | text             |           |
s_com_units | text             |           |
s_tot_units | integer          |           |
s_sq_ft     | double precision |           |
s_g_sq_ft   | double precision |           |
s_yr_built  | integer          |           |
s_tax_c     | integer          |           |
s_b_class2  | text             |           |
s_price     | double precision |           |
s_sales_dt  | text             |           |
Server: bigsql_server1
FDW Options: (&quot;table&quot; 'salesdata')

  • Lets run a query to see the sales by neighborhood in Manhattan;

SELECT s_neighbor, sum(s_price) FROM  salesdata GROUP BY s_neighbor;

Results should look simular to:

</p>

<p>postgres=# SELECT s_neighbor,SUM(s_price) FROM salesdata GROUP BY s_neighbor;</p>

<p>        s_neighbor         |    sum
 ---------------------------+------------
  FINANCIAL                 | 1402998865
  MANHATTAN-UNKNOWN         |          0
  LITTLE ITALY              |  250821741
  ALPHABET CITY             |  211818189
  MIDTOWN CBD               | 3203149051
  WASHINGTON HEIGHTS UPPER  |  330247015
  KIPS BAY                  |  422209848
  GRAMERCY                  | 1021019603
  HARLEM-UPPER              |  258439080
  GREENWICH VILLAGE-CENTRAL | 1210462635
  JAVITS CENTER             |  148603249
  UPPER WEST SIDE (79-96)   | 1422565321
  UPPER WEST SIDE (96-116)  |  595009679
  MORNINGSIDE HEIGHTS       |   29809357
  FLATIRON                  | 1612799353
  HARLEM-CENTRAL            |  791802512
  LOWER EAST SIDE           |  658796147
  UPPER EAST SIDE (96-110)  |  210717582
  WASHINGTON HEIGHTS LOWER  |  367326009
  EAST VILLAGE              |  468213911
  FASHION                   | 1714396813
  CIVIC CENTER              |  612244158
  MANHATTAN VALLEY          |  220308024
  MIDTOWN EAST              | 1294451495
  GREENWICH VILLAGE-WEST    | 1670496319
  SOHO                      | 2929228590
  SOUTHBRIDGE               |  978509618
  UPPER EAST SIDE (59-79)   | 4325223036
  HARLEM-WEST               |  160041248
  CHELSEA                   | 2250264798
  CLINTON                   |  806269355
  MURRAY HILL               | 1441188759
  UPPER EAST SIDE (79-96)   | 3159172312
  CHINATOWN                 |  288838109
  INWOOD                    |  120332695
  HARLEM-EAST               |  217612503
  TRIBECA                   | 1663415012
  MIDTOWN WEST              | 6728533426
  UPPER WEST SIDE (59-79)   | 3404609800
 (39 rows)

How many properties in each community?

SELECT s_neighbor, count(s_price) FROM salesdata GROUP BY s_neighbor;

postgres=# SELECT s_neighbor, count(*) FROM salesdata GROUP BY s_neighbor;</p>

<p>        s_neighbor         | count
 ---------------------------+-------
  FINANCIAL                 |   491
  MANHATTAN-UNKNOWN         |     3
  LITTLE ITALY              |   101
  ALPHABET CITY             |   128
  MIDTOWN CBD               |   235
  WASHINGTON HEIGHTS UPPER  |   350
  KIPS BAY                  |   304
  GRAMERCY                  |   504
  HARLEM-UPPER              |   161
  GREENWICH VILLAGE-CENTRAL |   699
  JAVITS CENTER             |    38
  UPPER WEST SIDE (79-96)   |  1028
  UPPER WEST SIDE (96-116)  |   461
  MORNINGSIDE HEIGHTS       |    65
  FLATIRON                  |   528
  HARLEM-CENTRAL            |   867
  LOWER EAST SIDE           |   439
  UPPER EAST SIDE (96-110)  |   131
  WASHINGTON HEIGHTS LOWER  |   186
  EAST VILLAGE              |   233
  FASHION                   |   249
  CIVIC CENTER              |   263
  MANHATTAN VALLEY          |   268
  MIDTOWN EAST              |  1196
  GREENWICH VILLAGE-WEST    |   710
  SOHO                      |   432
  SOUTHBRIDGE               |    70
  UPPER EAST SIDE (59-79)   |  2408
  HARLEM-WEST               |    67
  CHELSEA                   |   961
  CLINTON                   |   495
  MURRAY HILL               |   771
  UPPER EAST SIDE (79-96)   |  1929
  CHINATOWN                 |   175
  INWOOD                    |   117
  HARLEM-EAST               |   298
  TRIBECA                   |   694
  MIDTOWN WEST              |  5856
  UPPER WEST SIDE (59-79)   |  1985
 (39 rows)

You will notice the results match the results the queries run directly in Hive. In our next tutorial we will add some additional data, create some cross platform joins and continue to exercise BigSQL.
For more information on BigSQL, click here.