PostgreSQL Foreign Data Wrapper to Hadoop Hive Tutorial 2

In our last tutorial, Hadoop Hive Tutorial – Zero to Results in 15 Minutes, we installed Postgres and Hadoop and ran some simple queries against a sample dataset directly against Hive tables.  In this tutorial we will create a FDW (Foreign Data Wrapper) in Postgres and access the same data using psql within Postgres.  If you have not completed the first tutorial, it is recommended that you at least ensure that you have set up bigsql, created the hive tables and populated the hive data.

Let’s Roll!


Verify that you have started BigSQL and ran the setENV.sh script.  (See the first tutorial if you need help with this step!)

The first step is to create the EXTENSION for the FDW, run the pslq interpreter and type the following:

[code language=”sql”]
CREATE EXTENSION hadoop_fdw;
[/code]

Now we will create a SERVER using the EXTENSION we just created:

[code language=”sql”]
CREATE SERVER bigsql_server1 FOREIGN DATA WRAPPER hadoop_fdw
OPTIONS (url ‘jdbc:hive://localhost:10000/<wbr>default’);
[/code]

Next we will create a USER MAPPING for the SERVER we just created:

[code language=”sql”]
CREATE USER MAPPING FOR user SERVER bigsql_server1
OPTIONS (username ‘user’, password ‘password’);
[/code]

Finally, we will create the FOREIGN TABLE with the mappings from the Hive data store:

[code language=”sql”]
CREATE FOREIGN TABLE salesdata (
s_num       FLOAT,
s_borough   INT,
s_neighbor  TEXT,
s_b_class   TEXT,
s_c_p       TEXT,
s_block     TEXT,
s_lot       TEXT,
s_easement  TEXT,
w_c_p_2     TEXT,
s_address   TEXT,
s_app_num   TEXT,
s_zip       TEXT,
s_res_units TEXT,
s_com_units TEXT,
s_tot_units INT,
s_sq_ft     FLOAT,
s_g_sq_ft   FLOAT,
s_yr_built  INT,
s_tax_c     INT,
s_b_class2  TEXT,
s_price     FLOAT,
s_sales_dt  TEXT)
SERVER bigsql_server1
OPTIONS (table ‘salesdata’);
[/code]

Now Assuming everything went well… We should be able to run some queries against the Hive table!

SELECT * from salesdata LIMIT 5;

You should see 5 records from the table.

  • Take a quick look at the structure of the salesdata table:

\d salesdata;

[code]
postgres=# \d salesdata;

Foreign table "public.salesdata"
Column    |       Type       | Modifiers | FDW Options
————+——————+———–+————-
s_num       | double precision |           |
s_borough   | integer          |           |
s_neighbor  | text             |           |
s_b_class   | text             |           |
s_c_p       | text             |           |
s_block     | text             |           |
s_lot       | text             |           |
s_easement  | text             |           |
w_c_p_2     | text             |           |
s_address   | text             |           |
s_app_num   | text             |           |
s_zip       | text             |           |
s_res_units | text             |           |
s_com_units | text             |           |
s_tot_units | integer          |           |
s_sq_ft     | double precision |           |
s_g_sq_ft   | double precision |           |
s_yr_built  | integer          |           |
s_tax_c     | integer          |           |
s_b_class2  | text             |           |
s_price     | double precision |           |
s_sales_dt  | text             |           |
Server: bigsql_server1
FDW Options: ("table" ‘salesdata’)
[/code]

  • Lets run a query to see the sales by neighborhood in Manhattan;

SELECT s_neighbor, sum(s_price) FROM  salesdata GROUP BY s_neighbor;

Results should look simular to:

[code]

postgres=# SELECT s_neighbor,SUM(s_price) FROM salesdata GROUP BY s_neighbor;

        s_neighbor         |    sum
—————————+————
 FINANCIAL                 | 1402998865
 MANHATTAN-UNKNOWN         |          0
 LITTLE ITALY              |  250821741
 ALPHABET CITY             |  211818189
 MIDTOWN CBD               | 3203149051
 WASHINGTON HEIGHTS UPPER  |  330247015
 KIPS BAY                  |  422209848
 GRAMERCY                  | 1021019603
 HARLEM-UPPER              |  258439080
 GREENWICH VILLAGE-CENTRAL | 1210462635
 JAVITS CENTER             |  148603249
 UPPER WEST SIDE (79-96)   | 1422565321
 UPPER WEST SIDE (96-116)  |  595009679
 MORNINGSIDE HEIGHTS       |   29809357
 FLATIRON                  | 1612799353
 HARLEM-CENTRAL            |  791802512
 LOWER EAST SIDE           |  658796147
 UPPER EAST SIDE (96-110)  |  210717582
 WASHINGTON HEIGHTS LOWER  |  367326009
 EAST VILLAGE              |  468213911
 FASHION                   | 1714396813
 CIVIC CENTER              |  612244158
 MANHATTAN VALLEY          |  220308024
 MIDTOWN EAST              | 1294451495
 GREENWICH VILLAGE-WEST    | 1670496319
 SOHO                      | 2929228590
 SOUTHBRIDGE               |  978509618
 UPPER EAST SIDE (59-79)   | 4325223036
 HARLEM-WEST               |  160041248
 CHELSEA                   | 2250264798
 CLINTON                   |  806269355
 MURRAY HILL               | 1441188759
 UPPER EAST SIDE (79-96)   | 3159172312
 CHINATOWN                 |  288838109
 INWOOD                    |  120332695
 HARLEM-EAST               |  217612503
 TRIBECA                   | 1663415012
 MIDTOWN WEST              | 6728533426
 UPPER WEST SIDE (59-79)   | 3404609800
(39 rows)
[/code]

How many properties in each community?

SELECT s_neighbor, count(s_price) FROM salesdata GROUP BY s_neighbor;

[code]
postgres=# SELECT s_neighbor, count(*) FROM salesdata GROUP BY s_neighbor;

        s_neighbor         | count
—————————+——-
 FINANCIAL                 |   491
 MANHATTAN-UNKNOWN         |     3
 LITTLE ITALY              |   101
 ALPHABET CITY             |   128
 MIDTOWN CBD               |   235
 WASHINGTON HEIGHTS UPPER  |   350
 KIPS BAY                  |   304
 GRAMERCY                  |   504
 HARLEM-UPPER              |   161
 GREENWICH VILLAGE-CENTRAL |   699
 JAVITS CENTER             |    38
 UPPER WEST SIDE (79-96)   |  1028
 UPPER WEST SIDE (96-116)  |   461
 MORNINGSIDE HEIGHTS       |    65
 FLATIRON                  |   528
 HARLEM-CENTRAL            |   867
 LOWER EAST SIDE           |   439
 UPPER EAST SIDE (96-110)  |   131
 WASHINGTON HEIGHTS LOWER  |   186
 EAST VILLAGE              |   233
 FASHION                   |   249
 CIVIC CENTER              |   263
 MANHATTAN VALLEY          |   268
 MIDTOWN EAST              |  1196
 GREENWICH VILLAGE-WEST    |   710
 SOHO                      |   432
 SOUTHBRIDGE               |    70
 UPPER EAST SIDE (59-79)   |  2408
 HARLEM-WEST               |    67
 CHELSEA                   |   961
 CLINTON                   |   495
 MURRAY HILL               |   771
 UPPER EAST SIDE (79-96)   |  1929
 CHINATOWN                 |   175
 INWOOD                    |   117
 HARLEM-EAST               |   298
 TRIBECA                   |   694
 MIDTOWN WEST              |  5856
 UPPER WEST SIDE (59-79)   |  1985
(39 rows)
[/code]

You will notice the results match the results the queries run directly in Hive. In our next tutorial we will add some additional data, create some cross platform joins and continue to exercise BigSQL.
For more information on BigSQL, click here.