This Tutorial will walk through using Hadoop’s Hive to access data stored in HDFS.  This is the first tutorial (more to follow) and covers installing Hadoop as part of BigSQL and running some simple queries.

The Platform:  Linux (RHEL, CentOS or Ubuntu) or Mac OSX.  The Hadoop and Postgres Install was from www.BigSQL.org.  Takes about 10 – 15 minutes to install the developer bundle and get up and running.  ( dont skip the pre-requisites  ).

The Data: The original source data is from the NYC finance department and contains Rolling Sales files for the last 12 months in New York City.   The data for Manhattan was used in the tutorial.   A slightly cleaned file that can be used for the tutorial is available below.

Let’s get started!

  • If you have not previously installed bigsql,  download the bundle and tar the file.  Make sure to use the exact filename downloaded:

tar -xf bigsql-9.3xxx-linux64.tar.bz2

  • navigate to the home directory of your BigSQL install.  

I have installed BigSQL in the following directory:  /home/cgerth/bigsql/bigsql-9.3beta1-7   simply  cd to this directory

  • start BigSQL

./bigsql start

  • execute the environment script setENV.sh

. setENV.sh

  • create a tutorial directory, change to that directory

mkdir tutorial

cd tutorial

  • put the following three files in the tutorial directory: ex2data.csv, ex1createHiveTable.sql and ex1loadHiveTables.sql .  The files are zipped here

use the tool of your choice!

  • create the hive table structure.  Take a look at the ex1createHiveTable.sql script which is used to create the Hive Table.

CREATE TABLE IF NOT EXISTS salesdata (
s_num FLOAT,
s_borough INT,
s_neighbor STRING,
s_b_class STRING,
s_c_p STRING,
s_block STRING,
s_lot STRING,
s_easement STRING,
w_c_p_2 STRING,
s_address STRING,
s_app_num STRING,
s_zip STRING,
s_res_units STRING,
s_com_units STRING,
s_tot_units INT,
s_sq_ft FLOAT,
s_g_sq_ft FLOAT,
s_yr_built INT,
s_tax_c INT,
s_b_class2 STRING,
s_price FLOAT,
s_sales_dt STRING

)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;

$ hive -v -f ex1createHiveTables.sql

  • load the datafile into the structure created.  The ex1loadHiveTable.sql script can be used for this.  Take a look at the script before running.

LOAD DATA LOCAL INPATH ‘ex1data.csv’
OVERWRITE INTO TABLE salesdata;

hive -v -f ex1loadHiveTables.sql

  • Run a quick query to test the data load.  Run the following command and you should see the 5 records from the table.

hive

SELECT * from salesdata LIMIT 5;

You should see 5 records from the table.

  • Take a quick look at the hive table created.

DESCRIBE salesdata;

OK
col_name    data_type    comment
s_num    float    
s_borough    int    
s_neighbor    string    
s_b_class    string    
s_c_p    string    
s_block    string    
s_lot    string    
s_easement    string    
w_c_p_2    string    
s_address    string    
s_app_num    string    
s_zip    string    
s_res_units    string    
s_com_units    string    
s_tot_units    int    
s_sq_ft    float    
s_g_sq_ft    float    
s_yr_built    int    
s_tax_c    int    
s_b_class2    string    
s_price    float    
s_sales_dt    string    
Time taken: 0.194 seconds
  • Lets run a query to see the sales by neighborhood in Manhattan;

SELECT s_neighbor, sum(s_price) FROM  salesdata GROUP BY s_neighbor;

Results should look simular to:
ALPHABET CITY 2.11818189E8
CHELSEA 2.250264798E9
CHINATOWN 2.88838109E8
CIVIC CENTER 6.12244158E8
CLINTON 8.06269355E8
EAST VILLAGE 4.68213911E8
FASHION 1.714396813E9
FINANCIAL 1.402998865E9
FLATIRON 1.612799353E9
GRAMERCY 1.021019603E9
GREENWICH VILLAGE-CENTRAL 1.210462635E9
GREENWICH VILLAGE-WEST 1.670496319E9
HARLEM-CENTRAL 7.91802512E8
HARLEM-EAST 2.17612503E8
HARLEM-UPPER 2.5843908E8
HARLEM-WEST 1.60041248E8
INWOOD 1.20332695E8
JAVITS CENTER 1.48603249E8
KIPS BAY 4.22209848E8
LITTLE ITALY 2.50821741E8
LOWER EAST SIDE 6.58796147E8
MANHATTAN VALLEY 2.20308024E8
MANHATTAN-UNKNOWN 0.0
MIDTOWN CBD 3.203149051E9
MIDTOWN EAST 1.294451495E9
MIDTOWN WEST 6.728533424E9
MORNINGSIDE HEIGHTS 2.9809357E7
MURRAY HILL 1.441188759E9
SOHO 2.929228594E9
SOUTHBRIDGE 9.78509618E8
TRIBECA 1.663415012E9
UPPER EAST SIDE (59-79) 4.325223036E9
UPPER EAST SIDE (79-96) 3.159172312E9
UPPER EAST SIDE (96-110) 2.10717582E8
UPPER WEST SIDE (59-79) 3.4046098E9
UPPER WEST SIDE (79-96) 1.422565321E9
UPPER WEST SIDE (96-116) 5.95009679E8
WASHINGTON HEIGHTS LOWER 3.67326009E8
WASHINGTON HEIGHTS UPPER 3.30247015E8
Time taken: 3.656 seconds
  • How many properties in each community?

SELECT s_neighbor, count(s_price) FROM salesdata GROUP BY s_neighbor;

ALPHABET CITY                128
 CHELSEA                      961
 CHINATOWN                    175
 CIVIC CENTER                 263
 CLINTON                      495
 EAST VILLAGE                 233
 FASHION                      249
 FINANCIAL                    491
 FLATIRON                     528
 GRAMERCY                     504
 GREENWICH VILLAGE-CENTRAL    699
 GREENWICH VILLAGE-WEST       710
 HARLEM-CENTRAL               867
 HARLEM-EAST                  298
 HARLEM-UPPER                 161
 HARLEM-WEST                  67
 INWOOD                       117
 JAVITS CENTER                38
 KIPS BAY                     304
 LITTLE ITALY                 101
 LOWER EAST SIDE              439
 MANHATTAN VALLEY             268
 MANHATTAN-UNKNOWN            3
 MIDTOWN CBD                  235
 MIDTOWN EAST                 1196
 MIDTOWN WEST                 5856
 MORNINGSIDE HEIGHTS          65
 MURRAY HILL                  771
 SOHO                         432
 SOUTHBRIDGE                  70
 TRIBECA                      694
 UPPER EAST SIDE (59-79)      2408
 UPPER EAST SIDE (79-96)      1929
 UPPER EAST SIDE (96-110)     131
 UPPER WEST SIDE (59-79)      1985
 UPPER WEST SIDE (79-96)      1028
 UPPER WEST SIDE (96-116)     461
 WASHINGTON HEIGHTS LOWER     186
 WASHINGTON HEIGHTS UPPER     350
 Time taken: 3.79 seconds
  • Interesting that UNKNOW had 3 transactions but no $ volume, lets take a closer look:

SELECT s_price FROM salesdata WHERE s_neighbor LIKE  ‘%UNKNOWN%’

Mapred Local Task Succeeded . Convert the Join into MapJoin
 OK
 s_price
 0.0
 0.0
 0.0
 Time taken: 3.479 seconds

Yes, as expected they are all zero value transactions.

This completes the first starting Hive tutorial.

The next Tutorial in the series is PostgreSQL FDW to Hive  .

Please share feedback and look for more tutorials moving forward.

Click here for more information on BigSql!