This tutorial with quickly teach you how to use HBase, a column-oriented tool that sits on top of Hadoop, it works best when you have large tables and are accessing your Big Data randomly and in real-time. Though it does not support SQL, HBase can easily be connected to Hive, providing you with the read/write speed of HBase, the ease of Hive, and the parallel processing of MapReduce.

The BigSQL bundle automatically starts up a pseudo-distributed model of HBase in which a master and region server are both running on your local computer.

The tutorial will use the data file previously used in the Hadoop Hive Tutorial (See this tutorial for all prerequisites).  If you have not grabbed the file already it is located in the zipfile here. Place the file ex1data.csv into the

~/Downloads/Sample_files/ex1data.csv directory.

Note: if you are using the BigSQL distribution (highly recommended) make sure you are using at least version beta 2.28!

The first step is to upload the csv file into HDFS. Use the hadoop fs command to make the directory and copy the ex1data.csv from your Downloads folder.

   $ hadoop fs -mkdir /user/data/salesdata
    $ hadoop fs -copyFromLocal ~/Downloads/Sample_files/ex1data.csv /user/data/salesdata/ex1data.csv

Next, start the hbase shell and create the table “sales_data” with the column families location, units, size, age and pricing.

   $ hbase shell
        hbase > create 'sales_data', 'location', 'units', 'size', 'age', 'pricing'
        hbase > quit

Use the ImportTsv tool to import the csv file into the HBase table. The column that will be the row’s primary key does not need to be listed by name. In this example, we list HBASE_ROW_KEY instead of explicitly saying s_num.

       $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,'
            -Dimporttsv.columns=HBASE_ROW_KEY,location:s_borough,location:s_neighbor,
            location:s_b_class,location:s_c_p,location:s_block,location:s_lot,location:s_easement,
            location:w_c_p_2,location:s_address,location:s_app_num,location:s_zip,units:s_res_units,
            units:s_com_units,units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built,
            pricing:s_tax_c,pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt
            sales_data /user/data/salesdata/ex1data.csv

Since this file was separated by comas and not tabs, you need to specify ‘-Dimporttsv.separator=,’.

HBase is also very good with bulk uploads. In order to do this, use the ‘importtsv.bulk.output’ tool to generate compatible files, then use the ‘completebulkloads’ utility to load those into the HBase tables.

To ensure that the table has been created and loaded into hive, you can use the list command to show all HBase tables.

       $ hbase shell
        hbase > list
            TABLE
            sales_data

To check the data within the table, you can use the scan command. This will list every cell in the table as one row.

       hbase > scan 'sales_data'

To add the table to hive, create an external table in hive stored by org.apache.hadoop.hive.hbase.HBaseStorageHandler. You must list the hbase.columns.mapping as shown below. Note that the even though s_num is listed in the definition of the table, it is not listed under the serde properties.

       $ hive
        hive > CREATE EXTERNAL TABLE IF NOT EXISTS sales_data ( s_num FLOAT, s_borough INT, s_neighbor STRING, s_b_class STRING, s_c_p STRING, s_block STRING, s_lot STRING, s_easement STRING, w_c_p_2 STRING, s_address STRING, s_app_num STRING, s_zip STRING, s_res_units STRING, s_com_units STRING, s_tot_units INT, s_sq_ft FLOAT, s_g_sq_ft FLOAT, s_yr_built INT, s_tax_c INT, s_b_class2 STRING, s_price FLOAT, s_sales_dt STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "location:s_borough, location:s_neighbor,location:s_b_class,location:s_c_p,location:s_block, location:s_lot,location:s_easement,location:w_c_p_2,location:s_address, location:s_app_num,location:s_zip,units:s_res_units,units:s_com_units, units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built,pricing:s_tax_c, pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt");

    hive> DESCRIBE sales_data;                                                                                  
        OK
        col_name        data_type       comment
        s_num                   float                   from deserializer   
        s_borough               int                     from deserializer   
        s_neighbor              string                  from deserializer   
        s_b_class               string                  from deserializer   
        s_c_p                   string                  from deserializer   
        s_block                 string                  from deserializer   
        s_lot                   string                  from deserializer   
        s_easement              string                  from deserializer   
        w_c_p_2                 string                  from deserializer       
        s_address               string                  from deserializer   
        s_app_num               string                  from deserializer   
        s_zip                   string                  from deserializer   
        s_res_units             string                  from deserializer   
        s_com_units             string                  from deserializer   
        s_tot_units             int                     from deserializer   
        s_sq_ft                 float                   from deserializer   
        s_g_sq_ft               float                   from deserializer   
        s_yr_built              int                     from deserializer   
        s_tax_c                 int                     from deserializer   
        s_b_class2              string                  from deserializer   
        s_price                 float                   from deserializer   
        s_sales_dt              string                  from deserializer   
        Time taken: 0.27 seconds, Fetched: 22 row(s)

You can also use the HBase Console (localhost:60010/master-status) to check the user tables created and their attributes and other metrics!
For more information on BigSQL visit BigSQL.org