Slide09

Installing and Running BigSQL, the Hadoop and Postgres Bundle

This tutorial will show you how to install and run BigSQL. You can download the bundle from BigSQL.org and get started using Hadoop, Hive, Hbase, and PostgreSQL in less than ten minutes. The video below goes over how to:

  • Complete the prerequisites (the commands used can be found here)
  • Start the bundle
  • Open the psql, hive, and hbase shells
  • Use the web administration console

[youtube url=”http://www.youtube.com/watch?v=y6hG4QM24QE”]

PostgreSQL Studio

PostgreSQL Studio 1.0 is out

With the growing use of PostgreSQL in the cloud, users are bumping into the challenge of how to interact with their databases. Experienced PostgreSQL users happily ssh into the instance and bring up psql, but there are a ton of new users to PostgreSQL who still aren’t comfortable with psql and want a GUI. The standard bearer for PostgreSQL GUIs is pgAdmin, but that requires people to make a database connection across the internet which they may be reluctant to do if they don’t have a VPN of some sort set up. The right thing is to have a web-based GUI allowing users to connect to their database through a browser. PhpPgAdmin is an option, but many of the new PostgreSQL users I’ve talked to like the newer look and feel of sites like Heroku and are turned off by the UI of phpPgAdmin.

PostgreSQL Studio


A couple years back when I looked at this, I came to the conclusion that the best road forward was to just start from a clean slate. I decided to write things in Java since that’s what I know well and it let’s you create a cross-platform application very easily. I also used the Google Web Toolkit since Google is pretty good at making interactive UIs on the web. There are a number of other UIs out there that have a very good web-based GUI for PostgreSQL, but none of them are open source, which I believe is critical in the success of the project. So… OpenSCG is now releasing PostgreSQL Studio under the PostgreSQL license.To find out more about PostgreSQL Studio, check out www.postgresqlstudio.org
HBase Elephant

Hadoop HBase Tutorial

This tutorial with quickly teach you how to use HBase, a column-oriented tool that sits on top of Hadoop, it works best when you have large tables and are accessing your Big Data randomly and in real-time. Though it does not support SQL, HBase can easily be connected to Hive, providing you with the read/write speed of HBase, the ease of Hive, and the parallel processing of MapReduce.

The BigSQL bundle automatically starts up a pseudo-distributed model of HBase in which a master and region server are both running on your local computer.

The tutorial will use the data file previously used in the Hadoop Hive Tutorial (See this tutorial for all prerequisites).  If you have not grabbed the file already it is located in the zipfile here. Place the file ex1data.csv into the

~/Downloads/Sample_files/ex1data.csv directory.

Note: if you are using the BigSQL distribution (highly recommended) make sure you are using at least version beta 2.28!

The first step is to upload the csv file into HDFS. Use the hadoop fs command to make the directory and copy the ex1data.csv from your Downloads folder.

	$ hadoop fs -mkdir /user/data/salesdata
 	$ hadoop fs -copyFromLocal ~/Downloads/Sample_files/ex1data.csv /user/data/salesdata/ex1data.csv

Next, start the hbase shell and create the table “sales_data” with the column families location, units, size, age and pricing.

	$ hbase shell
     	hbase > create 'sales_data', 'location', 'units', 'size', 'age', 'pricing'
     	hbase > quit

Use the ImportTsv tool to import the csv file into the HBase table. The column that will be the row’s primary key does not need to be listed by name. In this example, we list HBASE_ROW_KEY instead of explicitly saying s_num.

     	$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,'
            -Dimporttsv.columns=HBASE_ROW_KEY,location:s_borough,location:s_neighbor,
            location:s_b_class,location:s_c_p,location:s_block,location:s_lot,location:s_easement,
            location:w_c_p_2,location:s_address,location:s_app_num,location:s_zip,units:s_res_units,
            units:s_com_units,units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built,
            pricing:s_tax_c,pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt
            sales_data /user/data/salesdata/ex1data.csv

Since this file was separated by comas and not tabs, you need to specify ‘-Dimporttsv.separator=,’.

HBase is also very good with bulk uploads. In order to do this, use the ‘importtsv.bulk.output’ tool to generate compatible files, then use the ‘completebulkloads’ utility to load those into the HBase tables.

To ensure that the table has been created and loaded into hive, you can use the list command to show all HBase tables.

     	$ hbase shell
     	hbase > list
        	TABLE
        	sales_data

To check the data within the table, you can use the scan command. This will list every cell in the table as one row.

     	hbase > scan 'sales_data'

To add the table to hive, create an external table in hive stored by org.apache.hadoop.hive.hbase.HBaseStorageHandler. You must list the hbase.columns.mapping as shown below. Note that the even though s_num is listed in the definition of the table, it is not listed under the serde properties.

     	$ hive
     	hive > CREATE EXTERNAL TABLE IF NOT EXISTS sales_data ( s_num FLOAT, s_borough INT, s_neighbor STRING, s_b_class STRING, s_c_p STRING, s_block STRING, s_lot STRING, s_easement STRING, w_c_p_2 STRING, s_address STRING, s_app_num STRING, s_zip STRING, s_res_units STRING, s_com_units STRING, s_tot_units INT, s_sq_ft FLOAT, s_g_sq_ft FLOAT, s_yr_built INT, s_tax_c INT, s_b_class2 STRING, s_price FLOAT, s_sales_dt STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "location:s_borough, location:s_neighbor,location:s_b_class,location:s_c_p,location:s_block, location:s_lot,location:s_easement,location:w_c_p_2,location:s_address, location:s_app_num,location:s_zip,units:s_res_units,units:s_com_units, units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built,pricing:s_tax_c, pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt");

	hive> DESCRIBE sales_data;                                                                                  
		OK
		col_name		data_type		comment
		s_num               	float               	from deserializer   
		s_borough           	int                 	from deserializer   
		s_neighbor          	string              	from deserializer   
		s_b_class           	string              	from deserializer   
		s_c_p               	string              	from deserializer   
		s_block             	string              	from deserializer   
		s_lot               	string              	from deserializer   
		s_easement          	string              	from deserializer   
		w_c_p_2             	string              	from deserializer   	
		s_address           	string              	from deserializer   
		s_app_num           	string              	from deserializer   
		s_zip               	string              	from deserializer   
		s_res_units         	string              	from deserializer   
		s_com_units         	string              	from deserializer   
		s_tot_units         	int                 	from deserializer   
		s_sq_ft             	float               	from deserializer   
		s_g_sq_ft           	float               	from deserializer   
		s_yr_built          	int                 	from deserializer   
		s_tax_c             	int                 	from deserializer   
		s_b_class2          	string              	from deserializer   
		s_price             	float               	from deserializer   
		s_sales_dt          	string              	from deserializer   
		Time taken: 0.27 seconds, Fetched: 22 row(s)

You can also use the HBase Console (localhost:60010/master-status) to check the user tables created and their attributes and other metrics!
For more information on BigSQL visit BigSQL.org