stinger

Stinger Delivers Fast Hive SQL via YARN & TEZ




The latest BigSQL bundle now includes Stinger and Tez to dramatically improve performance and extend the SQL dialect. In the latest distribution of BigSQL Release 2, it is extremely easy to get started exploring Tez and Stinger in your own sandbox environment. Taking advantage of YARN, Stinger is now delivering on its three main goals of Speed, SQL Enhancememnts & Scale..

 

SPEED

The Stinger initiative (Hive 0.13 & Tez-0.3) calls for a 100X performance over current Hive (Hive 0.10) batch queries. Key developments helping to achieve these results are:

  • Leveraging ORC Files these columnar formats allow for faster read, writes and processing of Hive data. The format supports predicate pushdowns, compression and even allows light-weight indexes to be stored within the files.
  • Continued Optimizations in the Hive Query Engine Continuous enhancements include avoiding unneeded writes, support for hot containers (saves launch shutdown overhead), in memory cache, leveraging wider Metastore statistics have all helped improve latency and throughput.

 

SQL ENHANCEMENTS

Full support DECIMAL, TIMESTAMP, DATE & VARCHAR datatypes as well as a rich set of semantic improvements.

 

Scale

Designed for large scale data environments, Stinger’s goal is to deliver even when you are analyzing Petabytes of data.

 

Stinger in Action!

Below is a screenshot from the BigSQL YARN console showing the execution of a Tez job.

At OpenSCG we focus on helping you get the most value and leverage out of your data environments. Its an exciting time in DATA and we thrive on helping our clients understand the latest changes in data technology. We realize that there are great opportunities in new NoSQL and clustered data solutions and take pride in helping you evaluate your data options.
cassandraeye2

BigSQL Adds Cassandra to the Hadoop and PostgreSQL Bundle

The BigSQL developer’s bundle has added a significant new component: Cassandra 2.0.4.  All of the bundles and VM’s now allow developers to get up and running with the latest Hadoop ecosystem and Cassandra.   Join our free Beta program now!  

As a significant number of our clients are leveraging Cassandra for Highly Available Multi Partition solutions, it was a natural fit to add to our bundle.    

In addition to Cassandra the latest beta includes all of the standard BigSQL components and many enhancements:

  • Continued HadoopFDW development (Postgres  – Hadoop Foreign data wrapper) fixes and improvements
  • JDK 7 is now a pre-req on all platforms
  • upgrade PostgreSQL from 9.3.1 to 9.3.2
  • upgrade DataFu from 1.1.1 to 1.2.0
  • upgrade PostgresqlStudio from 1.0 to 1.1
  • upgrade Hue from 3.0.0 to 3.5.
  • autostart Oozie & Hue in the vm
  • upgrade Sqoop from 1.4.4 to 1.99.3 (and autostart it)
  • include pgTSQL functionality from tpostgresq.org 
  • compact our quickstart virtual machine from 3 GB to 1.5 GB
  • configure eclipse for hadoop in the vm
At OpenSCG we are focus on helping you get the most value and leverage out of your data environments.  Its an exciting time in DATA and we thrive on helping our clients understand the latest changes in data technology.  We realize that there are great opportunities in new NOSQL and clustered data solutions and take pride in helping you evaluate your data options.

BigSQL Hadoop + Postgres now Available as a VM

With the continued success of the BigSQL developer’s bundle, we are proud to announce the availability of a new VM image with BigSQL preinstalled.  Primarily created to provide our Windows friends access to the technology bundle, the Image is based on CentOS 6.4 Desktop including many tools to begin exploring today.  Join our free Beta program now!  

We see this as a quick way to open up the platform for Windows clients while we complete our native Windows bundle distribution which is on our roadmap and should be in alpha by late Q1 of 2014.

The beta includes all of the standard BigSQL components plus:

  • Hadoop-2.2
  • Hive-0.12
  • HBase-0.96
At OpenSCG we are focused on integrating existing SQL infrastructure WITH NoSQL extensions to ensure our clients get the most out of their investment dollars.  We understand DATA and thrive on helping our clients get maximum value out of DATA!

PostgreSQL, Hadoop, and Pentaho to Analyze and Visualize Data

This tutorial shows how to use Postgres, Hadoop, Hive, Hbase, and Pig to load, refine, and store big data for visualization- all in fewer than five minutes.

You can watch the BigSQL, Postgres + Hadoop, Pentaho Demo here or follow along with the written version below.

[youtube url=”http://www.youtube.com/watch?v=eSEe_33pImA” fs=”1″ rel=”0″]

The following tutorial shows the scripts used to accomplish the demo. It also shows how to complete the additional step of loading the data into HBase, which allows you to update, insert and delete data in the HDFS from Postgres.

The data used for this example consists of three files:

  • Products, which includes the category of products and their corresponding urls
  • Logs, which is semi-structured website logs that includes data like a timestamp, user id, and geocoded ip address for each entry
  • Users, which contains the users’ ids, birthday and gender

In Bigsql, we will create the Hive table definitions and move all of the data to the hadoop file system:

   hive > CREATE TABLE users ( SWID STRING, BIRTH_DT STRING, GENDER_CD STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/data/users';

   hive > CREATE TABLE products ( url STRING, category STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/data/products';

   hive > CREATE TABLE Logs ( col_0 STRING, col_1 STRING, col_2 STRING, col_3 STRING, col_4 STRING, col_5 STRING, col_6 STRING, col_7 STRING, col_8 STRING, col_9 STRING, col_10 STRING, col_11 STRING, col_12 STRING, col_13 STRING, col_14 STRING, col_15 STRING, col_16 STRING, col_17 STRING, col_18 STRING, col_19 STRING, col_20 STRING, col_21 STRING, col_22 STRING, col_23 STRING, col_24 STRING, col_25 STRING, col_26 STRING, col_27 STRING, col_28 STRING, col_29 STRING, col_30 STRING, col_31 STRING, col_32 STRING, col_33 STRING, col_34 STRING, col_35 STRING, col_36 STRING, col_37 STRING, col_38 STRING, col_39 STRING, col_40 STRING, col_41 STRING,        col_42 STRING, col_43 STRING, col_44 STRING, col_45 STRING, col_46 STRING, col_47 STRING,        col_48 STRING, col_49 STRING, col_50 STRING, col_51 STRING, col_52 STRING, col_53 STRING, col_54 STRING, col_55 STRING, col_56 STRING, col_57 STRING, col_58 STRING, col_59 STRING, col_60 STRING, col_61 STRING, col_62 STRING, col_63 STRING, col_64 STRING, col_65 STRING, col_66 STRING, col_67 STRING, col_68 STRING, col_69 STRING, col_70 STRING, col_71 STRING, col_72 STRING, col_73 STRING, col_74 STRING, col_75 STRING, col_76 STRING, col_77 STRING, col_78 STRING, col_79 STRING, col_80 STRING, col_81 STRING, col_82 STRING, col_83 STRING, col_84 STRING, col_85 STRING, col_86 STRING, col_87 STRING, col_88 STRING, col_89 STRING, col_90 STRING, col_91 STRING, col_92 STRING, col_93 STRING, col_94 STRING, col_95 STRING, col_96 STRING, col_97 STRING, col_98 STRING, col_99 STRING, col_100 STRING, col_101 STRING, col_102 STRING, col_103 STRING, col_104 STRING, col_105 STRING, col_106 STRING, col_107 STRING, col_108 STRING, col_109 STRING, col_110 STRING, col_111 STRING, col_112 STRING, col_113 STRING, col_114 STRING, col_115 STRING, col_116 STRING, col_117 STRING, col_118 STRING, col_119 STRING, col_120 STRING, col_121 STRING, col_122 STRING, col_123 STRING, col_124 STRING, col_125 STRING, col_126 STRING, col_127 STRING, col_128 STRING, col_129 STRING, col_130 STRING, col_131 STRING, col_132 STRING, col_133 STRING, col_134 STRING, col_135 STRING, col_136 STRING, col_137 STRING, col_138 STRING, col_139 STRING, col_140 STRING, col_141 STRING, col_142 STRING, col_143 STRING, col_144 STRING, col_145 STRING, col_146 STRING, col_147 STRING, col_148 STRING, col_149 STRING, col_150 STRING, col_151 STRING, col_152 STRING, col_153 STRING, col_154 STRING, col_155 STRING, col_156 STRING, col_157 STRING, col_158 STRING, col_159 STRING, col_160 STRING, col_161 STRING, col_162 STRING, col_163 STRING, col_164 STRING, col_165 STRING, col_166 STRING, col_167 STRING, col_168 STRING, col_169 STRING, col_170 STRING, col_171 STRING, col_172 STRING, col_173 STRING, col_174 STRING, col_175 STRING, col_176 STRING, col_177 STRING, col_178 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/data/logs';

   hive > LOAD DATA LOCAL INPATH 'Downloads/webLogAnalytics/users.tsv' OVERWRITE INTO TABLE users;
   hive > LOAD DATA LOCAL INPATH 'Downloads/webLogAnalytics/products.tsv' OVERWRITE INTO TABLE products;
   hive > LOAD DATA LOCAL INPATH 'Downloads/webLogAnalytics/Omniture.tsv' OVERWRITE INTO TABLE logs;

Now, use the hbase shell to define the webloganalytics table:

   $ hbase shell
   hbase > create 'webloganalytics', 'info','user','location'

Next we will use the following Pig script to refine the data. You may need to export the following as your Hadoop_Classpath in order for pig to work correctly with HBase:

export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.94.11.jar:$HADOOP_HOME/hadoop-core-1.2.1.jar:$ZOOKEEPER_HOME/zookeeper-3.4.5.jar:$HBASE_HOME/lib/protobuf-java-2.4.0a.jar

First, register the piggybank, Hbase and time jars, and then define the two time functions we will use.

   grunt >  cd /user/data
   grunt >  REGISTER bigsql-9.3.0-rc2.56/pig-0.11.1-src/contrib/piggybank/java/piggybank.jar 
   grunt >  REGISTER bigsql-9.3.0-rc2.56/pig-0.11.1-src/build/ivy/lib/Pig/joda-time-1.6.jar
   grunt >  REGISTER bigsql-9.3.0-rc2.56/hbase/lib/protobuf-java-2.4.0a.jar
   grunt >  DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
   grunt >  DEFINE ISOYearsBetween org.apache.pig.piggybank.evaluation.datetime.diff.ISOYearsBetween();

Next, load the three tables from the hadoop file system:

   grunt >  users = LOAD '/user/data/users/users.csv' USING PigStorage(',') AS (SWID:chararray,BIRTH_DT:chararray,GENDER_CD:chararray);
   grunt >  products = LOAD '/user/data/products/products.tsv' USING PigStorage('\t') AS (url:chararray,category:chararray);
   grunt >  logs = LOAD '/user/data/logs/Omniture.0.tsv' USING PigStorage('\t') AS (col_0:chararray,col_1:chararray,col_2:chararray,col_3:chararray,col_4:chararray,col_5:chararray,col_6:chararray,col_7:chararray,col_8:chararray,col_9:chararray,col_10:chararray,col_11:chararray,col_12:chararray,col_13:chararray,col_14:chararray,col_15:chararray,col_16:chararray,col_17:chararray,col_18:chararray,col_19:chararray,col_20:chararray,col_21:chararray,col_22:chararray,col_23:chararray,col_24:chararray,col_25:chararray,col_26:chararray,col_27:chararray,col_28:chararray,col_29:chararray,col_30:chararray,col_31:chararray,col_32:chararray,col_33:chararray,col_34:chararray,col_35:chararray,col_36:chararray,col_37:chararray,col_38:chararray,col_39:chararray,col_40:chararray,col_41:chararray,col_42:chararray,col_43:chararray,col_44:chararray,col_45:chararray,col_46:chararray,col_47:chararray,col_48:chararray,col_49:chararray,col_50:chararray,col_51:chararray,col_52:chararray,col_53:chararray,col_54:chararray,col_55:chararray,col_56:chararray,col_57:chararray,col_58:chararray,col_59:chararray,col_60:chararray,col_61:chararray,col_62:chararray,col_63:chararray,col_64:chararray,col_65:chararray,col_66:chararray,col_67:chararray,col_68:chararray,col_69:chararray,col_70:chararray,col_71:chararray,col_72:chararray,col_73:chararray,col_74:chararray,col_75:chararray,col_76:chararray,col_77:chararray,col_78:chararray,col_79:chararray,col_80:chararray,col_81:chararray,col_82:chararray,col_83:chararray,col_84:chararray,col_85:chararray,col_86:chararray,col_87:chararray,col_88:chararray,col_89:chararray,col_90:chararray,col_91:chararray,col_92:chararray,col_93:chararray,col_94:chararray,col_95:chararray,col_96:chararray,col_97:chararray,col_98:chararray,col_99:chararray,col_100:chararray,col_101:chararray,col_102:chararray,col_103:chararray,col_104:chararray,col_105:chararray,col_106:chararray,col_107:chararray,col_108:chararray,col_109:chararray,col_110:chararray,col_111:chararray,col_112:chararray,col_113:chararray,col_114:chararray,col_115:chararray,col_116:chararray,col_117:chararray,col_118:chararray,col_119:chararray,col_120:chararray,col_121:chararray,col_122:chararray,col_123:chararray,col_124:chararray,col_125:chararray,col_126:chararray,col_127:chararray,col_128:chararray,col_129:chararray,col_130:chararray,col_131:chararray,col_132:chararray,col_133:chararray,col_134:chararray,col_135:chararray,col_136:chararray,col_137:chararray,col_138:chararray,col_139:chararray,col_140:chararray,col_141:chararray,col_142:chararray,col_143:chararray,col_144:chararray,col_145:chararray,col_146:chararray,col_147:chararray,col_148:chararray,col_149:chararray,col_150:chararray,col_151:chararray,col_152:chararray,col_153:chararray,col_154:chararray,col_155:chararray,col_156:chararray,col_157:chararray,col_158:chararray,col_159:chararray,col_160:chararray,col_161:chararray,col_162:chararray,col_163:chararray,col_164:chararray,col_165:chararray,col_166:chararray,col_167:chararray,col_168:chararray,col_169:chararray,col_170:chararray,col_171:chararray,col_172:chararray,col_173:chararray,col_174:chararray,col_175:chararray,col_176:chararray,col_177:chararray,col_178:chararray);

Select only the important fields from the logs table, convert the ‘yyyy-MM-dd HH:mm:ss’ format dates to an ISO timestamp, and ensure that all of the countries and states are all uppercase strings:

   grunt >  omniture = FOREACH logs GENERATE CustomFormatToISO(col_1, 'yyyy-MM-dd HH:mm:ss') AS logdate, col_7 AS ip, col_12 AS url, col_13 AS swid, col_49 AS city, col_50 AS country, col_52 AS state;
   grunt >  omniture2 = FOREACH omniture GENERATE logdate, url, ip, city, UPPER(state) AS state, UPPER(country) AS country, swid;

Convert the users’ table’s ‘dd-MMM-yy’ format birth dates to ISO timestamps and then calculate the age of each user:

   grunt >  users2 = FILTER users BY BIRTH_DT IS NOT NULL;
   grunt >  users3 = FOREACH users2 GENERATE SWID,ISOYearsBetween('2013-01-01T02:00:00.000Z', CustomFormatToISO(BIRTH_DT, 'dd-MMM-yy')) AS AGE,GENDER_CD;
   grunt >  users4 = JOIN users BY (SWID, GENDER_CD) LEFT OUTER, users3 BY (SWID, GENDER_CD);
   grunt >  users5 = FOREACH users4 GENERATE users::SWID AS swid, users3::AGE AS age, users::GENDER_CD AS gender;

Next, join the logs table with the user table by the user id, and then join that table with the products table by the url:

   grunt >  webloganalytics1 = JOIN omniture2 BY swid LEFT OUTER, users5 BY swid;
   grunt >  webloganalytics2 = JOIN webloganalytics1 BY url LEFT OUTER, products BY url;
   grunt >  webloganalytics3 = ORDER webloganalytics2 BY logdate;
   grunt >  webloganalytics4 = FOREACH webloganalytics3 GENERATE logdate, products::url, ip, city,state, country, category, users5::age, users5::gender;

For this tutorial, instead of just saving the data to HDFS, it will actually be loaded into hbase. In order to accomplish this, we need to first add a key to the webloganalytics table.

   grunt >  webloganalytics = RANK webloganalytics4;    
   grunt >  STORE webloganalytics INTO 'hbase://webloganalytics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:logdate info:url user:ip location:city location:state location:country info:category user:age user:gender');

The resulting data, not including the new id:

Now, to use the foreign data wrapper’s insert, update and delete capabilities, you need to create the corresponding hive table:

   hive > CREATE TABLE webloganalytics ( id INT, logdate STRING, url STRING, ip STRING, city STRING, state STRING, country STRING, category STRING,   age INT, gender STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" =":key,info:logdate,info:url,user:ip, location:city,location:state,location:country,info:category,user:age,user:gender") TBLPROPERTIES("hbase.table.name"="webloganalytics");

Next, use the Hadoop Foreign Data Wrapper to create the corresponding Postgres table:

   postgres=# CREATE EXTENSION hadoop_fdw;

   postgres=# CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw OPTIONS (address '127.0.0.1', port '10000');

   postgres=# CREATE USER MAPPING FOR PUBLIC SERVER hadoop_server;

   postgres=# CREATE FOREIGN TABLE webloganalytics ( id INT, logdate TIMESTAMP, url TEXT, ip TEXT, city TEXT, state TEXT, country TEXT, category TEXT, age INT, gender TEXT ) SERVER hadoop_server OPTIONS ( table 'webloganalytics' hbase_port '9090', hbase_address 'localhost', hbase_mapping ':key,info:logdate,info:url,user:ip,location:city, location:state,location:country,info:category,user:age,user:gender' );

Now we can show how we are able to select, insert, update, and delete data from postgres:

postgres=# SELECT * FROM webloganalytics WHERE state = 'CA' LIMIT 5;
  id    |       logdate     |                  url                 |       ip       |     city   | state  | country  |category | age | gender 
 --------+---------------------+-------------------------------------------+------------------+-------------+--------+---------+----------+-----+-------- 
 100002 | 2012-03-13 12:36:42 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100005 | 2012-03-13 12:36:43 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100010 | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119.238.163 | los angeles | CA    | USA     | shoes    |     | 
 100011 | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119.238.163 | los angeles | CA    | USA     | shoes    |     | 
 100012 | 2012-03-13 12:37:10 | http://www.acme.com/SH55126545/VD55179433 | 108.198.177.186 | los angeles | CA    | USA     | shoes    |  27 | M

postgres=# DELETE FROM webloganalytics WHERE id=100010;
postgres=# SELECT * FROM webloganalytics WHERE state = 'CA' LIMIT 5;
  id    |       logdate     |                  url                 |       ip       |     city   | state  | country  |category | age | gender 
 --------+---------------------+-------------------------------------------+------------------+-------------+--------+---------+----------+-----+-------- 
 100002 | 2012-03-13 12:36:42 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100005 | 2012-03-13 12:36:43 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100011 | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119.238.163 | los angeles | CA    | USA     | shoes    |     | 
 100012 | 2012-03-13 12:37:10 | http://www.acme.com/SH55126545/VD55179433 | 108.198.177.186 | los angeles | CA    | USA     | shoes    |  27 | M
 100013 | 2012-03-13 12:37:13 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U

postgres=# UPDATE webloganalytics SET age=30, gender='F' WHERE id=100011;
postgres=# SELECT * FROM webloganalytics WHERE state = 'CA' LIMIT 5;                                             
  id    |       logdate     |                  url                 |       ip       |     city   | state  | country  |category | age | gender 
 --------+---------------------+-------------------------------------------+------------------+-------------+--------+---------+----------+-----+-------- 
100002  | 2012-03-13 12:36:42 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U 
100005  | 2012-03-13 12:36:43 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U 
100011  | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119.238.163 | los angeles | CA    | USA     | shoes    |  30 | F 
100012  | 2012-03-13 12:37:10 | http://www.acme.com/SH55126545/VD55179433 | 108.198.177.186 | los angeles | CA    | USA     | shoes    |  27 | M
100013  | 2012-03-13 12:37:13 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U

postgres=# INSERT INTO webloganalytics VALUES(100010,'2012-03-13 12:37:07','http://www.acme.com/SH55126545/VD55179433','162.119/238/163','los angeles','CA','USA','shoes',30,'F'); INSERT 0 1 postgres=# SELECT * FROM webloganalytics WHERE state = 'CA' LIMIT 5;                                                                                                 
  id    |       logdate     |                  url                 |       ip       |     city   | state  | country  |category | age | gender 
 --------+---------------------+-------------------------------------------+------------------+-------------+--------+---------+----------+-----+-------- 
 100002 | 2012-03-13 12:36:42 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100005 | 2012-03-13 12:36:43 | http://www.acme.com/SH55126545/VD55177927 | 108.0.201.188   | long beach  | CA    | USA     | clothing |  25 | U
 100010 | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119/238/163 | los angeles | CA    | USA     | shoes    |  30 | F
 100011 | 2012-03-13 12:37:07 | http://www.acme.com/SH55126545/VD55179433 | 162.119.238.163 | los angeles | CA    | USA     | shoes    |  30 | F
 100012 | 2012-03-13 12:37:10 | http://www.acme.com/SH55126545/VD55179433 | 108.198.177.186 | los angeles | CA    | USA     | shoes    |  27 | M
header

JasperReports with Hadoop – Tutorial

This tutorial covers step by step instruction on running Jasper Reports with Hadoop using Jaspersoft Studio easy and simple to use environment . Also, publishing the created sample report with Jasperserver that is connected to Jaspersoft Studio .

1. Prerequisites

a) BigSQL Tutorial is required to be followed till Step 3 http://www.bigsql.org/se/tutorial.jsp

Step 1 : Creating example.customer_history in HIVE and loading values.
Step 2 : Creating example.customer_history as foreign table in Postgres
Step 3 : Verifying example.customer_history

2. Things required to be downloaded

a) Download Jaspersoft Studio for your OS version from http://community.jaspersoft.com/project/jaspersoft-studio/releases

  • Select your OS version from Link column

  • After downloading completes , install the Studio

b) Download JasperReports Server from http://community.jaspersoft.com/project/jasperreports-server/releases for your OS version

c) Download JDBC file from http://jdbc.postgresql.org/download/postgresql-9.2-1003.jdbc4.jar

3. Simple Data Adaptor configuration

a) Verify if BIGSQL is running

$ ./bigsql status
#################################################
# BIGSQL:- your BIGSQL version
# TIME: current time
# HOST: your hostname
# JAVA: path to your JAVA_HOME
# OS: your OS
#################################################

## Checking critical ports ################
Postgres port 5432 is busy.
ZooKeeper port 2181 is busy.
HDFS port 50070 is busy.
HBase port 60010 is busy.
Hive port 9083 is busy.
Tomcat port 8080 is busy.

b) Run the installed Jaspersoft Studio instance

1) For DataAdapter , click Under Repositor Tab → Right click on Data Adaptors → Create New Data Adaptor → Next → Database JDBC Connection → Next

2) Database Location

    • Enter Name “BigSQL”
    • JDBC Driver : org.postgresql.Driver
    • JDBC Url : jdbc:postgresql://localhost:5432/postgres
    • Username : postgres
    • Password : password

3) Driver Classpath

    • Click on Add Button → Select Driver Path → Click Open

c) Click Finish Button to add the Data Adaptor

4. Easy Report Creation method

a) Create a new Project

  • File → New → Project → JasperReports Project .

b) Create New Jasper Reports

  • File → New → Jasper Report → Blank A4 → Enter Name: bigsql.jrxml → Finish.

c) Select created report and open “source” view .

  • Select all and Delete selected.

d) Hadoop JasperTest Report

“copy the below report in xml and below paste it in the ‘source’ tab of the created report”

<?xml version="1.0" encoding="UTF-8"?>
<!-- Created with Jaspersoft Studio version last-->
<jasperReportxmlns="http://jasperreports.sourceforge.net/jasperreports"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://jasperreports.sourceforge.net/jasperreports http://jasperreports.sourceforge.net/xsd/jasperreport.xsd"name="BigSQL"language="groovy"pageWidth="595"pageHeight="842"columnWidth="535"leftMargin="20"rightMargin="20"topMargin="20"bottomMargin="20"uuid="73c28c9e-7a45-457e-bb9c-d3b4a665aec8">
<propertyname="com.jaspersoft.studio.data.defaultdataadapter"value="BigSQL"/>
<stylename="Title"fontName="Times New Roman"fontSize="50"isBold="true"pdfFontName="Times-Bold"/>
<stylename="SubTitle"forecolor="#736343"fontName="Arial"fontSize="18"/>
<stylename="Column header"forecolor="#666666"fontName="Arial"fontSize="12"isBold="true"/>
<stylename="Detail"fontName="Arial"fontSize="12"/>
<stylename="Row"mode="Transparent">
<conditionalStyle>
<conditionExpression><![CDATA[$V{REPORT_COUNT}%2 == 0]]></conditionExpression>
<stylebackcolor="#E6DAC3"/>
</conditionalStyle>
</style>
<queryString>
<![CDATA[Select * from example.customer_history;]]>
</queryString>
<fieldname="hist_id"class="java.lang.Integer"/>
<fieldname="h_date"class="java.sql.Timestamp"/>
<fieldname="h_amount"class="java.math.BigDecimal"/>
<fieldname="h_data"class="java.lang.String"/>
<background>
<bandsplitType="Stretch"/>
</background>
<title>
<bandheight="136"splitType="Stretch">
<staticText>
<reportElementuuid="3cbe62a2-fc4f-4470-8a41-04ba5a76f4ff"style="Title"x="170"y="40"width="263"height="62"/>
<textElementverticalAlignment="Middle">
<fontpdfFontName="Times-Roman"/>
</textElement>
<text><![CDATA[BigSQL Report]]></text>
</staticText>
<staticText>
<reportElementuuid="9c023796-93e4-4fbf-a609-9537018b189c"style="SubTitle"x="350"y="50"width="196"height="22"/>
<textElement>
<fontfontName="Times New Roman"pdfFontName="Times-Roman"/>
</textElement>
<text><![CDATA[with Jasper Reports]]></text>
</staticText>
<image>
<reportElementuuid="805e8dcf-c46e-49df-9647-4f14646c972d"x="-1"y="0"width="161"height="110"/>
<imageExpression><![CDATA["/Users/alisaggu/Desktop/logo10.png"]]></imageExpression>
</image>
</band>
</title>
<pageHeader>
<bandsplitType="Stretch"/>
</pageHeader>
<columnHeader>
<bandheight="16"splitType="Stretch">
<line>
<reportElementuuid="26e9c0f5-5aea-4bd6-848c-8f9820d2cb6c"positionType="FixRelativeToBottom"x="0"y="15"width="555"height="1"/>
<graphicElement>
<penlineWidth="0.5"lineColor="#999999"/>
</graphicElement>
</line>
<staticText>
<reportElementuuid="ad2d59e2-f15f-4475-9fc4-8068d571e552"style="Column header"x="0"y="0"width="138"height="15"forecolor="#736343"/>
<textElement/>
<text><![CDATA[Hist Id]]></text>
</staticText>
<staticText>
<reportElementuuid="90d84a3f-e23f-4090-9db8-42fca0ced0ab"style="Column header"x="138"y="0"width="138"height="15"forecolor="#736343"/>
<textElement/>
<text><![CDATA[Date Due]]></text>
</staticText>
<staticText>
<reportElementuuid="50862542-096a-4072-8665-88c1238fa7c5"style="Column header"x="276"y="0"width="138"height="15"forecolor="#736343"/>
<textElement/>
<text><![CDATA[Total Amount]]></text>
</staticText>
<staticText>
<reportElementuuid="8e8735a4-a896-4eef-876a-0472fabf5493"style="Column header"x="414"y="0"width="138"height="15"forecolor="#736343"/>
<textElement/>
<text><![CDATA[Raw Data]]></text>
</staticText>
</band>
</columnHeader>
<detail>
<bandheight="15"splitType="Stretch">
<frame>
<reportElementuuid="840d241f-46d5-4267-abf2-fbfabe77a1ba"style="Row"mode="Opaque"x="0"y="0"width="555"height="15"/>
<textFieldisStretchWithOverflow="true">
<reportElementuuid="904e88f2-9f49-494c-9e20-7a0f2a66c46e"style="Detail"x="0"y="0"width="138"height="15"/>
<textElement/>
<textFieldExpression><![CDATA[$F{hist_id}]]></textFieldExpression>
</textField>
<textFieldisStretchWithOverflow="true">
<reportElementuuid="745e034c-c2b7-406d-97b4-19f6a5a1f2ec"style="Detail"x="138"y="0"width="138"height="15"/>
<textElement/>
<textFieldExpression><![CDATA[$F{h_date}]]></textFieldExpression>
</textField>
<textFieldisStretchWithOverflow="true">
<reportElementuuid="c174e25f-9c68-4386-8968-1bda99810e3e"style="Detail"x="276"y="0"width="138"height="15"/>
<textElement/>
<textFieldExpression><![CDATA[$F{h_amount}]]></textFieldExpression>
</textField>
<textFieldisStretchWithOverflow="true">
<reportElementuuid="aa181d78-3fb6-47dd-ae2f-bc1aafba23ae"style="Detail"x="414"y="0"width="138"height="15"/>
<textElement/>
<textFieldExpression><![CDATA[$F{h_data}]]></textFieldExpression>
</textField>
</frame>
</band>
</detail>
<columnFooter>
<bandheight="45"splitType="Stretch">
<line>
<reportElementuuid="7c17f014-3c79-4265-b557-b38cf17c5f0f"positionType="FixRelativeToBottom"x="0"y="3"width="555"height="1"/>
<graphicElement>
<penlineWidth="0.5"lineColor="#999999"/>
</graphicElement>
</line>
</band>
</columnFooter>
<pageFooter>
<bandheight="25"splitType="Stretch">
<frame>
<reportElementuuid="1f9a92db-af7d-429a-ba60-a43bd78a783e"mode="Opaque"x="-21"y="1"width="597"height="24"forecolor="#D0B48E"backcolor="#F2EBDF"/>
<textFieldevaluationTime="Report">
<reportElementuuid="17fc328e-c16b-477d-a061-a733a9167907"style="Column header"x="533"y="0"width="40"height="20"forecolor="#736343"/>
<textElementverticalAlignment="Middle">
<fontsize="10"isBold="false"/>
</textElement>
<textFieldExpression><![CDATA[" " + $V{PAGE_NUMBER}]]></textFieldExpression>
</textField>
<textField>
<reportElementuuid="8087e1e6-d232-4090-97c3-72d455154cab"style="Column header"x="453"y="0"width="80"height="20"forecolor="#736343"/>
<textElementtextAlignment="Right"verticalAlignment="Middle">
<fontsize="10"isBold="false"/>
</textElement>
<textFieldExpression><![CDATA["Page "+$V{PAGE_NUMBER}+" of"]]></textFieldExpression>
</textField>
<textFieldpattern="EEEEE dd MMMMM yyyy">
<reportElementuuid="2a3dc088-bcb4-4488-8bc8-7809bdaf9901"style="Column header"x="22"y="1"width="197"height="20"forecolor="#736343"/>
<textElementverticalAlignment="Middle">
<fontsize="10"isBold="false"/>
</textElement>
<textFieldExpression><![CDATA[new java.util.Date()]]></textFieldExpression>
</textField>
</frame>
</band>
</pageFooter>
<summary>
<bandsplitType="Stretch"/>
</summary>
</jasperReport>

5. Preview your created report

Save and click on Preview to view the report .

6. Finally , Publishing the Report

a) Install the downloaded JasperReports Server and launch it .

Login the web interface with admin account .
username : jasperadmin
password : jasperadmin

b) Create a new JasperReports Server connection under repository tab in Jaspersoft Studio

c) A dialog to insert the server data appears. Fill it in as follows:

  • Name: the name of the connection. You can use any name you want. For this example we will leave the default name: JasperReports Server.
  • URL: the address of the server. The default address is already correct if we are using a local server. For this example the correct address is http://localhost:8080/jasperserver/
  • User: the username to access the server. The default for the local server is “jasperadmin”.
  • Password: as with the username, for the local server by default it is “jasperadmin”.

Then click the Test Connection button to test the connection. If everything is working, click Finish.

d) Open the report and click the button with a blue arrow in the upper-right corner of the designer. In the opened window you can browse the server directory structure to choose where to place the report. Select the Reports folder and name the report Then click Next.

e) Create the DataSource from Local Data source.

  • Select Datasource JDBC , click NEXT
  • Enter name and id then click next
  • Click Import from Jaspersoft Studio and Select BigSQL -> JDBC connection and click finish to create the datasource

f) On the web interface logged on with admin account , click on the folder Reports .

g) click on BigSQL published report to view it in the next screen

HadoopFDW

With the release of PostgreSQL 9.3, it let’s us do some really cool things with writable foreign tables. BigSQL has just released a Hadoop Foreign Data Wrapper that is writable into HDFS files and Hbase tables. The Hbase integration allows for full SELECT, INSERT, UPDATE and DELETE syntax through PostgreSQL and the HDFS integration allows for SELECT and INSERT.

The HadoopFDW is released under the PostgreSQL license and can be found here.

Featured Image2

Tutorial – Run BIRT against Hadoop

This tutorial covers setting up of BIRT (Business Intelligence Reporting Tool) to run against Hadoop. BIRT is a relatively simple and yet powerful analytics tool and the tutorial will help you harness the power against Hadoop.

1. Pre-Requisites

BigSQL Tutorial

The BigSQL Tutorial (http://www.bigsql.org/se/tutorial.jsp) needs to be followed till step 3

Step 1 : Creating example.customer_history in HIVE and loading values.
Step 2 : Creating example.customer_history as foreign table in Postgres
Step 3 : Verifying example.customer_history

Eclipse with BIRT

a) Download Eclipse with BIRT framework for your OS version from http://www.eclipse.org/downloads/packages/eclipse-ide-java-and-report-developers/keplerr

  • Select your OS version from Link column

  • Select a nearby mirror for the download location

b) Extract the downloaded .tar file and run Eclipse , setting up the workspace location

2. Eclipse, Creation of Report

a) Run an instance of eclipse with Birt reporting configured .

b) In the Menu , click on Window → Open Perspective → Report Design

c) From the menu Click on File → New → Report

  • Enter Parent Folder

  • Enter File Name
  • Click on Finish button

3. Data Source configuration

a) Verify if BIGSQL is running

$ ./bigsql status

#################################################
# BIGSQL:- your BIGSQL version
# TIME: current time
# HOST: your hostname
# JAVA: path to your JAVA_HOME
# OS: your OS
#################################################
## Checking critical ports ################
Postgres port 5432 is busy.
ZooKeeper port 2181 is busy.
HDFS port 50070 is busy.
HBase port 60010 is busy.
Hive port 9083 is busy.
Tomcat port 8080 is busy.

b) Open Window → Show View → Data Explorer

c) Download JDBC file from http://jdbc.postgresql.org/download/postgresql-9.2-1003.jdbc4.jar

d) Right click on Data Sources folder and select ‘ New Data Source’

e) Select ‘JDBC Data Source’ and click Next button.

f) Click on Manage Drivers and Add the JDBC Driver downloaded in (c) .

g) Select /Enter

  • Driver Class : org.postgresql.Driver
  • Database URL : jdbc:postgresql:postgres
  • User Name : postgres
  • Password : password

h) Click “ Test Connection” button to verify the connection

4. Dataset and Report

a) Right click on ‘Datasets’ and select ‘New Data Set’

b) Select JDBC Data source , Enter Data Source name and click next .

c) Enter in the “Query Text” window :

select hist_id , h_date , h_data from example.customer_history;

d) Drag the created Data Set and drop it in the blank report window to create the reporting table.

5. Report View

In the top menu , Click on Run → View Report → In Web Viewer

The Three “V”s of Big Data

Big Data. What’s the big deal, really? Data has been around for a long time, and organizations have been using it to their advantage since for ever. So what is so different now?
Legitimate questions from anyone who is sick of hearing all the hype around Big Data. Granted, you will come across many instances where the term is used just because it is the fashionable thing to do. At the same time, the drive towards what is called Big Data is very real. The acceleration is due to expansion of data in not one, not two, but three dimensions simultaneously. These three dimensions are famously known as the Three “V”s of Big Data.

Volume 

This is the obvious one. From mega to giga to tera to peta bytes, volume of data is increasing exponentially. You can understand this quite simply by comparing the size of your typical text file with that of a typical MP3. Take it forward and compare sizes with video files and then take it a step further to look at the size for what is now a norm with all of us, i.e. HD videos. Increase in size is pretty dramatic, and of course that means volume of data being collected at any place increases.
But that’s not the only thing that drives volume. Sources of data have increased significantly as well. Take, as an example, a typical cell phone we carry around. Not so long ago, the only data it had was voice calls and texts. With the advent of smartphones and the mind boggling array of sensors and other equipment within them, the amount of data they carry is staggering. Combine this with the Internet explosion, social media, digitized businesses, e-commerce and you will begin to understand why it’s not only the size of individual files that are driving data volumes in today’s world.
It is no longer ponds or lakes of data organizations are dealing with, it is vast expanses of oceans. And when you are in an ocean, you have a whole different set of challenges to navigate your way through the hazards of rough seas.

Velocity

This is probably the least understood of the 3 “V”s. There are multiple angles to look at velocity as a dimension in which data is expanding.
First is the sheer rate at which data is coming in. Along with an ever-faster connectivity, this co-relates directly with the volume dimension described above.
It is not just the rate at which data is pouring in, it is also how quickly it needs to be processed. Batch processing is not good enough, business managers need data in real-time. As I mentioned in a previous post (http://openscg.com/2013/07/bi-and-hadoop/), “I will get back to you tomorrow” is no longer an acceptable answer!
Thirdly, data is only useful as long as it is being processed faster than it is entering the system. If the velocity of data processing is less than the velocity of data entering the system, all you would end up with is an ever-growing backlog. And of course, because the demand is for real-time data, back-logged information is not really of any use to anyone.
Lastly, you need to be prepared for not only consistently high velocity, but also acceleration and, at times, bursts of intense activity.

Variety

Think about all the various types of data that you have the ability to gather with today’s technology. Audio/video streams, GPS, sensors, documents, SMS/MMS, flash, images etc etc.
Every organization gathers data and every organization wants to use that data. In order to convert this ‘dark’ unstructured data into information, traditional structured approaches no longer suffice.
Wrapping up, I hope you can see now why Big Data has gained so much momentum lately and why organizations need to take urgent steps to ensure they are not left behind in this race. Proper tools are the key!
135649559

Hadoop Sqoop Tutorial HDFS Hive & Postgres

The Hadoop Sqoop Tutorial is the third in the series providing easy to follow examples of using a Hadoop cluster in conjunction with Postgres.  This tutorial will build on the Hadoop Hive Tutorial which should be considered a pre requisite if you want to run the examples provided. In this tutorial, we will create a Postgres table from public domain data provided by the New York City health department.  The data has been slightly cleaned in the version attached to this blog article to save everyone time :-)   This data will then be moved to HDFS Hive using the Sqoop IMPORT command.   Finally we will run some Hive queries and Postgres queries to validate the transfer!  Assuming all of the prerequisites are in place, this should be a 15 minute exercise.

Sqoop is really not that difficult after getting your environment up and running!  The best way to learn Sqoop is to jump in and start testing out the various options and configuration setting Sqoop makes available.

Let’s Get Started!!

First though let’s just make sure that Sqoop is available and ready to go:

  • Change to the tutorial directory: (if your not already there)

$ cd tutorial

  • check the version of sqoop you are running:

$ sqoop version

Sqoop 1.4.3 git commit id 82fde21d0f6c89059bade4588f3ba6989989dc76 Compiled by hari on Tue Feb 26 11:30:27 PST 2013 

If you do not see a simular response go back to the first tutorial and ensure you have Hadoop properly installed.

Now let’s create a table in Postgres to hold the data we will eventually Import into HDFS using Sqoop:

Here is a zip file containing a subset of the data in CSV format Tutorial Data: Sample Data

  • Unzip the data in the tutorial directory.  

Use the PostgreSQL COPY command to load the data into the Postgres tables.

  • Of course we need to create the tables before using the COPY command.
$psql

CREATE  TABLE nycrestdata (
s_camis     TEXT,
s_dba       TEXT,
s_boro      TEXT,
s_building  TEXT,
s_street    TEXT,
s_zipcode   TEXT,
s_phone     TEXT,
s_cuscode   TEXT,
s_insp_date TEXT,
s_action    TEXT,
s_viocode   TEXT,
s_score     TEXT,
s_grade     TEXT,
s_g_date    TEXT,
s_r_date    TEXT
);

Copy, "paste into psql" and run the above code to create the table.

  • Now use the COPY command to load the text file into the table:

COPY nycrestdata FROM '/home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/restdata.csv' (DELIMITER ','); 

Sqoop prefers data with a Primary Key.

  • Add a PRIMARY KEY to the data:
ALTER TABLE nycrestdata ADD pk SERIAL PRIMARY KEY;

dont forget to \q out of Postgres.

Lets verify that Sqoop can "see the table" in PostgreSQL:
$ sqoop list-tables --connect jdbc:postgresql://localhost:5432/postgres --username postgres --password password 

13/06/20 17:48:43 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 
13/06/20 17:48:43 INFO manager.SqlManager: Using default fetchSize of 1000 nycrestdata   

Great there is our table nycrestdata!   
before moving further lets quickly  create a Sqoop Options file to store the parameters we will be using repetitively. 
Here is the options file contents, create a file called sqoop_parms.txt containing this data and save it to the tutorial directory:

  • Create the Options File

#
# Options file for Sqoop import
#

# Connect parameter and value
--connect
jdbc:postgresql://localhost:5432/postgres
# Username parameter and value
--username
postgres
#
# Password
--password
password

Now we will look at the tables in PostgreSQL again using the sqoop_parm.txt file:

sqoop list-tables --options-file /home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/Sqoop_parms.txt

Now let’s import the data from PostgreSQL to HDFS!

sqoop import --options-file /home/YOURHOME/bigsql/bigsql-9.3beta2-2/tutorial/Sqoop_parms.txt --table nycrestdata --hive-import

You should see the Map Reducer jobs executed, assuming success we should now have the tables in both PostgreSQL and Hadoop Hive.

Lets run some queries to ensure that the import worked and the data matches.

Lets take a quick look at the number of restaurants by each score: (Multiple scores per establishment may be in the data!)

$psql

  • Run a few  SQL  Test Queries

SELECT s_grade,count(*) FROM nycrestdata group by s_grade;

JFI a grade of Z is a temporary grade until the data is updated!  The C’s you have to look out for!!


SELECT DISTINCT s_dba from nycrestdata where s_grade LIKE '%C%';

Now lets see if the Hive data correlates!

from the hive> prompt

SELECT s_grade,count(*) FROM nycrestdata group by s_grade;

And now for the C’s again!

SELECT DISTINCT s_dba from nycrestdata where s_grade LIKE 'C';

Looks like the data matches as Expected! Our next tutorial will be extended functions of the FDW.

If you have not seen the foreign data wrapper in action check out this tutorial!

BI and Hadoop

The Data Warehousing Institute anticipates that Hadoop technologies will soon become a common and valuable complement to established products and practices for BI (http://tdwi.org/research/2013/04/tdwi-best-practices-report-integrating-hadoop-into-business-intelligence-and-data-warehousing.aspx). Increasingly, data conscious businesses are putting their BI groups under pressure for all sorts of data and analytics. Where, some time back, a weekly report summarizing a few select KPIs was enough, expectations from BI have increased exponentially. The team is required now to not only furnish the required KPIs but also lend a hand in operational troubleshooting, running ad-hoc analytics, integrating with development teams in order to guide them on reporting needs, and of course be always available with data at their fingertips. “I will get back to you tomorrow” is no longer an acceptable answer!

In terms of availability of tools, on the face of it, the demands conflict each other. While some tools out there will help you store long term data, they will not let you run ad-hoc queries efficiently. Some other tools will be very good at ad-hoc analytics, but they will bog you down when you need to load huge quantities of data in real-time. BI seems to be in a state of conflict all the time.

We think BigSQL is the answer. BigSQL brings the best of two worlds together in a single entity. On one hand, you get Hadoop’s scalability, high-availability, and ultra-fast bulk loads. On the other hand, you get to leverage PostgreSQL’s ANSI compliant SQL analytics and structured data approach. What BigSQL does for you is to help avoid the ‘either/or’ question. You no longer have to choose between SQL and NoSQL. You get both at the same time … you get to harness that immense power.

Go on, take a peek … http://www.bigsql.org/