My PL/pgSQL code is stuck, but where?

In the short time that I am enhancing the PL profiler (https://bitbucket.org/openscg/plprofiler) I have been asked multiple times if it can be abused as a debugging aid. Not directly. The conversation goes something like this:

Q: When my PL code is stuck somewhere, can I turn on profiling and see where?
A: No.
Q: Will this feature be added?
A: No.

Of course would that be a useful feature. I don't argue that. And it seems to be that this is precisely how Oracle users find out where their PL/SQL code gets stuck or moves like molasses. However, the reason why I am against adding this is because a profiler, or even parts of it, should not be enabled 24x7 on a production database server for the purpose of eventually using it to extract some debugging information some day. There are other ways to get that information and I will show you one of them.

If a program is stuck somewhere, one uses a debugger to get a backtrace. This works with C programs, like the database server, provided that symbol information is available. In that backtrace (and some variables) we also find the PL/pgSQL backtrace. Having symbol information available is also useful in case of a program crash, to find out why it crashed by loading the core dump into the debugger. 

Every line of PL code, that is executed, goes through the PL/pgSQL executor function exec_stmt(). At that place we find the current PL code's line in a variable. Likewise every function execution goes through either plpgsql_exec_function() or plpgsql_exec_trigger(). In those stack frames we find the OID of the function as well as its signature (name and call argument types).

Doing this eventually several times for a deeply nested PL/pgSQL program is tedious and no fun. So here is a little script called plbacktrace (https://github.com/wieck/plbacktrace.git) that does all of that. It is invoked with the PID of the PostgreSQL backend and will output information like



[postgres@db1 tmp]$ ./plbacktrace.py 13239
fn_oid=105672 lineno=5 func="life_get_cell_pl(life_board,integer,integer)"
fn_oid=105673 lineno=12 func="life_num_neighbors_pl(life_board,integer,integer)"
fn_oid=105675 lineno=11 func="life_tick_pl(life_board)"
fn_oid=105674 lineno=22 func="life_pl(integer,integer,boolean)"
 I will explain another time why I have a database server playing Conway's Game of Life written in PL/pgSQL. 


Regards, Jan

pgstudio-small

PostgreSQL Studio 2.0 is out

PostgreSQL Studio is a fully open source web based UI for PostgreSQL that is ideal for Postgres in the cloud. With the impending release of PostgreSQL 9.5, PostgreSQL Studio 2.0 has just been released supporting some of the new features in Postgres. The big new feature that requires UI changes is support for Row Level Security. With Row Level Security, policies need to be created on each table for the role accessing the data. These policies can become rather complex so seeing tied to the table can greatly help with understanding the security placed on your data.

rls_policy

In addition to the new features, we also updated the versions of the libraries and stopped support for JDK 6.

You can download PostgreSQL Studio at: http://www.postgresqlstudio.org/

And while I have you attention…
The call for presentations for PGConf US 2016 is now open! Submissions will be allowed through January 31st, 2016. All speakers will be notified by February 15, 2016. Please note that as with our past conferences, there will be no extensions of the submission deadline. For more information and submission details, please visit http://www.pgconf.us/2016/submit/

Synchronous Commit

Synchronous Commit

While I was at PGConf.EU a couple of weeks ago in Madrid, I attended a talk by Grant McAlister discussing Amazon RDS for PostgreSQL.  While it was interesting to see how Amazon had made it very simple for developers to get a production PostgreSQL instance quickly, the thing that really caught my eye was the performance benchmarks comparing the fsync and synchronous commit parameters.

sync_commitFrighteningly, it is not that uncommon for people to turn off fsync to get a performance gain out of their PostgreSQL database. While the performance gain is dramatic, it carries the risk that your database could become corrupt. In some cases, this may be OK, but these cases are really rather rare. A more common case is a database where it is OK to lose a little data in the event of a crash. This is where synchronous commit comes in. When synchronous commit is off, the server returns back success immediately to the client, but waits to flush the data to disk for a short period of time. When the data is ultimately flushed it is still properly sync to disk so there is no chance of data corruption. The only risk if the event of a crash is that you may lose some transactions. The default setting for this window is 200ms.

In Grant’s talk, he performed a benchmark that showed turning off synchronous commit gave a bigger performance gain than turning off fsync. He performed an insert only test so I wanted to try a standard pgbench test. I didn’t come up with the same results, but the I still saw a compelling case for leaving fsync on while turning off synchronous commit.

I ran a pgbench test with 4 clients and a scaling factor of 100 on a small EC2 instance running 9.3.5. What I saw was turning off fsync resulted in a 150% performance. Turning off synchronous commit resulted in a 128% performance gain. Both are dramatic performance gains, but the synchronous commit option has a lot less risk.

 

Speaking of conferences, the call for papers is open for PGConf US 2015. If there is a topic you’d like to present in New York in March, submit it here.

0002c4100575aa80d3bf96f26ff5cd58

FTP I/O from PostgreSQL

PostgreSQL applications at times need to communicate with external servers using FTP. Oracle has the UTL_TCP package that is very useful for PL/SQL applications. PostgreSQL doesn’t have an equivalent package, but it does have something really powerful … the ability to write functions in various scripting languages. PostgreSQL has four languages that are part of the core distribution: PL/pgSQL, PL/Tcl, PL/Perl, & PL/Python. The example that I give below uses PL/Python to write to an FTP location.

Consider a scenario where you want to run a few PL/pgSQL routines to create daily data files and then place them on an FTP server for your customer. Theoretically, this is possible using external shell scripts, but you could very easily run into synchronization issues between your database’s functions and the scripts in your OS. A very easy way out is to write a small PL/Python function like the one below that gets called once your daily routines are done within the database.

    CREATE OR REPLACE FUNCTION put_file_on_ftp (file_name TEXT)
        RETURNS VOID AS
    $BODY$
        from ftplib import FTP
        ftp = FTP('ftp.testing.com')
        ftp.login('user_name', 'password')
        ftp.cwd('path/to/directory')
        ftp.storbinary('STOR my_file.bin', open(file_name)
    $BODY$
    LANGUAGE plpythonu VOLATILE
    COST 100;

Let me explain the body of the above function line by line.

from ftplib import FTP

This line imports the FTP class from ftplib module. FTP class implements the client side of FTP protocol, which can be used in the subsequent steps of your function.

ftp = FTP('ftp.testing.com')

This returns an instance of the FTP class connected to the host listed as parameter.

ftp.login('user_name', 'password')

The instance returned by the FTP class above is being used here to actually login to the server. In case no user name and password are provided, the function would assume an anonymous login. Note that most FTP commands are allowed only after the user has successfully logged in.

Obviously you shouldn’t hard code your username and password in a function, but I am doing it here just to simplify the example.

ftp.cwd('path/to/directory')

This line switches to the target directory specified as a parameter.

ftp.storbinary('STOR my_file.bin', open(file_name)

This function stores the file specified in file_name (including complete path on the source server) on the target server as my_file.bin. The file transfer is in binary format. The first argument should always be a valid STOR command.

For more commands that you can use for FTP using PL/Python, please refer to the link: https://docs.python.org/2/library/ftplib.html.

Hope this was helpful!

migrate

Hierarchical Queries using PostgreSQL

One of the major hurdles faced when converting Oracle’s PL/SQL code to any other database is to find the equivalent of START WITH and CONNECT BY PRIOR. This SQL in Oracle provides a very powerful way of writing hierarchical queries, something that comes in really handy when dealing with hierarchical data – this is fairly common in any enterprise.

Well, there is good news. There is an equivalent available in PostgreSQL … and it is not too hard!

Basically, in order to write the equivalent query in PostgreSQL, we use the WITH RECURSIVE syntax, essentially referring the query to its own output and hence recursively iterating through the result-set.

Let me explain with the help of a simple example.

Consider a simple table ‘places’ that lists out the name of a place under column ‘place’ and where it is located under the column ‘is_in’.

example=# \d places
   Table "public.places"
 Column | Type | Modifiers
--------+------+-----------
  place | text |
  is_in | text |

Consider the following data in this table:

example=# select * from places;
     place      |     is_in
----------------+----------------
 U.S.A          |
 California     | U.S.A
 Florida        | U.S.A
 Massachusetts  | U.S.A
 San Francisco  | California
 Los Angeles    | California
 San Diego      | California
 Miami          | Florida
 Orlando        | Florida
 Tampa          | Florida
 United Kingdom |
 London         | United Kingdom
 Leeds          | United Kingdom
(13 rows)

As you can see, the data is arranged in a hierarchy represented by the chart below:

Hierarchical_chart

An example resultset someone would want to retrieve from such data is ‘give me all the places that are in U.S.A.’. In Oracle, the query would like the following:

 SELECT op.places, op.is_in
   FROM places op
  START WITH op.is_in = 'U.S.A'
CONNECT BY PRIOR op.place = op.is_in

The PostgreSQL equivalent would be:

WITH RECURSIVE q AS (SELECT place, is_in
                       FROM places
                      WHERE place = 'U.S.A'
                      UNION ALL
                     SELECT m.place, m.is_in
                       FROM places m
                       JOIN q ON q.place = m.is_in)
        SELECT place, is_in FROM q

This gives the following result:

     place     |   is_in
---------------+------------
 U.S.A         |
 California    | U.S.A
 Florida       | U.S.A
 Massachusetts | U.S.A
 San Francisco | California
 Los Angeles   | California
 San Diego     | California
 Miami         | Florida
 Orlando       | Florida
 Tampa         | Florida
(10 rows)

Let’s break this query down to understand what’s going on.

The general structure of a WITH RECURSIVE query is:

a)    The non-recursive term

b)    UNION or UNION ALL, depending on whether or not you want to have duplicates

c)    Recursive term containing a reference to the queries own output

Query evaluation is done as follows (taken from PostgreSQL docs):

  1. Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.
  2. So long as the working table is not empty, repeat these steps:
    1. Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.
    2. Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.

An important thing to ensure while writing hierarchical queries like this is the end state, i.e. making sure that eventually, the recursive portion of the query will not return any rows. Without this end state, the query will go into an infinite loop. To illustrate, let’s suppose in the above example, we have another rows in the ‘places’ table that says ‘California’ is in ‘San Francisco’.

example=# select * from places;
      place     |     is_in
----------------+----------------
 U.S.A          |
 California     | U.S.A
 Florida        | U.S.A
 Massachusetts  | U.S.A
 San Francisco  | California
 Los Angeles    | California
 San Diego      | California
 Miami          | Florida
 Orlando        | Florida
 Tampa          | Florida
 United Kingdom |
 London         | United Kingdom
 Leeds          | United Kingdom
 California     | San Francisco

As you can see, the data essentially says that ‘San Francisco’ is in ‘California’ and ‘California’ is in ‘San Francisco’. This data will throw our query above in an infinite loop. The simplest way to avoid such a situation is to use UNION instead of UNION ALL to join the recursive and non-recursive portions of the hierarchical query. With that change, the results are as follows:

example=# WITH RECURSIVE q AS (SELECT place, is_in
                                 FROM places
                                WHERE place = 'U.S.A'
                                UNION
                               SELECT m.place, m.is_in
                                 FROM places m
                                 JOIN q ON q.place = m.is_in)
                  SELECT place, is_in FROM q;
     place     |     is_in
---------------+---------------
 U.S.A         |
 California    | U.S.A
 Florida       | U.S.A
 Massachusetts | U.S.A
 San Francisco | California
 Los Angeles   | California
 San Diego     | California
 Miami         | Florida
 Orlando       | Florida
 Tampa         | Florida
 California    | San Francisco
(11 rows)

I hope you find this blog helpful!

PostgreSQL Studio 1.2 Released

I am happy to announce that version 1.2 of PostgreSQL Studio, an open source web interface for PostgreSQL, was released on February 23, 2014. This releases contains some new functionality mostly around improving SQL Worksheet as well as some bug fixes.

The changes in this release are:

Features

  • Added Open and Save buttons to SQL Worksheet
  • Moved the close button on SQL Worksheet to the upper right corner
  • Improved the look and feel of pop ups windows
  • Auto select the first object in a schema
  • Added a limit drop down to SQL Worksheet

Bugs

  • Removed extra new line characters when displaying a function
  • Added a click handler to reselect an already selected row
  • Added a scroll panel for the Data tab for wide tables

The new version is available from http://www.postgresqlstudio.org/download/

tPostgres Features Sybase & SQL Server Compatibility

We are pleased to announce version 9.3.2 of tPostgres. Find out more at http://www.tpostgres.org

The transformation of the tPostgres: Sybase & SQL Server Compatibility Project project is now
complete. We are fully based on the PGTSQL open source PostgreSQL Extension and are fully woking on
the very latest and greatest point release of PG (9.3.2).

Additionally, we have done away with the bulky and awkward 32-bit linux virtual machine. We now have safe and simple to use native command line sandbox installers for Windows, OSX & Linux.

We’ve still got documentation work to do, but there are excellent regression test transact-sql programming examples in the PGTSQL code. Please take a look for yourself and consider contributing to the project at https://bitbucket.org/openscg/pgtsql

–Luss

cassandraeye2

BigSQL Adds Cassandra to the Hadoop and PostgreSQL Bundle

The BigSQL developer’s bundle has added a significant new component: Cassandra 2.0.4.  All of the bundles and VM’s now allow developers to get up and running with the latest Hadoop ecosystem and Cassandra.   Join our free Beta program now!  

As a significant number of our clients are leveraging Cassandra for Highly Available Multi Partition solutions, it was a natural fit to add to our bundle.    

In addition to Cassandra the latest beta includes all of the standard BigSQL components and many enhancements:

  • Continued HadoopFDW development (Postgres  – Hadoop Foreign data wrapper) fixes and improvements
  • JDK 7 is now a pre-req on all platforms
  • upgrade PostgreSQL from 9.3.1 to 9.3.2
  • upgrade DataFu from 1.1.1 to 1.2.0
  • upgrade PostgresqlStudio from 1.0 to 1.1
  • upgrade Hue from 3.0.0 to 3.5.
  • autostart Oozie & Hue in the vm
  • upgrade Sqoop from 1.4.4 to 1.99.3 (and autostart it)
  • include pgTSQL functionality from tpostgresq.org 
  • compact our quickstart virtual machine from 3 GB to 1.5 GB
  • configure eclipse for hadoop in the vm
At OpenSCG we are focus on helping you get the most value and leverage out of your data environments.  Its an exciting time in DATA and we thrive on helping our clients understand the latest changes in data technology.  We realize that there are great opportunities in new NOSQL and clustered data solutions and take pride in helping you evaluate your data options.
huesqoop

Sqoop2 in BigSQL Hue

This tutorial will show you how to import PostgreSQL data into the Hadoop File System using the Sqoop 2 application within Hue. Hue, or the Hadoop User Experience, allows you to use Oozie, Hive, Pig, Postgres, Sqoop, Hbase, Zookeeper, and Hadoop from a browser. Since release three of the BigSQL Quick-start Virtual Machine comes with Hue already installed, I will be using that for this tutorial.

In the virtual machine, open a terminal window, go to the bigsql-9.3r2-b3 directory and start BigSQL:

$ cd bigsql-9.3r2-b3
$ ./bigsql.sh start

Next, set up the benchmarksql.customer history table:

$ . ./setENV.sh
$ psql

postgres=# CREATE USER benchmarksql WITH SUPERUSER PASSWORD ‘password';
postgres=# GRANT ALL PRIVILEGES ON DATABASE postgres TO benchmarksql;
postgres=# \q

$ cd examples/run/
$ ./runSQL.sh props.pg sqlTableCreates
$ ./runLoader.sh props.pg numWarehouses 1
$ ./runSQL.sh props.pg sqlIndexCreates

Now, open hue by going to localhost:8000/ in the virtual machine’s web browser. Log in with the username: bigsql and the password: bigsql.

In the Data Browsers menu, select Sqoop Transfer. Then click to add a new job.

Next, you will need to create a new connection to your PostgreSQL database. Click on +Add a new connection.

Name the connection, make sure generic-jdbc-connector is selected, and define the class as org.postgresql.Driver. For this example we’re connecting to the postgres database as the benchmarksql user. Enter  jdbc:postgresql://localhost:5432/postgres as the connection string, benchmarksql as the username, password as the password, and then click Save.

Now you can name the job, select that is an Import job type, select the connection that you just created and then click Next.

The next page is where you will fill out the information about where you are importing from. For this tutorial, specify that the schema name is Benchmarksql, that the table name is History, and then click Next.

Next, describe where you want Sqoop to move the data to. Choose HDFS as the storage type, Text File as the output format, no compression, and that the output directory will be /user/data/example/hive_customer_history. Then, click Save and run.
Once the job starts running, a notification will pop up next to the Job Browser. If you go here, you will be able the see the status of the mappers and reducers, and other information about the job.

Once the job has finished, you will be able to go to the File Browser and see the data in HDFS.