FTP I/O from PostgreSQL

PostgreSQL applications at times need to communicate with external servers using FTP. Oracle has the UTL_TCP package that is very useful for PL/SQL applications. PostgreSQL doesn’t have an equivalent package, but it does have something really powerful … the ability to write functions in various scripting languages. PostgreSQL has four languages that are part of the core distribution: PL/pgSQL, PL/Tcl, PL/Perl, & PL/Python. The example that I give below uses PL/Python to write to an FTP location.

Consider a scenario where you want to run a few PL/pgSQL routines to create daily data files and then place them on an FTP server for your customer. Theoretically, this is possible using external shell scripts, but you could very easily run into synchronization issues between your database’s functions and the scripts in your OS. A very easy way out is to write a small PL/Python function like the one below that gets called once your daily routines are done within the database.

    CREATE OR REPLACE FUNCTION put_file_on_ftp (file_name TEXT)
        from ftplib import FTP
        ftp = FTP('ftp.testing.com')
        ftp.login('user_name', 'password')
        ftp.storbinary('STOR my_file.bin', open(file_name)
    COST 100;

Let me explain the body of the above function line by line.

from ftplib import FTP

This line imports the FTP class from ftplib module. FTP class implements the client side of FTP protocol, which can be used in the subsequent steps of your function.

ftp = FTP('ftp.testing.com')

This returns an instance of the FTP class connected to the host listed as parameter.

ftp.login('user_name', 'password')

The instance returned by the FTP class above is being used here to actually login to the server. In case no user name and password are provided, the function would assume an anonymous login. Note that most FTP commands are allowed only after the user has successfully logged in.

Obviously you shouldn’t hard code your username and password in a function, but I am doing it here just to simplify the example.


This line switches to the target directory specified as a parameter.

ftp.storbinary('STOR my_file.bin', open(file_name)

This function stores the file specified in file_name (including complete path on the source server) on the target server as my_file.bin. The file transfer is in binary format. The first argument should always be a valid STOR command.

For more commands that you can use for FTP using PL/Python, please refer to the link: https://docs.python.org/2/library/ftplib.html.

Hope this was helpful!


Hierarchical Queries using PostgreSQL

One of the major hurdles faced when converting Oracle’s PL/SQL code to any other database is to find the equivalent of START WITH and CONNECT BY PRIOR. This SQL in Oracle provides a very powerful way of writing hierarchical queries, something that comes in really handy when dealing with hierarchical data – this is fairly common in any enterprise.

Well, there is good news. There is an equivalent available in PostgreSQL … and it is not too hard!

Basically, in order to write the equivalent query in PostgreSQL, we use the WITH RECURSIVE syntax, essentially referring the query to its own output and hence recursively iterating through the result-set.

Let me explain with the help of a simple example.

Consider a simple table ‘places’ that lists out the name of a place under column ‘place’ and where it is located under the column ‘is_in’.

example=# \d places
   Table "public.places"
 Column | Type | Modifiers
  place | text |
  is_in | text |

Consider the following data in this table:

example=# select * from places;
     place      |     is_in
 U.S.A          |
 California     | U.S.A
 Florida        | U.S.A
 Massachusetts  | U.S.A
 San Francisco  | California
 Los Angeles    | California
 San Diego      | California
 Miami          | Florida
 Orlando        | Florida
 Tampa          | Florida
 United Kingdom |
 London         | United Kingdom
 Leeds          | United Kingdom
(13 rows)

As you can see, the data is arranged in a hierarchy represented by the chart below:


An example resultset someone would want to retrieve from such data is ‘give me all the places that are in U.S.A.’. In Oracle, the query would like the following:

 SELECT op.places, op.is_in
   FROM places op
  START WITH op.is_in = 'U.S.A'
CONNECT BY PRIOR op.place = op.is_in

The PostgreSQL equivalent would be:

                       FROM places
                      WHERE place = 'U.S.A'
                      UNION ALL
                     SELECT m.place, m.is_in
                       FROM places m
                       JOIN q ON q.place = m.is_in)
        SELECT place, is_in FROM q

This gives the following result:

     place     |   is_in
 U.S.A         |
 California    | U.S.A
 Florida       | U.S.A
 Massachusetts | U.S.A
 San Francisco | California
 Los Angeles   | California
 San Diego     | California
 Miami         | Florida
 Orlando       | Florida
 Tampa         | Florida
(10 rows)

Let’s break this query down to understand what’s going on.

The general structure of a WITH RECURSIVE query is:

a)    The non-recursive term

b)    UNION or UNION ALL, depending on whether or not you want to have duplicates

c)    Recursive term containing a reference to the queries own output

Query evaluation is done as follows (taken from PostgreSQL docs):

  1. Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.
  2. So long as the working table is not empty, repeat these steps:
    1. Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.
    2. Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.

An important thing to ensure while writing hierarchical queries like this is the end state, i.e. making sure that eventually, the recursive portion of the query will not return any rows. Without this end state, the query will go into an infinite loop. To illustrate, let’s suppose in the above example, we have another rows in the ‘places’ table that says ‘California’ is in ‘San Francisco’.

example=# select * from places;
      place     |     is_in
 U.S.A          |
 California     | U.S.A
 Florida        | U.S.A
 Massachusetts  | U.S.A
 San Francisco  | California
 Los Angeles    | California
 San Diego      | California
 Miami          | Florida
 Orlando        | Florida
 Tampa          | Florida
 United Kingdom |
 London         | United Kingdom
 Leeds          | United Kingdom
 California     | San Francisco

As you can see, the data essentially says that ‘San Francisco’ is in ‘California’ and ‘California’ is in ‘San Francisco’. This data will throw our query above in an infinite loop. The simplest way to avoid such a situation is to use UNION instead of UNION ALL to join the recursive and non-recursive portions of the hierarchical query. With that change, the results are as follows:

example=# WITH RECURSIVE q AS (SELECT place, is_in
                                 FROM places
                                WHERE place = 'U.S.A'
                               SELECT m.place, m.is_in
                                 FROM places m
                                 JOIN q ON q.place = m.is_in)
                  SELECT place, is_in FROM q;
     place     |     is_in
 U.S.A         |
 California    | U.S.A
 Florida       | U.S.A
 Massachusetts | U.S.A
 San Francisco | California
 Los Angeles   | California
 San Diego     | California
 Miami         | Florida
 Orlando       | Florida
 Tampa         | Florida
 California    | San Francisco
(11 rows)

I hope you find this blog helpful!

Eclipse for Perl Developers

I’ve spent a lot of time in Perl using Eclipse lately. Personally, I prefer to use vim for most of my development. Lately though, I’ve been spending a lot of time in some complex java with many different references to many different classes; I’ll be honest, I am enjoying the index and code-completion. I have always heard rumbles about perl in eclipse, and figured that I would check it out. My personal favorite turned out to be: http://www.epic-ide.org/ . It actually has completion and a few other handy features that ( although I never really wanted them ) are nice to haves. Combine that with vrapper ( http://vrapper.sourceforge.net/home/ ) to give me nice vim keyboard functionality in eclipse and eclipse color theme ( http://marketplace.eclipse.org/content/eclipse-color-theme ) to give me light text on a dark background and I’ve got a great dev environment for my two biggest used languages.

Matching Server and Client Constraints

Postgres supports all the SQL-standard constraint types, e.g. UNIQUE, CHECK. These work fine at the server level:

CREATE TABLE ledger (id SERIAL, balance NUMERIC(10,2) CHECK (balance > 0));
ERROR:  new row for relation "ledger" violates check constraint "ledger_balance_check"
DETAIL:  Failing row contains (1, -2.00).

Continue Reading »

Dynamic Columns

SQL has never been good at handling dynamic columns — it can easily process tables with a fixed number of columns, but dynamic column handling has always been a challenge. (Dynamic columns are often used to implement user-defined fields.) One approach has been to define columns for every possible value that might be stored in a row; this leads to sparse tables containing mostly NULL values. Because Postgres stores NULL values in a single bit, there is little overhead to this approach. However, it is hard to call these dynamic columns since every new column requires an ALTER TABLE ... ADD COLUMN command.

Another approach is to use entity/attribute/value layout, where each value is stored in its own row. This obviously increases storage requirements, has performance challenges, and adds complexity to SQL queries.

Fortunately, Postgres has various options to efficiently handle dynamic columns. The simplest solution is arrays which allow multiple values of the same data type to be stored in a single field. This assumes that the values are independent, or that their ordinal position in the array is sufficiently meaningful.

Continue Reading »

Partial Indexes

Having covered expression indexes, I want to cover another index feature — partial indexes. Partial indexes allow you to create an index on part of a table. Why would you want to do that?

  • index only frequently-accessed values, allowing smaller index files
  • avoid indexing common values, which are unlikely to be useful in an index
  • smaller index files allow faster index traversal and improve index caching
  • avoid index insertion overhead for non-indexed values
  • constrain uniqueness to a subset of table rows

People usually don't use expression or partial indexes initially, but there are certainly some use-cases that can only be solved with these index features.

View or Post Comments

A Java Based Command Line PortChecker

The below code snippet is a lovely little piece of Java magic the we use in various installation and configuration scenarios to verify if a port is available or busy

import java.io.*;
import java.net.*;

public class PortChecker {

    public static void main(String[] args) {

	int port=0;

	if (args.length == 1) {
    	  try {
	        port = Integer.parseInt(args[0]);
	    } catch (NumberFormatException e) {
	        System.err.println("INVALID ARGUMENT: must be an integer.");
      	} else {
	    System.err.println("A single integer argument is required");

	try {
	    ServerSocket srv = new ServerSocket(port);
	    srv = null;
	} catch (IOException e) {

Expression Indexes

Having covered virtual columns recently, I wanted to cover one additional type of stored value, expression indexes. Unlike virtual columns, which are computed on each access, expression indexes are stored in index files, allowing rapid access. Let's look at some examples, building on the customer table and fullname function created in my previous blog entry:

CREATE INDEX i_customer_lastname ON customer (lastname);
CREATE INDEX i_customer_concat ON customer ((firstname || ' ' || lastname));
CREATE INDEX i_customer_fullname ON customer (fullname(customer));

Continue Reading »

Setting JAVA_HOME for JDK6 on Linux and OSX

Here is a useful little code snippet I use as part of BigSQL.

if [ "x$JAVA_HOME" == "x" ]; then
  if [ -d "$macosx_j6" ]; then
    export JAVA_HOME="$macosx_j6"
  elif [ -d "$ubuntu_j6" ]; then
    export JAVA_HOME="$ubuntu_j6"
  elif [ -d "$redhat_j6" ]; then
    export JAVA_HOME="$redhat_j6"
    echo "WARNING: Cannot figure out JAVA_HOME."
    exit 3

All Operators Call Functions

Operators are usually easily distinguished from functions because they use punctuation, e.g. -, *, /, rather than words with letters and numbers. There are two popular types of operators, unary and binary. Unary operators take one argument, e.g. -3 (minus), !false (not). Binary operators take two arguments, one on each side of the operator, e.g. 2 + 3 (addition), 'a' || 'b' (concatenation). (C-based languages often also support the ternary operator ?:.)

Perhaps surprisingly, in Postgres comparisons are also operators, e.g. a = 3 (equality), to_tsvector('hello bye') @@ 'hello' (full text search). Comparison operators return true, false, or NULL, and their return values are processed by surrounding keywords or clauses, e.g. WHERE, AND. For example, to process WHERE col1 = 10 + 1 AND col2 > 5, 10 + 1 is processed with +, col1 = 11 is then processed with =, col2 > 5 is processed with >, the results processed by AND, and its return value is passed to the query processor via WHERE. Just as you can type \dfS to see all defined functions in psql, and \dTS to see all defined data types, \doS shows all defined operators.

You might think of operators as distinct from functions, but internally operator behavior is bound directly to functions. For example, factorial contains both operator and function definitions. The factorial operator, !, is actually the only Postgres-supplied left-associative unary operator:

Continue Reading »