With a default configuration of PostgreSQL, the information in the PostgreSQL logs is generally pretty small with some warnings and the occasional typo in psql. However, once you start tweaking the configuration parameters a bit and you start logging more things like query times, the logs get a lot bigger. Having PostgreSQL generate over 10GB of logs per day is not unheard of if you set log_min_duration_statement low enough on a busy server. You can analyze logs of that size with pgBadger pretty easily, but to get the real value, you want to look at the logs of the course of time. Things like how does the query load look compared to the same time last week, last month or last year. If you save the daily reports from pgBadger, you can look at them over time, but that’s a manaul process. Another option is to put your logs into Hadoop and use MapReduce to analyze the information over time.

The Hadoop project has a service that is built for exactly that purpose called Flume. It lets you setup a Flume agent that will listen for the logs and then push them into Hadoop. Setting everything up is pretty simple.

First, you’ll want to setup PostgreSQL to send the logs to syslog. In postgresql.conf, you’ll set:

log_destination = 'syslog'

Then you want to setup your syslog server to forward messages to Flume. Depending on your syslog server, you’ll setup a rule like this in rsyslog.conf

*.* @

Finally, you’ll need to start Flume where it will listen for syslog messages and push the messages into an HFDS sink.

# pglogs.conf: A single-node Flume configuration
# Name the components on this agent
pglogs.sources = sl
pglogs.sinks = k1
pglogs.channels = c1
# Describe/configure the source
pglogs.sources.sl.type = syslogudp
pglogs.sources.sl.port = 5140
pglogs.sources.sl.host =
# Describe the sink
pglogs.sinks.k1.type = hdfs
# Use a channel which buffers events in memory
pglogs.channels.c1.type = memory
pglogs.channels.c1.capacity = 1000
pglogs.channels.c1.transactionCapacity = 100
#Describe the HDFS Sink 
pglogs.sinks.k1.hdfs.path = /user/postgres/logs
pglogs.sinks.k1.hdfs.filePrefix = pg
pglogs.sinks.k1.hdfs.fileSuffix = .log
pglogs.sinks.k1.hdfs.fileType = DataStream
# Bind the source and sink to the channel
pglogs.sources.sl.channels = c1
pglogs.sinks.k1.channel = c1