Latency: That pesky little thing

Recently, I was helping out a client improve their performance of their ETL processes. They were loading several million rows and updating several million more and the whole process was taking better than 7 hours. Jumping in, it was apparent they didn’t spare any expense on the hardware. With 256GB of RAM and a high end SAN their server had plenty of horsepower, but they couldn’t push more than 1000 rows/second. I then took a close look at their config file and there were some changes that should be made, but they didn’t have anything that was unreasonable. I went ahead made some changes to the config that should affect the workload they are pushing on their server like shared_buffers and checkpoint_segments, but it had no effect at all. Watching the server while the load was running, I was surprised to see, the server was mostly idle. There was a little disk activity and a bit of CPU usage, but nothing that would indicate a problem. I went back and looked at the server loading the data and saw similar results, a little activity, but nothing that would pinpoint the issue. That’s when I figured I’d ping the PostgreSQL server from the ETL server and there it was, 1ms latency. That was something I would expect between availability zones on Amazon, but not in a enterprise data center. It turned out that they had a firewall protecting their database servers and the latency was adding 4-5 hours to their load times. The small amount of 1ms really adds up when you need to do millions of network round trips.