Fast Pandas

I recently was optimizing a model for forest growth. The model is implemented in python, and to some degree relies on pandas data frames.

It also relies on some numerical computations using numpy, including an iterative optimization algorithm.

When the time arrived to make it more performant to scale to our operational data, the optimization was the prime suspect, but I held off rushing in. Premature optimization is the root of all evil, someone said.

After figuring out cprofile and kcachegrind/snakeviz (material for a future post) it was apparent that pandas append operations was the bottleneck. Here's where I admit embarassment - I should have been aware having worked with R for many years and then being constantly vigilant about copying data too much. Shame on me for letting that slip, even suggesting to my colleague to append >_<.

To demonstrate, compare the following two operations, which are in principle similar to what was happening in a few places in the forestry growth model:

Method 1, using append

import pandas as pd

d = pd.DataFrame(columns=['A'])
for i in range(1000):
d.append({'A': i}, ignore_index=True)

1 loop, best of 3: 1.64 s per loop

Method 2, pre-allocating the data frame

d = pd.DataFrame(columns=['A'], index=range(1000))
for i in range(1000):
    d.loc[i,'A'] = i

1 loop, best of 3: 202 ms per loop

The outcomes are equivalent, but the pre-allocation approach is 8x faster! Not bad for a nominal change.

I am looking for a larger performance improvement, so this will be just one of many optimization iterations. But it's a good start, and had an empirical foundation.

I'm interested to hear what other pandas optimizations are out there. Of course there is a pandas page about performance. I'm curious what others have experienced.

Do you use pandas? Have you ever run into performance issues with pandas operations? Tweet at me (see social links on left)!

Go Top
comments powered by Disqus