profiling | Harrison Morgan

When optimizing the performance of a program, it’s essential to test and measure what the bottlenecks are. Programmers are bad at guessing what part of a program will be the slowest. Trying to guess is likely to lead to sacrificing code readability for uncertain gains, or even losses in performance (Code Complete, 2nd Edition page 594).

In my case, I have a starry sky generator that I wanted to improve the performance of. The goal: to allow people to generate bigger images in a reasonable amount of time. So, how can we find out where improvements need to be made?

Enter Py-Spy

Py-Spy is a tool for profiling a Python program. When I was first optimizing the performance of my program, I used PyFlame, but that project is no longer maintained. Py-Spy does everything PyFlame did and more. Another nice bonus is that it isn’t limited to Linux. On top of all that, it’s also installable through pip, so it seems to be a big win!

To install it, just run pip install py-spy.

Py-Spy has a number of commands and options we can use to customize the output. For one, we can attach it to an already-running process (for example, a production web server that’s having issues we want to diagnose) using -p PID. This method will probably require you to run it as root (sudo) so it can access the memory of the other process. The method I will be using is to pass py-spy the command to start the Python program itself, which will look like py-spy [command] -- python stars.py.

Speaking of commands, there are three available: record, top, and dump. Top is an interesting one: it looks like the unix top command, but instead shows data about which functions your program is spending the most time in. Dump just prints out the current call stack, and can only be used in by attaching to an already-running process. This is useful if your program is getting hung up somewhere, and you want to find out where.

For our purposes, though, the record command is the most useful. It comes with various options.

Record’s Options

$ py-spy record --help
py-spy-record 
Records stack trace information to a flamegraph, speedscope or raw file

USAGE:
    py-spy record [OPTIONS] --output <filename> --pid <pid> [python_program]...

OPTIONS:
    -p, --pid <pid>              PID of a running python program to spy on
    -o, --output <filename>      Output filename
    -f, --format <format>        Output file format [default: flamegraph]  [possible values:
                                 flamegraph, raw, speedscope]
    -d, --duration <duration>    The number of seconds to sample for [default: unlimited]
    -r, --rate <rate>            The number of samples to collect per second [default: 100]
    -s, --subprocesses           Profile subprocesses of the original process
    -F, --function               Aggregate samples by function name instead of by line number
    -g, --gil                    Only include traces that are holding on to the GIL
    -t, --threads                Show thread ids in the output
    -i, --idle                   Include stack traces for idle threads
    -n, --native                 Collect stack traces from native extensions written in Cython, C
                                 or C++
        --nonblocking            Don't pause the python process when collecting samples. Setting
                                 this option will reduce the perfomance impact of sampling, but
                                 may lead to inaccurate results
    -h, --help                   Prints help information
    -V, --version                Prints version information

ARGS:
    <python_program>...    commandline of a python program to run

There are a few options that are particularly noteworthy here. --format lets us pick between flamegraph, raw, or speedscope. We’ll be using flamegraph, but speedscope is interesting, too. You can examine a speedscope file using the webapp.

--function will group the output by function, rather than by line number. Both have pros and cons. Grouping by function is helpful to get an easier-to-understand overview, while grouping by line number can help you narrow it down further.

Finally, --rate tells py-spy how many times per second to sample the program. The default is 100, but I’ve found that adjusting this either up or down can help, especially if there are a lot of small, quick functions (or lines) that add up. It doesn’t hurt to play around with this and compare the resulting flamegraphs to see which seem the most useful.

Now, we can generate a flamegraph of the starry sky generator. I’ll be running py-spy -o profile.svg --function -- python stars.py on this commit, modified to generate one image with the dimensions (2000, 2000).

Reading a Flamegraph

Here it is!

When running it yourself, you’ll get an SVG file that, when opened in your browser, will be bigger, easier to read, and include some nice JavaScript features like being able to click on a block to zoom in on it, and search. For the moment, take some time to explore the above SVG—hover over a block to get the full text and percentage.

The graph is read from top to bottom. So, in this case, all of the time was spent in the module stars.py. Underneath that is the call to generate_sky, which also basically takes up all of the time. From there, things get more interesting. A portion of the time is taken up just by generate_sky (the part that doesn’t have any blocks beneath it), most of it is taken up by generate_sky_pixel, and some is used by putpixel.

Note that this isn’t grouped by time, but by function. These functions are called one-by-one, so if it were grouped by time, it would be a tiny block for generate_sky_pixel, then a tiny block for putpixel and so on several thousand times.

Since it’s grouped by function, we can more easily compare overall how much time is spent in a particular function versus another. At a glance, we can see that much more time is spent generating the pixel vs. putting it into the image.

A lot of time in generate_sky_pixel isn’t taken up by a sub function, but a fairly significant amount is still used by cast and others.

Let’s get a new graph, but grouped by line number instead of function: py-spy record -o profile.svg -- python stars.py

There’s a lot more information in this graph. For example, it calls attention to the fact that cast is called in two different lines in generate_sky_pixel. The time spent in generate_star_pixel is pretty evenly distributed between lines 87, 88, and 89–which makes sense, because those are the three lines that call planck.

There’s one more piece of information that will be useful: the total time it takes to generate an image. The flamegraph tells us what percentage of time each function/line takes, but it isn’t meant to measure the total run time. I created a performance.py file which uses timeit to generate ten images with the dimensions (900, 900) and return the average number of seconds it took per image. In this case, it took 7.64 seconds. We can definitely do better.

Tuning the Performance

Now that we have the information from these two graphs, as well as the run time, we can get to work making it faster. Looking again at the first flamegraph above, it seems putpixel uses up a total of ~17% of the run time. The docs specifically warn that putpixel is relatively slow, so it seems like this should be a pretty easy win.

I experimented with several methods, including storing the data as a list of lists of tuples, then converting to a numpy array, then feeding that to Image.fromarray, with a result of 7.1 seconds, about a 7% savings. As you might imagine, this still wasn’t very good.

The natural progression from there seemed to be to skip the lists altogether and start with a numpy array, filled with zeroes initially. For some reason, this was actually slower than putpixel: 7.89 seconds. I’m not a NumPy expert, so I’m not sure why this is. Perhaps mutating is a slow operation for NumPy, or maybe I was just doing it the wrong way. If someone takes a look and wants to let me know, I’d be happy to learn about this.

After that, I tried building up a bytearray, extending it by the three pixels each time they were generated, then converting that to bytes and passing it to Image.frombytes(). Total run time: 6.44 seconds. That’s about a 17% savings over putpixel.

Here’s what our flamegraph looks like now that we’ve settled on a putpixel replacement (and after splitting the bytearray.extend onto its own line, so that it will show up separately):

Line 154 (bytearray.extend(pixels)) now only took up about 6% of the time. Even on a small image of 900 by 900 pixels, this resulted in a savings of over a second per image. For bigger images, this savings is in the range of several seconds.

Everything else in the program is directly related to image generation and the math and random number generation behind that, so assuming all of that is already optimal (spoiler alert: it isn’t, the cast() function was entirely unnecessary), this is about as fast as the program can get.

Conclusion

Flamegraphs and the profilers that generate them are useful tools for understanding the performance of a program. Using them, you can avoid trying to guess where bottlenecks are and potentially doing a lot of work for little gain.

For further reading, I recommend this article about the reasoning behind the creation of flamegraphs and the problem they were trying to solve. If you’re struggling to understand how to read the graph, it may help more than my explanation.

Now, go forth and conquer your performance problems!

Tag: profiling

Measuring Python Performance Using Py-Spy

Enter Py-Spy

Record’s Options

Reading a Flamegraph

Tuning the Performance

Conclusion