When optimizing the performance of a program, it’s essential to test and measure what the bottlenecks are. Programmers are bad at guessing what part of a program will be the slowest. Trying to guess is likely to lead to sacrificing code readability for uncertain gains, or even losses in performance (Code Complete, 2nd Edition page 594).
In my case, I have a starry sky generator that I wanted to improve the performance of. The goal: to allow people to generate bigger images in a reasonable amount of time. So, how can we find out where improvements need to be made?
Enter Py-Spy
Py-Spy is a tool for profiling a Python program. When I was first optimizing the performance of my program, I used PyFlame, but that project is no longer maintained. Py-Spy does everything PyFlame did and more. Another nice bonus is that it isn’t limited to Linux. On top of all that, it’s also installable through pip, so it seems to be a big win!
To install it, just run pip install py-spy
.
Py-Spy has a number of commands and options we can use to customize the output. For one, we can attach it to an already-running process (for example, a production web server that’s having issues we want to diagnose) using -p PID
. This method will probably require you to run it as root (sudo) so it can access the memory of the other process. The method I will be using is to pass py-spy the command to start the Python program itself, which will look like py-spy [command] -- python stars.py
.
Speaking of commands, there are three available: record
, top
, and dump
. Top is an interesting one: it looks like the unix top
command, but instead shows data about which functions your program is spending the most time in. Dump just prints out the current call stack, and can only be used in by attaching to an already-running process. This is useful if your program is getting hung up somewhere, and you want to find out where.
For our purposes, though, the record command is the most useful. It comes with various options.
Record’s Options
$ py-spy record --help py-spy-record Records stack trace information to a flamegraph, speedscope or raw file USAGE: py-spy record [OPTIONS] --output <filename> --pid <pid> [python_program]... OPTIONS: -p, --pid <pid> PID of a running python program to spy on -o, --output <filename> Output filename -f, --format <format> Output file format [default: flamegraph] [possible values: flamegraph, raw, speedscope] -d, --duration <duration> The number of seconds to sample for [default: unlimited] -r, --rate <rate> The number of samples to collect per second [default: 100] -s, --subprocesses Profile subprocesses of the original process -F, --function Aggregate samples by function name instead of by line number -g, --gil Only include traces that are holding on to the GIL -t, --threads Show thread ids in the output -i, --idle Include stack traces for idle threads -n, --native Collect stack traces from native extensions written in Cython, C or C++ --nonblocking Don't pause the python process when collecting samples. Setting this option will reduce the perfomance impact of sampling, but may lead to inaccurate results -h, --help Prints help information -V, --version Prints version information ARGS: <python_program>... commandline of a python program to run
There are a few options that are particularly noteworthy here. --format
lets us pick between flamegraph, raw, or speedscope. We’ll be using flamegraph, but speedscope is interesting, too. You can examine a speedscope file using the webapp.
--function
will group the output by function, rather than by line number. Both have pros and cons. Grouping by function is helpful to get an easier-to-understand overview, while grouping by line number can help you narrow it down further.
Finally, --rate
tells py-spy how many times per second to sample the program. The default is 100, but I’ve found that adjusting this either up or down can help, especially if there are a lot of small, quick functions (or lines) that add up. It doesn’t hurt to play around with this and compare the resulting flamegraphs to see which seem the most useful.
Now, we can generate a flamegraph of the starry sky generator. I’ll be running py-spy -o profile.svg --function -- python stars.py
on this commit, modified to generate one image with the dimensions (2000, 2000).
Reading a Flamegraph
Here it is!
When running it yourself, you’ll get an SVG file that, when opened in your browser, will be bigger, easier to read, and include some nice JavaScript features like being able to click on a block to zoom in on it, and search. For the moment, take some time to explore the above SVG—hover over a block to get the full text and percentage.
The graph is read from top to bottom. So, in this case, all of the time was spent in the module stars.py
. Underneath that is the call to generate_sky
, which also basically takes up all of the time. From there, things get more interesting. A portion of the time is taken up just by generate_sky
(the part that doesn’t have any blocks beneath it), most of it is taken up by generate_sky_pixel
, and some is used by putpixel
.
Note that this isn’t grouped by time, but by function. These functions are called one-by-one, so if it were grouped by time, it would be a tiny block for generate_sky_pixel
, then a tiny block for putpixel
and so on several thousand times.
Since it’s grouped by function, we can more easily compare overall how much time is spent in a particular function versus another. At a glance, we can see that much more time is spent generating the pixel vs. putting it into the image.
A lot of time in generate_sky_pixel
isn’t taken up by a sub function, but a fairly significant amount is still used by cast
and others.
Let’s get a new graph, but grouped by line number instead of function: py-spy record -o profile.svg -- python stars.py
There’s a lot more information in this graph. For example, it calls attention to the fact that cast
is called in two different lines in generate_sky_pixel
. The time spent in generate_star_pixel
is pretty evenly distributed between lines 87, 88, and 89–which makes sense, because those are the three lines that call planck
.
There’s one more piece of information that will be useful: the total time it takes to generate an image. The flamegraph tells us what percentage of time each function/line takes, but it isn’t meant to measure the total run time. I created a performance.py
file which uses timeit
to generate ten images with the dimensions (900, 900) and return the average number of seconds it took per image. In this case, it took 7.64 seconds. We can definitely do better.
Tuning the Performance
Now that we have the information from these two graphs, as well as the run time, we can get to work making it faster. Looking again at the first flamegraph above, it seems putpixel
uses up a total of ~17% of the run time. The docs specifically warn that putpixel
is relatively slow, so it seems like this should be a pretty easy win.
I experimented with several methods, including storing the data as a list of lists of tuples, then converting to a numpy
array, then feeding that to Image.fromarray
, with a result of 7.1 seconds, about a 7% savings. As you might imagine, this still wasn’t very good.
The natural progression from there seemed to be to skip the lists altogether and start with a numpy
array, filled with zeroes initially. For some reason, this was actually slower than putpixel
: 7.89 seconds. I’m not a NumPy expert, so I’m not sure why this is. Perhaps mutating is a slow operation for NumPy, or maybe I was just doing it the wrong way. If someone takes a look and wants to let me know, I’d be happy to learn about this.
After that, I tried building up a bytearray
, extending it by the three pixels each time they were generated, then converting that to bytes
and passing it to Image.frombytes()
. Total run time: 6.44 seconds. That’s about a 17% savings over putpixel
.
Here’s what our flamegraph looks like now that we’ve settled on a putpixel
replacement (and after splitting the bytearray.extend
onto its own line, so that it will show up separately):
Line 154 (bytearray.extend(pixels)
) now only took up about 6% of the time. Even on a small image of 900 by 900 pixels, this resulted in a savings of over a second per image. For bigger images, this savings is in the range of several seconds.
Everything else in the program is directly related to image generation and the math and random number generation behind that, so assuming all of that is already optimal (spoiler alert: it isn’t, the cast()
function was entirely unnecessary), this is about as fast as the program can get.
Conclusion
Flamegraphs and the profilers that generate them are useful tools for understanding the performance of a program. Using them, you can avoid trying to guess where bottlenecks are and potentially doing a lot of work for little gain.
For further reading, I recommend this article about the reasoning behind the creation of flamegraphs and the problem they were trying to solve. If you’re struggling to understand how to read the graph, it may help more than my explanation.
Now, go forth and conquer your performance problems!