I’m working at a public transit company, we need to print schedules to place on the stops, the current system was broken, took 3 days to generate all the PDFs, and was hard to modify.
I was tasked with creating a new system for printing a schedule for bus line passing at a certain stop. The new setup generates about 30 PDFs per second, which is a significant improvement over the previous system.
The new system is open source, and can be checked out here
I investigated a few options, top requirement was definitely being able to modify the templates easily and secondly it had to be faster than the current option.
react-pdf was the first option I tried, but it required some pdf-react specific styling, which people would have to learn. It was similar enough to HTML/CSS, but when benchmarked against our second option, Puppeteer, it was not significantly faster. Puppeteer ended up being chosen, as we could use the same stack we used for other applications, Next.js, which meant less stuff to learn for the team.
The setup
The setup was simple, we used our existing GTFS processing pipeline (GTFS is a common format for public transit trips) to process the data, and our existing API server to serve the processed data and then we used Next.js to query the API and generate the pages, which a Puppeteer nodejs script would consume to generate PDFs.
More specifically, we import the GTFS into Postgres, parse it into a more usable format, placing it in Redis. Our api server, using Fastify, just queries Redis and serves the data, since our data changes infrequently, this setup allows us to serve data quickly.
Considerations
There isn’t really a lot of technical detail, I just wanted to share some specific tips for performance with this setup.
- Make sure you use SSR, do not render the page client side, it will be slower if Puppeteer has to wait for JavaScript to load and run it, and your main bottleneck will be Puppeteer.
- Use tabs on Puppeteer! This is a very good performance improvement, as you can reuse the same Chromium instance, but still use multiple threads, thus reducing overhead.
- Use waitUntil: ’load’ on Puppeteer, with SSR this is faster as your page is already fully rendered when it gets to the client.
- Use multiple NodeJS processes! This seems irrelevant, as all the time will be spent inside the Chromium instance and not Node, but this is not the case, as Puppeteer IO actually takes a lot of time inside NodeJS when passing PDFs to it, on my machine 2 processes were enough, with a single process the main Node thread barely waited (10~% wait time) meaning it would struggle to keep up with Puppeteer pumping out the PDFs. I made a small script for spawning a certain amount of processes with a certain amount of tabs each, the nodejs script then splits the work, doing only its own, and also splits it between tabs, maximizing speed.
#!/bin/bash
pids=()
processes=${PROCESSES:-1}
tabs=${TABS:-10}
for ((i=0; i<$processes; i++))
do
TABS=$tabs PROCESS=$i node build/index.js &
pids+=($!)
done
kill_processes() {
for pid in ${pids[*]}; do
kill $pid 2>/dev/null
done
}
trap kill_processes SIGINT
trap kill_processes INT
wait
- Restart the Puppeteer tabs after a few pdf generations, they don’t seem to garbage collect automatically if you have multiple Puppeteer processes open. To do this, close the tab, and open a new one.
These are the main non-obvious tips I have, if you have more ideas or questions, feel free to ask me on Twitter!