“It does not work, and I do not know why! Now it works and I still do not know why!”
This is a nasty joke circulating in the software development community. You are stuck with some code that does not work and do not know why. Suddenly, you give it another go and now, like magic, it works. It is a good one! This might be the exact point, where you should stop and send the code in time to your professor, if you are a student. On the contrary though, you are not even close to delivering your work, if this is your job. It is only done, when it works, and you know why.
I am pointing fingers at no one but myself right now. A few months ago, we took over a new web application project at LOGIC, where we had to fix a few bugs in an existing obsolete code base. While usually, I am not the one writing code in the company, I took over this one personally since the rest of the team was super busy with other clients and Vaulty and I also found it interesting.
One of the bugs that we had to fix was the ability to export a particular screen of the web application in PDF, using Puppeteer under the hood. After we instrumented and configured Puppeteer appropriately, we got it working on our environment, so we sent the code to the client for review, and… it just did not work. We moved on and deployed the app on a new instance in the cloud provider of the client, and indeed it did not work. We got no error messages or exceptions in the client or server. All we got was a random screen exported as the PDF and an innocent-looking client-side console log of a timeout, initiated from a WebSocket message. This is where things start getting really interesting. It was time to start assuming what might be going wrong and put these assumptions to the test.
The first assumption was that Puppeteer was consuming too many resources. So, we bumped up the instance from 2 GB to 8 GB of RAM, and… it worked! That was an easy one; no need to even write a single line of code. We let the client know, they bumped their instance to 8 GB as well, and… nothing worked. They got the same random page exported as PDF. I have to admit that this was quite demotivating.
The next assumption was that this was caused by a concurrency issue. Exporting required too much time, which in conjunction with other HTTP requests, blocked the server threads and could not render in time the screen to be exported. We increased the concurrency configuration dramatically, with no results.
This is where we abandoned any hopes for low-hanging fruits and started getting our hands in the dirt. We started examining and diagnosing the PDF export script, which was forked and executed as a separate process, line-by-line. Suddenly, we noticed that there is no actual error handling in this script. It caught errors, did nothing, and moved forward. We moved on and changed this behavior to logging errors and stopping execution, and… PDF exporting now failed with the error “Authentication times out”. Alright! That’s progress.
Then, it was time to find out why authentication failed. First, we checked the networking conditions between the deployed containers. Everything was set up appropriately. Puppeteer could communicate with the web application server and the web application server could communicate with the database successfully. Next, we went on to diagnose the whole authentication strategy, reading the actual code of the obsolete version of the open-source library used for that, and… there it was! Authentication was based on bcrypt
, lasting from 7 to 10 seconds each time, while the authentication request in Puppeteer was timing out at 5 seconds.
At last, we knew everything. The web application was not working for multiple reasons. The PDF export was displaying a random screen, as a result of performing actions, that assumed previous steps were successful, while they actually all failed silently. These steps all failed, because the initial step —bcrypt
authentication— timed out. All we had to do was to increase the timeout threshold in the client, and… it just worked. We sent over the fix to the client, they tried it, they confirmed the fix, and we wrapped up the project happily.
Of course, increasing an HTTP request timeout threshold quite often causes more problems than it solves. This was as far as we could interfere though and it got the job done.
This is the point where we delivered; when we knew exactly why things did not work before and why they worked afterward, after digging deeper and deeper until we diagnosed the issue and fixed it. There is no middle ground there. It only works if you know why.