The devil is in the details – how to find it

The devil is in the details – how to find it


Were you ever in the position when, after trying to identify a bug, your best outcome was ‘it does not reproduce’, ’it works on my machine’, ‘it does not make sense’, ‘we’ll check once more if it happens again’ ?

Every developer has had at some point to figure out why the application he was working on did not behave as expected. Regardless of the nature of the application, there are a few steps that you should follow.

Where to start?

Make sure you fully understand the issue. If it is a new topic for you, take your time to familiarize yourself with the prerequisites and the underlying conditions that could potentially trigger the behavior. If you are blind to it, your chances of success are already slim. Do not jump headfirst into the code; the most common bugs come from development mistakes. If you cannot find the source, you might be tempted to give up due to rush or frustration. If a root cause investigation were opened into the issue, it probably would be a deeper concern.

Collect information

Once, I remember we had an issue with some bulk inserts failing due to a schema change. It was frustrating because it required much manual work to fix corrupted data. We tried adding extra logging and isolating the operation inside a high-priority transaction. Still, nothing seemed to work or provide us any insight into what was happening, and, of course, the issue could not be reproduced locally.

Our maintenance plan that backed up the database’s transaction logs to enable and disable the CDC on the database every hour locked the system login table to recreate its user, resulting in a schema change. We only managed to link the events together after we performed another investigation into some deadlocks. Still, we sure did learn an important lesson: Try to find as much information as you can about the circumstance of the incident, even if it seems insignificant. Potential info might include steps to reproduce, logs, time and date, values of resources, database entries/flags, activity inside or outside of the application, and course, the code.

Always keep your eyes open to any activity during the event, be it a user or other processes running; the intercalation of threads is common and difficult to diagnose.

Classify your information

Now that you have a stack of info about everything that took place, you must filter the data; some pieces might not be part of your puzzle, do not try to formulate a hypothesis just yet. While evaluating records, you need to categorize your information into groups: circumstantial, documentary, demonstrative, unreliable, and objective evidence. Circumstantial: relation of events that might have triggered a behavior, such as the steps to reproduce an issue.

This type should not be taken for granted as it relies on individual experience ( A user says he clicked a button and the application crashed, was it the button?). However, do not disregard it! (A user clicked on a button that started a process that deadlocked somewhere, and the application crashed). Documentary: information that is recorded in some way: logs, traces, activity monitoring. It will provide a better insight into the event flow, based on which you might deduce other triggered actions. Demonstrative: evidence that helps support the context of other evidence: graphs of CPU/memory recordings ( you might suspect an out-of-memory exception).

Unreliable: information that can change over time, such as an updated database entry between the event and the time of the investigation. One of the traps you can fall into is to base your theory upon unreliable information. It will most likely detour your quest. Real: an undebatable piece of evidence that points to the source of the problem: line of code, misconfiguration.

Piece it together

Your information is now grouped based on reliability; you must go top to bottom to find possible scenarios that might fit and digest each.

Envision the events as individual elements and make a note of each one; then, you will gain a better understanding of their dependency when you combine them.

Create a timeline of the incidents that led to a specific behavior. Place the proof that supports an event below it to create a picture of what happened.

If the issue presents a repetitive behavior, reconstruct other occurrences as well to see if you can identify a pattern between the scenarios.

Try to reproduce it

After you have all the circumstances of the issue, you can now try to reproduce it. Make sure you replicate the environment with as much accuracy as possible. If you are working with a bunch of data, do not be lazy and investigate a specific element causing trouble, it might be an issue within the group, not the individual.

Follow the code line by line, even if you have an idea of what is happening. Sometimes default settings might trick you.

If you still do not have a clear vision, do not be afraid to introduce more logging and revisit the investigation after you have additional information.

It’s ok to ask for help.

If you have struggled for days to figure it out to no avail, try to reach out to your colleagues; sometimes, a fresh pair of eyes or a more experienced one can help.

Collaboration is key to solving problems!

Anti-patterns for investigations

Geronimo: try to keep your steps and facts organized; once you find a piece of the puzzle that you believe is the smoking gun, do not put all your bets on it; keep your line of thought and investigate all the way. Otherwise, you might jump to the wrong conclusion

Mad chicken: do not run around trying to pull information from everywhere, even if it is unrelated, just because you lack solid proof; your conclusions might get entirely off track

Peacock: use your logic, do not formulate a theory before you even begin just based on your experience or hearsay. Otherwise, you might see only the evidence that supports your opinion disregarding other clear tracks.

Sloth: even if it takes more time, do not assume anything; verify everything with your own eyes; you might pass by the culprit

Passer: own your investigation, do not throw it to someone else just because you could not figure it out; it is not supposed to be easy; if it were, it would have already been solved.

Denier: do not proceed by taking for granted a specific functionality; every event is a potential suspect, even if it is a third party.

But he just knew!

Did you encounter someone who could smell the smoke and tell you where it is coming from? With experience comes instinct; by finalizing multiple such investigations, you learn to filter the information better and go straight to the source.

Yes, performing all the steps described can be time-consuming, but as you exercise this thinking pattern, you will reach proficiency, and it will become second nature. With enough practice, you will start achieving this in your head. Until then, be thorough and relentless.

Comments (0)
Join the discussion
Read them all
 

Comment

Hide Comments
Back

This is a unique website which will require a more modern browser to work!

Please upgrade today!

Share