Outstanding Testers (that I have had the chance to work with/coach) did not just “report that there was a fire” , they were skilled at investigating and communicating –
- How long the fire has been burning ?
- What is the scale of impact ?
- Which areas are affected vs not ?
- What is the nature of the impact ?
- When did it start ?
- When did we last check ?
- What could have caused it ?
- What could we do better next time to help answering the above questions (when the next fire hits) ?
For Exploratory Testing, one of key challenges in testing an unfamiliar (and complex) system is ascertaining where to look for the source of the error for debugging and root cause analysis purposes.
From my experience in testing multi-technology integrated systems, I have put together a bunch of generic heuristics that I use to investigate and look for information that helps in debugging and contributes towards articulating the root cause of end-user errors.
1. “Top- down” heuristics
By top-down (in this context) I mean debugging the application stack of the system component where the symptom has cropped up.
The intention here is to ask questions to ascertain as to whether the root cause lies in the vertical slice of the architecture or not ? Because , if not , then we can start looking at the second set of heuristics (i.e. integration of the current system component with other components of the solution architecture)
- Symptom Repeatability – Are you able to repeat the error consistently from the UX ? Which browsers + platform combination is the best bet to reproduce the symptom ?
- API traffic for the stack– Which underlying API end points are called by the stack’s UX (when the error happens) ? Are those end points responding ? What do the browser developer tools ( or alternate methods) tell you about the request payload and response when the error happens ? Invoke the API end point directly ( with exactly the same request payload) and compare the response with the response received from the UX ? Are there any errors logged in the developer tools console ? Are those error related , how do you know ?
- DB transactions within the stack – Which tables is the API supposed to write to ? Which fields ? Are those tables /fields being correctly populated ? Are your DB schema definitions upto date and correct ? If a stored procedure is called , is it being called , how do you know ? Do you log API/Database errors in the database ? If yes , have any errors been logged when the UX error happens ? If not, you should advocate persistent logging of errors for debuggability with your Product Owner
- Last working version of the stack – What was the last working version of the stack i.e. did not have this error? Revert the stack to that version , can you still reproduce the error ? If not, hold a peer review of the changes since then ? Have you got automated checks to tell the status of all the versions between the working and non-working ? By reviewing those checks or manually changing (one variable at a time ) can you pin point which version of the stack this error started ?
2. “End to end” architecture heuristics
Ok, running our top-down ( through the stack) debugging checks did not yield success, now we need to inspect the integration points and other system components that your application interacts with.
- Data flow and events across integration points – Do you have(/can you draw) a solution architecture diagram to confirm which other system components does your application stack deal with ? When the error happens, can you confirm what data, events is your application expecting from the system components(that it is integrated with) ? Is your application the receiving the data it expects ? Is the data in the right format ? When the error happens , can you confirm which data is being written to which other system components ? Is it actually happening , how do you know ? Is there logging evidence to confirm the answers ? If not, you should advocate persistent logging of errors for debuggability with your Product Owner
- Last working version of the architecture – Do you know the last working versions (i.e. not displaying this error) of all the integration points and system components ? Can the whole architecture be rolled back to a working version ? Have you got automated checks to tell the status of all the system components between the working and non-working copies of the architecture ? By reviewing those checks or manually can you pin point in which version/by which change of a system component/integration point this error started ?
- Completeness of the architecture – Is the architecture complete i.e. are all the system components and integration points responding ? Is there logging to confirm (or negate) that there is no missing system component or disabled integration point ? If not , have a discussion with your solution architect as to how could this be improved to aid debuggability ?
- Non-functional/timing activities across the architecture – When the error happens , are there any resource intensive ( CPU,Memory, Dis I/O) processes that are running and/or kicked off , in other part of the architecture ? How can you monitor resources across the components and integration points ? How do you know that those resource intensive processes have completed or are stuck ? If not , where do you refer to for evidence of failure of those processes/tasks ? Are there any time outs involved i.e. is any system component awaiting on another for a response and is not getting it ? If their logging to this affect ? If not, you know what to do 😉