Dishonest Performance Metrics

Dishonesty in metrics occurs where Team metrics have ambiguous goals, are used to achieve ends other than what they were introduced for, are arbitrarily applied or can not be traced to business outcomes/personnel growth

Dishonest metrics arouse frustration, instil fear or even worse, perpetuate dishonesty in staff ( e.g. when people start to game the metrics to position themselves for professional gain )

Good Leaders work hard on ensuring Psychological safety and transparency during introduction, collection and application of metrics

They usually have (or can facilitate) compelling answers to the following…

  1. Why are we using this metric ?
  2. What are we using this metric to inform….x) y) z) ?
  3. What are we using this metric to NOT inform….x) y) z) ?
  4. Who will view & use this data ? Who will not ?
  5. How will we measure effectiveness of the metric itself ?
  6. How will we measure ineffectiveness of this metric ?
  7. What is the business impact if we dont have this metric ?
  8. How will we look for better metrics (5,6,7 apply again here)?
  9. Can I propose a Team metric? (5,6,7 apply here again)?
  10. Which metrics are common across Teams or across the Org structure ? which ones are unique ? why ?

Looking for adaptability while hiring Project Leaders

“Agile/Lean” ways of working are fast becoming the norm in enterprises (especially software) , but you still see emphasis on “control, predict and linear plan” style of project management.

Project management mechanics are important and no denying that delivering in sprints is not a guise to not plan at all ,however,I have observed a lot of senior managers still biasing towards hiring Project Leaders who exude the highest level of comfort and predictability of time, cost and quality.

This mindset under values and hence under incentivizes the “adaptability” aspect of project management (irrespective of whether you are delivering in increments or as a waterfall).

Effective Project Management in uncertain operational environments is as much about gathering empirical evidence post incidents, adapting to change and constant prioritisation of scope, as it is about predicting the trajectory of a project, mitigating risk & exercising control

While hiring Project Managers look for thought leadership on adapting to uncertainty & change during interviews ( in addition to depth in facilitation, process mechanics and framing predictability)

“How do you estimate features(of the size & scale) that have never been delivered by your Team before ?”

“What will be your Project Delivery strategy for a mandated rewrite of your product’s database schema?”

“As Project Leader how will you deal with a spate of critical bugs found immediately post a major release?”

“How do you re-plan your project when your dependent peer Team is blocked for weeks?”

“How do you deal with a key integrated sub-system suddenly not being supported by your vendor anymore?”

“A newly hired Lead Engineer on your project turned out to be a brilliant jerk and their peers are leaving your project, how will you resolve this situation?

Root patterns of organisational silos

The single most influential factor that dedicates if organisations succeed at their goals, is the ability of their business units/Teams to work together in alignment towards those goals

I have learnt that behaviours that bring misalignment, e.g siloed behaviours (defensiveness, back stabbing, closed attitudes towards new ideas ,ambiguity on accountability) are symptoms of something systemic occurring in the Teams, Business Unit and/or the firm that has allowed that silo to take root, establish and sustain.

As an independent consultant , I am often put in assignments to affect cultural change ( advocating shared ownership of Quality, new Testing approaches or tools , performing a Test practice assessment) where in I often encounter the above mentioned siloed behaviours.

My approach towards chipping away at these siloes is to observe patterns of systemic issues and frame them as themes of root causes that need to be addressed to initiate that cultural change

Here are some of the themes that I have encountered so far that have helped me understand root causes of siloed behaviours in organisations

1. Siloes due to lack of psychological safety

Manifest as…

My resources will be poached if anyone outside our unit get a whiff of the initiative

I invariably get attacked if I approach them with new ideas on how they could improve their Team processes ?

What will it mean for my job’s prominence if we collaborate with that Team?

Will be I punished if I move out of my lane ?

2. Siloes due to following path of least resistance

Manifest as…

This is the only way we can get anything done in this company

My last boss said this is OK, she will handle the consequences !

We are only responsible for these areas of the stack, unhappy customers not my problem

3. Siloes due to lack of coaching

Manifest as…

This is the way we have always run this Team !

I do not know of another way to do it

We are always busy and there is no time to reflect and improve

Categorising siloed behaviours into these themes helps me contextualise and train my mind to view silos through the lens of the systemic issues and helps frame solutions as,

What steps do I need to advocate to increase physiological safety of these Team members ?

Do I need to agree & document acceptable ways of working first before moving ahead with the project ?

Is there a people coaching need here rather than a project resourcing need, who needs coaching on which aspects ?

,rather than viewing siloed behaviours as lazy choices that Teams or individuals have make naturally to avoid accountability.

If you just had 1 Testing question to hire/reject a QA candidate

It is unfair to judge a candidate through just one challenge or exercise, but image that you are in a (non-violent & harmonious) Squid Game situation and as the hiring manager you were only allowed one Testing challenge to pose to the candidate ,

what would that be and why?

Something that is related to the Testing craft, can be applied agnostic of the experience level of the candidate and can be used a vehicle to elicit their core testing mindset

For me, it is goes something as below…

  1. I will draw a whiteboard diagram of the product or system under test
  2. I will explain a typical end to end use-case of the product/system
  3. I will explain the integrations and touch points that the system has with other sub-systems/products

and then I would commence the challenge by an open ended question

“What do you think could go wrong with this Product/System ?”

Good testers, that I have had the fortune to hire & work with, engage with this exercise usually on the below lines

  • They will probe more on the context under which this question is being asked, they will try and understand what “wrong” means here i.e. are we talking about functionality going wrong ? Scalability of this system ? end user experience ? data integrity ? security of the components ? deployment & availability ?

  • They will try & understand how and what stages does a human interact with the system and in which roles ( UI end user , admins , deployment, tech support ) ?

  • They will ask counter questions on how does data flow through the system ? Architecturally how do the integrations work , to which spec , is there a shared understanding on API specs ? Which operations can be performed on the data ? where is it stored ? how is it retrieved & displayed ?

  • They will inquire about testability & monitoring of the system or the sub-components ? How do I know data has reached from A to B in the system ? What does A hear back from B when the transaction finishes ? How are errors logged, retrieved, cleared ?

  • They will frame questions around understanding change to the system ? What is our last working version in this context ? Which patterns of failures in the past might be relevant in this context ? How do we track changes to the code , config , test environments of the product/system ?
  • They will try & establish modes of failure of the components of the system , how to simulate them ? how to deploy and redeploy the system ?
  • They will delve into finding what happens when parts of the system are loaded or soaked e.g. exposed to user interaction or due to voluminous transactions of bulk data or susceptible to infrastructure availability/scalability

These are just some of the rudimentary but important aspects of critical thinking that I would expect from promising or established Testers

Of course , a holistically capable Testers’ skills go way beyond the above points but this challenge has served me as a handy guide that acts as a screener during interviews and usually sets up the trajectory for the remainder of the interview

Should your Team have a dedicated Scrum Master or Agile coach?

Temporarily – yes

Dedicated to your Team full time, as a permanent role – respectfully, no

From my agile practitioner experience, I believe there are 2 reasons for not having a dedicated full time Scrum Master or an Agile coach on your cross-functional squad/Team

  1. Coaching needs are inherently impermanent .

For example, A Senior Team member needs coaching to get better at facilitation, A Tester in the squad needs coaching on determining the best Testing approach, The Team needs coaching on how to provide estimates to Business users/Project Leaders, A Senior Tech lead needs coaching on aligning product road maps with other Teams

Coaching needs like the above have ( and must) a life cycle , roughly where in ,

a) A coaching need is detected

b) Coach facilitates discovery & framing of the root cause, metrics of success are established

c) and then , experiments/solutions are tried over an agreed time frame to meet the coaching need

d) and then, at the end of the cycle either you have fulfilled the need or have surfaced sub-problems/impediments that may not be coaching needs but organisational/systemic problems

( e.g. external dependency that can not be resolved,Team can self organize now to run effective meetings & does not warrant any further coaching, line management escalation is needed , Team resourcing issue etc) .

2. Scrum Mastering is not a front to off-load “admin” work

Let’s first define “admin” work first, i call it “common work” ,

Work resulting from agile rituals that –

a) repeats every release cycle

b) Has connotations of not being intellectually rewarding

d) might not align with your core competency/background

e.g. Maintaining your JIRA board on a daily basis , running effective playback sessions, facilitating a post mortem, organising meetings to resolve business priority conflicts , weeding the mid/long term backlog ( beyond current release cycle), regularly communicating with the Teams that are dependent on your work

Often, common work is considered by Org Leaders as coming in the way of achieving tangible Team outcomes and business value. Hence, they plug the gap by delegating common work to a dedicated role , so that the Team can focus on “real” work.

This is fallacious thinking, because , well executed common work

firstly, benefits the whole Team by instilling software engineering discipline

secondly and importantly, allows the Team to get better at self-organising ,inspecting & adapting to change and taking ownership of aligning their work to business needs .

Getting someone else to do the thinking (all the time) on the Team’s behalf, stymies the ability of the Team to do it for themselves ,

and in it’s essence contradicts the bed rock of agility i.e. forming self-organising Teams, putting the Team in a disadvantaged position, that if they don’t have a dedicated person ensuring that the Team’s common work keeps flowing, they will loose their throughput.

Having expressed all this, doing common work definitely requires tangible skills that need to be nurtured and practiced. That is exactly, where coaching steps in and it becomes a cyclic coaching problem (as described in point 1 above).

Starter pack on Penetration/Security Testing for newbies

As an experienced Tester, recently I have been endeavouring to grow my Penetration & Security Testing skills.

As with any new skill-set the journey can get overwhelming very quickly , because of the vast number of concepts, new terminologies, lack of dedicated mentorship and research sources .

Based on my learning and explorations over the past few months in the Pen Testing & Cyber Security realm, I am putting together a table a learning goals and resources that i hope will help Testers start out on their journey in Pen Testing .

Not by any stretch this is a replacement for real world project experience or structured certification training like OSCP , but is rather aimed as full-time Test Professionals, who on the side are interested in learning about security challenges & Pen Testing for Web,Network and Mobile apps.

Learning goal/research topicResources
What are some of the most common security weaknesses out there?OWASP Top 10

https://owasp.org/www-project-top-ten/
How can you inspect HTTP requests/responses, view source code, manipulate cookies etc using Chrome Dev tools ? https://developers.google.com/web/tools/chrome-devtools
Why is Kali Linux so popular for Pen Testing practitioners ? How can you install Kali Linux using Virtual Box ? https://www.kali.org/docs/introduction/what-is-kali-linux/
Set up your own instance of Kali Linux and if you are new to Linux , handy to go through this –>
https://tryhackme.com/module/linux-fundamentals
Where can you find apps that are deliberately vulnerable ?
The common Pen Testing approach for all tool sets below is –
You have a machine + OS ( like Kali Linux) to be your “attacker” machine, i.e. from where to run the tools to find weaknesses in the “target” machine or a machine hosting the vulnerable app.
https://github.com/kaiiyer/awesome-vulnerable

https://pentester.land/cheatsheets/2018/10/12/list-of-Intentionally-vulnerable-android-apps.html
How do you scan a web app for vulnerabilities ? Start with ZAP proxy – https://www.zaproxy.org/getting-started/
Application of ZAP proxy to detect common weaknesses in Web apps
https://www.zaproxy.org/docs/guides/zapping-the-top-10/
then explore Nessus –
https://resources.infosecinstitute.com/topic/a-brief-introduction-to-the-nessus-vulnerability-scanner/
What does everyone rave about Burpsuite ?
What capabilities does it provide to perform scanning and penetration attacks ?
Starting with Burpsuite ->
https://dev.to/leading-edje/getting-started-with-burp-suite-31hd#articles-list

OWASP Top 10 detection using Burpsuite –>
this is quite intense, but well worth the learning
https://portswigger.net/support/using-burp-to-test-for-the-owasp-top-ten
What is Network reconnaissance ?
Which is a beginner’s tool to scan your network for gathering information ?
Watch this series of excellent tutorials on Nmap from YouTuber – Hackerspoilt
https://www.youtube.com/watch?v=5MTZdN9TEO4
Are there any tools solely focussing on trying to exploit sql databases ?
Yes, SQLMap is one that is preinstall on Kali Linux , that you can use to try & penetrate a vulnerable website

https://www.kalitutorials.net/2014/03/hacking-website-with-sqlmap-in-kali.html
How to get started with Android Pen testing ? Understand Android architecture and how Android apps are built ?

https://medium.com/mobile-penetration-testing/00-prepare-for-penetration-testing-cea4c3de1f05

Use one of the traffic sniffing tools ( e.g Burp Suite proxy) to intercept traffic from an Android app

https://medium.com/androgoat/intercept-http-traffic-from-android-app-androgoat-6e3d4d14d352

This is intense again , but going through these tutorials really helped me get a understanding common Android vulnerabilities and how to detect them ?

https://medium.com/mobile-penetration-testing/android-penetration-testing-courses-4effa36ac5ed


How do you reverse engineer apk files and study application code for static verification ?APK tool and JADX GUI are two reverse engineering tools that i used

https://ibotpeaches.github.io/Apktool/

https://ourcodeworld.com/articles/read/387/how-to-decompile-an-apk-or-dex-file-using-jadx-in-windows

Are there any “Security as a Service” type of scanners for apps ? I explored and played with 3 –

MobSF https://github.com/MobSF/Mobile-Security-Framework-MobSF
Python based and you have to install it locally

Ostor Lab – A cloud based service where you can upload your app and run vulnerability scans on it
https://www.ostorlab.co/

Immuni Web – Another cloud based service
https://www.immuniweb.com/

Other tools that I have come across but have not used yet

Intruder.io
Infection Monkey – Simulates breaches & attacks on your Network
Going deeper into Mobile Application Security

This book by the OWASP Team is excellent and has great hands on material

https://owasp.org/www-project-mobile-security-testing-guide/
Self Training and hacking practice platforms I have primarily used TryHackMe and their paid service , found it will worth the 10 $ per month that they charge
https://tryhackme.com/

There is another one, I have have come across but not used yet – https://www.hackthebox.eu/

Testing is “easy”

Testing is easy, you just have to …..


1. Elicit user needs from missing or no requirements (usually in a single Tester Team)


2. Be great at analytical thinking and detecting your own biases 


3. Analyze and understand end to end architectural risks


4. Analyze and understand end to end business process 


5. Be resilient in the face of “why didnt you catch it ?” probes


6. Be apt at creating effective test data 


7. Excel at communicating technical issues to business folks and vice versa 


8. Be nerdy enough to analyze lots of PROD data to inform your tests 


9. Be informed enough to know which logs to dive into for which errors 


10. Coach peers on effective testing ( vs. just breaking the system) 


11. Constantly look to automate repetitive tasks, reduce Testing related waste


12. Facilitate discussions and manage stakeholder expectations on effort, scope and risks of Testing effort 


13. Report progress on Testing , adapting to the context of the audience, project, company culture and tech stack.

14. Determine what (code/environment/user behavior/test method/dependencies/integration interface/data/cognitive interpretation) changed since last time ?

15. And ,how to quantity that change, to prove that it is faster/slower/better/less useable/non-complaint ?

Organisational QA/Testing smells

https://tenor.com/search/the-rock-smell-gifs

On the lines of Code smells” , QA/Testing related smells in my experience, fall into these 4 broad categories (of root cause(s))

  1. Apathy – Disregard for Testing as a function/craft
  2. Hubris – Talent or position driven blindspots that lead towards flawed decision making
  3. Ignorance – No one has shown them how to do better or Team members lacking (certain) Testing mindset
  4. Helplessness – Cognitive exhaustion from pushing back against immature SW practices or organisational dysfunction

Compiling a list of verbatim/observational “smells” I have come across in my Testing career so far 🙂 , including some that I myself have been culprit of !

Feel free to add yours in the comments below , thank you

I’m sure some will resonate as stereotypes , hopefully some are new to you? ( hence something you might want to watch out for)

  • Customers wont use it “that” way
  • You are testing too early
  • (corollary) You tested too late
  • Why would you be needed in the design session ?
  • Why would you be needed in the code review ?
  • Why would be be needed in the requirements gathering session ?
  • I challenge you to break it
  • (corollary) No customer has complained so far !
  • Look ( pointing to their IDE) , works
  • Try now <keyboard clatter> ..Try now <keyboard clatter> ……………..Try now <keyboard clatter>
  • Did it just break ?
  • When was it last working ?
  • I did not change anything
  • how do I know what changed?
  • I cant tell you which tests you should write , your job!
  • It’s a big change, we just have do it all in one go
  • How do I see the back end errors ?
  • What does “Unhandled exception , contact your System Administrator” mean ?
  • How do I know if all these errors are related?
  • This keeps happening but I just cant make it happen at will
  • Ok, I cant tell what else is broken ?
  • This aaaaawwlllways breaks
  • We always, must take 3 days to retest everything
  • I am a “manual” Tester
  • (corollary) I am an “automated” Tester
  • Every developer must be experienced in Automated Testing
  • (corollary) Ma..look no Testers needed !
  • (corollary of corollary) <this approach/tech> will replace Testers
  • No, DONT try changing this config file
  • (corollary) why do you need access to the build pipeline ?
  • It will be faster in this release
  • Test it while I document the design
  • (corollary) will document it, only if I have time
  • I will refactor it in one go
  • This was never meant to be in scope
  • Why do we just have 157 test cases for this project?
  • (corollary) We are 100 % PASS
  • (corollary of corollary) We were 100 % but this we are 87 % PASS
  • (corollary of corollary of corollary) If we go down from 75 % we can’t ship
  • This environment is just for development
  • (corollary) It will be “different” in PROD
  • “Don’t worry about integration yet” ( Team 1) ……tududududu… ( 2 weeks from Go Live) … Team 2 – “not one told us about these changes”
  • Must be the database box
  • (corollary) Must be permissions
  • (corollary of corollary) Must be a known issue
  • These are my automated tests , they dont need them in version control
  • (corollary) Test code isnt “production code”
  • (corollary of corollary) Our Team only ships application code
  • This is how Agile is
  • (corollary) This is how Agile Testing is
  • (corollary of corollary) This is how <insert prevalent industry term> Testing is
  • Because, Docker
  • (corollary) Because, <some new tech>

List to be continued…..

Reflection :: The toll of Leadership and an year of being self employed

I have been very fortunate to be in leadership roles for 8 years now, ranging from Team Leadership, mentoring, to leading a practice (business line) of extremely competent Testers.

It has been in the top 3 fulfilling experiences of my professional & persona l life. Seeing individuals succeed with (some of) your assistance, advise and guidance is what made leadership so satisfying to me. Putting others first, always, and to shepherd them towards success is why I have kept leadership roles as a sought after career path. Growing and developing individuals and Team is a passion that blossomed , almost , as a second skin on me, during these stints in leadership roles.

However, what I did not realize , that I had also started wearing the foggy lens of a “careerist” and exercising questionable judgement during that time . By that , I mean –

a) Attaching self worth to the extent of my responsibilities . More responsibilities, new strategic projects, bigger Teams to help lead ,were all a measure of professional “success” for me

b) Incessant intellectual restlessness until that bar of self worth was reached and after every milestone finding that the bar just got higher, i.e. a vicious cycle.

c)Achieving outcomes for Team members in the face of corporate dysfunction and resistance ( aka the bread & butter of leadership roles) , made me “compromise” . Compromise with staying in/trying to change organisation behaviours in eco-systems where clearly, the org’s values/mindset and mine, did not match. But still I had to carry on , because the “Team can not be let down!” and leadership is a “balancing act” , at the end of the day

d)Surprise…surprise….this took focus away from my mental & physical well being ( in-spite of getting professional help) . Also took focus away from effectively exercising my role as a parent

This carried on for a dangerously long period for about 2 years until last July , when after only 4 weeks into a “dream” role, I quit ,without a job in hand .

My act of quitting was not a Buddha-esque lightning bolt of enlightenment but occurred because I could not just carry on , i was in hospital, twice in a matter of 2 weeks ,with dangerous symptoms of cardiac pain. My body and mind had plotted to conjure up the act of giving up.

I had to be ejected from the corporate hairball orbit , without a space suit , let alone a plan. I had close to 0 savings , borrowed money from my sister , and only thing foreseeable and enjoyable i had was to drop & pick kids from school as I could do it now. And the closest I had to a plan was to reach to ex-colleagues on LinkedIn and check out with my ex-employer to see if I can get my old desk back. I did not , which now in hindsight is the best that could have happened to me , because , what happened next and has been happening since has been equally fulfilling to the so called “zenith of professional success” that I had experienced earlier.

Gentle warning – I’m not suggesting that this path be followed at all , sorry who knows, whether you will be more or less lucky than what I was , but what transpired was that an ex client whom I had consulted before had a role for a contract Test Manager . I had nothing to lose , I had the courage to say no , I had the flexibility to try something new out and shun it if I did not like … well that was a fragrance I had not experienced before , so I followed the whiff . And it has been a sumptuous feast so far !

Over the past year –

a) I have worked on time & mission critical programme of work affecting daily lives of NewZealanders

b) I have been exposed to /tested new technologies that I had no experience before e.g. R, big data ETL , Machine Learning models

c) Achieved things that I yearned for in my leadership roles e.g. further deepen my tech skills, contribute to the Team in code on a daily basis , architect a cross functional Team from scratch.

d) And doing that at the same time as leveraging on my core skills of servant leadership , facilitation and critical thinking

During this very brief journey in being a self-employed contractor , it has dawned on me that being a “careerist” had not only definite negative inclinations but consequences too , as I was equating my self worth to my job title . Being a contractor has given me the gold dust of flexibility , where in

— I can choose to say no to organisations and walk away from oppurtunities when their demonstrated values/ethics dont align with mine ,without worrying about how would it look on my CV

— Exercise my core skills and develop new ones parallely

— Above all , take care of my family and myself , physically , mentally and spiritually.

Lastly,

Please dont get me wrong ,

I am not suffering fools, this joy ride is impermanent or contracting is somehow better than in house roles, objectively !

Self employment comes with some lusty challenges around inconsistent financial reward and the risk that poses e.g. to a young mortgage paying family . Creating a sustainable pipeline of work in an emerging but (relatively) small IT community wont be easy , but all i can say, is I am relishing every minute of this current joyride , with no mental demons to slay . And I would encourage every current/ex “careerist” to try freelancing/independent contracting atleast once in their career and/or feel free to reach out to me if you want a sounding board.

Stay well peers and flourish ! 🙂

Heuristics for debugging integration problems

Outstanding Testers (that I have had the chance to work with/coach) did not just “report that there was a fire” , they were skilled at investigating and communicating –

  • How long the fire has been burning ?
  • What is the scale of impact ?
  • Which areas are affected vs not ?
  • What is the nature of the impact ?
  • When did it start ?
  • When did we last check ?
  • What could have caused it ?
  • What could we do better next time to help answering the above questions (when the next fire hits) ?

For Exploratory Testing, one of key challenges in testing an unfamiliar (and complex) system is ascertaining where to look for the source of the error for debugging and root cause analysis purposes.

From my experience in testing multi-technology integrated systems, I have put together a bunch of generic heuristics that I use to investigate and look for information that helps in debugging and contributes towards articulating the root cause of end-user errors.

1. “Top- down” heuristics

By top-down (in this context) I mean debugging the application stack of the system component where the symptom has cropped up.

The intention here is to ask questions to ascertain as to whether the root cause lies in the vertical slice of the architecture or not ? Because , if not , then we can start looking at the second set of heuristics (i.e. integration of the current system component with other components of the solution architecture)

  • Symptom Repeatability – Are you able to repeat the error consistently from the UX ? Which browsers + platform combination is the best bet to reproduce the symptom ?
  • API traffic for the stack– Which underlying API end points are called by the stack’s UX (when the error happens) ? Are those end points responding ? What do the browser developer tools ( or alternate methods) tell you about the request payload and response when the error happens ? Invoke the API end point directly ( with exactly the same request payload) and compare the response with the response received from the UX ? Are there any errors logged in the developer tools console ? Are those error related , how do you know ?
  • DB transactions within the stack – Which tables is the API supposed to write to ? Which fields ? Are those tables /fields being correctly populated ? Are your DB schema definitions upto date and correct ? If a stored procedure is called , is it being called , how do you know ? Do you log API/Database errors in the database ? If yes , have any errors been logged when the UX error happens ? If not, you should advocate persistent logging of errors for debuggability with your Product Owner
  • Last working version of the stack – What was the last working version of the stack i.e. did not have this error? Revert the stack to that version , can you still reproduce the error ? If not, hold a peer review of the changes since then ? Have you got automated checks to tell the status of all the versions between the working and non-working ? By reviewing those checks or manually changing (one variable at a time ) can you pin point which version of the stack this error started ?

2. “End to end” architecture heuristics

Ok, running our top-down ( through the stack) debugging checks did not yield success, now we need to inspect the integration points and other system components that your application interacts with.

  • Data flow and events across integration points – Do you have(/can you draw) a solution architecture diagram to confirm which other system components does your application stack deal with ? When the error happens, can you confirm what data, events is your application expecting from the system components(that it is integrated with) ? Is your application the receiving the data it expects ? Is the data in the right format ? When the error happens , can you confirm which data is being written to which other system components ? Is it actually happening , how do you know ? Is there logging evidence to confirm the answers ? If not, you should advocate persistent logging of errors for debuggability with your Product Owner
  • Last working version of the architecture – Do you know the last working versions (i.e. not displaying this error) of all the integration points and system components ? Can the whole architecture be rolled back to a working version ? Have you got automated checks to tell the status of all the system components between the working and non-working copies of the architecture ? By reviewing those checks or manually can you pin point in which version/by which change of a system component/integration point this error started ?
  • Completeness of the architecture – Is the architecture complete i.e. are all the system components and integration points responding ? Is there logging to confirm (or negate) that there is no missing system component or disabled integration point ? If not , have a discussion with your solution architect as to how could this be improved to aid debuggability ?
  • Non-functional/timing activities across the architecture – When the error happens , are there any resource intensive ( CPU,Memory, Dis I/O) processes that are running and/or kicked off , in other part of the architecture ? How can you monitor resources across the components and integration points ? How do you know that those resource intensive processes have completed or are stuck ? If not , where do you refer to for evidence of failure of those processes/tasks ? Are there any time outs involved i.e. is any system component awaiting on another for a response and is not getting it ? If their logging to this affect ? If not, you know what to do 😉