Testing problem:
As a output of a data transformation program, I had a large excel sheet ( 100 ~ 200 MB) of error logs to sieve through manually to look for error codes.
These error codes were supposed to be compared against an expected set of error codes, to ensure that the program was capturing the complete set of errors ( that were purposely injected into the source data set).
Scripting opportunity:
I was executing the check “manually” i.e. filtering the output logs to look for the “error_code” column and then retrieve unique error codes to be compared against the source list of unique error codes.
This has a fair amount of duplicate effort on each test run, hence I decide to script it using my programming language of choice i.e. Python
Scripted comparison approach:
My approach was to iteratively script the test i.e.
- Script the excel parsing and comparison of error codes
- Then script the running of the data transformation program ( to output the log files). (This already existed , all I had to integrate this with the parsing/comparison script once I had created it. This post covers the parsing and comparison solution)
Parsing -> I started researching (aka googling) a solution and ended up the with using pyexcel as the module that I will use to parse the excel sheet, mainly because it supports multiple excel formats and has excellent documentation .
Comparison -> This lead to thinking about the second part of the problem i.e. how to retrieve unique error codes from the logs and compare them against an expected list.
I landed on using sets for comparison, as they are an extremely handy to deal with data-sets formed of unique elements and can operate seamlessly with lists & dictionaries.
Equipped with the above tools, I started coding a basic POC as below
Solution ->
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pyexcel as pe | |
from datetime import datetime | |
from datetime import date | |
import pytest | |
def test_monthly_close_off_checks(): | |
# this is the expected set of errors from business requirements | |
expected_error_code_set = set({'J01','J01.N','J02','J03','A01','A02','A02.N','A03','A04','A05','D01','D02','AV01','AV02','AR01','AR02','AP01','AP02','AP03','AP04','AP05'}) | |
#get today's date in YYYYMMDD format as it is appended at the end of the excel sheet that needs to be parsed | |
today_date = str(date.today().strftime("%Y%m%d")) | |
print(today_date) | |
# using pyexcel object parse the first log file to get a list of ordered dicts for each row in the log file | |
notifications = pe.get_records(file_name = r"CloseOffChecks_Sunjeet\Notifications_"+today_date+".xlsx") | |
# define an empty set to store the list of unique error codes parsed from the log files | |
parsed_error_code = set() | |
#iterate through the rows and get the value of the error code , it is under the column "Error Code" i.e. would be the key in the retrieved dict | |
for n in notifications: | |
#add the error to the set . The set will ensure uniquness ! | |
parsed_error_code.add(n["Error Code"]) | |
print (n) | |
print(parsed_error_code) | |
#same drill as the notifications file above for the error file | |
errors = pe.get_records(file_name = r"CloseOffChecks_Sunjeet\Errors_"+today_date+".xlsx") | |
for e in errors: | |
#append further error codes to the existing set | |
parsed_error_code.add(e["Error Code"]) | |
print (n) | |
print(parsed_error_code) | |
#assert that the parsed set of error codes is the same as the expected set | |
assert(parsed_error_code == expected_error_code_set) |
Further work ->
- Integration with the data transformation program to complete an E2E solution that grabs source data, transforms it , parses and compares error codes
- Performance ! I am working with fairly chunky log data, how could I optimize my code ?
Leave a Reply