As a output of a data transformation program, I had a large excel sheet ( 100 ~ 200 MB) of error logs to sieve through manually to look for error codes.
These error codes were supposed to be compared against an expected set of error codes, to ensure that the program was capturing the complete set of errors ( that were purposely injected into the source data set).
I was executing the check “manually” i.e. filtering the output logs to look for the “error_code” column and then retrieve unique error codes to be compared against the source list of unique error codes.
This has a fair amount of duplicate effort on each test run, hence I decide to script it using my programming language of choice i.e. Python
Scripted comparison approach:
My approach was to iteratively script the test i.e.
- Script the excel parsing and comparison of error codes
- Then script the running of the data transformation program ( to output the log files). (This already existed , all I had to integrate this with the parsing/comparison script once I had created it. This post covers the parsing and comparison solution)
Parsing -> I started researching (aka googling) a solution and ended up the with using pyexcel as the module that I will use to parse the excel sheet, mainly because it supports multiple excel formats and has excellent documentation .
Comparison -> This lead to thinking about the second part of the problem i.e. how to retrieve unique error codes from the logs and compare them against an expected list.
I landed on using sets for comparison, as they are an extremely handy to deal with data-sets formed of unique elements and can operate seamlessly with lists & dictionaries.
Equipped with the above tools, I started coding a basic POC as below
Further work ->
- Integration with the data transformation program to complete an E2E solution that grabs source data, transforms it , parses and compares error codes
- Performance ! I am working with fairly chunky log data, how could I optimize my code ?