Python 3.x – Using sets to parse log data

Testing problem: 

As a output of a data transformation program, I had a large excel sheet ( 100 ~ 200 MB) of error logs to sieve through manually to look for error codes.

These error codes were supposed to be compared against an expected set of error codes, to ensure that the program was capturing the complete set of errors ( that were purposely injected into the source data set).

Scripting opportunity: 

I was executing the check “manually” i.e. filtering the output logs to look for the “error_code” column and then retrieve unique error codes to be compared against the source list of unique error codes.


This has a fair amount of duplicate effort on each test run, hence I decide to script it using my programming language of choice i.e. Python

Scripted comparison approach:

My approach was to iteratively script the test i.e.

  • Script the excel parsing and comparison of error codes
  • Then script the running of the data transformation program ( to output the log files). (This already existed , all I had to integrate this with the parsing/comparison script once I had created it. This post covers the parsing and comparison solution)

Parsing -> I started researching (aka googling) a solution and ended up the with using pyexcel as the module that I will use to parse the excel sheet, mainly because it supports multiple excel formats and has excellent documentation .

Comparison -> This lead to thinking about the second part of the problem i.e. how to retrieve unique error codes from the logs and compare them against an expected list.

I landed on using sets for comparison, as they are an extremely handy to deal with data-sets formed of unique elements and can operate seamlessly with lists & dictionaries.

Equipped with the above tools, I started coding a basic POC as below

Solution -> 

import pyexcel as pe
from datetime import datetime
from datetime import date
import pytest
def test_monthly_close_off_checks():
# this is the expected set of errors from business requirements
expected_error_code_set = set({'J01','J01.N','J02','J03','A01','A02','A02.N','A03','A04','A05','D01','D02','AV01','AV02','AR01','AR02','AP01','AP02','AP03','AP04','AP05'})
#get today's date in YYYYMMDD format as it is appended at the end of the excel sheet that needs to be parsed
today_date = str("%Y%m%d"))
# using pyexcel object parse the first log file to get a list of ordered dicts for each row in the log file
notifications = pe.get_records(file_name = r"CloseOffChecks_Sunjeet\Notifications_"+today_date+".xlsx")
# define an empty set to store the list of unique error codes parsed from the log files
parsed_error_code = set()
#iterate through the rows and get the value of the error code , it is under the column "Error Code" i.e. would be the key in the retrieved dict
for n in notifications:
#add the error to the set . The set will ensure uniquness !
parsed_error_code.add(n["Error Code"])
print (n)
#same drill as the notifications file above for the error file
errors = pe.get_records(file_name = r"CloseOffChecks_Sunjeet\Errors_"+today_date+".xlsx")
for e in errors:
#append further error codes to the existing set
parsed_error_code.add(e["Error Code"])
print (n)
#assert that the parsed set of error codes is the same as the expected set
assert(parsed_error_code == expected_error_code_set)

view raw

hosted with ❤ by GitHub

Further work ->

  • Integration with the data transformation program to complete an E2E solution that grabs source data, transforms it , parses and compares error codes
  • Performance ! I am working with fairly chunky log data, how could I optimize my code ?





One thought on “Python 3.x – Using sets to parse log data

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s