Using Python Pandas library

Peter Šimečka

The purpose of this blog post is to show the strength of Python Pandas library in combination with EVL Tool Anonymization microservice. EVLTool is a powerful family of services that provides data transformations in several business areas, across many platforms, with huge data volumes.

Python is very efficient at processing tabular data like csv and excel files, google spreadsheets, and SQL tables. Especially Python’s PANDAS library enables quick creation of various table operations suitable even for novices.

EVLTool Anonymization example

EVL Tool Anonymization microservice processes anonymization of input data by applying functions declared in an excel configuration file. It is an especially useful example as the results of anonymization are easy to see. We are going to test some anonymization functions and show the results in several steps:

Read input data csv file and configuration excel file into dataframes
Change one anonymization function and save the new configuration csv file
Call the linux EVL job for running the anonymization
Show the comparison of original and anonymized data for one chosen entity on the screen
Create a new excel file with two sheets: one with the current configuration and another one containing samples of original and anonymized data
Open the excel and let the user view the results

Pandas DataFrame

Python Pandas library uses the DataFrame for storing the internal representation of the input, output and configuration data. It offers several functions for reading/writing various data file types like pd.read_excel and pd.read_csv and plenty of functions for manipulation of the rows and columns in the dataframes. We will be using mainly the positioning functions. Additionally we are going to use also two Python libraries: subprocess, and os for spawning operating system programs.

Code Examples

The whole program and video are available: see details in the last chapter. Here we will briefly describe some Pandas features using a few examples of the code.

Importing Libraries

These libraries will be needed for performing some operations


import pandas as pd       # import Python Pandas Data Analysis library
import subprocess        # import Subprocess module for spawning processes 
import os             # import OS module for operating system interfaces

Input and output files definition

Declaration of directories and filenames, relative to the program location, in this case, it's in the main anonymization project folder.


config_excel_file  = 'configs/sample_source.xlsx'
config_csv_file   = 'configs/sample_source.csv'
source_file  = 'data/source/addresses.csv'
anon_file      = 'data/anon/addresses.csv'
output_excel = 'py_anon_blog_data.xlsx'

Reading files into the DataFrames

pd.read creates a dataframe from the chosen file (type). Parameters can define plenty of options like delimiters, column lists, sheet names, etc.


config = pd.read_excel(config_excel_file, sheet_name='anon', encoding="utf-8")
source = pd.read_csv(source_file, sep=';', usecols = ['addr_id', 'addr_type_code
anon   = pd.read_csv(anon_file, sep=';',   usecols = ['addr_id', 'addr_type_code'

Function: changing anonymization types

source.loc[3].at[‘city’] identifies the value of the anon_type column on 4th row (counting from 0) and if it equals RANDOM it changes the value to MASK_LEFT(4). These two values can be changed to any of the available EVLTool anonymization functions. Then the config DataFrame is exported to sample_source.csv file.


def change_config():
	if config.loc[3].at['anon_type'] == 'RANDOM':
		config.at[3,'anon_type'] = 'MASK_LEFT(4)'
		config.at[3,'description'] = 'Changed to MASK_LEFT(4) by Python'
# export the excel configuration file to csv configuration file
		config.to_csv(config_csv_file, sep = ';', index = False)  
		print('Changed anon function of the city entity from RANDOM to MASK_LEFT(4)')

Function: running EVL job

Construct the EVL Data Anonymization job command and call it via bash


def change_config():
    if config.loc[3].at['anon_type'] == 'RANDOM':
        config.at[3,'anon_type'] = 'MASK_LEFT(4)'
        config.at[3,'description'] = 'Changed to MASK_LEFT(4) by Python'
# export the excel configuration file to csv configuration file
        config.to_csv(config_csv_file, sep = ';', index = False)  
print('Changed anon function of the city entity from RANDOM to MASK_LEFT(4)')

Function: comparing original and anonymized data

Comparing original (source) city and anonymized (anon) city for all rows. source.loc[i].at[‘city’] contains the value of the cell in the row i and column name ‘city’


def compare_cities():
	i = 0
	print('Compare city before and after anonymization')
	for i in range(len(source)):
			print('\t' + source.loc[i].at['city']+'\t\t\t' + anon.loc[i].at['city'])

Function: save DataFrames to the appropriate files


def save_files():
	print('Saving files')
	data   = source.append(anon)    # Append anonymized data below the source data
	excel = pd.ExcelWriter(output_excel, engine='xlsxwriter')
	config.to_excel(excel,'anon', index = False)
# index=True would create additional 1st column containing row numbers
	data.to_excel(excel,'data', index = False)
	excel.close()

Calling the functions


# Calling functions
change_config()	# Function for changing the anonymization function for 
run_evl()			# Function runs EVL Data Anonymization job defined in the config file for 
            		# the file addresses.csv
compare_cities()	# Function for comparing values of the entity city before and after anonymization
save_files()		# Function saves dataframes into excel on 2 sheets

Open the excel file


print('Starting excel. Check the changes in both sheets')
start_excel = 'start excel.exe ' + output_excel
os.system(start_excel)        # Opens excel file containing new config and data

Whole program code


# This program shows some features mainly the Python Pandas module for working with DataFrames, 
# excel and csv files using the EVL Data Anonymization example 
# Steps:
#    reads the excel configuration file and input csv file
#   runs the EVL Data Anonymization program for anonymizing input file according to the configuration
#   in the configuration file
#    creates a new excel with the configuration on 1st sheet and merged original
#   and anonymized data on the second sheet
#   opens excel for checking the result 

# Import necessary Python modules

import pandas as pd       # import Python Pandas Data Analysis library
import subprocess        # import Subprocess module for spawning processes 
import os                 # import OS module for oprating system interfaces

# Change of anonymization for entity = city (4th column, numbering starting with 0) if the current anonymization parameter is RANDOM
def change_config():
    if config.loc[3].at['anon_type'] == 'RANDOM':
        config.at[3,'anon_type'] = 'MASK_LEFT(4)'
        config.at[3,'description'] = 'Changed to MASK_LEFT(4) by Python'
# export the excel configuration file to csv configuration file
        config.to_csv(config_csv_file, sep = ';', index = False)  
        print('Changed anon function of the city entity from RANDOM to MASK_LEFT(4)')

def compare_cities():
    i = 0
    print('Compare city before and after anonymization')
    for i in range(len(source)):
        print('\t' + source.loc[i].at['city']+'\t\t\t' + anon.loc[i].at['city'])

# Functions for running EVL Data Anonymization program
def run_evl():
    evl = "bash -c '. $HOME/.evlrc; evl run/anon/sample_source."    # starting bash, setting environment
    job = 'addresses'                                                # set the anonymized entity and the approproate evl job    
    suffix = ".evl 1>py_blog.log 2>&1'"                                # add the job suffix .evl and move errors and output to the logfile
    evljob = evl + job + suffix
    print('Running EVL job: ',evljob)
    subprocess.call(evljob, shell=True)

# Function for saving the dataframes into excel file
def save_files():
    print('Saving files')
    data   = source.append(anon)    # Append anonymized data below the source data
    excel = pd.ExcelWriter(output_excel, engine='xlsxwriter')
    config.to_excel(excel,'anon', index = False)    # index=True would create additional 1st column containing row numbers
    data.to_excel(excel,'data', index = False)
    excel.close()

# MAIN


# Define input and output files

config_excel_file  = 'configs/sample_source.xlsx'
config_csv_file   = 'configs/sample_source.csv'
source_file  = 'data/source/addresses.csv'
anon_file      = 'data/anon/addresses.csv'
output_excel = 'blog_data.xlsx'

# Read the files into Dataframes
print('Reading files into dataframes')
config = pd.read_excel(config_excel_file, sheet_name='anon', encoding="utf-8")
source = pd.read_csv(source_file, sep=';', usecols = ['addr_id', 'addr_type_code', 'street', 'city', 'country', 'zip', 'valid_from', 'entry_timestamp'])
anon   = pd.read_csv(anon_file, sep=';',   usecols = ['addr_id', 'addr_type_code', 'street', 'city', 'country', 'zip', 'valid_from', 'entry_timestamp'])

# Calling functions
change_config()        # Function for changing the anonymization function for 
run_evl()            # Function runs EVL Data Anonymization job defined in the config file for 
                    # the file addresses.csv
compare_cities()    # Function for comparing values of the entity city before and after anonymization
save_files()        # Function saves dataframes into excel on 2 sheets

print('Starting excel. Check the changes in both sheets')
os.system('start excel.exe blog_data.xlsx')        # Opens excel

Resources

The source code of the Python program anon.py

Video contains

Showing the contents of the original config file and data source
running the Python program
opening an excel file with changed data and config file

For running the code you need to download the Anonymization trial from the EVL Tool Web and store the Python program into the sample_source folder

Environment

Windows 10
Python 3.8.3
WSL (Windows Subsystem for Linux)
Ubuntu 18.0.4
EVLTool Anonymization Trial