Overview Overview EVL Anonymization EVL Anonymization EVL Validation EVL Validation EVL Data Generation EVL Data Generation QVD Utils QVD Utils EVL Manager
Omlouváme se, tato stránka ještě není dostupná v českém jazyce.

EVL Anon Macros tool

Aug 11, 2020 - Peter Šimečka

EVL Anonymization on Windows Ubuntu

The Anonymization trial Startguide describes how to easily install and run the anonymization sample in the Ubuntu for Windows environment. Running EVL Anonymization scripts is easy even for users who know just a few shell commands. On the other hand for shell experts EVL Anonymization offers many ways how to test, automate, and run anonymization jobs and workflows.

EVL Anon Macros tool

EVL Anon Macros tool serves mainly for testing EVL anonymization features in the Windows environment for users who want avoiding command line usage at all. It gives users the ability to test anonymization functions from one excel file.

Description

We will use the standard trial sample for describing how the tool works; however it would work for any other configuration.

Anon sheet

The anonymizations sheet enables changing the anonymization parameters by picking the functions from anon_type list.

Macros sheet

Config file is the name of the .csv file and the chosen entity from the list will be used as a parameter for running the anonymization for the data file in the data/source folder.

There are 4 buttons containing Visual basic macros.

Generate Unique Entities List

Creates the unique list from the entity_name column values in the anon sheet. It’s needed only if you create a new project or add additional entities to the anon sheet.

Save config.csv

Saves the anon sheet into the .csv config file. You need to use this button whenever you make changes in the anon sheet.

Anonymize Entity

Runs evl_anon_macros_run.exe, which calls the appropriate EVL Anonymization job and generates an anonymized file for the chosen entity in the data/anon folder.

Video

Video shows:

  • Opening the EVL_ANON_MACRO.xlsm file and changing the anonymization for the STREET entity from ANON to MASK_RIGHT(4)
  • Switching to the macros sheet and using the three buttons for saving the config file, running anonymization, and generating results
  • Showing the results: changed anonymized data for the STREET entity and the configuration part of the config file for the entity ADDRESSES

See Result

Creates and opens a new excel file from the source, anonymized data files, and the config file. It consists of two sheets:

Entity_data


Entity_config

Download

Download the zip file and unpack files into your sample_source folder. Open the EVL_ANON_MACRO.xlsm and follow the steps in the video.

Support

The EVL_ANON_MACRO tool is freely available; however no upgrade, customizations or troubleshooting are guaranteed. contact team@evltool.com for assistance

Environment

Using Python Pandas library

Jul 13, 2020 - Peter Šimečka

The purpose of this blog post is to show the strength of Python Pandas library in combination with EVL Tool Anonymization microservice. EVLTool is a powerful family of services that provides data transformations in several business areas, across many platforms, with huge data volumes.

Python is very efficient at processing tabular data like csv and excel files, google spreadsheets, and SQL tables. Especially Python’s PANDAS library enables quick creation of various table operations suitable even for novices.

EVLTool Anonymization example

EVL Tool Anonymization microservice processes anonymization of input data by applying functions declared in an excel configuration file. It is an especially useful example as the results of anonymization are easy to see. We are going to test some anonymization functions and show the results in several steps:

  • Read input data csv file and configuration excel file into dataframes
  • Change one anonymization function and save the new configuration csv file
  • Call the linux EVL job for running the anonymization
  • Show the comparison of original and anonymized data for one chosen entity on the screen
  • Create a new excel file with two sheets: one with the current configuration and another one containing samples of original and anonymized data
  • Open the excel and let the user view the results

Pandas DataFrame

Python Pandas library uses the DataFrame for storing the internal representation of the input, output and configuration data. It offers several functions for reading/writing various data file types like pd.read_excel and pd.read_csv and plenty of functions for manipulation of the rows and columns in the dataframes. We will be using mainly the positioning functions. Additionally we are going to use also two Python libraries: subprocess, and os for spawning operating system programs.

Code Examples

The whole program and video are available: see details in the last chapter. Here we will briefly describe some Pandas features using a few examples of the code.

Importing Libraries

These libraries will be needed for performing some operations


import pandas as pd       # import Python Pandas Data Analysis library
import subprocess        # import Subprocess module for spawning processes 
import os             # import OS module for operating system interfaces
	

Input and output files definition

Declaration of directories and filenames, relative to the program location, in this case, it's in the main anonymization project folder.


config_excel_file  = 'configs/sample_source.xlsx'
config_csv_file   = 'configs/sample_source.csv'
source_file  = 'data/source/addresses.csv'
anon_file      = 'data/anon/addresses.csv'
output_excel = 'py_anon_blog_data.xlsx'
	

Reading files into the DataFrames

pd.read creates a dataframe from the chosen file (type). Parameters can define plenty of options like delimiters, column lists, sheet names, etc.


config = pd.read_excel(config_excel_file, sheet_name='anon', encoding="utf-8")
source = pd.read_csv(source_file, sep=';', usecols = ['addr_id', 'addr_type_code
anon   = pd.read_csv(anon_file, sep=';',   usecols = ['addr_id', 'addr_type_code'
	

Function: changing anonymization types

source.loc[3].at[‘city’] identifies the value of the anon_type column on 4th row (counting from 0) and if it equals RANDOM it changes the value to MASK_LEFT(4). These two values can be changed to any of the available EVLTool anonymization functions. Then the config DataFrame is exported to sample_source.csv file.


def change_config():
	if config.loc[3].at['anon_type'] == 'RANDOM':
		config.at[3,'anon_type'] = 'MASK_LEFT(4)'
		config.at[3,'description'] = 'Changed to MASK_LEFT(4) by Python'
# export the excel configuration file to csv configuration file
		config.to_csv(config_csv_file, sep = ';', index = False)  
		print('Changed anon function of the city entity from RANDOM to MASK_LEFT(4)')
	

Function: running EVL job

Construct the evl anonymization job command and call it via bash


def change_config():
    if config.loc[3].at['anon_type'] == 'RANDOM':
        config.at[3,'anon_type'] = 'MASK_LEFT(4)'
        config.at[3,'description'] = 'Changed to MASK_LEFT(4) by Python'
# export the excel configuration file to csv configuration file
        config.to_csv(config_csv_file, sep = ';', index = False)  
print('Changed anon function of the city entity from RANDOM to MASK_LEFT(4)')
	

Function: comparing original and anonymized data

Comparing original (source) city and anonymized (anon) city for all rows. source.loc[i].at[‘city’] contains the value of the cell in the row i and column name ‘city’


def compare_cities():
	i = 0
	print('Compare city before and after anonymization')
	for i in range(len(source)):
			print('\t' + source.loc[i].at['city']+'\t\t\t' + anon.loc[i].at['city'])
	

Function: save DataFrames to the appropriate files


def save_files():
	print('Saving files')
	data   = source.append(anon)    # Append anonymized data below the source data
	excel = pd.ExcelWriter(output_excel, engine='xlsxwriter')
	config.to_excel(excel,'anon', index = False)
# index=True would create additional 1st column containing row numbers
	data.to_excel(excel,'data', index = False)
	excel.close()
	

Calling the functions


# Calling functions
change_config()	# Function for changing the anonymization function for 
run_evl()			# Function runs EVL Anonymization defined in the config file for 
            		# the file addresses.csv
compare_cities()	# Function for comparing values of the entity city before and after anonymization
save_files()		# Function saves dataframes into excel on 2 sheets
	

Open the excel file


print('Starting excel. Check the changes in both sheets')
start_excel = 'start excel.exe ' + output_excel
os.system(start_excel)        # Opens excel file containing new config and data
	

Whole program code


# This program shows some features mainly the Python Pandas module for working with DataFrames, 
# excel and csv files using the EVL anonymization example 
# Steps:
#    reads the excel configuration file and input csv file
#   runs the EVL Anonymization program for anonymizing input file according to the configuration
#   in the configuration file
#    creates a new excel with the configuration on 1st sheet and merged original
#   and anonymized data on the second sheet
#   opens excel for checking the result 

# Import necessary Python modules

import pandas as pd       # import Python Pandas Data Analysis library
import subprocess        # import Subprocess module for spawning processes 
import os                 # import OS module for oprating system interfaces

# Change of anonymization for entity = city (4th column, numbering starting with 0) if the current anonymization parameter is RANDOM
def change_config():
    if config.loc[3].at['anon_type'] == 'RANDOM':
        config.at[3,'anon_type'] = 'MASK_LEFT(4)'
        config.at[3,'description'] = 'Changed to MASK_LEFT(4) by Python'
# export the excel configuration file to csv configuration file
        config.to_csv(config_csv_file, sep = ';', index = False)  
        print('Changed anon function of the city entity from RANDOM to MASK_LEFT(4)')

def compare_cities():
    i = 0
    print('Compare city before and after anonymization')
    for i in range(len(source)):
        print('\t' + source.loc[i].at['city']+'\t\t\t' + anon.loc[i].at['city'])

# Functions for running EVL anonymization program
def run_evl():
    evl = "bash -c '. $HOME/.evlrc; evl run/anon/sample_source."    # starting bash, setting environment
    job = 'addresses'                                                # set the anonymized entity and the approproate evl job    
    suffix = ".evl 1>py_blog.log 2>&1'"                                # add the job suffix .evl and move errors and output to the logfile
    evljob = evl + job + suffix
    print('Running EVL job: ',evljob)
    subprocess.call(evljob, shell=True)

# Function for saving the dataframes into excel file
def save_files():
    print('Saving files')
    data   = source.append(anon)    # Append anonymized data below the source data
    excel = pd.ExcelWriter(output_excel, engine='xlsxwriter')
    config.to_excel(excel,'anon', index = False)    # index=True would create additional 1st column containing row numbers
    data.to_excel(excel,'data', index = False)
    excel.close()

# MAIN


# Define input and output files

config_excel_file  = 'configs/sample_source.xlsx'
config_csv_file   = 'configs/sample_source.csv'
source_file  = 'data/source/addresses.csv'
anon_file      = 'data/anon/addresses.csv'
output_excel = 'blog_data.xlsx'

# Read the files into Dataframes
print('Reading files into dataframes')
config = pd.read_excel(config_excel_file, sheet_name='anon', encoding="utf-8")
source = pd.read_csv(source_file, sep=';', usecols = ['addr_id', 'addr_type_code', 'street', 'city', 'country', 'zip', 'valid_from', 'entry_timestamp'])
anon   = pd.read_csv(anon_file, sep=';',   usecols = ['addr_id', 'addr_type_code', 'street', 'city', 'country', 'zip', 'valid_from', 'entry_timestamp'])

# Calling functions
change_config()        # Function for changing the anonymization function for 
run_evl()            # Function runs EVL Anonymization defined in the config file for 
                    # the file addresses.csv
compare_cities()    # Function for comparing values of the entity city before and after anonymization
save_files()        # Function saves dataframes into excel on 2 sheets

print('Starting excel. Check the changes in both sheets')
os.system('start excel.exe blog_data.xlsx')        # Opens excel

	

Resources

The source code of the Python program anon.py

Video contains

  • Showing the contents of the original config file and data source
  • running the Python program
  • opening an excel file with changed data and config file

For running the code you need to download the Anonymization trial from the EVL Tool Web and store the Python program into the sample_source folder

Environment

EVL Tool version 2.3 is out!

Apr 20, 2020 - Jan Štěpnička

Version 2.3 of EVL Tool brings new features:

New features:

  • EVL Manager – various GUI enhancements
  • Read’ and ‘Write’ components support (next to ‘hdfs://’) also ‘sftp://’, ‘gs://’, and ‘s3://’ URI
  • next to GCC, also Clang compiler can be used
  • DEBUG mode

New utilities:

  • qvd_header’ – get QVD’s header XML and provide various information, in JSON, XML or EVD

New components:

  • Readqvd’ – read Qlik’s QVD file

New commands:

  • Chmod
  • Chown
  • Rmdir

EVL Tool version 2.2 is out!

Oct 28, 2019 - Jan Štěpnička

Version 2.2 EVL Tool brings new features:

New features:

  • EVL Manager – initial version of monitoring tool with web UI
  • Full Unicode support (ICU)
  • Helpers: ‘csv2evd’, ‘csv2qvd

New utilities:

  • csv2evd
  • csv2qvd

New components:

  • Readxls’ – read (oldstyle) Excel sheets
  • Readxlsx’ – read Excel sheets
  • Tac’ – write records in reverse order
  • Writeqvd’ – write Qlik’s QVD file

Anonymization Terms and Techniques

Jul 15, 2019 - Jan Štěpnička

Generally anonymization means conversion of personal data into anonymized data by using various anonymization techniques. The EU GDPR regulation lays down rules relating to the protection of natural persons with regard to the processing of personal data quite clearly. All companies dealing with personal data of EU citizens have to respect those rules.

Anonymization terms and properties

Pseudonymization – GDPR defines the term pseudonymization: the anonymized personal data cannot be attributed to a specific data subject without the use of additional information. It means there is still a possibility to re-identify the original data from the additional data that should be kept separately.

Reversibility – if it is possible to get the original value or at least how difficult it is. This is something different from pseudonymization, because here is supposed you don’t have the (secret) additional information. In other words: if it is possible to break the algorithm.

Repeatability – same value will be anonymized to the same anonymized value again. I.e. rerun of anonymization will always produce the same result. Or some value will be replaced in different tables always to the same anonymized number. This is usually very important to anonymize IDs.

Uniqueness – two different original values will be anonymized into two different anonymized values. In other words such anonymization is bijective function, i.e. one-to-one correspondence. This is quite important to anonymize IDs for example.

Preserve data type – anonymize values keeps the data type. So an integer for example cannot be anonymized to a string or the anonymized timestamp must be again valid timestamp.

Preserve length – anonymized values are of the same length or keeps the maximal length of given data type. This is very important for anonymization for testing purposes.

Salt is an arbitrary additional string or value added to given data, usually before doing checksum. It is necessary for example when creating hash of some short or somehow bounded value, e.g. phone number. By the information that some hash is made by sha256 from a phone number, one can use sha256 to produce translation table with all possible values. But when some salt, like string "M8SKC7WOFP975WUS", is added to phone number, then without knowing this salt, usual checksum sha256 is not possible to revert.

Choosing the appropriate anonymization approach and techniques highly depends on the purpose and type of processing of personal data e.g. running production systems, archiving or providing data to partners or development teams. The responsibility lies on the Data controller (the natural or legal person, public authority, agency ...) who determines the purposes and means of the processing of personal data.

Anonymization Techniques

Scrambling – means permutation of letters. But quite often it is possible to revert the original data. Example:

Peter Sellers ---> Teepr Resells

Shuffling – permutes values within a whole column. Example of shuffling ID:

idname idname
2Pierre Richard 5Pierre Richard
3Richard Matthew Stallman  →   2Richard Matthew Stallman
5Donald E. Knuth 3Donald E. Knuth

In general this technique is not repeatable, but it is bijection.

Randomization – simply replace the original value by any random one. Example:

1st run Michael Raynolds ---> 5dtZ4twxx7896avkf78ad+0p 2nd run Michael Raynolds ---> 6shk8t9we6fgos7rthj98d

It is clear that randomization is not repeatable and also not bijective function.

Encryption – uses a key to encrypt the original value. Then such key must be kept secret or can be deleted immediately, depends on the purpose, if we’d like to be able to decrypt the data or not.

Masking – allows an important/unique part of the data to be hidden with random characters or other data. For example a credit card number:

9370 4442 9037 4197 ---> **** **** **** 4197

The advantage of masking is the ability to identify data without manipulating actual identities.

Tokenization – replaces sensitive data with non-sensitive substitutes, referred to as tokens, and usually stored in some secret mapping table. Tokenization keeps the data type and usually also length of data, so it can be processed by legacy systems that are sensitive to data length and type. That is achieved by keeping specific data fully or partially visible for processing and analytics while sensitive information is kept hidden.

Table with overview of anonymization techniques

technique property
revertible uniqueness pseudo-
nymization
repeatable preserve
data type
preserve
length
scrambling often yes not yes depends on
algorithm
yes yes
shuffling not yes not depends on
algorithm
yes yes
randomization not not not not yes yes
encryption not *) not yes, but that’s
the purpose here
yes not not
masking not not not yes mostly yes yes
checksum often yes almost yes yes yes not not
salted checksum not *) almost yes not yes not not
tokenization not *) depends on
algorithm
yes yes yes yes
data type preserving
anonymization
not *) almost yes not yes yes yes
data type preserving
unique anonymization
not *) yes not yes yes yes

*) Additional information must be kept somewhere secretly. Either permanently or temporarily. Like a token table, an encryption key or a salt.

EVL Tool version 2.1 is out!

Apr 26, 2019 - Jan Štěpnička

Version 2.1 of EVL Tool brings new features:

New features:

  • EVL Workflow enhanced
  • Microservices – EVL parts bundled for specific purposes
  • Oracle connectivity
  • String manipulation functions added (‘str_pad_left’, ‘str_pad_right’)
  • Docker images available
  • New outputs available: XLSX, QVX

New components:

  • ReadOra
  • RunOra
  • WriteOra
  • Writeqvx’ – Write QlikView’s QVX file
  • Writexlsx’ – Write Excel sheets

New commands:

  • Mail’ – to have e-mailing easier
  • Test’ – to handle test command also on HDFS

EVL Tool version 2.0 is out!

Nov 05, 2018 - Jan Štěpnička

Version 2.0 of EVL Tool brings a lot new features:

New features:

  • EVL Job Manager renamed to EVL Workflow
  • Man pages
  • Standalone components (i.e. components can be used from command-line)
  • Handy mapping functions applied for group of columns – ‘in_to_out()
  • PostgreSQL connectivity
  • String manipulation functions added:
    length’, ‘starts_with’, ‘ends_with’, ‘substr’,
    hex_to_str’, ‘str_to_hex’,
    str_compress’, ‘str_uncompress
    str_index’, ‘str_rindex
  • Shared lookup tables – one lookup can be used in several mappings
  • Sort within a group – sort already grouped data
  • Read/Write Parquet files
  • Incremental unique Run ID

New utilities:

  • evl_increment_run_id

New components:

  • Echo
  • Lookup
  • Readparq
  • ReadPG
  • RunPG
  • Sortgroup
  • Writeparq
  • WritePG

New commands:

  • Fr’ – to handle File Registration
  • Ls
  • Rm
  • Spark’ – to run Spark jobs

EVL Tool version 1.3 is out!

Mar 20, 2018 - Jan Štěpnička

After successful version 1.2 EVL Tool brings its users new features in the newest 1.3 version. What does it bring?

New features:

  • Read/write XML
  • Sample data generator
  • New string manipulation functions

New components:

  • Generate
  • Readxml
  • Writexml

New commands:

  • Cp

Interested? You can get this latest version from GitHub.

EVL Tool goes to Silicon Valley

Feb 14, 2018 - Petr Horčička

What would you say to a three month stay in a foreign country to develop business and gain new experiences? We say yes – thanks to a chance from Czech Accelerator. The acceleration program offers a three month incubator full of mentoring, counseling, workshops, networking events, and other services. This is a chance to kick off one’s business while gaining invaluable experiences.

EVL Tool, as well as other projects, applied for this unique opportunity in Silicon Valley, and after a demanding and critical jury evaluation, our team was chosen to take part in this program. Starting mid-April, we will be developing our potential in California for three months. We believe it will expand our horizons and bring us a lot of useful insights and skills. Our team is really excited and cannot wait to share this experience with you. During our stay, we will regularly inform you about EVL Tool news. Stay with us!

EVL Tool version 1.2 is out!

Jan 29, 2018 - Jan Štěpnička

EVL Tool brings new features in the newest 1.2 version:

New features:

  • Nested fields
  • Partitioning and parallelism
  • Read/write JSON files with full nested fields support
  • Read/write Avro files with full nested fields support

New utilities:

  • evd_remove_comments
  • evd_to_avro_schema

New components:

  • Assign’ – assign content of the flow into variable
  • Depart
  • Gather
  • Merge
  • Partition
  • Readjson
  • Run’ – include custom command
  • Watcher’ – makes debugging easier
  • Writejson

Bugfixes: Fixed issue with standard C++ library that was very rarely and on some systems only, causing memory/data corruption when copying partially overlapping memory regions.

EVL Tool version 1.1 is out!

Nov 07, 2017 - Jan Štěpnička

New features in version 1.1:

New features:

  • Avro data format support (only flat structures)
  • Produce/consume Kafka data stream
  • Teradata FastExport and FastLoad integration
  • Validation functions within mapping

New components:

  • Head
  • Readavro
  • Readkafka
  • ReadTD
  • Tail
  • Validate
  • Writeavro
  • Writekafka
  • WriteTD

EVL Tool version 1.0 is out!

Aug 01, 2017 - Jan Štěpnička

First official version released! After more than a year of design and development and after first industry implementation in T-Mobile CZ.

New features:

  • Lookup tables’ – lookup loaded into memory and used in mappings.
  • Checksum functions’ – standard checksum function for strings: ‘md5’, ‘sha224’, ‘sha256’, ‘sha384’, ‘sha512’.
  • HDFS support
  • Spark code generation – Parquet and Impala integration
  • Job Manager

New components:

  • Aggreg’ – do aggregation for groups of records.
  • Cat’ – concatenate several input flows into single output one.
  • Comp’ – use custom component, which is actually another job.
  • Cut’ – omit fields from input by the output data definition.
  • Filter’ – for simple one- or two-way switch. For more complex use ‘Map’.
  • Join’ – join two input flows by the key. Catch left/right or even unmatched records.
  • Map’ – transform input fields and write into output fields.
  • Read’ – read file(s) into output flow, uncompress if needed.
  • Sort’ – sort, deduplicate, check sort; simply the output is always sorted by the key.
  • Tee’ – replicate one input flow to several output ones.
  • Trash’ – like /dev/null.
  • Write’ – write the flow into file, compress if needed.

New commands:

  • Mkdir
  • Mv
Questions? Contact Us

Contact Us

Email
Message