4 Anonymization
Before we will go into detail, let’s provide an overview of anonymization process.
To initiate, setup and build a project (i.e. group of data you would like to anonymize) follow these steps.
See evl anon command for details about ‘evl anon’ commands.
-
Create new project
evl anon project new <project_dir>See Project for details about projects.
-
Add a source, i.e. folder with files to be anonymized or database with tables to be anonymized:
evl anon source new <source_name> \
--guess-from-csv <path_to_folder_with_such_CSVs>See Source Settings for details about settings for a source.
-
Edit such a config (
csv) file according to your preferences. (Excel file checks the validity immediately and provides drop down options.) -
Check the config file for mistakes
evl anon check <config_file> -
Generate anonymization jobs and workflow
evl anon build <config_file>See Build and Run for details about jobs and workflow generation and see Config File for details about a config file.
Then to anonymize (regularly), run anonymization jobs:
evl run/anon/<table_1>.evl
evl run/anon/<file_1>.evl
...
Each job represents one file or table to be anonymized. See Build and Run for details.
Note: Be careful running anonymization jobs several times, as data are by default overwritten in the target, unless
export EVL_ANON_APPEND=1is specified in settingsconfigs/anon/*.shfile orproject.sh.
See Environment variables for details about all possible configuration EVL_ANON_* variables.
Having many files or tables to anonymize in one batch, you don’t need to run anonymization jobs one after another, but you can run all jobs by running generated workflow:
evl run workflow/anon/<source_name>.ewf
See Salt for dealing with a salt.
4.1 evl anon command
(since EVL 1.0)
To help to generate, check and build all the configuration files, there is ‘evl anon’ command line utility.
evl anon project new <project_dir>
creates new project folder <project_dir> with default folder structure and files inside.
evl anon project sample <project_dir>
creates new project folder <project_dir> with sample data and configs.
evl anon source new <source_name>
creates new source <source_name> in current project directory (or in <project_dir>).
With ‘--guess-from-csv’ option, it guess data types based on source csv files.
evl anon salt regenerate
generate new salt, or regenerate existing one, for given <project_dir>. Path to the salt is
taken from ‘EVL_ANON_SALT_PATH’ variable from ‘project.sh’ file. When no
<project_dir> is specified, suppose current directory as project’s one.
evl anon check <config_file>
check if <config_file> contains valid combination of metadata.
evl anon build <config_file>
generates anonymization jobs based on <config_file> and also a Workflow with all these jobs.
Synopsis
evl anon project
( new | sample ) <project_dir>
[-v|--verbose]
evl anon source new
<source_name>
[-p|--project <project_dir>]
[-g|--guess-from-csv <source_dir>]
[-v|--verbose]
evl anon salt regenerate
[-p|--project <project_dir>]
[-v|--verbose]
evl anon check
<config_file>
[-p|--project <project_dir>]
[-v|--verbose]
evl anon build
<config_file>
[-p|--project <project_dir>]
[--parallel [<parallel_threads>]]
[-v|--verbose]
evl anon
( --help | --usage | --version )
Options
-p, --project=<project_dir>
if the current directory is not a project’s one, full or relative path can be specified by
<project_dir>
--parallel[=<parallel_threads>]
generate workflow with jobs parallelized as much as possible. To limit this parallelization to,
<parallel_threads> can be specified, which is the value how many jobs can run in parallel.
-g, --guess-from-csv=<source_dir>
preserve mode (i.e. permission), timestamps and ownership
Standard options:
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Environment Variables
The list of all EVL Data Anonymization variables with their default values. One can change these
values in his ‘~/.evlrc’ file or in the project in ‘project.sh’.
EVL_ANON_APPEND=0
whether append or overwrite target files/tables. Possible values are ‘0’ or ‘1’.
EVL_ANON_EOL=""
whether Linux (‘\n’), Windows (‘\r\n’) or old Mac (‘\r’) end-of-lines. Possible
values are "dos", "mac", or leave empty for Linux EOL.
EVL_ANON_HEADER=1
whether or how many lines has file header. Zero means no header.
EVL_ANON_SALT_PATH=".salt"
path to a salt. It is strongly recommended to have this file with 600 permissions
EVL_ANON_TOKEN_DIR=".token"
token tables (files) directory. It is strongly recommended to have this folder secret, so with 700
permissions.
EVL_CONFIG_FIELD_SEPARATOR=";"
the default field separator used in config files
EVL_DEFAULT_FIELD_SEPARATOR=";"
the default field separator for CSV files. This character might be any one of the first 128 ascii
ones.
EVL_DEFAULT_RECORD_SEPARATOR='\n'
the default record separator for CSV files. This character might be any one of the first 128 ascii
ones. By default a Linux newline is used. To use Windows end of line (i.e. ‘\r\n’), use
‘EVL_ANON_EOL’ variable
4.2 Project
Consider an anonymization project to be a folder, where we work on anonymization of some group of data. For example a group of data from business point of view. In most cases there would be only one or a couple of projects.
You can create a new project by hand or by a command:
evl anon project new my_project
It will create new directory my_project in current folder with default settings and subfolder structure.
Or you can a new project with sample data and configuration:
evl anon project sample $HOME/my_sample_project
It will create new directory my_sample_project in your home folder with a sample project.
The anonymization project directory structure is:
build/
files generated by ‘evl anon build’ command
configs/
configuration csv files and settings sh files
lib/
folder for custom anonymization functions
run/
anonymization jobs generated by ‘evl anon build’ command
worflow/
workflows generated by ‘evl anon build’ command
All files in build, run and workflow directories are completely generated based on configuration file(s) configs/<source_name>.csv.
4.3 Source Settings
Once we have a project directory, we would like to add a source, which could be a folder with files or a database.
What and how should be anonymized is specified in a config and setting files.
Config file could be a csv file and setting file is a shell script with variables definitions.
Each source would have one config and one setting file.
To create a new empty config and setting files, run:
evl anon source new my_source
which creates two files in current project folder
configs/my_source.csv
configs/anon/my_source.sh
To create a pre-generated config and setting files, based on a folder with source csv files:
evl anon source new my_source --guess-from-csv=data/source
which goes through all csv files in data/source folder and fill in config file entity names (i.e. file names), field names based on headers, data types and null flag of a field.
If the current directory is not the project’s one, specify the path to the project by option ‘--project=<project_path>’.
See Config File for detailed information about config files.
4.4 Build and Run
For each Entity from config file, i.e. table or file, anonymization job with mapping and other metadata need to be build. It is enough to run the command line utility
evl anon build <config_file>
[-p|--project <project_dir>]
[--parallel [<parallel_threads>]]
[-v|--verbose]
That build all the files in build/ project subdirectory. There you can find evd and evm files in appropriate folders. EVD means EVL Data definition file and it defines the structure of the source/target; field names, data types and other attributes. EVM means EVL Mapping file and it defines how each field is mapped. Although both these files are generated, it is sometimes good to check how they are look like for debug purpose.
It generates also a file in run/anon/ subdirectory, where you can find one evl file per each Entity.
These files can be then run to anonymize the data. For example for three tables, party_addr, party_cont and party_rel it would be fired by these commands:
evl run/anon/party_addr.evl
evl run/anon/party_cont.evl
evl run/anon/party_rel.evl
Once such evl file exists for an Entity, there no need to build jobs again. It check each run if the config file has changed or not for given Entity and run ‘evl anon build’ command automatically.
Note: There is no need to run ‘
evl anon build’ every time the config file is updated. It is done automatically once the job is fired.
The build command also generates a workflow file for given source in workflow/anon/ subdirectory. You can run the anonymization for all the Entities from that source. For example having source defined by configs/some_source.csv, you can run
evl run workflow/anon/some_source.ewf
and it will run all anonymization jobs in one or several parallel threads. It depends on the value defined by --parallel option.
If one or more anonymization jobs in a workflow fail, then you can the restart the whole workflow by:
evl restart workflow/anon/some_source.ewf
or continue from those last failures:
evl continue workflow/anon/some_source.ewf
4.5 Salt
A so called salt is used in anonymize functions.
This salt is stored in .salt file in the project directory by default and must have permissions 600.
Path to this file can be configured in project.sh setting file by EVL_ANON_SALT_PATH variable.
To generate a new salt, or regenerate an existing one, for given <project_dir> (or current folder),
run the command line utility
evl anon salt regenerate [<project_dir>] [-v|--verbose]
or click on Regenerate Salt button in Anonymization view in EVL Manager graphical user inteface.
|