4 Data Generation
Before we will go into detail, let’s provide an overview of data generation process.
To initiate, setup and build a project (i.e. group of data you would like to anonymize) follow these steps.
See evl datagen command for details about ‘evl datagen’ commands.
-
Create new project
evl datagen project new <project_dir>See Project for details about projects.
-
Add a source, i.e. folder with files to be anonymized or database with tables to be anonymized:
evl datagen source new <source_name> \
--guess-from-csv <path_to_folder_with_such_CSVs>See Source Settings for details about settings for a source.
-
Edit such a config (
csv) file according to your preferences. (Excel file checks the validity immediately and provides drop down options.) -
Check the config file for mistakes
evl datagen check <config_file> -
Generate anonymization jobs and workflow
evl datagen build <config_file>See Build and Run for details about jobs and workflow generation and see Config File for details about a config file.
Then to anonymize (regularly), run anonymization jobs:
evl run/datagen/<table_1>.evl
evl run/datagen/<file_1>.evl
...
Each job represents one file or table to be anonymized. See Build and Run for details.
Note: Be careful running anonymization jobs several times, as data are by default overwritten in the target, unless
export EVL_DATAGEN_APPEND=1is specified in settingsconfigs/datagen/*.shfile orproject.sh.
See Environment variables for details about all possible configuration EVL_DATAGEN_* variables.
Having many files or tables to anonymize in one batch, you don’t need to run anonymization jobs one after another, but you can run all jobs by running generated workflow:
evl run workflow/datagen/<source_name>.ewf
4.1 evl datagen command
(since EVL 1.0)
To help to generate, check and build all the configuration files, there is ‘evl datagen’ command line utility.
evl datagen project new <project_dir>
creates new project folder <project_dir> with default folder structure and files inside.
evl datagen project sample <project_dir>
creates new project folder <project_dir> with sample data and configs.
evl datagen source new <source_name>
creates new source <source_name> in current project directory (or in <project_dir>).
With ‘--guess-from-csv’ option, it guess data types based on source csv files.
evl datagen check <config_file>
check if <config_file> contains valid combination of metadata.
evl datagen build <config_file>
generates data-generation jobs based on <config_file> and also a Workflow with all these
jobs.
Synopsis
evl datagen project
( new | sample ) <project_dir>
[-v|--verbose]
evl datagen source new
<source_name>
[-p|--project <project_dir>]
[-g|--guess-from-csv <source_dir>]
[-v|--verbose]
evl datagen check
<config_file>
[-p|--project <project_dir>]
[-v|--verbose]
evl datagen build
<config_file>
[-p|--project <project_dir>]
[--parallel [<parallel_threads>]]
[-v|--verbose]
evl datagen
( --help | --usage | --version )
Options
-p, --project=<project_dir>
if the current directory is not a project’s one, full or relative path can be specified by
<project_dir>
--parallel[=<parallel_threads>]
generate workflow with jobs parallelized as much as possible. To limit this parallelization to,
<parallel_threads> can be specified, which is the value how many jobs can run in parallel.
-g, --guess-from-csv=<source_dir>
preserve mode (i.e. permission), timestamps and ownership
Standard options:
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Environment Variables
The list of all EVL Data Generation variables with their default values. One can change these
values in his ‘~/.evlrc’ file or in the project in ‘project.sh’.
EVL_DATAGEN_APPEND=0
whether append or overwrite target files/tables. Possible values are ‘0’ or ‘1’.
EVL_DATAGEN_EOL=""
whether Linux (‘\n’), Windows (‘\r\n’) or old Mac (‘\r’) end-of-lines. Possible
values are "dos", "mac", or leave empty for Linux EOL.
EVL_DATAGEN_HEADER=1
whether or how many lines has file header. Zero means no header.
EVL_CONFIG_EOL=""
whether Linux (‘\n’), Windows (‘\r\n’) or old Mac (‘\r’) end-of-lines are used for
main config CSV file. Possible values are "dos", "mac", or leave empty for Linux EOL.
EVL_CONFIG_FIELD_SEPARATOR=";"
the default field separator used in config files
EVL_DEFAULT_FIELD_SEPARATOR=";"
the default field separator for CSV files. This character might be any one of the first 128 ascii
ones.
EVL_DEFAULT_RECORD_SEPARATOR='\n'
the default record separator for CSV files. This character might be any one of the first 128 ascii
ones. By default a Linux newline is used. To use Windows end of line (i.e. ‘\r\n’), use
‘EVL_DATAGEN_EOL’ variable
4.2 Project
Consider an anonymization project to be a folder, where we work on anonymization of some group of data. For example a group of data from business point of view. In most cases there would be only one or a couple of projects.
You can create a new project by hand or by a command:
evl datagen project new my_project
It will create new directory my_project in current folder with default settings and subfolder structure.
Or you can a new project with sample data and configuration:
evl datagen project sample $HOME/my_sample_project
It will create new directory my_sample_project in your home folder with a sample project.
The anonymization project directory structure is:
build/
files generated by ‘evl datagen build’ command
configs/
configuration csv files and settings sh files
lib/
folder for custom anonymization functions
run/
anonymization jobs generated by ‘evl datagen build’ command
worflow/
workflows generated by ‘evl datagen build’ command
All files in build, run and workflow directories are completely generated based on configuration file(s) configs/<source_name>.csv.
4.3 Source Settings
Once we have a project directory, we would like to add a source, which could be a folder with files or a database.
What and how should be anonymized is specified in a config and setting files.
Config file could be a csv file and setting file is a shell script with variables definitions.
Each source would have one config and one setting file.
To create a new empty config and setting files, run:
evl datagen source new my_source
which creates two files in current project folder
configs/my_source.csv
configs/datagen/my_source.sh
To create a pre-generated config and setting files, based on a folder with source csv files:
evl datagen source new my_source --guess-from-csv=data/source
which goes through all csv files in data/source folder and fill in config file entity names (i.e. file names), field names based on headers, data types and null flag of a field.
If the current directory is not the project’s one, specify the path to the project by option ‘--project=<project_path>’.
See Config File for detailed information about config files.
4.4 Build and Run
For each Entity from config file, i.e. table or file, anonymization job with mapping and other metadata need to be build. It is enough to run the command line utility
evl datagen build <config_file>
[-p|--project <project_dir>]
[--parallel [<parallel_threads>]]
[-v|--verbose]
That build all the files in build/ project subdirectory. There you can find evd and evm files in appropriate folders. EVD means EVL Data definition file and it defines the structure of the source/target; field names, data types and other attributes. EVM means EVL Mapping file and it defines how each field is mapped. Although both these files are generated, it is sometimes good to check how they are look like for debug purpose.
It generates also a file in run/datagen/ subdirectory, where you can find one evl file per each Entity.
These files can be then run to anonymize the data. For example for three tables, party_addr, party_cont and party_rel it would be fired by these commands:
evl run/datagen/party_addr.evl
evl run/datagen/party_cont.evl
evl run/datagen/party_rel.evl
Once such evl file exists for an Entity, there no need to build jobs again. It check each run if the config file has changed or not for given Entity and run ‘evl datagen build’ command automatically.
Note: There is no need to run ‘
evl datagen build’ every time the config file is updated. It is done automatically once the job is fired.
The build command also generates a workflow file for given source in workflow/datagen/ subdirectory. You can run the anonymization for all the Entities from that source. For example having source defined by configs/some_source.csv, you can run
evl run workflow/datagen/some_source.ewf
and it will run all anonymization jobs in one or several parallel threads. It depends on the value defined by --parallel option.
If one or more anonymization jobs in a workflow fail, then you can the restart the whole workflow by:
evl restart workflow/datagen/some_source.ewf
or continue from those last failures:
evl continue workflow/datagen/some_source.ewf