Skip to main content
Version: 2.8

4 Data Generation

Before we will go into detail, let’s provide an overview of data generation process.

To initiate, setup and build a project (i.e. group of data you would like to anonymize) follow these steps. See evl datahub command for details about ‘evl datahub’ commands.

  1. Create new project

    evl datahub project new <project_dir>

    See Project for details about projects.

  2. Add a source, i.e. folder with files to be anonymized or database with tables to be anonymized:

    evl datahub source new <source_name> \
    --guess-from-csv <path_to_folder_with_such_CSVs>

    See Source Settings for details about settings for a source.

  3. Edit such a config (csv) file according to your preferences. (Excel file checks the validity immediately and provides drop down options.)

  4. Check the config file for mistakes

    evl datahub check <config_file>
  5. Generate anonymization jobs and workflow

    evl datahub build <config_file>

    See Build and Run for details about jobs and workflow generation and see Config File for details about a config file.

Then to anonymize (regularly), run anonymization jobs:

evl run/datahub/<table_1>.evl
evl run/datahub/<file_1>.evl
...

Each job represents one file or table to be anonymized. See Build and Run for details.

Note: Be careful running anonymization jobs several times, as data are by default overwritten in the target, unless export EVL_DATAGEN_APPEND=1 is specified in settings configs/datahub/*.sh file or project.sh.

Having many files or tables to anonymize in one batch, you don’t need to run anonymization jobs one after another, but you can run all jobs by running generated workflow:

evl run workflow/datahub/<source_name>.ewf

4.1 evl datahub command

(since EVL 1.0)

To help to generate, check and build all the configuration files, there is ‘evl datahub’ command line utility.

All three config CSV files are comma (’,’) separated with Linux EOL (’n’).

evl datahub new system <system_name>
create an empty config CSV files ẃith given name in project folder <project_dir> in subfolder ‘config’, i.e. <project_dir>/config/<system_name>.datasets.csv <project_dir>/config/<system_name>.fields.csv with ‘--sample’ create config with sample data

evl datahub new config <config_name>
create an empty config CSV file ẃith given name in project folder <project_dir> in subfolder ‘config’, i.e. <project_dir>/config/<config_name>.jobs.csv with ‘--sample’ create config with sample data

evl datahub extract ( datasets | fields )
extract datasets or fields from source database. Fields are extracted based on existing datasets config CSV file and create/update fields config CSV file.

evl datahub generate config <config_name>
generate config for jobs, i.e. prepare file <project_dir>/config/<config_name>.jobs.csv

evl datahub build config <config_name>
build all the files for all jobs from given config

evl datahub list systems
return the list of all systems of given project, i.e. list Datasets and Fields CSV config files

evl datahub list configs
return the list of all job configs of given project, i.e. list Jobs CSV config files

evl datahub list jobs <config_name>
return the list of all jobs for particular <config_name>

evl datahub check
check if all config CSV files are correct and ready to generate files from them

evl datahub export ( datasets | fields | jobs )
export config CSV files given by prefix, with resolved variables, in format specified by ‘--output-format

Options

--dataset-name=<dataset_name_regex>
check, generate or export only for given dataset(s)

--dataset-system=<system_regex>
check, generate or export only for given dataset system(s)

--dataset-version=<version_number>
check, generate or export only for given dataset version

-o, --output=<file>
write output into file <file> instead of standard output

-f, --output-format=(csv|json)
write output in given file format, by default write csv

-p, --project=<project_dir>
if the current directory is not a project’s one, full or relative path can be specified by <project_dir>

--sample
when creating new configs, add sample data into them

--uri
URI to the database or folder, e.g. postgres://my_user@the_server:5432/my_database

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit


4.2 Project

Consider an anonymization project to be a folder, where we work on anonymization of some group of data. For example a group of data from business point of view. In most cases there would be only one or a couple of projects.

You can create a new project by hand or by a command:

evl datahub project new my_project

It will create new directory my_project in current folder with default settings and subfolder structure.

Or you can a new project with sample data and configuration:

evl datahub project sample $HOME/my_sample_project

It will create new directory my_sample_project in your home folder with a sample project.

The anonymization project directory structure is:

build/
files generated by ‘evl datahub build’ command

configs/
configuration csv files and settings sh files

lib/
folder for custom anonymization functions

run/
anonymization jobs generated by ‘evl datahub build’ command

worflow/
workflows generated by ‘evl datahub build’ command

All files in build, run and workflow directories are completely generated based on configuration file(s) configs/<source_name>.csv.


4.3 Source Settings

Once we have a project directory, we would like to add a source, which could be a folder with files or a database.

What and how should be anonymized is specified in a config and setting files. Config file could be a csv file and setting file is a shell script with variables definitions.

Each source would have one config and one setting file.

To create a new empty config and setting files, run:

evl datahub source new my_source

which creates two files in current project folder

configs/my_source.csv
configs/datahub/my_source.sh

To create a pre-generated config and setting files, based on a folder with source csv files:

evl datahub source new my_source --guess-from-csv=data/source

which goes through all csv files in data/source folder and fill in config file entity names (i.e. file names), field names based on headers, data types and null flag of a field.

If the current directory is not the project’s one, specify the path to the project by option ‘--project=<project_path>’.

See Config File for detailed information about config files.


4.4 Build and Run

For each Entity from config file, i.e. table or file, anonymization job with mapping and other metadata need to be build. It is enough to run the command line utility

evl datahub build <config_file>
[-p|--project <project_dir>]
[--parallel [<parallel_threads>]]
[-v|--verbose]

That build all the files in build/ project subdirectory. There you can find evd and evm files in appropriate folders. EVD means EVL Data definition file and it defines the structure of the source/target; field names, data types and other attributes. EVM means EVL Mapping file and it defines how each field is mapped. Although both these files are generated, it is sometimes good to check how they are look like for debug purpose.

It generates also a file in run/datahub/ subdirectory, where you can find one evl file per each Entity. These files can be then run to anonymize the data. For example for three tables, party_addr, party_cont and party_rel it would be fired by these commands:

evl run/datahub/party_addr.evl
evl run/datahub/party_cont.evl
evl run/datahub/party_rel.evl

Once such evl file exists for an Entity, there no need to build jobs again. It check each run if the config file has changed or not for given Entity and run ‘evl datahub build’ command automatically.

Note: There is no need to run ‘evl datahub build’ every time the config file is updated. It is done automatically once the job is fired.

The build command also generates a workflow file for given source in workflow/datahub/ subdirectory. You can run the anonymization for all the Entities from that source. For example having source defined by configs/some_source.csv, you can run

evl run workflow/datahub/some_source.ewf

and it will run all anonymization jobs in one or several parallel threads. It depends on the value defined by --parallel option.

If one or more anonymization jobs in a workflow fail, then you can the restart the whole workflow by:

evl restart workflow/datahub/some_source.ewf

or continue from those last failures:

evl continue workflow/datahub/some_source.ewf