4 Data Generation
Before we will go into detail, let’s provide an overview of data generation process.
To initiate, setup and build a project (i.e. group of data you would like to anonymize) follow these steps.
See evl datahub command for details about ‘evl datahub’ commands.
-
Create new project
evl datahub project new <project_dir>See Project for details about projects.
-
Add a source, i.e. folder with files to be anonymized or database with tables to be anonymized:
evl datahub source new <source_name> \
--guess-from-csv <path_to_folder_with_such_CSVs>See Source Settings for details about settings for a source.
-
Edit such a config (
csv) file according to your preferences. (Excel file checks the validity immediately and provides drop down options.) -
Check the config file for mistakes
evl datahub check <config_file> -
Generate anonymization jobs and workflow
evl datahub build <config_file>See Build and Run for details about jobs and workflow generation and see Config File for details about a config file.
Then to anonymize (regularly), run anonymization jobs:
evl run/datahub/<table_1>.evl
evl run/datahub/<file_1>.evl
...
Each job represents one file or table to be anonymized. See Build and Run for details.
Note: Be careful running anonymization jobs several times, as data are by default overwritten in the target, unless
export EVL_DATAGEN_APPEND=1is specified in settingsconfigs/datahub/*.shfile orproject.sh.
Having many files or tables to anonymize in one batch, you don’t need to run anonymization jobs one after another, but you can run all jobs by running generated workflow:
evl run workflow/datahub/<source_name>.ewf
4.1 evl datahub command
(since EVL 1.0)
To help to generate, check and build all the configuration files, there is ‘evl datahub’ command line utility.
All three config CSV files are comma (’,’) separated with Linux EOL (’n’).
evl datahub new system <system_name>
create an empty config CSV files ẃith given name in project folder <project_dir> in
subfolder ‘config’, i.e. <project_dir>/config/<system_name>.datasets.csv
<project_dir>/config/<system_name>.fields.csv with ‘--sample’ create config with sample data
evl datahub new config <config_name>
create an empty config CSV file ẃith given name in project folder <project_dir> in subfolder
‘config’, i.e. <project_dir>/config/<config_name>.jobs.csv with ‘--sample’ create config
with sample data
evl datahub extract ( datasets | fields )
extract datasets or fields from source database. Fields are extracted based on existing datasets
config CSV file and create/update fields config CSV file.
evl datahub generate config <config_name>
generate config for jobs, i.e. prepare file <project_dir>/config/<config_name>.jobs.csv
evl datahub build config <config_name>
build all the files for all jobs from given config
evl datahub list systems
return the list of all systems of given project, i.e. list Datasets and Fields CSV config files
evl datahub list configs
return the list of all job configs of given project, i.e. list Jobs CSV config files
evl datahub list jobs <config_name>
return the list of all jobs for particular <config_name>
evl datahub check
check if all config CSV files are correct and ready to generate files from them
evl datahub export ( datasets | fields | jobs )
export config CSV files given by prefix, with resolved variables, in format specified by
‘--output-format’
Options
--dataset-name=<dataset_name_regex>
check, generate or export only for given dataset(s)
--dataset-system=<system_regex>
check, generate or export only for given dataset system(s)
--dataset-version=<version_number>
check, generate or export only for given dataset version
-o, --output=<file>
write output into file <file> instead of standard output
-f, --output-format=(csv|json)
write output in given file format, by default write csv
-p, --project=<project_dir>
if the current directory is not a project’s one, full or relative path can be specified by
<project_dir>
--sample
when creating new configs, add sample data into them
--uri
URI to the database or folder, e.g.
postgres://my_user@the_server:5432/my_database
Standard options:
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
4.2 Project
Consider an anonymization project to be a folder, where we work on anonymization of some group of data. For example a group of data from business point of view. In most cases there would be only one or a couple of projects.
You can create a new project by hand or by a command:
evl datahub project new my_project
It will create new directory my_project in current folder with default settings and subfolder structure.
Or you can a new project with sample data and configuration:
evl datahub project sample $HOME/my_sample_project
It will create new directory my_sample_project in your home folder with a sample project.
The anonymization project directory structure is:
build/
files generated by ‘evl datahub build’ command
configs/
configuration csv files and settings sh files
lib/
folder for custom anonymization functions
run/
anonymization jobs generated by ‘evl datahub build’ command
worflow/
workflows generated by ‘evl datahub build’ command
All files in build, run and workflow directories are completely generated based on configuration file(s) configs/<source_name>.csv.
4.3 Source Settings
Once we have a project directory, we would like to add a source, which could be a folder with files or a database.
What and how should be anonymized is specified in a config and setting files.
Config file could be a csv file and setting file is a shell script with variables definitions.
Each source would have one config and one setting file.
To create a new empty config and setting files, run:
evl datahub source new my_source
which creates two files in current project folder
configs/my_source.csv
configs/datahub/my_source.sh
To create a pre-generated config and setting files, based on a folder with source csv files:
evl datahub source new my_source --guess-from-csv=data/source
which goes through all csv files in data/source folder and fill in config file entity names (i.e. file names), field names based on headers, data types and null flag of a field.
If the current directory is not the project’s one, specify the path to the project by option ‘--project=<project_path>’.
See Config File for detailed information about config files.
4.4 Build and Run
For each Entity from config file, i.e. table or file, anonymization job with mapping and other metadata need to be build. It is enough to run the command line utility
evl datahub build <config_file>
[-p|--project <project_dir>]
[--parallel [<parallel_threads>]]
[-v|--verbose]
That build all the files in build/ project subdirectory. There you can find evd and evm files in appropriate folders. EVD means EVL Data definition file and it defines the structure of the source/target; field names, data types and other attributes. EVM means EVL Mapping file and it defines how each field is mapped. Although both these files are generated, it is sometimes good to check how they are look like for debug purpose.
It generates also a file in run/datahub/ subdirectory, where you can find one evl file per each Entity.
These files can be then run to anonymize the data. For example for three tables, party_addr, party_cont and party_rel it would be fired by these commands:
evl run/datahub/party_addr.evl
evl run/datahub/party_cont.evl
evl run/datahub/party_rel.evl
Once such evl file exists for an Entity, there no need to build jobs again. It check each run if the config file has changed or not for given Entity and run ‘evl datahub build’ command automatically.
Note: There is no need to run ‘
evl datahub build’ every time the config file is updated. It is done automatically once the job is fired.
The build command also generates a workflow file for given source in workflow/datahub/ subdirectory. You can run the anonymization for all the Entities from that source. For example having source defined by configs/some_source.csv, you can run
evl run workflow/datahub/some_source.ewf
and it will run all anonymization jobs in one or several parallel threads. It depends on the value defined by --parallel option.
If one or more anonymization jobs in a workflow fail, then you can the restart the whole workflow by:
evl restart workflow/datahub/some_source.ewf
or continue from those last failures:
evl continue workflow/datahub/some_source.ewf