EVL Anonymization Microservice
Why Anonymize Data?
Creating Anonymized data sets based on production data offers several benefits, including: GDPR legal compliance regarding personal information; and the protection of commercially sensitive data from developers, testers, and other outside contractors.
EVL Anonymization Microservice enables fast, automated and cost-effective anonymization of data sets. It can be used for pseudonymization and anonymization of the production data according to GDPR requirements as well as for the protection of commercially sensitive data from developers, testers and other outside contractors.
EVL Microservices are built on top of the core EVL software and retain its flexibility, robustness, high productivity, and ability to read data from various sources; including CSV files, databases–Oracle, Teradata, SQL Server, etc–and Hadoop streaming data like Kafka.
EVL Anonymization Key Advantages
- High productivity
- Custom functions can be easily designed and embedded into the solution
- Low implementation and operating costs
- Combination of anonymization techniques: Encryption, Tokenization, Masking, Randomization
EVL Anonymization Functions
|Masking||string||str_mask_left(), str_mask_right()||str_mask_left(“1234 5678 9012 3456”,4,’X’) -> “XXXX XXXX XXXX 3456”, i.e. mask by “X” from left, but keep 4 characters from right|
|Random||any||random(min,max)||return random value of given data type from the given range|
|Randomization||date, timestamp||randomize()||randomize(date("2019-01-01"),5,6,15) returns random value with year 2019 plus/minus 5 years, January plus/minus 6 months and first day in month plus/minus 15 days|
|Anonymization||string||anonymize()1||anonymize(“abcd”,2,8) -> “s8L7df”, i.e. returns a string of the length between 2 and 8|
|Anonymization||numbers||anonymize()1||anonymize(573,0,1023) -> 850, i.e. returns an integer between 0 and 1023|
|Anonymization||date, timestamp||anonymize()1||anonymize(date(“2018-05-25”), 1, 6, 15) -> 2019-09-17, i.e. return given date plus/minus 1 year, plus/minus 6 months and plus/minus 15 days|
|Unique anonymization||integral data types||anonymize_uniq()2||anonymize_uniq((uint)133) -> 85.189.556 i.e. return uint, but no other than 133 can return 85.189.556, so this mapping is unique|
|Encryption||string||encrypt()||encrypt(“abcd”) -> “99bd … c4u8” i.e. return encrypted value based on the algorithm and its length|
|Decryption||string||decrypt()||decrypt(“99bd … c4u8”) -> “abcd” i.e. return decrypted value based on the algorithm and its length|
|Tokenization||string||anonymize_uniq(str, length(str), length(str))||Tokenzation is actually only specific application of anonymize_uniq() function|
|Hashing||string||sha256sum()||sha256sum(“abcd”) -> “fc4b5fd6 … b801d62c”|
|Salted hash||string||sha256sum(str + salt)||i.e. simply add a salt and do a checksum, but to keep the reasonable length better use anonymize() function|
1 For given value and given salt produces the same output, but might happen that two different values obtain the same anonymized value.
2 For given value and given salt produces the same output, but in an unique way, so bijection is guaranteed. Particularly useful for IDs.
One bank needed to provide production data for the development team so the data couldn’t be re-identified by keeping the entity relationships. The source were 100+ tables stored in CSV files, SQL Server, Informix and Oracle. The target for the anonymized development data was Oracle database. Customer filled-in one configuration file containing all data definitions and anonymization types and parameters leading to the source files (directories for CSV files and connect strings to databases). The EVL anonymization jobs were created automatically and run in parallel batches with great performance: e.g. the anonymization of one file containing 10 million rows took 50 seconds.
EVL Anonymization projectAn anonymization project consists of following steps:
- unzipping EVL distribution and defining a few variables and paths
- filling-in CSV file defining source type (e.g. CSV, Oracle, …), table or file name and field names and validations functions to be applied
- automatic generation of EVL jobs for each entity
- running EVL jobs in a batch or individually
ExampleFollowing example shows an implementation of anonymization data for a development and test environment of one banking application.
# Source and target data directories DATA_SOURCE_DIR="/some/path/source" DATA_ANON_DIR="/some/path/anon" # Path to salt export EVL_ANON_SALT_PATH="/some/path/.salt"
Anonymization definition for file TEST:
|Src||Entity||Ord||Field name||Data type||Null||Anon type||EVL Function||Description|
|FILE||TEST||1||ID||int||No||ANONYMIZE_UNIQ||Unique identifier of the person|
|FILE||TEST||2||ACCOUNT||int||No||ANONYMIZE_UNIQ||Unique account number|
|Personal ID (must be Mod 11), custom function is used|
|ANONYMIZE||Start date of the account|
|FILE||TEST||5||SCORE||decimal(15,2)||ANONYMIZE||Score of the account holder|
|FILE||TEST||6||DESC||string||ANONYMIZE||Description of the account|
|FILE||TEST||7||TEXT||string||Free text - no anonymization|
# generating evl jobs from the config file evl run/generate_jobs.evl # running the anonymization job for an entity “TEST” evl run/anon.test.evl
Example data - one record
|Account has been established on another name then changed||jZy96jPqkiH8GMYhdj9Ti6O8TdPVQKDciDmd8Nyi|
|He prefers blue color||He prefers blue color|
Position in DWH Architecture
EVL Anonymization Detail Architecture