Generally anonymization means conversion of personal data into anonymized data by using various anonymization techniques. The EU GDPR regulation lays down rules relating to the protection of natural persons with regard to the processing of personal data quite clearly. All companies dealing with personal data of EU citizens have to respect those rules.

Anonymization terms and properties

Pseudonymization – GDPR defines the term pseudonymization: the anonymized personal data cannot be attributed to a specific data subject without the use of additional information. It means there is still a possibility to re-identify the original data from the additional data that should be kept separately.

Reversibility – if it is possible to get the original value or at least how difficult it is. This is something different from pseudonymization, because here is supposed you don’t have the (secret) additional information. In other words: if it is possible to break the algorithm.

Repeatability – same value will be anonymized to the same anonymized value again. I.e. rerun of anonymization will always produce the same result. Or some value will be replaced in different tables always to the same anonymized number. This is usually very important to anonymize IDs.

Uniqueness – two different original values will be anonymized into two different anonymized values. In other words such anonymization is bijective function, i.e. one-to-one correspondence. This is quite important to anonymize IDs for example.

Preserve data type – anonymize values keeps the data type. So an integer for example cannot be anonymized to a string or the anonymized timestamp must be again valid timestamp.

Preserve length – anonymized values are of the same length or keeps the maximal length of given data type. This is very important for anonymization for testing purposes.

Salt is an arbitrary additional string or value added to given data, usually before doing checksum. It is necessary for example when creating hash of some short or somehow bounded value, e.g. phone number. By the information that some hash is made by sha256 from a phone number, one can use sha256 to produce translation table with all possible values. But when some salt, like string "M8SKC7WOFP975WUS", is added to phone number, then without knowing this salt, usual checksum sha256 is not possible to revert.

Choosing the appropriate anonymization approach and techniques highly depends on the purpose and type of processing of personal data e.g. running production systems, archiving or providing data to partners or development teams. The responsibility lies on the Data controller (the natural or legal person, public authority, agency ...) who determines the purposes and means of the processing of personal data.

Anonymization Techniques

Scrambling – means permutation of letters. But quite often it is possible to revert the original data. Example:

Peter Sellers ---> Teepr Resells

Shuffling – permutes values within a whole column. Example of shuffling ID:

idname idname
2Pierre Richard 5Pierre Richard
3Richard Matthew Stallman  →   2Richard Matthew Stallman
5Donald E. Knuth 3Donald E. Knuth

In general this technique is not repeatable, but it is bijection.

Randomization – simply replace the original value by any random one. Example:

1st run Michael Raynolds ---> 5dtZ4twxx7896avkf78ad+0p 2nd run Michael Raynolds ---> 6shk8t9we6fgos7rthj98d

It is clear that randomization is not repeatable and also not bijective function.

Encryption – uses a key to encrypt the original value. Then such key must be kept secret or can be deleted immediately, depends on the purpose, if we’d like to be able to decrypt the data or not.

Masking – allows an important/unique part of the data to be hidden with random characters or other data. For example a credit card number:

9370 4442 9037 4197 ---> **** **** **** 4197

The advantage of masking is the ability to identify data without manipulating actual identities.

Tokenization – replaces sensitive data with non-sensitive substitutes, referred to as tokens, and usually stored in some secret mapping table. Tokenization keeps the data type and usually also length of data, so it can be processed by legacy systems that are sensitive to data length and type. That is achieved by keeping specific data fully or partially visible for processing and analytics while sensitive information is kept hidden.

Table with overview of anonymization techniques

technique property
revertible uniqueness pseudo-
nymization
repeatable preserve
data type
preserve
length
scrambling often yes not yes depends on
algorithm
yes yes
shuffling not yes not depends on
algorithm
yes yes
randomization not not not not yes yes
encryption not *) not yes, but that’s
the purpose here
yes not not
masking not not not yes mostly yes yes
checksum often yes almost yes yes yes not not
salted checksum not *) almost yes not yes not not
tokenization not *) depends on
algorithm
yes yes yes yes
data type preserving
anonymization
not *) almost yes not yes yes yes
data type preserving
unique anonymization
not *) yes not yes yes yes

*) Additional information must be kept somewhere secretly. Either permanently or temporarily. Like a token table, an encryption key or a salt.