I'm trying to write a script to anonymize a database that contains personal information for use in a testing environment and bug reproduction. The script generates sql depending on the columns to anonymize. The idea was to first create a pool of e.g. names and to replace the real names with the ones in the pool.
One requirement is that during a run of the anonymization a certain name should always be replaced by the same anonymized name. E.g. if multiple columns contain the name Peter
this should always result in the same pseudonym Frank
. Therefore I used random()
and setseed()
to An (simplified) UPDATE
statement generated by the script looks like this:
UPDATE customer
SET name = (
SELECT val FROM name_pool
WHERE id = FLOOR(random_with_seed(name || 'foo') * (SELECT COUNT(*) FROM name_pool) + 1)
)
The name_pool
looks like that:
|---------------------|------------------|
| id | val |
|---------------------|------------------|
| 1 | Peter |
|---------------------|------------------|
| 2 | Paul |
|---------------------|------------------|
| ... | ... |
|---------------------|------------------|
| 10000 | Mary |
|---------------------|------------------|
The procedure random_with_seed()
does nothing else then setting the given seed before returning a random number. This makes sure that given the same name as a seed it returns the same random number. The 'foo'
in this example is a random secret salt to prevent guessing the original name:
CREATE OR REPLACE FUNCTION random_with_seed(seed text) RETURNS double precision AS
$$
BEGIN
PERFORM setseed(cos(hashtext(seed)));
RETURN random();
END
$$ LANGUAGE plpgsql IMMUTABLE;
Now the actual problem is that this update becomes extremely slow when the pool of names is big (> 10.000 rows). With pools of only a few rows it's very fast even if the table to anonymize has >100.000 rows.
Is there any better way to write the UPDATE
statement or optimize the query or another way to select a random element from the pool based on the existing name?
Aucun commentaire:
Enregistrer un commentaire