mardi 14 janvier 2020

Updating a large table with synthetic data from a pool table [postgresql]

I'm trying to write a script to anonymize a database that contains personal information for use in a testing environment and bug reproduction. The script generates sql depending on the columns to anonymize. The idea was to first create a pool of e.g. names and to replace the real names with the ones in the pool.

One requirement is that during a run of the anonymization a certain name should always be replaced by the same anonymized name. E.g. if multiple columns contain the name Peter this should always result in the same pseudonym Frank. Therefore I used random() and setseed() to An (simplified) UPDATE statement generated by the script looks like this:

UPDATE customer 
SET name = (
    SELECT val FROM name_pool 
    WHERE id = FLOOR(random_with_seed(name || 'foo') * (SELECT COUNT(*) FROM name_pool) + 1)
)

The name_pool looks like that:

|---------------------|------------------|
|          id         |        val       |
|---------------------|------------------|
|          1          |       Peter      |
|---------------------|------------------|
|          2          |       Paul       |
|---------------------|------------------|
|         ...         |        ...       |
|---------------------|------------------|
|        10000        |       Mary       |
|---------------------|------------------|

The procedure random_with_seed() does nothing else then setting the given seed before returning a random number. This makes sure that given the same name as a seed it returns the same random number. The 'foo' in this example is a random secret salt to prevent guessing the original name:

CREATE OR REPLACE FUNCTION random_with_seed(seed text) RETURNS double precision AS
    $$
        BEGIN
            PERFORM setseed(cos(hashtext(seed)));
            RETURN random();
        END
    $$ LANGUAGE plpgsql IMMUTABLE;

Now the actual problem is that this update becomes extremely slow when the pool of names is big (> 10.000 rows). With pools of only a few rows it's very fast even if the table to anonymize has >100.000 rows.

Is there any better way to write the UPDATE statement or optimize the query or another way to select a random element from the pool based on the existing name?




Aucun commentaire:

Enregistrer un commentaire