I have a database with personal data that I want to anonymize with random names etc. Eg., the table person (id int, first_name varchar, last_name varchar, birth_date date)
. I also have a table with random names, eg. first_name (id int, value varchar)
. I know that this boils down to optimizing select * from ... order by rand()
and I've read several StackOverflow questions on it (and http://jan.kneschke.de/projects/mysql/order-by-rand/), but this doesn't answer the question of updating with random data in bulk.
I've tried:
- (Initially)
UPDATE _person me SET birth_date = (SELECT value FROM anonymizer.birth_date ORDER BY RAND() LIMIT 1)
(...same for other columns). As there are multiple tables, anonymizing the whole DB was going for an hour when I interrupted it. - (Currently)
UPDATE _person me join anonymizer.first_name on anonymizer.first_name.id = round(rand() * (select max(id) from anonymizer.first_name)) SET first_name = anonymizer.first_name.value
. This single query took 5 minutes (apart from other columns and tables).
I can prepare my fake data tables in any way (init with sequence numbers etc.). Is there a way to anonymize the DB efficiently?
Aucun commentaire:
Enregistrer un commentaire