Anonymizing data and preserving foreign key relationships
BenDippenaar
Posts: 2
We are using data generator to anonymize some existing data in a database.
Right now, when i use the SQL foreign key data generator, it looks like the data that is generated has to be unique, or I have to specify the amount of repeats from within the interface.
It would be really nice if the data generator could generate data that matches the referential ‘footprint’ of the data it is generating from.
for example, it doesn't seem like i can preserve existing many-to-one relationships in the anonymized database.
Right now, when i use the SQL foreign key data generator, it looks like the data that is generated has to be unique, or I have to specify the amount of repeats from within the interface.
It would be really nice if the data generator could generate data that matches the referential ‘footprint’ of the data it is generating from.
for example, it doesn't seem like i can preserve existing many-to-one relationships in the anonymized database.
Comments
Usually for anonymizing data, you'd work on the basis of "copying" your current database and then generating random data for just the columns you want to remove sensitive information on. We have a document describing the process here
I'd suggest having a look through that and seeing if it helps, if not, please let us know!
Redgate Software
What if your DB had 3 tables in it and in all three tables there was a column lets call it creditcardnumber the columns all need to be anonymized the same way so that the one to many or the one to one relationship can be maintained and you can query the new anonymizing data and get the same row count and reult set as the live data. Can the tool do that?
Thanks,
Scriber
If the 3 tables are actually a master, and then there's a FK on the other two back to that, it should just bring the values in automatically using the FK generator.
Redgate Software
SQL Data Generator doesn't have any kind of cross-column rules, so you are limited in that sense.
If yuo get stuck, then a simple repro may be useful... you can email it to support with F0053299 in the subjectline
Redgate Software
We are trying to do exactly the same thing. We have three tables:
Address, Customer, CustomerAddresses (contains only 2 FKs: AddressId and CustomerId). A Customer can have 0 to many Addresses.
The goal is to generate new address info and customer names but keep the same FKs in the CustomerAddresses table. I am using the "Use existing data source" option and am referencing a copy of the database to generate/import the data from Address and Customer. What I want to do is to import the CustomerAddresses table EXACTLY as it is. Same Id values in each row.
In the "real" database, the data looks like this:
Address/AddressId
1 (and then street1, city, state, zip, etc.)
2
3
...
Customer/CustomerId
1 (and then name, credit limit, etc.)
2
3
CustomerAddresses
CustomerId / AddressId
1 1
1 2
2 3
2 4
3 5
3 6
But when we generate the CustomerAddresses table with "Use existing data source" it forces us to use the Foreign Key Generator for these two columns. We can't change it. How can we just use the existing data and not generate anything?
We cannot de-select the CustomerAddresses table for generation because it chokes when trying to delete all the rows from Customer.
The other option is to drop the FK's before starting up Data Generator; then you can use whatever generation option you like on those columns. You can recreate them afterwards then. This is a little more inconvenient of course, but the only way around it I can think of right now.
Redgate Software
This is a serious limitation in your product. All I really want to do is to "UPDATE" a few columns to wipe out some of the production data columns so that we can have a realistic test database.
As a concrete example for why this is a serious limitation, I put forth the following:
Customer can have 0 or more related Address records (via CustomerAddresses). If the CustomerAddresses CustomerId and/or AddressId FKs are randomly assigned, then the Addresses won't make much sense at all. You'll have Address records from different states or countries. Which is fine if you're trying to test for all available inputs but terrible if your goal is a realistic-yet-cleansed set of data.
There are a number of other examples in our database where many-to-many relationships only make sense if you keep the same keys. As far as I can tell, the only way to do this is to run a post-SQL Generator script that deletes all the FKs in these m2n tables, sets IDENTITY INSERT and then re-imports the existing data.
Why can't we just import all of the CustomerAddresses FKs as-is? Can't we just set the generation order so that Customer and Address are generated before CustomerAddresses?
Redgate Software
I'm getting around this limitation by creating views against the source tables that select only the columns I don't want to "update". I use the Table/View generator to populate those columns in the target and randomizing generators for the columns I want to "update". The only columns for which this doesn't work are those with a Foreign Key constraint and IDENTITY columns.
2 possible enhancements to address this, which would help my situation but possibly not the other posters', are:
1. For columns with FK constraints, default to the "Foreign Key" generator, but allow us to change it to a different one. 2. For IDENTITY columns, default to the "Server Assigned" generator, but allow us to change it to another integer-type generator with an option to SET IDENTITY_INSERT ON.