Anonymizing data and preserving foreign key relationships

BenDippenaar · October 14, 2011 7:27PM

We are using data generator to anonymize some existing data in a database.

Right now, when i use the SQL foreign key data generator, it looks like the data that is generated has to be unique, or I have to specify the amount of repeats from within the interface.

It would be really nice if the data generator could generate data that matches the referential â€˜footprintâ€™ of the data it is generating from.

for example, it doesn't seem like i can preserve existing many-to-one relationships in the anonymized database.

James B · October 17, 2011 7:26AM

Hi, and thanks for your post.

Usually for anonymizing data, you'd work on the basis of "copying" your current database and then generating random data for just the columns you want to remove sensitive information on. We have a document describing the process here

I'd suggest having a look through that and seeing if it helps, if not, please let us know!

Scriber · October 17, 2011 1:24PM

Hey James,

What if your DB had 3 tables in it and in all three tables there was a column lets call it creditcardnumber the columns all need to be anonymized the same way so that the one to many or the one to one relationship can be maintained and you can query the new anonymizing data and get the same row count and reult set as the live data. Can the tool do that?

Thanks,
Scriber

James B · October 17, 2011 1:27PM

For most of the random generators, you get the option of a "seed" value. AFAIK, the same seed should produce the same ordered data.

If the 3 tables are actually a master, and then there's a FK on the other two back to that, it should just bring the values in automatically using the FK generator.

Scriber · October 17, 2011 2:10PM

Thanks James I'll give it a try, and it should work in a one to many relationship? Were table A has one entry for creditcardnumber 123 and table B has say ten entries for the same creditcardnumber 123 when it changes table A to say ABC it will chnage the other ten entries in Table B to ABC so when I query for creditcard number ABC I'll get the proper row count?

James B · October 17, 2011 2:13PM

Hmm, I'm not sure if it's that clever when just generating random records. If your FK relationships are set up on that column then you have the choice in the FK generator for whether to generate all rows at random, make them unique, or to repeat a certain number of times, but I'm not sure if that's going to help.

SQL Data Generator doesn't have any kind of cross-column rules, so you are limited in that sense.

If yuo get stuck, then a simple repro may be useful... you can email it to support with F0053299 in the subjectline

mark__a · November 10, 2011 3:39PM

BenDippenaar wrote:

It would be really nice if the data generator could generate data that matches the referential â€˜footprintâ€™ of the data it is generating from.

for example, it doesn't seem like i can preserve existing many-to-one relationships in the anonymized database.

We are trying to do exactly the same thing. We have three tables:
Address, Customer, CustomerAddresses (contains only 2 FKs: AddressId and CustomerId). A Customer can have 0 to many Addresses.

The goal is to generate new address info and customer names but keep the same FKs in the CustomerAddresses table. I am using the "Use existing data source" option and am referencing a copy of the database to generate/import the data from Address and Customer. What I want to do is to import the CustomerAddresses table EXACTLY as it is. Same Id values in each row.

In the "real" database, the data looks like this:

Address/AddressId
1 (and then street1, city, state, zip, etc.)
2
3
...

Customer/CustomerId
1 (and then name, credit limit, etc.)
2
3

CustomerAddresses
CustomerId / AddressId
1 1
1 2
2 3
2 4
3 5
3 6

But when we generate the CustomerAddresses table with "Use existing data source" it forces us to use the Foreign Key Generator for these two columns. We can't change it. How can we just use the existing data and not generate anything?

We cannot de-select the CustomerAddresses table for generation because it chokes when trying to delete all the rows from Customer.

James B · November 10, 2011 3:48PM

Unfortunately if the foreign keys are there, then it'll always use the FK generator. If you're copying records across for the other two tables does this matter though? Presumably, the primaries in the other two tables are what you want anyway?

The other option is to drop the FK's before starting up Data Generator; then you can use whatever generation option you like on those columns. You can recreate them afterwards then. This is a little more inconvenient of course, but the only way around it I can think of right now.

mark__a · November 10, 2011 5:47PM

james.billings wrote:

Unfortunately if the foreign keys are there, then it'll always use the FK generator. If you're copying records across for the other two tables does this matter though? Presumably, the primaries in the other two tables are what you want anyway?

This is a serious limitation in your product. All I really want to do is to "UPDATE" a few columns to wipe out some of the production data columns so that we can have a realistic test database.

As a concrete example for why this is a serious limitation, I put forth the following:

Customer can have 0 or more related Address records (via CustomerAddresses). If the CustomerAddresses CustomerId and/or AddressId FKs are randomly assigned, then the Addresses won't make much sense at all. You'll have Address records from different states or countries. Which is fine if you're trying to test for all available inputs but terrible if your goal is a realistic-yet-cleansed set of data.

There are a number of other examples in our database where many-to-many relationships only make sense if you keep the same keys. As far as I can tell, the only way to do this is to run a post-SQL Generator script that deletes all the FKs in these m2n tables, sets IDENTITY INSERT and then re-imports the existing data.

Why can't we just import all of the CustomerAddresses FKs as-is? Can't we just set the generation order so that Customer and Address are generated before CustomerAddresses?

James B · November 11, 2011 9:57AM

We do have a feature request to allow simple updates which would obviously help in your situation, but as yet it's not been scheduled for inclusion. The ref. is SDG-850 and I've added a link back to this thread on the job for when it's next evaluated.

anna.p · October 16, 2012 7:21PM

A feature to allow simple updates would be fantastic.

I'm getting around this limitation by creating views against the source tables that select only the columns I don't want to "update". I use the Table/View generator to populate those columns in the target and randomizing generators for the columns I want to "update". The only columns for which this doesn't work are those with a Foreign Key constraint and IDENTITY columns.

2 possible enhancements to address this, which would help my situation but possibly not the other posters', are:

1. For columns with FK constraints, default to the "Foreign Key" generator, but allow us to change it to a different one. 2. For IDENTITY columns, default to the "Server Assigned" generator, but allow us to change it to another integer-type generator with an option to SET IDENTITY_INSERT ON.

Anonymizing data and preserving foreign key relationships

Comments

Product Learning

Community Forums

Events & Friends

Simple Talk