Comparing large amounts of data

JeffSchumacher · June 26, 2008 1:03PM

Using SQL Toolkit, we've come across an issue when comparing large amounts of data and the way in which the data is retrieved from the destination database.

Our problem:
Every hour, our operational database has potentially 100,000 records which are either added or updated in several tables. At the end of each hour, those additions and changes have to be replicated over to our reporting database. Primary Key values MUST be maintained between the source and destination databases.

We've setup a solution using SQL Toolkit to handle the replication of data between the two servers once an hour based on simple criteria:
The LastUpdate (datetime) column in the source tables must be within the last hour to be included in source dataset. New records have their LastUpdate column defaulted with the current timestamp which automatically includes them in the source dataset.

When SQL Toolkit starts the comparison, the source records are filtered properly based on the LastUpdate field. However, when the destination records are being loaded for comparison, SQL Toolkit is requesting all records in each table in the comparison list. Some of these tables have 5 million+ rows. Obviously performance is an issue in this scenario.

My question is, in this setup, what is the optimal way to use SQL Toolkit so that the whole table is not fetched from the destination for the comparison? Which options should we set? We already know that if the LastUpdate column falls within the previous hour, it's been changed and needs to be sent to the destination database. We also know that the primary key value must be the same, so can that be used to determine if the record exists in the destination database without fetching all of the columns?

Brian Donahue · June 26, 2008 1:14PM

Hi Jeff,

If you are using SelectionDelegate to filter out records that you don't want to synchronize, then yes, the Data Compare Engine is going to read and compare the whole table from both databases. One way to prevent this may be to use a WHERE CLAUSE on the individual TableMappings. This will get Data Compare to pull back less data.

For instance, if you run your comparison every day, and you're sure that you have synchronized everything up until the current day, there is no need to pull back all of the data and compare it again. You can pull back only data where the date stamp has been changed since yesterday using the where clause.

Hopefully this helps!

JeffSchumacher · June 26, 2008 2:44PM

Thank you for the quick reply.

We are using a selection delegate on the source database based on the timestamp, and that works perfectly, only returning the records which were modified in the specific time period.

What I'm trying to avoid is having SQL Toolkit request all of the records for the destination table for the comparison process (we're looking at doing this once an hour, and these tables have millions of rows).

My thoughts on how to do this:
To be able to effectively provide a selection delegate on the destination database, I would have a provide a list of primary key's for the records from the source database which met the source delegate criteria. I cannot use the timestamp in the destination selection delegate, because that value changes and would exclude records from the destination comparison set which would need to be updated. By providing a list of primary key's, SQL Toolkit would only retrieve those records and could take the following actions:
1) Compare a row which exists in both the source and destination
2) Mark rows for insert which do not exist in the destination result set

Being that I require the primary key's to be the same between the source and destination, it seems logical that we should be able to request only rows matching the primary keys from the source.

I don't see a way to do this without making an initial query to the source database before the comparison takes place; and the where clause would quickly reach it's limitation of 65,535 constants. Would SQL toolkit break the request up into multiple queries based on this limitation? It doesn't seem like it due to the format of the selection delegate.

What I don't know is how to effectively and efficiently accomplish this using SQL Toolkit. If selecting each table in it's entirety is the only way SQL Toolkit can make this work, then we'll have to accept that for now and realize that this solution won't scale when our data increases in size.

I hope I'm not being confusing, and thank you for your assistance.

Robert C · June 26, 2008 2:52PM

I don't suppose you're able to add an extra column on to the table(s) in question, are you?

If so, you could run a pre-compare step which rather than building up a where clause, fills in the values of a bit column on the table marking them as to be included / excluded, at which point your where clause becomes quite nice and simple :-). You could exclude that column from the comparison, of course.

It feels a bit nasty, and obviously I realise you might not be able to do this, but it would mean the comparison would only pull back the rows you want to include...

Cheers,
Rob

Robert C · June 26, 2008 2:56PM

Or another thought... if you can express your selection condition in TSQL, what about creating an updateable view based on the table, then performing the comparison on that? Again, it might not be perfect, and again, it requires fiddling with the database schema, which isn't ideal I realise.

Rob

JeffSchumacher · June 26, 2008 3:45PM

Robert,

That's an interesting idea. I would have to measure the performance of loading several million rows into memory vs. performing additional updates before the comparison and after the comparison to reset the inclusion bit. It's a bit hacky, but it might work.

Thanks for your help!

Comparing large amounts of data

Comments

Product Learning

Community Forums

Events & Friends

Simple Talk