High level data comparison
rgribble
Posts: 88
I am looking for a way to compare 2 copies of a database that are supposed to be "in sync" and get a high level Yes/No status against each table, whether it is identical in each database or not.
In other words I want to know whether a table is different or not, but i dont yet care HOW it is different. Im hoping that this would make an initial comparison quicker and I could then choose to compare only the tables that are different.
This is easy to do using SELECT CHECKSUM_AGG( BINARY_CHECKSUM(*) ) FROM but as you probably know, the built in CHECKSUM routines in SQL are a very weak hash and can produce identical checksums on completely different data. The only other way i have found is to use HashBytes() function, but it only takes a string as input, so you have to cast and concatenate all fields from a table together... adn that still only gives you a checksum of each row - you then need to checksum all the checksums with eg a custom aggregate function to get an overall MD5/SHA-1 value of the table.
I was wondering whether anything in the engine of SQL Data Compare could assist with this, or whether you guys at RedGate have any particular tips or hints for me.
And going forward, if it would reduce processing/comparison time, it'd be great to have a project option that doesnt track or display each data item afterwards, and just tells me which tables are not identical... with a further option to then do a full comparison of those tables.
I wonder whether the internals of Data Compare already does this - eg have a first pass of each row to generate a checksum, and only if they differ, bother actually retrieving/storing all values etc.
In other words I want to know whether a table is different or not, but i dont yet care HOW it is different. Im hoping that this would make an initial comparison quicker and I could then choose to compare only the tables that are different.
This is easy to do using SELECT CHECKSUM_AGG( BINARY_CHECKSUM(*) ) FROM but as you probably know, the built in CHECKSUM routines in SQL are a very weak hash and can produce identical checksums on completely different data. The only other way i have found is to use HashBytes() function, but it only takes a string as input, so you have to cast and concatenate all fields from a table together... adn that still only gives you a checksum of each row - you then need to checksum all the checksums with eg a custom aggregate function to get an overall MD5/SHA-1 value of the table.
I was wondering whether anything in the engine of SQL Data Compare could assist with this, or whether you guys at RedGate have any particular tips or hints for me.
And going forward, if it would reduce processing/comparison time, it'd be great to have a project option that doesnt track or display each data item afterwards, and just tells me which tables are not identical... with a further option to then do a full comparison of those tables.
I wonder whether the internals of Data Compare already does this - eg have a first pass of each row to generate a checksum, and only if they differ, bother actually retrieving/storing all values etc.
Comments
This uses the built-in SQL Server implementation of CHECKSUM. We have amended the embedded help to say:
"Note that the built in SQL Server CHECKSUM is used. SQL Server CHECKSUM can occasionally fail to return different checksum values when the data sources differ. For more information, refer to your SQL Server documentation."
I have logged a feature request in our tracking system for the ability to perform a checksum comparison without performing a full comparison.