A Teradata HASHROW Table Difference Screening Test

1
452

teradata hashrow

Have you ever been waiting for too long for a table content difference check to return the result? Were you ever forced to interrupt and postpone quality checks on large tables because your resource usage was too high the last time?

What if you had a simple screening test at hand that is capable of telling you which of a vast number of tables need further inspection and which are ok and can be left out from in-depth analysis? Even more, what if this test knows no false negatives, i.e. no table that needs inspection will ever escape?

Such a test exists, and it consists of nothing more than a use of the Teradata HASHROW function!

Imagine the following situation:

A copy of a large table has been created. A little later, you wish to check whether a new kind of quality maintenance job has led to corrections on the table content or not. No rows were added or removed, so a simple row count is not helpful. Assume that there is no easy way around a full table content comparison between the before and after data. Imagine further that this is what you have to perform over hundreds of tables.

The direct approach is to compare tables THE_COPY and THE_ORIGINAL against each other with plain SQL.

SELECT * FROM DWH.THE_COPY

MINUS

SELECT * FROM DWH.THE_ORIGINAL

;

Alternatively, you can calculate the HASHOW()-value for every row and use this unique value list against each other:

SELECT HASHROW( colum1, column2, …, columnN) FROM DWH. THE_COPY

MINUS

SELECT HASHROW( colum1, column2, …, columnN) FROM DWH. THE_ORIGINAL

;

Whereas the direct approach shows you deviating rows in full and so is more convenient for small tables or smaller row subsets, it can become costly or even prohibitively bulky for large tables. The advantages of the ROWHASH() approach are, first, that we only operate with one value per row instead of a combination of all attributes together and, second, that we avoid uncompressing and recopying the real table content across AMPs before making the comparison.

We tested our notion with a table and a copy of 120 GB each with some compression already made. Measuring the resource consumption of both ways of comparison by defining Query Bands around them, we observed the following differences:

 

METHOD TotalIOCount AMPCPUTime MaxCPUAmpNr MinAmpIO MaxAmpIO MaxIOAmpNr SpoolUsage
PLAIN SQL 201.755.716 70.641 8 2.238.616 2.247.648 10 1.786.095.099.904
HASHROW 17.184.333 33.249 19 190.144 192.249 13 115.833.006.592

 

As can be seen clearly, the HASHOW() approach used only a fraction of the resources that the PLAIN SQL equivalent used. Note that we performed this test under ideal conditions of no other parallel activities and with a technical user mighty enough to cope with such a large spool space. We waited more than two hours for the first version to finish while we had the result of the second one in half the time.

In summary, the HASHROW() approach we presented provides a parsimonious first separator between those tables that are unchanged and those that might be affected by change. False positives are possible as the hashing algorithm might return the same value for two different combinations of values – a problem known under the name of hash collision -, but only very rarely so.

 

Our Reader Score
[Total: 2    Average: 5/5]
A Teradata HASHROW Table Difference Screening Test written by Paul Timar on May 20, 2014 average rating 5/5 - 2 user ratings

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here