The Impact of Skewing on Teradata demonstrated

2
1014

 

Anyone working with Teradata experiences sooner or later how the Impact of Skewing on Teradata impairs query run times.

However, I assume only a few of us know exactly the order of magnitude of influence we have to expect.

Let me, therefore, find some key figures which are helpful in getting the big picture.  That is why I decided to create a test scenario based on absolute measures.

I am a big fan of CPU usage and disk accesses. In contrast, to run times, these numbers always tell the truth. Further,  it is easy to get this information from DBC.DBQLOGTBL, if you are not in the unfavorable situation that database administrators restricted DBC table access.

In order to enlarge the test scenario a bit, I tested SET and MULTISET tables. The idea behind adding this table level attribute was the following:

If skewing alone already affected performance, shouldn’t a SET table in need for duplicate row checks add even more workload to the poor AMPs being a victim of the skewing?

My test setup looks like this:

CREATE SET TABLE TABLE_NOT_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;

CREATE MULTISET TABLE TABLE_NOT_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;

CREATE SET TABLE TABLE_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;

CREATE MULTISET TABLE TABLE_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;

As you can see, I created a skewed table and an unskewed table, each of them as SET and MULTISET. I added some data to each of these tables. Column A is almost unique, column B has only few distinct values:

INSERT INTO TABLE_NOT_SKEWED  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_NOT_SKEWED_MULTISET  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_SKEWED  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_SKEWED_MULTISET  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

Finally, I analysed DBC.DBQLOGTBL:

SELECT   QUERYTEXT,  MAXAMPCPUTIME,  MINAMPCPUTIME,  MAXAMPIO,  MINAMPIO,  TOTALIOCOUNT,  AMPCPUTIME  FROM DBC.DBQLOGTBL WHERE SESSIONID = ; — my SQL Assistant session

Below you can see the result of this test. Here are my observations:

QueryText MaxCPU MinCPU MaxIO MinIO TotalIO AMPCPU
INSERT INTO TABLE_NOT_SKEWED… 0,01 0 41 39 3.619 0,33
INSERT INTO TABLE_NOT_SKEWED_MULTISET… 0,02 0 38 40 3.534 0,44
INSERT INTO TABLE_SKEWED… 8,59 0 41.536 13 204.651 42,44
INSERT INTO TABLE_SKEWED_MULTISET… 0,03 0 86 11 1.350 0,33

 

SET TABLES:

The Disk accesses for the skewed table are 15 times higher
The busiest AMP has 4000 times more Disk accesses to do than the AMP with the least work, with a comparable result for CPU seconds!

 

MULTISET TABLES:

Multiset tables need fewer disk accesses compared against their set table equal. This is because the costly duplicate row check is omitted.
Very interesting: The skewed multiset table needs less than 50% of the disk accesses, compared with the unskewed table. Does this mean it is better? I don’t think so.
The reason probably is that most rows are physically co-located next to each other on a small set of AMPs (disks).  I assume that caching of rows in memory and enhanced block reads can be aggressively applied, reducing disk accesses significantly.
The bad thing is: only a few AMPs have to manage many rows, which leads to a bottleneck situation. So don’t be fooled by these numbers: basically, the inserting into the table is still much slower as done by just one AMP (or a few of them).

Conclusion: Use multiset tables when possible. The higher the table skew is, the worse is the impact of the duplicate row check.

Our Reader Score
[Total: 3    Average: 4/5]
The Impact of Skewing on Teradata demonstrated written by Roland Wenzlofsky on April 2, 2014 average rating 4/5 - 3 user ratings

2 COMMENTS

  1. Hello, it is a very interesting approach but on my opinion you have to add two or three more thing to have a complete test.

    What if i have already data on my table , how this impact the cpu / io for a multiset or set table?
    What if i will add more columns on the PI of my Set table in order to increase the information on histogram(less group of values to check)?
    What if i add a USI on my Set table?

    Furthermore, always i have an impact on cpu/io consuption with a Set table , but if i have a UPI or USI defined , the behaviour is the same , since the bottleneck is not the skewed Data – this problem i will have it on both cases – but TD will recognise the unique values to insert, and this is controlled only from PI,USI and stats.

    Ending , should it better to say ,the higher amount of rows that i have on a skew table ,the worse is the impact of the duplicate row check ?

    Thank you.

LEAVE A REPLY

Please enter your comment!
Please enter your name here