The Impact of Skewing on Teradata demonstrated
Anyone working with Teradata experiences sooner or later how the Impact of Skewing on Teradata impairs query run times.
However, I assume only a few of us know exactly the order of magnitude of influence we have to expect.
Let me, therefore, find some key figures which are helpful in getting the big picture. That is why I decided to create a test scenario based on absolute measures.
I am a big fan of CPU usage and disk accesses. In contrast, to run times, these numbers always tell the truth. Further, it is easy to get this information from DBC.DBQLOGTBL, if you are not in the unfavorable situation that database administrators restricted DBC table access.
In order to enlarge the test scenario a bit, I tested SET and MULTISET tables. The idea behind adding this table level attribute was the following:
If skewing alone already affected performance, shouldn’t a SET table in need for duplicate row checks add even more workload to the poor AMPs being a victim of the skewing?
My test setup looks like this:
CREATE SET TABLE TABLE_NOT_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;
CREATE MULTISET TABLE TABLE_NOT_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;
CREATE SET TABLE TABLE_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;
CREATE MULTISET TABLE TABLE_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;
As you can see, I created a skewed table and an unskewed table, each of them as SET and MULTISET. I added some data to each of these tables. Column A is almost unique, column B has only few distinct values:
INSERT INTO TABLE_NOT_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_NOT_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
Finally, I analysed DBC.DBQLOGTBL:
SELECT QUERYTEXT, MAXAMPCPUTIME, MINAMPCPUTIME, MAXAMPIO, MINAMPIO, TOTALIOCOUNT, AMPCPUTIME FROM DBC.DBQLOGTBL WHERE SESSIONID = ; — my SQL Assistant session
Below you can see the result of this test. Here are my observations:
|INSERT INTO TABLE_NOT_SKEWED…||0,01||0||41||39||3.619||0,33|
|INSERT INTO TABLE_NOT_SKEWED_MULTISET…||0,02||0||38||40||3.534||0,44|
|INSERT INTO TABLE_SKEWED…||8,59||0||41.536||13||204.651||42,44|
|INSERT INTO TABLE_SKEWED_MULTISET…||0,03||0||86||11||1.350||0,33|
Conclusion: Use multiset tables when possible. The higher the table skew is, the worse is the impact of the duplicate row check.