Anyone working with Teradata experiences sooner or later how the Impact of Skewing on Teradata impairs query run times.
However, I assume only a few of us know exactly the order of magnitude of influence we have to expect.
Therefore, let me find some key figures to help get the big picture. I decided to create a test scenario based on absolute measures.
I am a big fan of CPU usage and disk access. In contrast, to run times, these numbers always tell the truth. Further, it is easy to get this information from DBC.DBQLOGTBL, if you are not in the unfavorable situation that database administrators restricted DBC table access.
I tested SET and MULTISET tables to enlarge the test scenario a bit. The idea behind adding this table-level attribute was the following:
If skewing alone already affected performance, shouldn’t a SET table need duplicate row checks to add even more workload to the poor AMPs being victims of the skewing?
My test setup looks like this:
CREATE SET TABLE TABLE_NOT_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;
CREATE MULTISET TABLE TABLE_NOT_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;
CREATE SET TABLE TABLE_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;
CREATE MULTISET TABLE TABLE_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;
As you can see, I created a skewed table and an unskewed table, each of them as SET and MULTISET. I added some data to each of these tables. Column A is almost unique. Column B has only a few distinct values:
INSERT INTO TABLE_NOT_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_NOT_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
Finally, I analyzed DBC.DBQLOGTBL:
SELECT QUERYTEXT, MAXAMPCPUTIME, MINAMPCPUTIME, MAXAMPIO, MINAMPIO, TOTALIOCOUNT, AND AMPCPUTIME FROM DBC.DBQLOGTBL WHERE SESSIONID = ; — my SQL Assistant session
Below you can see the result of this test. Here are my observations:
|INSERT INTO TABLE_NOT_SKEWED…||0,01||0||41||39||3.619||0,33|
|INSERT INTO TABLE_NOT_SKEWED_MULTISET…||0,02||0||38||40||3.534||0,44|
|INSERT INTO TABLE_SKEWED…||8,59||0||41.536||13||204.651||42,44|
|INSERT INTO TABLE_SKEWED_MULTISET…||0,03||0||86||11||1.350||0,33|
Conclusion: Use multiset tables when possible. The higher the table skew is, the worse is the impact of the duplicate row check.