Anyone working with Teradata experiences sooner or later how the Impact of Skewing on Teradata impairs query run times.
However, I assume only a few of us know exactly the order of magnitude of influence we have to expect.
Therefore, let me find some key figures to help get the big picture. I decided to create a test scenario based on absolute measures.
I am a big fan of CPU usage and disk access. In contrast, to run times, these numbers always tell the truth. Further, Â it is easy to get this information from DBC.DBQLOGTBL, if you are not in the unfavorable situation that database administrators restricted DBC table access.
I tested SET and MULTISET tables to enlarge the test scenario a bit. The idea behind adding this table-level attribute was the following:
If skewing alone already affected performance, shouldn’t a SET table need duplicate row checks to add even more workload to the poor AMPs being victims of the skewing?
My test setup looks like this:
CREATE SET TABLE TABLE_NOT_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;
CREATE MULTISET TABLE TABLE_NOT_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;
CREATE SET TABLE TABLE_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;
CREATE MULTISET TABLE TABLE_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;
As you can see, I created a skewed table and an unskewed table, each of them as SET and MULTISET. I added some data to each of these tables. Column A is almost unique. Column B has only a few distinct values:
INSERT INTO TABLE_NOT_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_NOT_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
Finally, I analyzed DBC.DBQLOGTBL:
SELECT  QUERYTEXT, MAXAMPCPUTIME, MINAMPCPUTIME, MAXAMPIO, MINAMPIO, TOTALIOCOUNT, AND AMPCPUTIME  FROM DBC.DBQLOGTBL WHERE SESSIONID = ; — my SQL Assistant session
Below you can see the result of this test. Here are my observations:
QueryText | MaxCPU | MinCPU | MaxIO | MinIO | TotalIO | AMPCPU |
INSERT INTO TABLE_NOT_SKEWED… | 0,01 | 0 | 41 | 39 | 3.619 | 0,33 |
INSERT INTO TABLE_NOT_SKEWED_MULTISET… | 0,02 | 0 | 38 | 40 | 3.534 | 0,44 |
INSERT INTO TABLE_SKEWED… | 8,59 | 0 | 41.536 | 13 | 204.651 | 42,44 |
INSERT INTO TABLE_SKEWED_MULTISET… | 0,03 | 0 | 86 | 11 | 1.350 | 0,33 |
SET TABLES:
MULTISET TABLES:
Conclusion: Use multiset tables when possible. The higher the table skew is, the worse is the impact of the duplicate row check.
Hello, it is a very interesting approach but in my opinion you have to add two or three more things to have a complete test.
What if I have already data on my table, how this impact the CPU / io for a multiset or set table?
What if I will add more columns on the PI of my Set table in order to increase the information on the histogram(less group of values to check)?
What if I add a USI on my Set table?
Furthermore, always I have an impact on CPU/io consumption with a Set table, but if I have a UPI or USI defined, the behavior is the same since the bottleneck is not the skewed Data – this problem I will have it on both cases – but TD will recognize the unique values to insert, and this is controlled only from PI, USI, and stats.
Ending, should it better to say, the higher amount of rows that I have on a skew table, the worse is the impact of the duplicate row check?
Thank you.