Teradata professionals inevitably encounter the detrimental effects of data skew on query performance.
Yet, I believe only a handful of us know the precise extent of impact we should anticipate.
I sought out crucial data points and created a test scenario utilizing precise measurements to gain a comprehensive understanding.
I am an enthusiast of CPU utilization and disk activity, as opposed to run times, as these metrics consistently reflect accurate information. Obtaining this data is straightforward through DBC.DBQLOGTBL, unless restricted by database administrators in adverse scenarios.
I conducted tests on both SET and MULTISET tables to broaden the test scenario. The rationale behind incorporating this attribute at the table level was to:
Considering the impact of skewing on performance, wouldn’t it be necessary to impose duplicate row checks on a SET table, thereby increasing the workload on the AMPs affected by skewing?
The configuration of my test environment is as follows:
CREATE SET TABLE TABLE_NOT_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;
CREATE MULTISET TABLE TABLE_NOT_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;
CREATE SET TABLE TABLE_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;
CREATE MULTISET TABLE TABLE_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;
I generated both a skewed and unskewed table as both SET and MULTISET. I populated them with data, wherein Column A is nearly unique, and Column B contains only a handful of distinct values.
INSERT INTO TABLE_NOT_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_NOT_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
After careful analysis, I examined DBC.DBQLOGTBL:
SELECT QUERYTEXT, MAXAMPCPUTIME, MINAMPCPUTIME, MAXAMPIO, MINAMPIO, TOTALIOCOUNT, AND AMPCPUTIME FROM DBC.DBQLOGTBL WHERE SESSIONID = ; — my SQL Assistant session
Below you can see the result of this test. Here are my observations:
QueryText | MaxCPU | MinCPU | MaxIO | MinIO | TotalIO | AMPCPU |
INSERT INTO TABLE_NOT_SKEWED… | 0,01 | 0 | 41 | 39 | 3.619 | 0,33 |
INSERT INTO TABLE_NOT_SKEWED_MULTISET… | 0,02 | 0 | 38 | 40 | 3.534 | 0,44 |
INSERT INTO TABLE_SKEWED… | 8,59 | 0 | 41.536 | 13 | 204.651 | 42,44 |
INSERT INTO TABLE_SKEWED_MULTISET… | 0,03 | 0 | 86 | 11 | 1.350 | 0,33 |
SET TABLES
compared to the non-skewed table. This means the database system takes significantly longer to retrieve data from the skewed table than from the non-skewed table.
The most active AMP must perform 4000 times as many disk accesses as the least busy AMP while utilizing similar CPU seconds.
MULTISET TABLES
Multiset tables require fewer disk accesses than their set table counterparts because the expensive duplicate row check is omitted.
The skewed multiset table requires fewer disk accesses than the unskewed table by less than 50%. However, this does not necessarily indicate that it is superior.
Most rows are likely adjacent on a small number of AMPs, which reduces disk access by utilizing memory caching and optimizing block reading methods.
Unfortunately, only a limited number of AMPs are assigned to manage numerous rows, resulting in a bottleneck. Therefore, it would be unwise to be deceived by these figures as inserting into the table remains significantly slower than when undertaken by a single or a select few AMPs.
Use multiset tables when available to minimize table skew and reduce the negative effects on duplicate row checks.
Hello, it is a very interesting approach but in my opinion you have to add two or three more things to have a complete test.
What if I have already data on my table, how this impact the CPU / io for a multiset or set table?
What if I will add more columns on the PI of my Set table in order to increase the information on the histogram(less group of values to check)?
What if I add a USI on my Set table?
Furthermore, always I have an impact on CPU/io consumption with a Set table, but if I have a UPI or USI defined, the behavior is the same since the bottleneck is not the skewed Data – this problem I will have it on both cases – but TD will recognize the unique values to insert, and this is controlled only from PI, USI, and stats.
Ending, should it better to say, the higher amount of rows that I have on a skew table, the worse is the impact of the duplicate row check?
Thank you.