The Impact of Skewing on Teradata demonstrated

Roland Wenzlofsky

April 21, 2023

minutes reading time

Teradata professionals inevitably encounter the detrimental effects of data skew on query performance.

Yet, I believe only a handful of us know the precise extent of impact we should anticipate.

I sought out crucial data points and created a test scenario utilizing precise measurements to gain a comprehensive understanding.

I am an enthusiast of CPU utilization and disk activity, as opposed to run times, as these metrics consistently reflect accurate information. Obtaining this data is straightforward through DBC.DBQLOGTBL, unless restricted by database administrators in adverse scenarios.

I conducted tests on both SET and MULTISET tables to broaden the test scenario. The rationale behind incorporating this attribute at the table level was to:

Considering the impact of skewing on performance, wouldn’t it be necessary to impose duplicate row checks on a SET table, thereby increasing the workload on the AMPs affected by skewing?

The configuration of my test environment is as follows:

CREATE SET TABLE TABLE_NOT_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;

CREATE MULTISET TABLE TABLE_NOT_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;

CREATE SET TABLE TABLE_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;

CREATE MULTISET TABLE TABLE_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;

I generated both a skewed and unskewed table as both SET and MULTISET. I populated them with data, wherein Column A is nearly unique, and Column B contains only a handful of distinct values.

INSERT INTO TABLE_NOT_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_NOT_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;

After careful analysis, I examined DBC.DBQLOGTBL:

SELECT QUERYTEXT, MAXAMPCPUTIME, MINAMPCPUTIME, MAXAMPIO, MINAMPIO, TOTALIOCOUNT, AND AMPCPUTIME FROM DBC.DBQLOGTBL WHERE SESSIONID = ; — my SQL Assistant session

Below you can see the result of this test. Here are my observations:

QueryText	MaxCPU	MinCPU	MaxIO	MinIO	TotalIO	AMPCPU
INSERT INTO TABLE_NOT_SKEWED…	0,01	0	41	39	3.619	0,33
INSERT INTO TABLE_NOT_SKEWED_MULTISET…	0,02	0	38	40	3.534	0,44
INSERT INTO TABLE_SKEWED…	8,59	0	41.536	13	204.651	42,44
INSERT INTO TABLE_SKEWED_MULTISET…	0,03	0	86	11	1.350	0,33

SET TABLES

compared to the non-skewed table. This means the database system takes significantly longer to retrieve data from the skewed table than from the non-skewed table.

The most active AMP must perform 4000 times as many disk accesses as the least busy AMP while utilizing similar CPU seconds.

MULTISET TABLES

Multiset tables require fewer disk accesses than their set table counterparts because the expensive duplicate row check is omitted.

The skewed multiset table requires fewer disk accesses than the unskewed table by less than 50%. However, this does not necessarily indicate that it is superior.

Most rows are likely adjacent on a small number of AMPs, which reduces disk access by utilizing memory caching and optimizing block reading methods.

Unfortunately, only a limited number of AMPs are assigned to manage numerous rows, resulting in a bottleneck. Therefore, it would be unwise to be deceived by these figures as inserting into the table remains significantly slower than when undertaken by a single or a select few AMPs.

Use multiset tables when available to minimize table skew and reduce the negative effects on duplicate row checks.

Take a brief survey and receive the book for free Get instant access

Share0

Tweet0

Share0

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

Hello, it is a very interesting approach but in my opinion you have to add two or three more things to have a complete test.

What if I have already data on my table, how this impact the CPU / io for a multiset or set table?
What if I will add more columns on the PI of my Set table in order to increase the information on the histogram(less group of values to check)?
What if I add a USI on my Set table?

Furthermore, always I have an impact on CPU/io consumption with a Set table, but if I have a UPI or USI defined, the behavior is the same since the bottleneck is not the skewed Data – this problem I will have it on both cases – but TD will recognize the unique values to insert, and this is controlled only from PI, USI, and stats.

Ending, should it better to say, the higher amount of rows that I have on a skew table, the worse is the impact of the duplicate row check?

Thank you.

The Impact of Skewing on Teradata demonstrated

SET TABLES

MULTISET TABLES

Roland Wenzlofsky

Fast multi-file export of Teradata query results using only Teradata SQL Assistant

The Teradata AMP Worker Task

Boost Your Teradata Performance – The Critical Role of NOT NULL Declarations

Optimizing Teradata SQL Queries by Avoiding Full Table Scans and Utilizing Secondary Indexes