The Impact of Skewing on Teradata demonstrated

Teradata professionals inevitably encounter the detrimental effects of data skew on query performance.

Yet, I believe only a handful of us know the precise extent of impact we should anticipate.

I sought out crucial data points and created a test scenario utilizing precise measurements to gain a comprehensive understanding.

I am an enthusiast of CPU utilization and disk activity, as opposed to run times, as these metrics consistently reflect accurate information. Obtaining this data is straightforward through DBC.DBQLOGTBL, unless restricted by database administrators in adverse scenarios.

I conducted tests on both SET and MULTISET tables to broaden the test scenario. The rationale behind incorporating this attribute at the table level was to:

Considering the impact of skewing on performance, wouldn’t it be necessary to impose duplicate row checks on a SET table, thereby increasing the workload on the AMPs affected by skewing?

The configuration of my test environment is as follows:

CREATE SET TABLE TABLE_NOT_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;

CREATE MULTISET TABLE TABLE_NOT_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;

CREATE SET TABLE TABLE_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;

CREATE MULTISET TABLE TABLE_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;

I generated both a skewed and unskewed table as both SET and MULTISET. I populated them with data, wherein Column A is nearly unique, and Column B contains only a handful of distinct values.

INSERT INTO TABLE_NOT_SKEWED  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_NOT_SKEWED_MULTISET  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_SKEWED  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_SKEWED_MULTISET  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

After careful analysis, I examined DBC.DBQLOGTBL:

SELECT   QUERYTEXT,  MAXAMPCPUTIME,  MINAMPCPUTIME,  MAXAMPIO,  MINAMPIO,  TOTALIOCOUNT, AMPCPUTIME  FROM DBC.DBQLOGTBL WHERE SESSIONID = ; — my SQL Assistant session

Below you can see the result of this test. Here are my observations:

QueryTextMaxCPUMinCPUMaxIOMinIOTotalIOAMPCPU
INSERT INTO TABLE_NOT_SKEWED…0,01041393.6190,33
INSERT INTO TABLE_NOT_SKEWED_MULTISET…0,02038403.5340,44
INSERT INTO TABLE_SKEWED…8,59041.53613204.65142,44
INSERT INTO TABLE_SKEWED_MULTISET…0,03086111.3500,33

SET TABLES

The skewed SET table requires vastly more disk accesses compared to the non-skewed table. This means the database system takes significantly longer to retrieve data from the skewed table than from the non-skewed table.

The most active AMP must perform 4000 times as many disk accesses as the least busy AMP while utilizing similar CPU seconds.

MULTISET TABLES

Multiset tables require fewer disk accesses than their set table counterparts because the expensive duplicate row check is omitted.

The skewed multiset table requires fewer disk accesses than the unskewed table by less than 50%. However, this does not necessarily indicate that it is superior.

Most rows are likely adjacent on a small number of AMPs, which reduces disk access by utilizing memory caching and optimizing block reading methods.

Unfortunately, only a limited number of AMPs are assigned to manage numerous rows, resulting in a bottleneck. Therefore, it would be unwise to be deceived by these figures as inserting into the table remains significantly slower than when undertaken by a single or a select few AMPs.

Use multiset tables when available to minimize table skew and reduce the negative effects on duplicate row checks.

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

📊 Data Platform Migration Survey

Help us map where the industry is heading. Results are public — see what others chose.

1. What is your current data platform?

2. Where are you migrating to (or evaluating)?

Migrating FROM
Migrating TO

Thanks for voting! Share this with your network.

Follow me on LinkedIn for daily insights on data warehousing and platform migrations.

Stay Ahead in Data Warehousing

Get expert insights on Teradata, Snowflake, BigQuery, Databricks, Microsoft Fabric, and modern data architecture — delivered to your inbox.

1 thought on “The Impact of Skewing on Teradata demonstrated”

  1. Hello, it is a very interesting approach but in my opinion you have to add two or three more things to have a complete test.

    What if I have already data on my table, how this impact the CPU / io for a multiset or set table?
    What if I will add more columns on the PI of my Set table in order to increase the information on the histogram(less group of values to check)?
    What if I add a USI on my Set table?

    Furthermore, always I have an impact on CPU/io consumption with a Set table, but if I have a UPI or USI defined, the behavior is the same since the bottleneck is not the skewed Data – this problem I will have it on both cases – but TD will recognize the unique values to insert, and this is controlled only from PI, USI, and stats.

    Ending, should it better to say, the higher amount of rows that I have on a skew table, the worse is the impact of the duplicate row check?

    Thank you.

    Reply

Leave a Comment

DWHPro

Expert network for enterprise data platforms. Senior consultants, project teams built for your challenge — across Teradata, Snowflake, Databricks, and more.

📍Vienna, Austria & Jacksonville, Florida

Quick Links
Services Team Teradata Book Blog Contact Us
Connect
LinkedIn → [email protected]
Newsletter

Join 4,000+ data professionals.
Weekly insights on Teradata, Snowflake & data architecture.