teradata skewing

Skewing is one of the biggest problems on a Teradata system because it has a negative effect on performance.

If a table has a high skew factor, not all AMPs are uniformly employed for queries on this table. This not only has a negative effect on this one query but on the entire system.

The following happens in detail:

Skew leads to poor CPU parallel efficiency for full table scans and bulk inserts. The AMP with the most data sets forms a bottleneck. The remaining AMPs must wait for the slowest AMP.

Skew increases the number of IOs for updates and inserts of biased values. This is particularly negative for SET tables in combination with a NUPI unless the duplicate row checks are prevented by a secondary index.

There are many reasons for skew. With this article, I will show you 3 ways to discover Skew.

 

1. Detect Table Skew – The poorly designed Physical Data Model (PDM)

At the very beginning, the physical data model should be checked as the cause for skew.

A badly selected primary index leading to skew has a negative effect on all queries that use the table. This is where you can win the most.

I have seen cases where the daily loading time of the data warehouse could only be halved by changing the primary index of a few tables!

The primary index should also be checked right at the beginning, as the physical data model is relatively stable and changes only slowly.

The following is a sample query that calculates the skew of a table:

SELECT
TABLENAME,
SUM(CURRENTPERM) CURRENTPERM,
CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEWFACTOR_PERCENT
FROM DBC.TABLESIZE
WHERE DATABASENAME = ‘the_database’
GROUP BY 1
ORDER BY 1
;

Above query returns the permanently used disk space and the skew factor per table.

Perfectly distributed tables have a skew factor of 0%. However, tables with only a few rows (lookup tables) are usually distorted because there are simply not enough rows to distribute them evenly across all AMPs (especially if the number of rows < number of AMPs in the system). Exclude these tables from your analysis.

2. Detect Teradata Skewing by analyzing the Joins

While the first point refers to the static skewing, a dynamic skewing can occur when the query is executed, which is caused by the uneven distribution of the spool files. The principle of dynamic skew is simple:

Whenever a join takes place, the rows to be connected must be on the same AMP. This often makes it necessary to copy tables completely to all AMPs or to set a new primary index, which triggers a redistribution of the rows.

Unfortunately, this new primary index could turn out to be a bad choice if it distributes the rows unevenly among the AMPs.

Take a look at this sample query:

SELECT * FROM
CUSTOMER t01
INNER JOIN
GENDER_LOOKUP t02
ON t01.GENDER = t02.GENDER;

In our example, the primary index of table CUSTOMER is the “customerid” and the primary index of table GENDER_LOOKUP is the column “gender”.

If the optimizer selects the column “gender” as the new primary index for the customer table, the generated spool is strongly skewed, since only 2 AMPs will receive rows (male/female). This query is certainly problematic.

If you have access to any real-time monitoring tool like Viewpoint, PMON or DBMON you can easily spot such skewing problems:

Pay attention to execution planning steps that have a high skew factor and take a lot of time

3. Detect Teradata Skewing by analyzing Column Values

Dynamic skew as described above can be detected by analyzing the execution plan (the join columns) and knowing the contents of the columns.

The next type of skew is more dangerous because it is difficult to detect: the hidden skew.

These are a few values that are very common, but the table itself has no skew.

Hidden skew is not recognized by our check query in point 1.

For hidden skew we need this query:

SELECT the_snaky_column,COUNT(*) FROM the_table GROUP BY 1 ORDER BY 2 DESC;

The above query shows you the most common values at the top of the list. Should the result set look similar to this we have a problem with hidden skew:

‘A’,1000.000
‘B’,500
‘C’,480
‘D’,’480’

‘ZZZZZ’,1

Real-time monitoring with Viewpoint, PMON or DBMON is another great way of detecting skewing problems:

Note steps in the execution plan that show an even distribution (no skew), but use only a few gigabytes of spool and take a long time. At the same time, the size of the spool used hardly changes.

In my experience, all skew problems can be identified in this way.

It often turns out that the NULL value is the insidious one.

Before Teradata 14, the only solution was often to split the query into two parts:

The first section of the UNION treated the NULL value, the second all other values:

SELECT * FROM the_table1 t01 WHERE the_join_column IS NULL
UNION ALL
SELECT * FROM
the_table1 t01
INNER JOIN
the_table2 t02
ON
t01.the_join_column = t02.the_join_column
WHERE
t01.the_join_column IS NOT NULL
AND t02.the_join_column IS NOT NULL;

Since Teradata 14 we have the Partial Redistribution / Partial Distribution (PRPD) function, with which the optimizer can solve such problems alone.

The optimizer uses the demographic data stored in the statistics information (most typical values per histogram) and splits the join logic, but it is up to you to ensure correct and complete statistics!

Teradata Quiz
Teradata Quiz
Price: Free
Our Reader Score
[Total: 36    Average: 4.2/5]
3 Best Ways to detect Teradata Skewing written by Roland Wenzlofsky on November 1, 2014 average rating 4.2/5 - 36 user ratings

4 COMMENTS

  1. I have a question regarding using a table with Identity columns.
    Can we do fastload in a table having Identity Column? And if we do this, will the values of ID remain the same or will they be changed?
    I ask this because the current table has Identity Column and i need to be sure that the values of Identity are same or will not change after fastload run.
    On our system we have a very large IDENTITY column Table (DWP_VEDATA_T.MEASR_RESULT_DIAG). The size of this table is greater than 5 TB and has more than 61 Billion Rows. Now we have a requirement from the business to change the DDL of this table (one column in particular). The length of the column is to be increased from 1000 to 10,000.
    One way to go about this would be to create a new table and then do a “Insert – Select” in chunks
    This has some flaws as this would be time consuming, ETL batches using this table would be delayed, and over the weekends due to weekly backups the insert would have to be stopped.

    Last year, we had done a similar activity and that took 100+ days and batches had to be held, also after the table was populated a lot of manual intervention was required to actually make sure that the two tables were in sync.

    Question:
    Is there any faster/better way to do this?
    Original table has IDENTITY column and one column length has 1000
    Target table also has IDENTITY column and column length needs to be increased to 10,000

    INSERT / SELECT is time consuming and I am guessing would involve high spaced Transient Journals for each insert.

  2. It will spread the rows evenly across the AMPs. It is not about the total disk space consumed by the table but the better distribution is important as it avoids that one or a few AMPs run out of permanent space.

  3. Hi,
    If I remove the skweness by selecting the unique PI, will it Free up the space of table?|
    suppose it is 10 GB and after changing the PI to unique will it get reduce from 10 to lower GB?

LEAVE A REPLY

Please enter your comment!
Please enter your name here