fbpx

3 Various Ways To Do Detect Teradata Skewing

By Roland Wenzlofsky

November 1, 2014


Skewing is one of the biggest problems on a Teradata system because it harms performance.

If a table has a high skew factor, not all AMPs are uniformly employed for queries on this table. This not only hurts this one query but on the entire system.

The following happens in detail:

Skew leads to low CPU parallel efficiency for full table scans and bulk inserts. The AMP with the most data sets forms a bottleneck. The remaining AMPs must wait for the slowest AMP.

Skew increases the number of IOs for updates and inserts of biased values. This is significantly negative for SET tables combined with a NUPI unless a secondary index prevents the duplicate row checks.

There are many reasons for skew. With this article, I will show you three ways to discover Skew.

1. Detect Table Skew – The poorly designed Physical Data Model (PDM)

At the very beginning, the physical data model should be checked as the cause for skew.

A poorly selected primary index leading to skew hurts all queries that use the table. This is where you can win the most.

I have seen cases where the data warehouse’s daily loading time could only be halved by changing a few tables’ primary indexes!

The primary index should also be checked right initially, as the physical data model is relatively stable and changes only slowly.

The following is a sample query that calculates the skew of a table:

SELECT
TABLENAME,
SUM(CURRENTPERM) CURRENTPERM,
CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEWFACTOR_PERCENT
FROM DBC.TABLESIZE
WHERE DATABASENAME = ‘the_database’
GROUP BY 1
ORDER BY 1
;

The above query returns the permanently used disk space and the skew factor per table.

Correctly distributed tables have a skew factor of 0%. However, tables with only a few rows (lookup tables) are usually distorted because there are not enough rows to distribute them evenly across all AMPs (especially if the number of rows < the number of AMPs in the system). Exclude these tables from your analysis.

2. Detect Teradata Skewing by analyzing the Joins

While the first point refers to the static skewing, a dynamic skewing can occur when the query is executed, caused by the uneven distribution of the spool files. The principle of dynamic skew is simple:

Whenever a join takes place, the rows to be connected must be on the same AMP. This often makes it necessary to copy tables completely to all AMPs or set a new primary index, which triggers a redistribution of the rows.

Unfortunately, this new primary index could turn out to be a wrong choice if it distributes the rows unevenly among the AMPs.

Take a look at this sample query:

SELECT * FROM
CUSTOMER t01
INNER JOIN
GENDER_LOOKUP t02
ON t01.GENDER = t02.GENDER;

In our example, the table CUSTOMER’s primary index is the “customerid,” and the primary index of table GENDER_LOOKUP is the column “gender”.

If the optimizer selects the column “gender” as the new primary index for the customer table, the generated spool is strongly skewed, since only 2 AMPs will receive rows (male/female). This query is undoubtedly problematic.

If you have access to any real-time monitoring tool like Viewpoint, PMON, or DBMON, you can easily spot such skewing problems:

Pay attention to execution planning steps that have a high skew factor and take a lot of time

3. Detect Teradata Skewing by analyzing Column Values

As described above, dynamic skew can be detected by analyzing the execution plan (the join columns) and knowing the columns’ contents.

The next type of skew is more dangerous because it is difficult to detect: the hidden skew.

These are a few very common values, but the table itself has no skew.

Our check query in point 1 does not recognize hidden skew.

For hidden skew, we need this query:

SELECT the_snaky_column,COUNT(*) FROM the_table GROUP BY 1 ORDER BY 2 DESC;

The above query shows you the most common values at the top of the list. Should the result set look similar to this, we have a problem with hidden skew:

‘A’,1000.000
‘B’,500
‘C’,480
‘D’,’480’

‘ZZZZZ’,1

Real-time monitoring with Viewpoint, PMON, or DBMON is another great way of detecting skewing problems:

Note steps in the execution plan that show an even distribution (no skew), but use only a few gigabytes of the spool and take a long time. At the same time, the size of the spool used hardly changes.

In my experience, all skew problems can be identified in this way.

It often turns out that the NULL value is the insidious one.

Before Teradata 14, the only solution was often to split the query into two parts:

The first section of the UNION treated the NULL value, the second all other values:

SELECT * FROM the_table1 t01 WHERE the_join_column IS NULL
UNION ALL
SELECT * FROM
the_table1 t01
INNER JOIN
the_table2 t02
ON
t01.the_join_column = t02.the_join_column
WHERE
t01.the_join_column IS NOT NULL
AND t02.the_join_column IS NOT NULL;

Since Teradata 14, we have the Partial Redistribution / Partial Distribution (PRPD) function, with which the optimizer can solve such problems alone.

The optimizer uses the demographic data stored in the statistics information (most typical values per histogram). It splits the join logic, but it is up to you to ensure correct and complete statistics!

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
Buy the Book Teradata Query Performance Tuning

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

  • Avatar
    Bill Atkinson says:

    I have a question regarding using a table with Identity columns.
    Can we do fastload in a table having Identity Column? And if we do this, will the values of ID remain the same or will they be changed?
    I ask this because the current table has Identity Column and i need to be sure that the values of Identity are same or will not change after fastload run.
    On our system we have a very large IDENTITY column Table (DWP_VEDATA_T.MEASR_RESULT_DIAG). The size of this table is greater than 5 TB and has more than 61 Billion Rows. Now we have a requirement from the business to change the DDL of this table (one column in particular). The length of the column is to be increased from 1000 to 10,000.
    One way to go about this would be to create a new table and then do a “Insert – Select” in chunks
    This has some flaws as this would be time consuming, ETL batches using this table would be delayed, and over the weekends due to weekly backups the insert would have to be stopped.

    Last year, we had done a similar activity and that took 100+ days and batches had to be held, also after the table was populated a lot of manual intervention was required to actually make sure that the two tables were in sync.

    Question:
    Is there any faster/better way to do this?
    Original table has IDENTITY column and one column length has 1000
    Target table also has IDENTITY column and column length needs to be increased to 10,000

    INSERT / SELECT is time consuming and I am guessing would involve high spaced Transient Journals for each insert.

  • It will spread the rows evenly across the AMPs. It is not about the total disk space consumed by the table but the better distribution is important as it avoids that one or a few AMPs run out of permanent space.

  • Hi,
    If I remove the skweness by selecting the unique PI, will it Free up the space of table?|
    suppose it is 10 GB and after changing the PI to unique will it get reduce from 10 to lower GB?

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >