Skewing is one of the biggest problems on a Teradata system because it has a negative effect on performance.
If a table has a high skew factor, not all AMPs are uniformly employed for queries on this table. This not only has a negative effect on this one query but on the entire system.
The following happens in detail:
Skew leads to poor CPU parallel efficiency for full table scans and bulk inserts. The AMP with the most data sets forms a bottleneck. The remaining AMPs must wait for the slowest AMP.
Skew increases the number of IOs for updates and inserts of biased values. This is particularly negative for SET tables in combination with a NUPI unless the duplicate row checks are prevented by a secondary index.
There are many reasons for skew. With this article, I will show you 3 ways to discover Skew.
1. Detect Table Skew – The poorly designed Physical Data Model (PDM)
At the very beginning, the physical data model should be checked as the cause for skew.
A badly selected primary index leading to skew has a negative effect on all queries that use the table. This is where you can win the most.
I have seen cases where the daily loading time of the data warehouse could only be halved by changing the primary index of a few tables!
The primary index should also be checked right at the beginning, as the physical data model is relatively stable and changes only slowly.
The following is a sample query that calculates the skew of a table:
CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEWFACTOR_PERCENT
WHERE DATABASENAME = ‘the_database'
GROUP BY 1
ORDER BY 1
Above query returns the permanently used disk space and the skew factor per table.
Perfectly distributed tables have a skew factor of 0%. However, tables with only a few rows (lookup tables) are usually distorted because there are simply not enough rows to distribute them evenly across all AMPs (especially if the number of rows < number of AMPs in the system). Exclude these tables from your analysis.
2. Detect Teradata Skewing by analyzing the Joins
While the first point refers to the static skewing, a dynamic skewing can occur when the query is executed, which is caused by the uneven distribution of the spool files. The principle of dynamic skew is simple:
Whenever a join takes place, the rows to be connected must be on the same AMP. This often makes it necessary to copy tables completely to all AMPs or to set a new primary index, which triggers a redistribution of the rows.
Unfortunately, this new primary index could turn out to be a bad choice if it distributes the rows unevenly among the AMPs.
Take a look at this sample query:
SELECT * FROM
ON t01.GENDER = t02.GENDER;
In our example, the primary index of table CUSTOMER is the “customerid” and the primary index of table GENDER_LOOKUP is the column “gender”.
If the optimizer selects the column “gender” as the new primary index for the customer table, the generated spool is strongly skewed, since only 2 AMPs will receive rows (male/female). This query is certainly problematic.
If you have access to any real-time monitoring tool like Viewpoint, PMON or DBMON you can easily spot such skewing problems:
Pay attention to execution planning steps that have a high skew factor and take a lot of time
3. Detect Teradata Skewing by analyzing Column Values
Dynamic skew as described above can be detected by analyzing the execution plan (the join columns) and knowing the contents of the columns.
The next type of skew is more dangerous because it is difficult to detect: the hidden skew.
These are a few values that are very common, but the table itself has no skew.
Hidden skew is not recognized by our check query in point 1.
For hidden skew we need this query:
SELECT the_snaky_column,COUNT(*) FROM the_table GROUP BY 1 ORDER BY 2 DESC;
The above query shows you the most common values at the top of the list. Should the result set look similar to this we have a problem with hidden skew:
Real-time monitoring with Viewpoint, PMON or DBMON is another great way of detecting skewing problems:
Note steps in the execution plan that show an even distribution (no skew), but use only a few gigabytes of spool and take a long time. At the same time, the size of the spool used hardly changes.
In my experience, all skew problems can be identified in this way.
It often turns out that the NULL value is the insidious one.
Before Teradata 14, the only solution was often to split the query into two parts:
The first section of the UNION treated the NULL value, the second all other values:
SELECT * FROM the_table1 t01 WHERE the_join_column IS NULL
SELECT * FROM
t01.the_join_column = t02.the_join_column
t01.the_join_column IS NOT NULL
AND t02.the_join_column IS NOT NULL;
Since Teradata 14 we have the Partial Redistribution / Partial Distribution (PRPD) function, with which the optimizer can solve such problems alone.
The optimizer uses the demographic data stored in the statistics information (most typical values per histogram) and splits the join logic, but it is up to you to ensure correct and complete statistics!