Skewing is one of the biggest problems on a Teradata system because it harms performance.
If a table has a high skew factor, not all AMPs are uniformly employed for queries on this table. This not only hurts this one query but on the entire system.
The following happens in detail:
Skew leads to low CPU parallel efficiency for full table scans and bulk inserts. The AMP with the most data sets forms a bottleneck. The remaining AMPs must wait for the slowest AMP.
Skew increases the number of IOs for updates and inserts of biased values. This is significantly negative for SET tables combined with a NUPI unless a secondary index prevents the duplicate row checks.
There are many reasons for skew. With this article, I will show you three ways to discover Skew.
1. Detect Table Skew – The poorly designed Physical Data Model (PDM)
At the very beginning, the physical data model should be checked as the cause for skew.
A poorly selected primary index leading to skew hurts all queries that use the table. This is where you can win the most.
I have seen cases where the data warehouse’s daily loading time could only be halved by changing a few tables’ primary indexes!
The primary index should also be checked right initially, as the physical data model is relatively stable and changes only slowly.
The following is a sample query that calculates the skew of a table:
CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEWFACTOR_PERCENT
WHERE DATABASENAME = ‘the_database’
GROUP BY 1
ORDER BY 1
The above query returns the permanently used disk space and the skew factor per table.
Correctly distributed tables have a skew factor of 0%. However, tables with only a few rows (lookup tables) are usually distorted because there are not enough rows to distribute them evenly across all AMPs (especially if the number of rows < the number of AMPs in the system). Exclude these tables from your analysis.
2. Detect Teradata Skewing by analyzing the Joins
While the first point refers to the static skewing, a dynamic skewing can occur when the query is executed, caused by the uneven distribution of the spool files. The principle of dynamic skew is simple:
Whenever a join takes place, the rows to be connected must be on the same AMP. This often makes it necessary to copy tables completely to all AMPs or set a new primary index, which triggers a redistribution of the rows.
Unfortunately, this new primary index could turn out to be a wrong choice if it distributes the rows unevenly among the AMPs.
Take a look at this sample query:
SELECT * FROM
ON t01.GENDER = t02.GENDER;
In our example, the table CUSTOMER’s primary index is the “customerid,” and the primary index of table GENDER_LOOKUP is the column “gender”.
If the optimizer selects the column “gender” as the new primary index for the customer table, the generated spool is strongly skewed, since only 2 AMPs will receive rows (male/female). This query is undoubtedly problematic.
If you have access to any real-time monitoring tool like Viewpoint, PMON, or DBMON, you can easily spot such skewing problems:
Pay attention to execution planning steps that have a high skew factor and take a lot of time
3. Detect Teradata Skewing by analyzing Column Values
As described above, dynamic skew can be detected by analyzing the execution plan (the join columns) and knowing the columns’ contents.
The next type of skew is more dangerous because it is difficult to detect: the hidden skew.
These are a few very common values, but the table itself has no skew.
Our check query in point 1 does not recognize hidden skew.
For hidden skew, we need this query:
SELECT the_snaky_column,COUNT(*) FROM the_table GROUP BY 1 ORDER BY 2 DESC;
The above query shows you the most common values at the top of the list. Should the result set look similar to this, we have a problem with hidden skew:
Real-time monitoring with Viewpoint, PMON, or DBMON is another great way of detecting skewing problems:
Note steps in the execution plan that show an even distribution (no skew), but use only a few gigabytes of the spool and take a long time. At the same time, the size of the spool used hardly changes.
In my experience, all skew problems can be identified in this way.
It often turns out that the NULL value is the insidious one.
Before Teradata 14, the only solution was often to split the query into two parts:
The first section of the UNION treated the NULL value, the second all other values:
SELECT * FROM the_table1 t01 WHERE the_join_column IS NULL
SELECT * FROM
t01.the_join_column = t02.the_join_column
t01.the_join_column IS NOT NULL
AND t02.the_join_column IS NOT NULL;
Since Teradata 14, we have the Partial Redistribution / Partial Distribution (PRPD) function, with which the optimizer can solve such problems alone.
The optimizer uses the demographic data stored in the statistics information (most typical values per histogram). It splits the join logic, but it is up to you to ensure correct and complete statistics!