Teradata skewing can be considered one of the worst problems on any Teradata system. A high skew factor means in effect, the parallelism of the system is degraded leading to:
- Poor CPU parallel efficiency on full table scans and bulk inserts. The AMP holding the most records of the many values will be the bottleneck, forcing all other AMPs to wait.
- Increased IO for updates and inserts of biased values, considering the extra workload for the AMP with a high number of multiple rows for the same NUPI value.
The cause of Teradata skewing hides in many places. We will show you with this article 3 ways to discover your Teradata skewing problems.
1. Detect Table Skew – The poorly designed Physical Data Model (PDM)
The PDM is one of the most prominent areas to investigate for skewing problems. Bad Primary index choice could cause uneven data distribution and impact query performance.
Checking for table skew is typically the first area of investigation in the case of skewing problems as it only has to be done once in a while (when there are changes on the PDM or impacting changes in data demographics).
Below is one example query which will calculate the table skew:
CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEWFACTOR_PERCENT
WHERE DATABASENAME = ‘the_database’
GROUP BY 1
ORDER BY 1
This query will return the permanently used disk space and the skew factor per table.
Perfectly distributed tables have a skew factor of 0%. Nevertheless, tables with only a few rows (lookup tables) most times are skewed, as there may be simply not enough rows to distribute them evenly across all AMPs (number of rows < number of AMPS in the system). Just exclude them from your analysis.
2. Detect Teradata Skewing by analyzing the Joins
While point 1 related to static skewing, during query execution we may have to fight with dynamic skewing caused by the uneven distribution of spool files. The principle of dynamic skewing is simple:
Whenever a join takes place, the rows to be joined have to be co-located on the same AMP.
Data relocation means rehashing of one or both concerned tables, in fact giving them (in the spool file) a new primary index.
Unfortunately, this new primary index could turn out to be a bad choice, distributing the spool file rows very unevenly across the AMPs.
Take a look at this example query:
SELECT * FROM
ON t01.GENDER = t02.GENDER;
We assume that the Primary Index of table CUSTOMER is something like “customerid” and the primary index of table GENDER_LOOKUP is column “gender.”
If the optimizer decides to rehash table CUSTOMER using the column GENDER as the new primary index, problems are ensured as the spool file will end up on 2 AMPs. If you are lucky, your query runs out of spool on one of this two AMPs; it will impact the overall system performance for a very extended period.
If you have access to any real-time monitoring tool like Viewpoint, PMON or DBMON you can easily spot such skewing problems:
Watch out for execution plan steps which show a high skew factor and do not finish during an extended period.
3. Detect Teradata Skewing by analyzing Column Values
While joining skew described in point 2 can be detected probably quite easily by analyzing the query and having some common knowledge about the data content, there exists another hidden skewing risk caused by data demographics:
Frequent column values in an evenly distributed table
Such a skew is very sneaky. It will not be detected by our check query from point 1. Luckily, there is an easy way with below query:
SELECT the_snaky_column,COUNT(*) FROM the_table GROUP BY 1 ORDER BY 2 DESC;
Above query will show you on top of the list the most frequent values, a result set for a huge table like the one below should alert you:
Real-time monitoring with Viewpoint, PMON or DBMON is another great way of detecting such skewing problems:
Watch out for steps in the execution plan which show even distribution (no skew), only a few gigabytes of used spool which decreases very slowly over time, further the step does not seem to go ahead. From my experience, this is most times related to the high-frequency value skew problem.
Many times, it will turn out that the NULL value is the sneaky one. Before Teradata 14 the solution was to split your query into two parts and UNION them together. The first section of the UNION handled the NULL value, the second all other values:
SELECT * FROM the_table1 t01 WHERE the_join_column IS NULL
SELECT * FROM
t01.the_join_column = t02.the_join_column
t01.the_join_column IS NOT NULL
AND t02.the_join_column IS NOT NULL;
Since Teradata 14 we have the Partial Redistribution / Partial Distribution (PRPD) feature which the optimizer may use to solve such problems on its own.
The optimizer makes use of data demographics stored in the statistics information (most typical value per histogram) and splits up join logic, but it is up to you to ensure correct and complete statistics!