Introduction
As a performance tuner, 99% of slow queries can be attributed to precisely two causes. The first cause is that large amounts of data are copied between AMPs. This article will look at the second cause: Teradata skewing.
In Teradata, we speak of skew when the rows of a table are not evenly distributed across all AMPs. This leads to the fact that the parallelism of the system is not used optimally. This affects both queries and the changing of data (UPDATE, INSERT, DELETE). Skew leads to the fact that both the CPU and the IOs are not used evenly. AMPs with relatively many rows have more to do, therefore take longer and represent a bottleneck. A query is only done when even the slowest AMP delivers its work.
Another possible adverse effect of skew is that many duplicate row checks occur when inserting rows with the same primary index values into a NUPI table defined as a SET table. Therefore, the affected AMP(s) also need more CPU seconds to determine whether there are duplicate rows.
There are many reasons why skew can occur. In this article, we will discuss them all.
The skew of tables and the wrong data model
A common cause of skew is a poorly modeled physical data model. Correctly selecting primary indexes is one of the critical tasks in performance tuning. Since the physical data model is the basis for all workloads, the most significant performance gains can often be achieved here. The data model as a basis has the advantage that once it is correctly created it changes only very slowly over time.
The following is a sample query that calculates the skew of a table:
SELECT TABLENAME, SUM(CURRENTPERM) CURRENTPERM, CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEWFACTOR_PERCENT FROM DBC.TABLESIZE WHERE DATABASENAME = ‘the_database’ GROUP BY 1 ORDER BY 1 ;
The previous query calculates the used space per table and the skew. Perfectly distributed tables have 0% skew, but this ideal case is rare. The goal should be to keep the skew below 10%. It is essential to omit small tables from the analysis. If a table has fewer rows than there are AMPs in the system, it is logical that not all AMPs rows will be processed. So it is natural that the table has a skew.
Detect Teradata Skew by analyzing the Joins
The skew of tables is also called static skew because it becomes visible just by looking at the table distribution. The dynamic skew is much more challenging to recognize. We refer to skew that occurs during the execution of the execution plan. This affects spool tables. The leading cause is joins and aggregations.
The Teradata architecture requires that the rows of two tables be made available on the same AMP before joining. Furthermore, it may be necessary for specific join methods to get the same primary index (defined by the join columns). Skew occurs if this “new” primary index distributes the rows irregularly. Teradata tries to prevent such problems with statistics by, e.g., splitting the join into several different joins; one join for well-distributed values, another for the skewed value(s). However, all the statistics needed to detect such possibilities are not always available. See the following query:
SELECT * FROM CUSTOMER t01INNER JOIN GENDER_LOOKUP t02ON t01.GENDER = t02.GENDER;
If the optimizer lacks correct estimates, it might distribute the Customer table to the new primary index based on the Gender column (the better plan would be to copy the Gender table to all AMPs) and then perform the join. As a result, all rows will end up on a few AMPs.
The quickest way to detect such a problem is to monitor the query in Viewpoint. A Plan Step that takes a long time and has a high skew will have this problem. It is also possible to detect dynamic skew by reading the execution plan, but the faster method is the one above.
3. Detect Teradata Skewing by analyzing Column Values
Let’s discuss one more remarkable case of skew, which I call the hidden skew. It is the most difficult to detect because it is usually not visible in Viewpoint.
This problem is where a table is generally well distributed, but a few values are clustered. The skew hides in a well-distributed table. Imagine a transaction table, for example. This would be the case if there is one customer with a considerable number of transactions while all other customers are only walk-in customers. The following is a query that helps to find hidden skew:
SELECT the_snaky_column,COUNT(*) FROM the_table GROUP BY 1 ORDER BY 2 DESC;
The most frequent values of the Primary Index are at the top of the Result Set. If the result looks like the following, it is a hidden skew:
‘A’,1000.000‘B’,500’C’,480’D’,’480’…
‘ZZZZZ’,1
We have already touched on it briefly: Real-time monitoring with Viewpoint can be a great help to find plan steps that show skew. If a step is “stuck” for a longer time and the skew factor is high, it is a skew problem. Very short steps with skew are negligible: Viewpoint shows snapshots. At the end of a query, it can be that spool is released, and a snapshot is taken, which shows this situation as skew.
A little trick from practice: NULL values are always stored at the AMP with the highest number. So you can easily detect problems with NULL values by knowing the highest AMP in your system and looking if there is a skew on this AMP (visible in Viewpoint).
Before Teradata Release 14, the only way to prevent skew was often to split the query itself in two. One for the skewed value and one for the well-distributed values. This is no longer necessary, as Teradata offers features that do this internally if necessary).
The following example query uses a UNION to split the handling of NULL values and other values:
SELECT * FROM the_table1 t01 WHERE the_join_column IS NULL UNION ALL SELECT * FROM the_table1 t01 INNER JOIN the_table2 t02 ON t01.the_join_column = t02.the_join_column WHERE t01.the_join_column IS NOT NULL AND t02.the_join_column IS NOT NULL;
Before Teradata Release 14, the only way to prevent skew was often to split the query itself in two. One for the skewed value and one for the well-distributed values. This is no longer necessary, as Teradata offers features that do this internally if necessary).
The feature Teradata uses for joins is Partial Redistribution / Partial Distribution (PRPD), which splits joins into multiple joins. Understanding that these new features require accurate and up-to-date statistics is essential.