Teradata Skewing: Causes, Detection, and Solutions

DWH Pro Admin

April 21, 2023

minutes reading time


Introduction

As a tuner, most slow queries can be traced back to two primary causes. The initial cause is the transfer of substantial amounts of data between AMPs. This piece will focus on the second cause: Teradata skewing.

Teradata refers to skew as the uneven distribution of a table’s rows across the AMPs. This results in suboptimal utilization of the system’s parallelism and affects queries and data modifications (UPDATE, INSERT, DELETE). Skew causes uneven CPU and IO utilization, with AMPs having more rows experiencing longer processing times and becoming a bottleneck. The query completion time depends on the slowest AMP to finish its work.

Skew may result in numerous duplicate row checks when inserting rows with the same primary index values into a NUPI table defined as a SET table. This could also increase CPU usage for the affected AMP(s) in identifying duplicate rows.

Multiple factors can cause skewness, which we will address comprehensively in this article.

Poorly modeled physical data is a frequent cause of skew. A crucial aspect of performance tuning is the appropriate selection of primary indexes. The physical data model is fundamental to all workloads, and significant performance improvements can often be attained by correctly designing it. The data model is advantageous because it changes only gradually over time once it has been correctly created.

Here is a sample query to compute the skewness of a table:

SELECT
TABLENAME,
SUM(CURRENTPERM) CURRENTPERM,
CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEWFACTOR_PERCENT
FROM DBC.TABLESIZE
WHERE DATABASENAME = ‘the_database’
GROUP BY 1
ORDER BY 1
;

The prior computation determines table usage and skewness. Ideally, tables would have 0% skewness, but this is infrequent. The objective should be to maintain a skewness of less than 10%. It is crucial to exclude small tables from the assessment. If a table contains fewer rows than there are AMPs in the system, it is reasonable that not all rows will be processed by each AMP, resulting in table skewness.

Detect Teradata Skew by analyzing the Joins

The skew of tables is commonly known as static skew, which can be identified by simply examining the table distribution. On the other hand, dynamic skew is harder to detect as it occurs during the execution of the execution plan and mostly impacts spool tables. The primary culprits for dynamic skew are joins and aggregations.

To join two tables in the Teradata architecture, they must be available on the same AMP and have the same primary index defined by their join columns. Irregular distribution of rows, or skew, can occur if a new primary index is created during the join process. Teradata employs statistical analysis to prevent skew, such as dividing the join into multiple joins based on the distribution of values. However, not all necessary statistics may be available. Consider the following query:

SELECT * FROM
CUSTOMER t01
INNER JOIN
GENDER_LOOKUP t02
ON
t01.GENDER = t02.GENDER;

Insufficient estimates may lead the optimizer to allocate the Customer table to the fresh primary index relying on the Gender column. However, it would be more appropriate to replicate the Gender table to all AMPs. Subsequently, the join operation will ensue, resulting in a concentration of all the rows onto a limited number of AMPs.

To identify an issue promptly, observing the query through Viewpoint is best. A Plan Step that experiences prolonged execution time and high skewness indicates an issue. Although one can identify dynamic skew by studying the execution plan, monitoring through Viewpoint is a quicker approach.

Detect Teradata Skewing by analyzing Column Values

The hidden skew is a remarkable instance of skew that is difficult to detect as it is typically not visible in Viewpoint.

Hidden skew is a problem wherein a table is mostly evenly distributed, but certain values are clustered together, making the skew difficult to detect. For instance, consider a transaction table where one customer has numerous transactions while all others are walk-in customers. A query can be used to uncover such hidden skew:

SELECT the_snaky_column,COUNT(*) FROM the_table GROUP BY 1 ORDER BY 2 DESC;

A hidden skew is present when the Result Set’s top values correspond to the primary index’s most common values.

‘A’,1000.000
‘B’,500
‘C’,480
‘D’,’480′
…

‘ZZZZZ’,1

Real-time monitoring through Viewpoint can detect plan steps displaying skewness. Skew problems arise when a step remains “stuck” for an extended period with a high skew factor. Short steps exhibiting skewness can be disregarded as Viewpoint provides snapshots. At the query’s completion, releasing a spool triggers a snapshot revealing any skewness.

A tip from experience: NULL values are stored exclusively in the AMP with the highest number. Therefore, identifying issues pertaining to NULL values is simple. By being aware of the highest AMP in your system and checking for any skew on this specific AMP (which can be viewed in Viewpoint), you can easily detect any problems.

Prior to Teradata Release 14, splitting the query into two was the sole method to avoid skew, segregating the skewed value from the well-distributed values. However, with the new features provided by Teradata, this is no longer imperative, as the system can internally handle it.

This query example implements a UNION to differentiate between the processing of NULL values and non-NULL values:

SELECT * FROM the_table1 t01 WHERE the_join_column IS NULL
UNION ALL
SELECT * FROM
the_table1 t01
INNER JOIN
the_table2 t02
ON
t01.the_join_column = t02.the_join_column
WHERE
t01.the_join_column IS NOT NULL
AND t02.the_join_column IS NOT NULL;

Prior to Teradata Release 14, it was common to bifurcate the query to avoid data skew, separating the query for skewed values from the one for well-distributed ones. However, this practice is now obsolete since Teradata incorporates internal features to address skewness.

Teradata employs Partial Redistribution/Partial Distribution (PRPD) as its join feature, which divides joins into several segments. Recognizing that these novel functionalities necessitate precise and current statistics is crucial.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>