Mastering Teradata Performance Optimization: Introduction to Partial Redistribution and Duplication

Roland Wenzlofsky

April 27, 2023

minutes reading time

Introduction to Teradata Partial Redistribution and Partial Duplication

Managing data skew is a critical aspect of optimizing Teradata performance. While table-level skew can typically be prevented by selecting an appropriate Primary Index, spool skew occurs during query execution – a common and often significant problem.

Spool skew is a common issue when joining tables in Teradata. The join process is executed simultaneously by all available AMPs in parallel.

Let us consider the following example of a join:

B … Big table with 100 Million rows

S … Small table with 10.000 rows

Assuming that the two tables are joined on a column that is not the primary index for either table, the optimizer may rehash and redistribute both tables. This approach is ideal if the join column values are evenly distributed across all AMPs.

We assume that in the big table B, the value in the join column is the same for approximately 99% of the rows. This equates to 99 million out of 100 million rows sharing the same value in the join column.

Currently, we are experiencing a significant performance issue due to data skew.

Rehashing the rows assigns 99 million rows to a single AMP because they share the same hash value.

The AMP designated for receiving may experience a Hot AMP situation and deplete its spool space.

Before Teradata Release 13, a performance specialist’s sole responsibility was to address dynamic skewing issues by manipulating the optimizer to adopt an alternate execution plan. This issue could occasionally be resolved by incorporating statistics, while in other cases, the only viable solution was to rewrite the SQL statement fully.

Teradata 14 introduced a refined approach to resolve data skew problems.

The AMP is informed of any skewed values pertaining to the big Table B.

Skewed rows in Table B remain confined to their respective AMPs. Non-skewed rows are redistributed by rehashing the join column(s).

Non-biased rows from small table S are rehashed on their join column(s), while rows that would become skewed during redistribution are duplicated across all AMPs.

Upon completing the partial redistribution and duplication of the tables, each AMP proceeds to generate the initial result by merging the contents of the redistributed spools.

Each AMP produces a secondary output by merging table S’s replicated spool contents with table B’s local spool.

With Teradata 14, PRPD alleviates certain performance issues related to skew. However, the Teradata optimizer requires knowledge of biased values, underscoring the importance of maintaining up-to-date statistics.

Take a brief survey and receive the book for free Get instant access

Share0

Tweet0

Share0

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

Thanks Roland. Nice article as like always. 🙂

I have seen Random function does better in most cases for skewed joins, Please correct me if am wrong.

Eg:
…
Inner join ….
ON
CASE
WHEN (COALESCE(B.skewed_col=’-100′) THEN RANDOM(-99999999,-1) || ‘skb+^(_~:.-99’
ELSE B.skewedCol
END = S.joincol

how do we know when the optimizer is going for this PRPD ? is there any specific keyword which we can see in explain plan when PRPD takes place?

Mastering Teradata Performance Optimization: Introduction to Partial Redistribution and Duplication

Introduction to Teradata Partial Redistribution and Partial Duplication

Roland Wenzlofsky

Fast multi-file export of Teradata query results using only Teradata SQL Assistant

The Teradata AMP Worker Task

Boost Your Teradata Performance – The Critical Role of NOT NULL Declarations

Optimizing Teradata SQL Queries by Avoiding Full Table Scans and Utilizing Secondary Indexes