Introduction to Teradata Partial Redistribution and Partial Duplication
Managing data skew is a critical aspect of optimizing Teradata performance. While table-level skew can typically be prevented by selecting an appropriate Primary Index, spool skew occurs during query execution – a common and often significant problem.
Spool skew is a common issue when joining tables in Teradata. The join process is executed simultaneously by all available AMPs in parallel.
Let us consider the following example of a join:
B … Big table with 100 Million rows
S … Small table with 10.000 rows
Assuming that the two tables are joined on a column that is not the primary index for either table, the optimizer may rehash and redistribute both tables. This approach is ideal if the join column values are evenly distributed across all AMPs.
We assume that in the big table B, the value in the join column is the same for approximately 99% of the rows. This equates to 99 million out of 100 million rows sharing the same value in the join column.
Currently, we are experiencing a significant performance issue due to data skew.
Rehashing the rows assigns 99 million rows to a single AMP because they share the same hash value.
The AMP designated for receiving may experience a Hot AMP situation and deplete its spool space.
Before Teradata Release 13, a performance specialist’s sole responsibility was to address dynamic skewing issues by manipulating the optimizer to adopt an alternate execution plan. This issue could occasionally be resolved by incorporating statistics, while in other cases, the only viable solution was to rewrite the SQL statement fully.
Teradata 14 introduced a refined approach to resolve data skew problems.
The AMP is informed of any skewed values pertaining to the big Table B.
Skewed rows in Table B remain confined to their respective AMPs. Non-skewed rows are redistributed by rehashing the join column(s).
Non-biased rows from small table S are rehashed on their join column(s), while rows that would become skewed during redistribution are duplicated across all AMPs.
Upon completing the partial redistribution and duplication of the tables, each AMP proceeds to generate the initial result by merging the contents of the redistributed spools.
Each AMP produces a secondary output by merging table S’s replicated spool contents with table B’s local spool.
With Teradata 14, PRPD alleviates certain performance issues related to skew. However, the Teradata optimizer requires knowledge of biased values, underscoring the importance of maintaining up-to-date statistics.
Thanks Roland. Nice article as like always. 🙂
I have seen Random function does better in most cases for skewed joins, Please correct me if am wrong.
Eg:
…
Inner join ….
ON
CASE
WHEN (COALESCE(B.skewed_col=’-100′) THEN RANDOM(-99999999,-1) || ‘skb+^(_~:.-99’
ELSE B.skewedCol
END = S.joincol
how do we know when the optimizer is going for this PRPD ? is there any specific keyword which we can see in explain plan when PRPD takes place?