High-Performance Calculations with Teradata Ordered Analytic Functions
Teradata Ordered Analytic Functions use two different methods to redistribute the data, depending on the data demographics.
Teradata Ordered Analytic Functions
Teradata Ordered Analytic Functions are useful if the result of a row is dependent on the values of previous or subsequent rows.
Here is an example:
The running sum over the last 3 days should be calculated. The following calculation steps are necessary:
- The rows have to be sorted ascending by date
- The summation is carried out by:
- The value of the current row is added to the sum of the previous row
- The value of the row that lies 3 days back is subtracted
All the above steps can be performed with a single SQL:
SELECT SUM(Amount) OVER (PARTITION BY Department_Id ORDER BY Sales_Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS RunningSum FROM Sales;
Teradata Ordered Analytic Functions & Performance Considerations
- When using Teradata Analytic Functions, it should be noted that any available compressed Join Index or Hash Index is not used.
- All rows which fall into the same partition are moved to the same AMP. This can lead to the situation that the AMP is running out of spool space.
- Since the rows of a partition are copied to the same AMP, skewing can also occur.
More specifically, Teradata uses one of 2 different methods to distribute the rows (and to reduce skew):
- If the number of different values of the PARTITION BY columns is high in comparison to the number of AMPS, the distribution is carried out according to the PARTITION BY key using the usual hashing algorithm on the partition by columns.
- If the number of different values of the PARTITION BY columns is small compared to the number of AMPS, the distribution is carried out according to ranges of key values defined by the PARTITION by columns plus the ORDER BY columns.
A Simple Trick to reduce Skewing
We can extend the PARTITION BY key or the ORDER BY with additional columns that have as many different values as possible. Of course, the additional column must not change the semantics of the query.
Here is an example, where we added a column to the ORDER BY (which of course reduces skewing only if ORDER BY columns are considered by Teradata when distributing the rows):
Original Query (Heavily Skewed)
SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC ) AS TotalSum FROM Ranges; MinAMPIO = 102 MaxAMPIO = 237
Optimized Query (Less Skewed)
SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC, QuiteUniqueColumn ) AS TotalSum FROM Ranges; MinAMPIO = 191 MaxAMPIO = 231
How to determine how the Rows are distributed?
If distribution by ranges is used, can easily be found out: Check the execution plan for the statement “redistributed by value to all AMPs”
Explain SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC, QuiteUniqueColumn ) AS TotalSum FROM Ranges; 1) First, we lock DWHPRO.Ranges in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock DWHPRO.Ranges in TD_MAP1 for read. 3) We do an all-AMPs STAT FUNCTION step in TD_MAP1 from DWHPRO.Ranges by way of an all-rows scan with no residual conditions into Spool 5 (Last Use), which is assumed to be redistributed by value to all AMPs in TD_Map1. The result rows are put into Spool 1 (group_amps), which is built locally on the AMPs. The size is estimated with low confidence to be 293,656 rows (12,039,896 bytes). The estimated time for this step is 0.57 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.57 seconds.