Teradata uses two different methods in Ordered Analytic Function to prepare the required data geography for processing. This article will explain both methods and their advantages and disadvantages.
Teradata Ordered Analytical Functions
Teradata Ordered Analytic Functions have a variety of uses. The ability to access subsequent and previous rows is essential for many queries.
In our example, we want to calculate the running total of a metric over the previous three days. The following steps are necessary for this:
- The rows have to be sorted ascending by date
- The summation is carried out:
- The value of the current row is added to the sum of the previous row
- The value of the row that lies three days back is subtracted
All the above steps can be performed with a single SQL:
SELECT SUM(Amount) OVER (PARTITION BY Department_Id ORDER BY Sales_Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS RunningSum FROM Sales;
Teradata Ordered Analytical Functions & Performance Considerations
- Any available compressed Join Index or Hash Index is not used when using Teradata Analytical Functions.
- All rows which fall into the same partition are moved to the same AMP. Changing data geography can cause “out of spool” situations. By partition, we mean the columns defined in the PARTITION BY part of the Ordered Analytic Function. This is not to be confused with Row Level Partitioning.
- Skewing can occur since the rows of a partition are copied to the same AMP.
Teradata uses one of two different methods to distribute the rows evenly across the AMPs:
- Suppose the number of distinct values of the PARTITION BY columns is high compared to the number of AMPS. In that case, the distribution is carried out according to the PARTITION BY key using the well-known hashing algorithm on the partition by columns.
- Suppose the number of distinct values of the PARTITION BY columns is higher than the number of AMPS. In that case, the distribution is carried out according to the ranges of key values defined by the PARTITION by columns plus the ORDER BY column(s).
A Simple Trick to reduce the Skewing
We can extend the PARTITION BY key or the ORDER BY with additional columns with as many different values as possible. Of course, the extra column must not change the semantics of the query.
Here is an example where we added a column to the ORDER BY clause (which, of course, reduces skewing only if the ORDER BY clause is used for data distribution):
Original Query (Heavily Skewed)
SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC ) AS TotalSum FROM Ranges; MinAMPIO = 102 MaxAMPIO = 237
Optimized Query (Less Skewed)
SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC, QuiteUniqueColumn ) AS TotalSum FROM Ranges; MinAMPIO = 191 MaxAMPIO = 231
How to determine how the Rows are distributed?
If distribution by ranges is used, you can find the statement “redistributed by value to all AMPs” in the execution plan:
Explain SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC, QuiteUniqueColumn ) AS TotalSum FROM Ranges; 1) First, we lock DWHPRO.Ranges in TD_MAP1 for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock DWHPRO.Ranges in TD_MAP1 for read. 3) We do an all-AMPs STAT FUNCTION step in TD_MAP1 from DWHPRO.Ranges by way of an all-rows scan with no residual conditions into Spool 5 (Last Use), which is assumed to be redistributed by value to all AMPs in TD_Map1. The result rows are put into Spool 1 (group_amps), which is built locally on the AMPs. The size is estimated with low confidence to be 293,656 rows (12,039,896 bytes). The estimated time for this step is 0.57 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.57 seconds.