Teradata employs two distinct approaches in Ordered Analytic Function for preparing data geography necessary for processing. This article elucidates both approaches and their respective merits and demerits.
Teradata Ordered Analytical Functions
Teradata Analytic Functions are versatile tools that allow for a range of applications. The capability to retrieve previous and subsequent rows is critical in numerous inquiries.
To calculate the running total of a metric over the past three days, follow these steps:
- The rows have to be sorted ascending by date
- The summation is carried out:
- The value of the current row is added to the sum of the previous row
- The value of the row that lies three days back is subtracted
All of these steps can be executed using a solitary SQL statement.
SELECT SUM(Amount) OVER (PARTITION BY Department_Id ORDER BY Sales_Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS RunningSum
FROM Sales;
Teradata Ordered Analytical Functions & Performance Considerations
- Any available compressed Join Index or Hash Index is not used when using Teradata Analytical Functions.
- All rows which fall into the same partition are moved to the same AMP. Changing data geography can cause “out of spool” situations. By partition, we mean the columns defined in the PARTITION BY part of the Ordered Analytic Function. This is not to be confused with Row Level Partitioning.
- Skewing can occur since the rows of a partition are copied to the same AMP.
Teradata evenly distributes rows across AMPs using one of two methods.
- Suppose the number of distinct values of the PARTITION BY columns is higher than that of AMPS. In that case, the distribution is carried out according to the PARTITION BY key using the well-known hashing algorithm on the partition by columns.
- Suppose the number of distinct values of the PARTITION BY columns is higher than that of AMPS. In that case, the distribution is carried out according to the ranges of key values defined by the PARTITION by columns plus the ORDER BY column(s).
A Simple Trick to Reduce the Skewing
We can expand the PARTITION BY or ORDER BY by adding more columns with diverse values. However, the supplementary column should not alter the query’s meaning.
We added a column to the ORDER BY clause, which reduces skewing only if the clause is utilized for data distribution.
Original Query (Heavily Skewed)
SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC ) AS TotalSum
FROM Ranges;
MinAMPIO = 102
MaxAMPIO = 237
Optimized Query (Less Skewed)
SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC, QuiteUniqueColumn ) AS TotalSum
FROM Ranges;
MinAMPIO = 191
MaxAMPIO = 231
How to determine how the Rows are distributed?
If range distribution is utilized, the execution plan will display the statement “redistributed by value to all AMPs.”
Explain SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC, QuiteUniqueColumn ) AS TotalSum
FROM Ranges;
1) First, we lock DWHPRO.Ranges in TD_MAP1 for read on a reserved
RowHash to prevent global deadlock.
2) Next, we lock DWHPRO.Ranges in TD_MAP1 for read.
3) We do an all-AMPs STAT FUNCTION step in TD_MAP1 from DWHPRO.Ranges
by way of an all-rows scan with no residual conditions into Spool
5 (Last Use), which is assumed to be redistributed by value to all
AMPs in TD_Map1. The result rows are put into Spool 1
(group_amps), which is built locally on the AMPs. The size is
estimated with low confidence to be 293,656 rows (12,039,896
bytes). The estimated time for this step is 0.57 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.57 seconds.