Archive

Category Archives for "SQL"
3

Teradata TO_CHAR

The Teradata TO_CHAR Function

The Teradata TO_CHAR function can be used to:

  • Convert numeric expressions to character strings
  • Convert dates, times, interval data types, or timestamps to character strings

Here is an example of a numeric expression argument. It can be any of these data types: BYTEINT, SMALLINT, INTEGER, BIGINT, DECIMAL/NUMERIC, FLOAT/REAL/DOUBLE PRECISION, or NUMBER

The second argument is a format string (More about format strings you can find in the Teradata documentation). It has to be CHAR or VARCHAR.

SELECT TO_CHAR(123456789,'999999999');

Hint: If the query returns only ‘#’ characters, this means that the conversion failed:

SELECT TO_CHAR(123456789,'99');
--> ###

Here is an example of a date expression argument. The first argument has to be one of the following data types: DATE, TIME, TIME WITH TIME ZONE, TIMESTAMP, TIMESTAMP WITH TIME ZONE, or an Interval data type. The second argument has to be CHAR or VARCHAR:

SELECT TO_CHAR(DATE'2019-11-09', 'DD-MON-YYYY' );
--> '09-NOV-2019'

Teradata TO_CHAR and Statistic Estimations

Of course, we are all excited when Teradata provides new features with every new release that makes our lives easier. However, it is important not to forget that the use of functions always has the risk that the optimizer has no statistics available.

Here is an example that shows exactly such a case. First, we will apply TO_CHAR on a date and take a look at the execution plan:

SELECT Col_1, Col_2 FROM 
DBNAME.TableNane 
WHERE TO_CHAR(Date_Col, 'mm-dd-yyyy') ='12-31-9999'
ORDER by 1;

COLLECT STATS COLUMN TO_CHAR(ENT_END_TS, 'mm-dd-yyyy') AS COL_DATE_UDF 
ON DBNAME.Tablename;

The below excerpt from the execution plan illustrates that in step 2 the optimizer has “No Confidence” despite having the statistics collected on the column used in the WHERE condition. The estimated number of rows is 1, which is wrong (the correct result is 2 rows):

Step 2:

Next we do an all-amp operations RETRIEVE step from dbname.tablename in view viewed.view_nm by way of an all row scan with a condition of "(TD_SYSFNLIB.To_Char(CAST((date_col in view viewing.viewnam.ENT_END_TD) as TIMESTAMP(6) WITH TIME ZONE) AT TIME ZONE INTERVAL '0:00' HOUR TO MINUTE, 'mm-dd-yyyy')) ='12-31-9999'") into Spool 1 (group_amps). which is built locally on the amps. Then we do a SORT to order Spool 1 by the sort key in spool field1. The Size of spool 1 is estimated with no confidence to be 1 row.

Let’s rewrite the query, and use a simple CAST:

SELECT Col_1, Col_2 FROM
DBNAME.TableNane WHERE CAST(CAST(ENT_END_TD as DATE FORMAT 'MM-DD-YYYY') as CHAR(10)) ='12-31-9999' ORDER by 1;

COLLECT STATS COLUMN CAST(CAST(ENT_END_TD as DATE FORMAT 'MM-DD-YYYY') as CHAR(10)) as COL_DATE_UDF on DBNAME.Tablename;
Next we do an all-amp operations RETRIEVE step from dbname.tablename in view viewed.view_nm by way of an all row scan with a condition of "(TD_SYSFNLIB.To_Char(CAST((date_col in view viewing.viewnam.ENT_END_TD) as TIMESTAMP(6) WITH TIME ZONE) AT TIME ZONE INTERVAL '0:00' HOUR TO MINUTE, 'mm-dd-yyyy')) ='12-31-9999'") into Spool 1 (group_amps). which is built locally on the amps. Then we do a SORT to order Spool 1 by the sort key in spool field1. The Size of spool 1 is estimated with High Confidence to be 2 rows.

As we can see, the optimizer can now rate this step as “high confidence” because it has the statistics at its disposal. The TO_CHAR function from our first query has made it impossible to use statistics.

Conclusion

User-Defined Functions (UDF), such as the Teradata TO_CHAR function, should only be used if we cannot achieve the same result with “simple” means. Functions usually mean (with some exceptions like UPPERCASE) that existing statistics cannot be used!

High-Performance Calculations with Teradata Ordered Analytic Functions

Teradata Ordered Analytic Functions use two different methods to redistribute the data, depending on the data demographics.

Teradata Ordered Analytic Functions

Teradata Ordered Analytic Functions are useful if the result of a row is dependent on the values of previous or subsequent rows.

Here is an example:

The running sum over the last 3 days should be calculated. The following calculation steps are necessary:

  1. The rows have to be sorted ascending by date
  2. The summation is carried out by:
    • The value of the current row is added to the sum of the previous row
    • The value of the row that lies 3 days back is subtracted

All the above steps can be performed with a single SQL:

SELECT SUM(Amount) OVER (PARTITION BY Department_Id ORDER BY Sales_Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS RunningSum
FROM Sales;

Teradata Ordered Analytic Functions & Performance Considerations

  • When using Teradata Analytic Functions, it should be noted that any available compressed Join Index or Hash Index is not used.
  • All rows which fall into the same partition are moved to the same AMP. This can lead to the situation that the AMP is running out of spool space.
  • Since the rows of a partition are copied to the same AMP, skewing can also occur.

More specifically, Teradata uses one of 2 different methods to distribute the rows (and to reduce skew):

  • If the number of different values of the PARTITION BY columns is high in comparison to the number of AMPS, the distribution is carried out according to the PARTITION BY key using the usual hashing algorithm on the partition by columns.
  • If the number of different values of the PARTITION BY columns is small compared to the number of AMPS, the distribution is carried out according to ranges of key values defined by the PARTITION by columns plus the ORDER BY columns.

A Simple Trick to reduce Skewing

We can extend the PARTITION BY key or the ORDER BY with additional columns that have as many different values as possible. Of course, the additional column must not change the semantics of the query.

Here is an example, where we added a column to the ORDER BY (which of course reduces skewing only if ORDER BY columns are considered by Teradata when distributing the rows):

Original Query (Heavily Skewed)

SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC ) AS TotalSum
FROM Ranges;

MinAMPIO = 102
MaxAMPIO = 237
 

Optimized Query (Less Skewed)

SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC,  QuiteUniqueColumn ) AS TotalSum
FROM Ranges;

MinAMPIO = 191
MaxAMPIO = 231

How to determine how the Rows are distributed?

If distribution by ranges is used, can easily be found out: Check the execution plan for the statement “redistributed by value to all AMPs

Explain SELECT SUM(Amount) OVER (PARTITION BY Customer_Id ORDER BY Due_Date ASC, QuiteUniqueColumn ) AS TotalSum
 FROM Ranges;

  1) First, we lock DWHPRO.Ranges in TD_MAP1 for read on a reserved
     RowHash to prevent global deadlock.
  2) Next, we lock DWHPRO.Ranges in TD_MAP1 for read.
  3) We do an all-AMPs STAT FUNCTION step in TD_MAP1 from DWHPRO.Ranges
     by way of an all-rows scan with no residual conditions into Spool
     5 (Last Use), which is assumed to be redistributed by value to all
     AMPs in TD_Map1.  The result rows are put into Spool 1
     (group_amps), which is built locally on the AMPs.  The size is
     estimated with low confidence to be 293,656 rows (12,039,896
     bytes).  The estimated time for this step is 0.57 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.57 seconds.

Teradata Joins – A Simple Guide to avoid Skew

Skewed Teradata Joins – The initial situation

Let’s take the following scenario as an example. One table contains currencies, the other table contains the accounts of our customers including the currency in which the account is managed.
This means that the ISO code of the currency is the foreign key of the account table:

CREATE TABLE Currency
(
   CURRENCY_CD  VARCHAR(20) NOT NULL,
   CURRENCY_NAME VARCHAR(200),  
) PRIMARY INDEX (CURRENCY_CD);

CREATE TABLE Customer
(
  CUSTOMER_ID  INTEGER NOT NULL,
  CUSTOMER_NAME VARCHAR(255),
  ...
  CURRENCY_CD CHAR(20)
) PRIMARY INDEX (CUSTOMER_ID);

CUSTOMER TABLE

CUSTOMER_IDCUSTOMER_NAMECURRENCY_CD
1 Nina Lowery EUR
2 Alexia Neal USD
3 Kyla Chan NULL
4 Alesha Ferrell NULL
5 Cara Adams NULL
6 Abigail Larsen NULL
7 Amie Massey NULL

CURRENCY TABLE

CURRENCY_CDCURRENCY_NAME
EUREuro
USDUS Dollar
AUDAustralian Dollar
HUFHungarian Forint

Join using the Column CURRENCY_CD requires that both tables have this column as their primary index and that the rows of both tables are available on the relevant AMPs. As you can see immediately from our example table for accounts, most accounts have no currency assigned (NULL). Therefore, the join will be skewed.

Even though current versions of Teradata make appropriate optimizations to reduce this problem (e.g. by re-hashing the rows of well-distributed values, but copying rows with a NULL value to all AMPs), this problem still persists. Especially if the optimizer does not have the corresponding statistics about biased values or only has them obsolete.

A possible solution to this Problem

One possible solution is to replace the NULL value with a value that is as distinct as possible. In our case, for example, the CUSTOMER_ID can be used. We could simply cast it as a CHARACTER column, add a “special string’ in front of it, and write it into the Column CURRENCY_CD:

CUSTOMER_IDCUSTOMER_NAMECURRENCY_CD
1 Nina Lowery EUR
2 Alexia Neal USD
3 Kyla Chan #NULL#3
4 Alesha Ferrell #NULL#4
5 Cara Adams #NULL#5
6 Abigail Larsen #NULL#6
7 Amie Massey #NULL#7

A disadvantage of this method, however, is that NULL values are no longer recognizable as such. So you might need some logic in the queries.

1

Teradata UNION & UNION ALL in a Nutshell

Both statements, the Teradata UNION and the UNION ALL, combine the result sets of two or more SQL statements.

The UNION ALL statement takes all rows from the intermediate results and puts them together into the result set. The UNION statement removes duplicate rows before returning the result set:

SELECT <COLUMN1> FROM <TABLE_1>
UNION
SELECT <COLUMN1> FROM <TABLE_2> ;

SELECT <COLUMN1> FROM <TABLE_1>
UNION ALL
SELECT <COLUMN1> FROM <TABLE_2> ;

All statements combined with UNION have to return the same number of columns, and the data types of all columns across the participating select statements have to match. If they don’t match, the data types of the very first SQL select statement will be the relevant ones and not match columns of the remaining statements are implicitly cast to the same data type the very first select statement has.

Keep this in mind, especially if your column is a character data type, as this could cause hidden truncation of text columns –  a problem which is tough to discover. But now let’s come back to performance considerations: UNION ALL always performs better than UNION. Without exception. 

The reason is that UNION removes duplicates from the result set – the AMPs have to sort the resulting rows of all spools to be able to remove duplicates.
Sorting is expensive and makes the difference in performance between UNION and UNION ALL.

Only use UNION if you need to remove row duplicates. If there are no row duplicates in the data, always avoid the sort step by using UNION ALL.

Create a List of Values in Teradata

Method 1, avoiding any helper tables or functions which are only available in new releases of Teradata:

WITH RECURSIVE NumberRanges(TheNumber,TheString) AS
(
SELECT 1 AS TheNumber,casT(1 as VARCHAR(500)) as TheString
		
FROM
( 
  SELECT * FROM (SELECT NULL AS X) X
) DUMMYTABLE
UNION ALL
SELECT
	TheNumber + 1 AS TheNumber,
	TheString ||',' || TRIM(TheNumber+1) 
	
FROM NumberRanges
WHERE
TheNumber < 10
)
SELECT TheString
FROM NumberRanges
QUALIFY ROW_NUMBER() OVER ( ORDER BY TheNumber DESC) = 1

Method 2, with helper tables and XMLAGG:

SELECT 
     TRIM(TRAILING ',' FROM (XMLAGG(TRIM(rn)|| ',' ) (VARCHAR(255))))
 FROM 
     (SELECT 1 + ROW_NUMBER() OVER (ORDER BY Sys_Calendar."CALENDAR".day_of_calendar) as rn 
 FROM Sys_Calendar."CALENDAR" QUALIFY rn <= 10) t;

Both methods deliver the string “1,2,3,4,5,6,7,8,9,10”.

Which one do you think is consuming less Disk IOs and CPU seconds?

The Teradata OVERLAPS Command

With historically managed tables, it is often necessary to find out whether two time periods overlap. Usually, several comparisons are necessary.

The Teradata OVERLAPS command makes it easier.

Here is an example:

SELECT ‘Overlapping’ WHERE (DATE ‘2019-01-01’, DATE ‘2019-06-30’) OVERLAPS(DATE ‘2019-05-31’, DATE ‘2019-12-31’);

-> Result: ‘Overlapping

Caution is advised when the end of an interval is equal to the start of the second interval. The OVERLAPS command does not consider this to be an overlap:

SELECT ‘Overlapping’ WHERE (DATE ‘2019-01-01’, DATE ‘2019-06-30’) OVERLAPS(DATE ‘2019-06-30’, DATE ‘2019-12-31’);

-> No rows returned

If a date is part of the consideration, at least 2 days must be overlapping. So the following query recognizes the overlap as expected:

SELECT ‘Overlapping’ WHERE (DATE ‘2019-01-01’, DATE ‘2019-06-30’) OVERLAPS(DATE ‘2019-06-29’, DATE ‘2019-12-31’);

-> Result: ‘Overlapping

Analogous, for example, to 2 intervals with date, it also applies to intervals with time that the overlap must be at least 2 seconds:

Thus, the following query does not return an overlap:

SELECT ‘Overlapping’ (TITLE ‘ ‘ ‘)WHERE(TIME ’03:00:00′, TIME ’05:00:00′) OVERLAPS(TIME ’05:00:00′, TIME ’07:00:00’) ;

-> No rows returned

At least 2 seconds of overlap are necessary to detect such an overlap:

SELECT ‘Overlapping’ (TITLE ‘ ‘ ‘)WHERE(TIME ’03:00:00′, TIME ’05:00:00′) OVERLAPS(TIME ’04:59:59′, TIME ’07:00:00’) ;

-> Result: ‘Overlapping

Of course the OVERLAPS command can also be used for TIMESTAMPS:

SELECT ‘Overlapping’ (TITLE ‘ ‘)WHERE(TIMESTAMP ‘2019-01-01 03:00:00’, TIMESTAMP ‘2019-06-30 05:00:00’) OVERLAPS(TIMESTAMP ‘2019-06-30 05:00:00’, TIMESTAMP ‘2019-12-31 07:00:00’) ;

-> No rows returned

SELECT ‘Overlapping’ (TITLE ‘ ‘)WHERE(TIMESTAMP ‘2019-01-01 03:00:00’, TIMESTAMP ‘2019-06-30 05:00:00’) OVERLAPS(TIMESTAMP ‘2019-06-30 04:59:59’, TIMESTAMP ‘2019-12-31 07:00:00’) ;

-> Result: ‘Overlapping

As you can see, the OVERLAPS command provides a simple syntax to detect overlapping intervals. Only the peculiarity that at least 2 days (date) or 2 seconds (time) must be difference to recognize an overlap must be considered if necessary.

Finally, there is a special case, which may lead to an unexpected result for you:

SELECT ‘Overlapping’ (TITLE ‘ ‘ ‘)WHERE(TIME ’05:00:00′, TIME ’03:00:00′) OVERLAPS(TIME ’03:00:01′, TIME ’04:00:00’) ;

What do you think is being delivered back here? Leave a comment with your solution!

1 2 3 7
>