How Do I Select The Appropriate Data Type In Teradata?
Datatype conversions are very costly and can require a lot of CPU on large tables.
The wrong selection of data types hurts the execution plan.
This article will explain how to choose the correct data types to achieve optimal performance.
Choosing the same data type for the same information in different tables is essential. This is because of how Teradata distributes and finds data: Using the Teradata hashing algorithm.
This algorithm returns a different hash value depending on the data type. Even if the values seem the same to you, they may not be the same for Teradata.
Different hash values lead, e.g., to the fact that AMP local joins cannot be performed or access via Primary Index or USI is not possible, and a full table scan is required instead.
Let’s take the literals 3 (numeric) and ‘3’ (character) as examples, and look at the corresponding hash values:
SELECT HASHROW(3); -- 6D-27-DA-A6
SELECT HASHROW('3'); -- 3F-DD-85-0D
The different hash values mean that two rows with these values in the Primary Index are on other AMPs.

If two tables are joined together, the rows must first be moved to a common AMP (by copying or redistributing). This is an IO-intensive process that has a significant impact on performance.
Hash-Compatible Data Types
There is specific compatibility between data types, i.e., they generate the same hash value.
For example, all integer data types (INTEGER, BIGINT, BYTEINT, SMALLINT) and DECIMAL without decimal places produce the same hash value. Since a date is stored in Teradata as an integer, it is also compatible with the other integer data types.
NULL, 0, and an empty string produce the same hash value. It doesn’t make any difference if it’s an upper case or a lower case:
Both generate the same hash value.
The character data types CHAR and VARCHAR are also hash-compatible.
Also, spaces can influence the hash value. Let’s look at the following three queries:
SELECT HASHROW (' a'); 95-26-AC-0D SELECT HASHROW ('a'); CE-EA-E2-15 SELECT HASHROW ('a '); CE-EA-E2-15
As you can see, a different hash value is generated for the string with a leading space than for the strings without space or space at the end.
The Impact Of A Poor Data Model
Here I show you another example where you can see the effect when the data types don’t match because of a wrong data model.
We create three tables and fill the column PK with random numbers:
SyntaxEditor Code Snippet
CREATE TABLE INT_PI ( PK INTEGER ) PRIMARY INDEX (PK); CREATE TABLE INT2_PI ( PK INTEGER ) PRIMARY INDEX (PK); CREATE TABLE CHAR_PI ( PK CHAR(20) ) PRIMARY INDEX (PK);
INSERT INTO INT_PI SELECT ROW_NUMBER() OVER (ORDER BY 1 ) FROM SYS_CALENDAR.CALENDAR; INSERT INTO INT2_PI SELECT ROW_NUMBER() OVER (ORDER BY 1 ) FROM SYS_CALENDAR.CALENDAR; INSERT INTO CHAR_PI SELECT ROW_NUMBER() OVER (ORDER BY 1 ) FROM SYS_CALENDAR.CALENDAR;
First, we run a query on the two tables where column “PK” is defined as an integer:
SELECT * FROM INT_PI WHERE PK IN (SELECT PK FROM INT2_PI);
The optimizer generates the following execution plan:
Explain SELECT * FROM INT_PI WHERE PK IN (SELECT PK FROM INT2_PI); 4) We do an all-AMPs JOIN step in TD_MAP1 from DWHPRO.INT_PI by way of an all-rows scan with no residual conditions, which is joined to DWHPRO.INT2_PI by way of an all-rows scan with no residual conditions. DWHPRO.INT_PI and DWHPRO.INT2_PI are joined using an inclusion merge join, with a join condition of ("DWHPRO.INT_PI.PK = DWHPRO.INT2_PI.PK"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 73,408 rows (1,835,200 bytes). The estimated time for this step is 0.27 seconds. 5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.27 seconds.
As you can see in the execution plan, a direct (AMP local) join is done without moving rows between AMPs. This is possible because the primary index is the same, and the data type is the same.
In comparison, we now look at the query in which one of the tables has defined the column “PK” as a character:
SELECT * FROM INT_PI WHERE PK IN (SELECT PK FROM CHAR_PI);
Here is the corresponding execution plan:
4) We execute the following steps in parallel. 1) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.INT_PI by way of an all-rows scan with no residual conditions into Spool 3 (all_amps) fanned out into 8 hash join partitions, which is duplicated on all AMPs in TD_Map1. The size of Spool 3 is estimated with low confidence to be 146,816 rows ( 2,642,688 bytes). The estimated time for this step is 0.12 seconds. 2) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.CHAR_PI by way of an all-rows scan with no residual conditions into Spool 5 (all_amps), which is redistributed by the hash code of (DWHPRO.CHAR_PI.PK (FLOAT, FORMAT '-9.99999999999999E-999')) to all AMPs in TD_Map1. Then we do a SORT to order Spool 5 by the sort key in spool field1 eliminating duplicate rows. The size of Spool 5 is estimated with low confidence to be 73,504 rows (2,131,616 bytes). The estimated time for this step is 0.22 seconds. 5) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 5 (Last Use) by way of an all-rows scan into Spool 4 (all_amps) fanned out into 8 hash join partitions, which is redistributed by the hash code of (PK (FLOAT, FORMAT '-9.99999999999999E-999')(FLOAT)) to all AMPs in TD_Map1. The size of Spool 4 is estimated with low confidence to be 73,504 rows (2,205,120 bytes). 6) We do an all-AMPs JOIN step in TD_Map1 from Spool 3 (Last Use) by way of an all-rows scan, which is joined to Spool 4 (Last Use) by way of an all-rows scan. Spool 3 and Spool 4 are joined using a inclusion hash join of 8 partitions, with a join condition of ( "(PK )= Field_2"). The result goes into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 73,408 rows (1,835,200 bytes). The estimated time for this step is 0.35 seconds. 7) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1.
As you can see, both tables are copied or redistributed for this query. The rows have to be sorted further, and the data types are also converted.
Altogether, this will have an enormous negative effect on performance.
What We Should Learn From It
Unsuitable data types are always a sign of incorrect data modeling. I see more often that in projects, data is loaded somehow once, and then the data model results more or less randomly.
Even if it seems faster and cheaper at the beginning of a project, one thing is sure: The costs at the end of the project are many times higher.
Hi Roland,
Just to pick up on a question that you asked above. Yes, any DECIMAL data type with 0 digits after the decimal point will always generate the same hash as the same value in one of the integer data types.
It is for the reason that you thought, in Teradata DECIMAL columns are stored as ‘scaled integers’. So, a value of 12.3 stored in a DECIMAL(10,2) is stored as 1230. Similarly, the same value stored in a DECIMAL(10,1) would be stored as 123. This is why decimals with 0 decimal places will hash the same as integers.
Cheers,
Dave