December 27

1 comments

Tuning With Teradata Data Types – Choose Them Wisely!

By Roland Wenzlofsky

December 27, 2019

data types, hashing, hashrow

How Do I Select The Appropriate Data Type In Teradata?

Datatype conversions are very costly and can require a lot of CPU on large tables.

The wrong selection of datatypes has a negative effect on the execution plan.

In this article, I will explain the details and how to get optimal performance from your queries by choosing the right datatypes.

It is important to choose the same data type for the same information in different tables. This is because of how Teradata distributes and finds data: Using the Teradata hashing algorithm.

This algorithm returns a different hash value depending on the data type. Even if the values seem the same to you, they may not be the same for Teradata.

Different hash values lead e.g. to the fact that AMP local joins cannot be performed or access via Primary Index or USI is not possible and a full table scan is required instead.

Let's take the literals 3 (numeric) and ‘3' (character) as examples, and look at the corresponding hash values:

SELECT HASHROW(3); -- 6D-27-DA-A6
SELECT HASHROW('3'); -- 3F-DD-85-0D 

The different hash values mean that two rows that have these values in the Primary Index are on different AMPs.

teradata data type
‘3' (character) and 3 (integer) create different rowhash and are located on different AMPs

If two tables are to be joined together, the rows must first be moved to a common AMP (by copying or redistributing). This is an IO intensive process that has a great impact on performance.

Hash-Compatible Data Types

There is a certain compatibility between data types i.e. they generate the same hash value.

For example, all integer data types (INTEGER, BIGINT, BYTEINT, SMALLINT), as well as DECIMAL without decimal places, produce the same hash value. Since a date is stored in Teradata as an integer, it is also compatible with the other integer data types.

The values NULL, 0, and an empty string produce the same hash value. It doesn't make any difference if it's an upper case or a lower case:
Both generate the same hash value.

The character data types CHAR and VARCHAR are also hash-compatible.

Also, spaces can have an influence on the hash value. Let's look at the following 3 queries:

SELECT HASHROW (' a');
95-26-AC-0D
SELECT HASHROW ('a');
CE-EA-E2-15
SELECT HASHROW ('a ');
CE-EA-E2-15

As you can see, a different hash value is generated for the string with a leading space than for the strings without space or space at the end.

The Impact Of A Poor Data Model

Here I show you another example where you can see what the effect is when the data types don't match because of a bad data model.

We create three tables and fill the column PK with random numbers:

SyntaxEditor Code Snippet

CREATE TABLE INT_PI
(
    PK INTEGER
) PRIMARY INDEX (PK);
 CREATE TABLE INT2_PI
(
    PK INTEGER
) PRIMARY INDEX (PK); 
 CREATE TABLE CHAR_PI
(
    PK CHAR(20)
) PRIMARY INDEX (PK); 
INSERT INTO INT_PI SELECT ROW_NUMBER() OVER (ORDER BY 1 ) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO INT2_PI SELECT ROW_NUMBER() OVER (ORDER BY 1 ) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO CHAR_PI SELECT ROW_NUMBER() OVER (ORDER BY 1 ) FROM SYS_CALENDAR.CALENDAR;

First, we run a query on the two tables where column “PK” is defined as an integer:

SELECT * FROM INT_PI WHERE PK IN (SELECT PK FROM INT2_PI);

The optimizer generates the following execution plan:

Explain SELECT * FROM INT_PI WHERE PK IN (SELECT PK FROM INT2_PI);
 
  4) We do an all-AMPs JOIN step in TD_MAP1 from DWHPRO.INT_PI by way
     of an all-rows scan with no residual conditions, which is joined
     to DWHPRO.INT2_PI by way of an all-rows scan with no residual
     conditions.  DWHPRO.INT_PI and DWHPRO.INT2_PI are joined using an
     inclusion merge join, with a join condition of ("DWHPRO.INT_PI.PK
     = DWHPRO.INT2_PI.PK").  The result goes into Spool 1 (group_amps),
     which is built locally on the AMPs.  The size of Spool 1 is
     estimated with low confidence to be 73,408 rows (1,835,200 bytes).
     The estimated time for this step is 0.27 seconds.
  5) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.27 seconds.

As you can see in the execution plan, a direct (AMP local) join is done without having to move rows between AMPs. Because the primary index is the same and the datatype is the same this is possible.

In comparison, we now look at the query in which one of the tables has defined the column “PK” as a character:

SELECT * FROM INT_PI WHERE PK IN (SELECT PK FROM CHAR_PI);

Here is the corresponding execution plan:

  4) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.INT_PI
          by way of an all-rows scan with no residual conditions into
          Spool 3 (all_amps) fanned out into 8 hash join partitions,
          which is duplicated on all AMPs in TD_Map1.  The size of
          Spool 3 is estimated with low confidence to be 146,816 rows (
          2,642,688 bytes).  The estimated time for this step is 0.12
          seconds.
       2) We do an all-AMPs RETRIEVE step in TD_MAP1 from
          DWHPRO.CHAR_PI by way of an all-rows scan with no residual
          conditions into Spool 5 (all_amps), which is redistributed by
          the hash code of (DWHPRO.CHAR_PI.PK (FLOAT, FORMAT
          '-9.99999999999999E-999')) to all AMPs in TD_Map1.  Then we
          do a SORT to order Spool 5 by the sort key in spool field1
          eliminating duplicate rows.  The size of Spool 5 is estimated
          with low confidence to be 73,504 rows (2,131,616 bytes).  The
          estimated time for this step is 0.22 seconds.
  5) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 5 (Last Use)
     by way of an all-rows scan into Spool 4 (all_amps) fanned out into
     8 hash join partitions, which is redistributed by the hash code of
     (PK (FLOAT, FORMAT '-9.99999999999999E-999')(FLOAT)) to all AMPs
     in TD_Map1.  The size of Spool 4 is estimated with low confidence
     to be 73,504 rows (2,205,120 bytes).
  6) We do an all-AMPs JOIN step in TD_Map1 from Spool 3 (Last Use) by
     way of an all-rows scan, which is joined to Spool 4 (Last Use) by
     way of an all-rows scan.  Spool 3 and Spool 4 are joined using a
     inclusion hash join of 8 partitions, with a join condition of (
     "(PK )= Field_2").  The result goes into Spool 1 (group_amps),
     which is built locally on the AMPs.  The size of Spool 1 is
     estimated with low confidence to be 73,408 rows (1,835,200 bytes).
     The estimated time for this step is 0.35 seconds.
  7) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.

As you can see, both tables are copied or redistributed for this query. The rows have to be sorted further and the datatypes are also converted.

All together this will have an enormous negative effect on performance.

What We Should Learn From It

Unsuitable data types are always a sign of incorrect data modeling. I see more and more often that in projects data is simply loaded somehow once and then the data model results more or less randomly.

Even if it seems to be faster and cheaper at the beginning of a project, one thing is certain: The costs at the end of the project are many times higher.

Roland Wenzlofsky


Roland Wenzlofsky is a graduated computer scientist and Data Warehouse professional working with the Teradata database system for more than 20 years. He is experienced in the fields of banking and telecommunication with a strong focus on performance optimization.

You might also like

  • Dave Wellman
    Dave Wellman says:

    Hi Roland,
    Just to pick up on a question that you asked above. Yes, any DECIMAL data type with 0 digits after the decimal point will always generate the same hash as the same value in one of the integer data types.

    It is for the reason that you thought, in Teradata DECIMAL columns are stored as ‘scaled integers’. So, a value of 12.3 stored in a DECIMAL(10,2) is stored as 1230. Similarly, the same value stored in a DECIMAL(10,1) would be stored as 123. This is why decimals with 0 decimal places will hash the same as integers.

    Cheers,
    Dave

    Reply

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    Never miss a good story!

     Subscribe to our newsletter to keep up with the latest trends!

    >