fbpx

How Do I Select The Appropriate Data Type In Teradata?

Datatype conversions are very costly and can require a lot of CPU on large tables.

The wrong selection of datatypes has a negative effect on the execution plan.

In this article, I will explain the details and how to get optimal performance from your queries by choosing the right datatypes.

It is important to choose the same data type for the same information in different tables. This is because of how Teradata distributes and finds data: Using the Teradata hashing algorithm.

This algorithm returns a different hash value depending on the data type. Even if the values seem the same to you, they may not be the same for Teradata.

Different hash values lead e.g. to the fact that AMP local joins cannot be performed or access via Primary Index or USI is not possible and a full table scan is required instead.

Let’s take the literals 3 (numeric) and ‘3’ (character) as examples, and look at the corresponding hash values:

SELECT HASHROW(3); -- 6D-27-DA-A6
SELECT HASHROW('3'); -- 3F-DD-85-0D 

The different hash values mean that two rows that have these values in the Primary Index are on different AMPs.

teradata data type
‘3’ (character) and 3 (integer) create different rowhash and are located on different AMPs

If two tables are to be joined together, the rows must first be moved to a common AMP (by copying or redistributing). This is an IO intensive process that has a great impact on performance.

Hash-Compatible Data Types

There is a certain compatibility between data types i.e. they generate the same hash value.

For example, all integer data types (INTEGER, BIGINT, BYTEINT, SMALLINT), as well as DECIMAL without decimal places, produce the same hash value. Since a date is stored in Teradata as an integer, it is also compatible with the other integer data types.

The values NULL, 0, and an empty string produce the same hash value. It doesn’t make any difference if it’s an upper case or a lower case:
Both generate the same hash value.

The character data types CHAR and VARCHAR are also hash-compatible.

Also, spaces can have an influence on the hash value. Let’s look at the following 3 queries:

SELECT HASHROW (' a');
95-26-AC-0D
SELECT HASHROW ('a');
CE-EA-E2-15
SELECT HASHROW ('a ');
CE-EA-E2-15

As you can see, a different hash value is generated for the string with a leading space than for the strings without space or space at the end.

The Impact Of A Poor Data Model

Here I show you another example where you can see what the effect is when the data types don’t match because of a bad data model.

We create three tables and fill the column PK with random numbers:

SyntaxEditor Code Snippet

CREATE TABLE INT_PI
(
    PK INTEGER
) PRIMARY INDEX (PK);
 CREATE TABLE INT2_PI
(
    PK INTEGER
) PRIMARY INDEX (PK); 
 CREATE TABLE CHAR_PI
(
    PK CHAR(20)
) PRIMARY INDEX (PK); 
INSERT INTO INT_PI SELECT ROW_NUMBER() OVER (ORDER BY 1 ) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO INT2_PI SELECT ROW_NUMBER() OVER (ORDER BY 1 ) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO CHAR_PI SELECT ROW_NUMBER() OVER (ORDER BY 1 ) FROM SYS_CALENDAR.CALENDAR;

First, we run a query on the two tables where column “PK” is defined as an integer:

SELECT * FROM INT_PI WHERE PK IN (SELECT PK FROM INT2_PI);

The optimizer generates the following execution plan:

Explain SELECT * FROM INT_PI WHERE PK IN (SELECT PK FROM INT2_PI);
 
  4) We do an all-AMPs JOIN step in TD_MAP1 from DWHPRO.INT_PI by way
     of an all-rows scan with no residual conditions, which is joined
     to DWHPRO.INT2_PI by way of an all-rows scan with no residual
     conditions.  DWHPRO.INT_PI and DWHPRO.INT2_PI are joined using an
     inclusion merge join, with a join condition of ("DWHPRO.INT_PI.PK
     = DWHPRO.INT2_PI.PK").  The result goes into Spool 1 (group_amps),
     which is built locally on the AMPs.  The size of Spool 1 is
     estimated with low confidence to be 73,408 rows (1,835,200 bytes).
     The estimated time for this step is 0.27 seconds.
  5) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.27 seconds.

As you can see in the execution plan, a direct (AMP local) join is done without having to move rows between AMPs. Because the primary index is the same and the datatype is the same this is possible.

In comparison, we now look at the query in which one of the tables has defined the column “PK” as a character:

SELECT * FROM INT_PI WHERE PK IN (SELECT PK FROM CHAR_PI);

Here is the corresponding execution plan:

  4) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step in TD_MAP1 from DWHPRO.INT_PI
          by way of an all-rows scan with no residual conditions into
          Spool 3 (all_amps) fanned out into 8 hash join partitions,
          which is duplicated on all AMPs in TD_Map1.  The size of
          Spool 3 is estimated with low confidence to be 146,816 rows (
          2,642,688 bytes).  The estimated time for this step is 0.12
          seconds.
       2) We do an all-AMPs RETRIEVE step in TD_MAP1 from
          DWHPRO.CHAR_PI by way of an all-rows scan with no residual
          conditions into Spool 5 (all_amps), which is redistributed by
          the hash code of (DWHPRO.CHAR_PI.PK (FLOAT, FORMAT
          '-9.99999999999999E-999')) to all AMPs in TD_Map1.  Then we
          do a SORT to order Spool 5 by the sort key in spool field1
          eliminating duplicate rows.  The size of Spool 5 is estimated
          with low confidence to be 73,504 rows (2,131,616 bytes).  The
          estimated time for this step is 0.22 seconds.
  5) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 5 (Last Use)
     by way of an all-rows scan into Spool 4 (all_amps) fanned out into
     8 hash join partitions, which is redistributed by the hash code of
     (PK (FLOAT, FORMAT '-9.99999999999999E-999')(FLOAT)) to all AMPs
     in TD_Map1.  The size of Spool 4 is estimated with low confidence
     to be 73,504 rows (2,205,120 bytes).
  6) We do an all-AMPs JOIN step in TD_Map1 from Spool 3 (Last Use) by
     way of an all-rows scan, which is joined to Spool 4 (Last Use) by
     way of an all-rows scan.  Spool 3 and Spool 4 are joined using a
     inclusion hash join of 8 partitions, with a join condition of (
     "(PK )= Field_2").  The result goes into Spool 1 (group_amps),
     which is built locally on the AMPs.  The size of Spool 1 is
     estimated with low confidence to be 73,408 rows (1,835,200 bytes).
     The estimated time for this step is 0.35 seconds.
  7) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.

As you can see, both tables are copied or redistributed for this query. The rows have to be sorted further and the datatypes are also converted.

All together this will have an enormous negative effect on performance.

What We Should Learn From It

Unsuitable data types are always a sign of incorrect data modeling. I see more and more often that in projects data is simply loaded somehow once and then the data model results more or less randomly.

Even if it seems to be faster and cheaper at the beginning of a project, one thing is certain: The costs at the end of the project are many times higher.

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"rgb(255, 0, 0)"}},"gradients":[]}}]}__CONFIG_colors_palette__
GET OUR TERADATA BOOK

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

  • Dave Wellman
    Dave Wellman says:

    Hi Roland,
    Just to pick up on a question that you asked above. Yes, any DECIMAL data type with 0 digits after the decimal point will always generate the same hash as the same value in one of the integer data types.

    It is for the reason that you thought, in Teradata DECIMAL columns are stored as ‘scaled integers’. So, a value of 12.3 stored in a DECIMAL(10,2) is stored as 1230. Similarly, the same value stored in a DECIMAL(10,1) would be stored as 123. This is why decimals with 0 decimal places will hash the same as integers.

    Cheers,
    Dave

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >