Teradata TPT – Export & Import Script Examples

The following two Teradata TPT scripts show how easy it is to export a table from Teradata and upload it for example to another Teradata system.

Please note that bulk loading is used in this example. So there should be more than a few hundred thousand rows loaded this way. Few rows should only be loaded with transactional load methods (BTEQ, TPUMP) to conserve resources.

In this example, an export file in Teradata format “unformatted with indicators” is created.

Teradata TPT Export Script

USING CHARACTER SET UTF8
DEFINE JOB EXPORT_DELIMITED_FILE
DESCRIPTION 'Export rows from a Teradata table to a unformatted file'  (  DEFINE SCHEMA FILE_SCHEMA  (
"COL1" CHAR(6)
,"COL2" VARCHAR(66)
,"COL3" DECIMAL(4,0)
,"COL4" INTEGER
);  
DEFINE OPERATOR SQL_SELECTOR 
TYPE SELECTOR  SCHEMA FILE_SCHEMA ATTRIBUTES  
(
  VARCHAR PrivateLogName = 'selector_log',  
  VARCHAR TdpId = 'TD_SERVER', 
  VARCHAR UserName = 'dwhpro',  VARCHAR UserPassword = 'xxxxxx', 
  VARCHAR SelectStmt = 'SELECT * FROM DWHPRO.SourceTable;'  
);  
DEFINE OPERATOR FILE_WRITER TYPE DATACONNECTOR CONSUMER SCHEMA *  ATTRIBUTES  
(  
  VARCHAR PrivateLogName = 'dataconnector_log', 
  VARCHAR DirectoryPath = './data/',  
  VARCHAR FileName = 'test.txt',  
  VARCHAR Format = 'UNFORMATTED',  
  VARCHAR IndicatorMode = 'Y',  
  VARCHAR OpenMode = 'Write'
);  
APPLY TO OPERATOR (FILE_WRITER)
SELECT * FROM OPERATOR (SQL_SELECTOR);  
);

Teradata TPT Import Script

USING CHARACTER SET UTF8
DEFINE JOB IMPORT_DELIMITED_FILE  DESCRIPTION 
'Import rows to a Teradata target table'  
(  
  DEFINE OPERATOR Consumer_Table_Detail TYPE LOAD SCHEMA * ATTRIBUTES 
  (  
   VARCHAR TdpId = 'TD_Server',  
   VARCHAR UserName = 'dwhpro',  VARCHAR UserPassword = 'xxxxxx',  
   VARCHAR TargetDatabase = 'DWHPRO',  
   VARCHAR TargetTable = 'DWHPRO.TargetTable',  
   VARCHAR LogTable = 'target_table_log',  
   VARCHAR DateForm = 'AnsiDate'
  ); DEFINE SCHEMA Define_Target_Schema   
(
"COL1" CHAR(6)
,"COL2" VARCHAR(66)
,"COL3" DECIMAL(4,0)
,"COL4" INTEGER
);  

DEFINE OPERATOR Producer_File_Detail TYPE DATACONNECTOR PRODUCER SCHEMA Define_Target_Schema ATTRIBUTES 
(  
  VARCHAR DirectoryPath = './data',  
  VARCHAR Filename = 'test.txt',  
  VARCHAR Format = 'UNFORMATTED',  
  VARCHAR OpenMode = 'Read',  
  VARCHAR IndicatorMode = 'Y',  
  VARCHAR DateForm = 'AnsiDate'
); 
APPLY ('INSERT INTO DWHPRO.TargetTable(
COL1
,COL2
,COL3
,COL4
) VALUES (
:COL1
,:COL2
,:COL3
,:COL4
); ') 
TO OPERATOR (Consumer_Table_Detail) 
SELECT * FROM OPERATOR (Producer_File_Detail););

If you want to get to know all functionality of Teradata TPT please read on here:

3

Teradata Surrogate Keys Guidelines

What are Teradata Surrogate Keys?

Surrogate keys are used in Teradata to map the natural keys of source systems to a unique key.

Usually, one or more natural key columns are mapped to a surrogate key that is worth an INTEGER. Often a consecutive number is generated.

In Teradata, the ROW_NUMBER() function can be used, or an IDENTITY column.

The advantage of an IDENTITY column is that a new surrogate key is generated by the system as soon as a row is inserted into the key table.

However, it is important to define the IDENTITY column so that no duplicates are created.

For this, the IDENTITY column must be defined with GENERATED ALWAYS and NO CYCLE.

GENERATED always prevents the surrogate key column from being updated, NO CYCLE prevents the reuse of values – If the highest possible number for an integer data type is reached, no new surrogate keys can be generated and an error is reported.

Detailed information about IDENTITY columns is available in table DBC.IdCol

Disaster Reload of Key Tables with IDENTITY columns

If a key table that uses IDENTITY columns has to be reloaded, please pay attention to the following to avoid duplicates:

  • The key table needs to be re-created with a new START and MINVALUE numbers to exclude the numbers that were generated before to avoid the same number to be generated again.
  • The definition of the identity column needs to be changed from “GENERATED ALWAYS” to “GENERATED BY DEFAULT” to be able to reload the keys that were generated before.
  • New surrogate keys will be generated only when NULL values are passed to the identity column.

Here is an example where the highest value of the IDENTITY column is 200,000:

CREATE MULTISET TABLE TheKeys 
(
   NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
   SURROGATE_KEY BIGINT GENERATED ALWAYS AS IDENTITY
   (START WITH 1 INCREMENT BY 1 
    MINVALUE -999999999999999 
    MAXVALUE 999999999999999 
    NO CYCLE),
)
UNIQUE PRIMARY INDEX (NATURAL_KEY);

If we need to reload the key table the following DDL is required:

CREATE MULTISET TABLE TheKeys 
(
   NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
   SURROGATE_KEY BIGINT GENERATED BY DEFAULT AS IDENTITY
   (START WITH 1 INCREMENT BY 1 
    MINVALUE 200000 
    MAXVALUE 999999999999999 
    NO CYCLE),
)
UNIQUE PRIMARY INDEX (NATURAL_KEY);

Why should you use Surrogate Keys?

There are several reasons why it is a good idea to replace the natural keys with Teradata surrogate keys.

Different source systems deliver the same natural key for different information.

You may have more than one source system delivering the same value in the natural key for the same target table but with a different meaning.

This fact requires you to use surrogate keys because keeping the natural key from both sources would not allow you to distinguish the objects.

Different source systems deliver a different natural key for the same information.

But it can also be that the same information comes from different source systems with different natural keys. For example, the customers of a bank may be kept in different source systems. If it is necessary to represent these as a single customer in the data warehouse, surrogate keys are perfectly suited for this.

Unfortunately, many Teradata data warehouse implementations decide against surrogate keys and put the effort of integrating the information into the reports.

This approach will ultimately cause more costs than having a right surrogate key design from the begin (don’t forget, such an approach requires to repeat the logic of combining object information into each and every single report).

Exchange of Source System leads to the Delivery of new Natural Keys.

If the natural keys of the source system are used, and you undergo a replacement of the original system you end up in a bad situation.

In this case, let’s compare the effort caused by it.

If we have surrogate keys available, only the mapping table (Natural Keys -> Surrogate Keys) has to be adjusted, all surrogate keys have to be assigned once to the new natural keys.

If we only have the natural keys available in our data warehouse, this usually means a complete re-design. The old natural keys must be replaced everywhere by the new natural keys!

In general, you never should tightly couple your data model to the source system structures. An invoice table should be called Invoice and not something like SAP-Invoice. Data Warehousing is not the 1:1 storage of operational data.

At the begin of a project, it is always more agile to avoid surrogate keys and integrate source system tables 1:1.

However, this only shifts costs to the end of the chain. Changes that have to be made afterward are many times more expensive!

Teradata Surrogates Keys and Performance

Surrogate keys, if properly used, may increase your performance as you can replace several natural key columns by an integer column.

Often, natural key columns are character columns. From a performance point of view, this makes no difference for the hashing of the primary index. Where there is a negative effect on performance, however, is with joins, since Teradata must take character sets, etc. into account here.

Depending on the number of columns and data types of the natural key, this can have a negligible impact on performance or a noticeable effect.

8

Teradata Join Index

What is the Join Index?

The Teradata Join index stores pre-joined tables, the result of aggregations, or simply the content of one table in a different structure. Join indexes are comparable to materialized views.

Although the name indicates that join indexes are an index structure, they have to be considered more as an added layer of pre-joined, pre-aggregated or permanent tables with its own Primary Index, Secondary Indexes, and row-level partitioning.

What makes the join index comparable to a secondary index is that it provides the optimizer with additional access paths to the data.

Unlike secondary indexes, which are stored as internal subtables, Teradata Join indexes store as separate physical database objects.

Therefore, a join index can also be stored in any database.

Because join indexes are stored like base tables, their maintenance costs are visible in table DBC.DBQLOGTBL (Teradata does not store information about the maintenance costs of other index types).

The use of a Join Index by the Optimizer is not subject to the decision of the user.

Only the Teradata Optimizer decides when it makes sense from a cost perspective to use the join index and when it is cheaper to use the underlying table structures.

Secondary indexes typically offer exactly one additional access path to the data in the base table (hash lookup, in the case of a NUSI also full index scan).

Join indexes can be modeled in a variety of ways and themselves serve as the basis for secondary indexes or partitioning.

When to Use a Join Index?

  • Join together large tables with a significant number of rows
  • Join large tables selecting only a few columns
  • Queries with complex expressions in its predicate
  • Denormalization of PDM for to increase performance
  • Create alternative row-level partitioning
  • Create alternative Primary Index as an additional access path
  • Movement of time-consuming joins and aggregations into the batch window

A join index provides additional access paths to data and therefore is used for performance tuning.

Higher performance usually is achieved by designing a join index with a different primary index or row-level partitioning.

However, even if the primary index or the partitions of the join index cannot be used for querying, the optimizer will use the join index with a full index scan if the index is considerably smaller than the base table and therefore fewer data blocks have to be read.

The Primary Index of a Join Index

A unique Primary Index can be defined if the index is not compressed (this is not related to multivalue compression, but the compression of repeating groups in the index) and only on a single table join index.

This will give all the advantages of a UPI on a base table, such as ensuring that no duplicates can be inserted or created with an update statement.

Usually, the Primary Index is used for data distribution, and the rows are sorted by rowhash on the AMPs. But the Join Index allows as well to chose another sort order by using the “ORDER BY VALUE” clause. Value order gives better performance for range queries:

CREATE JOIN INDEX JI1 AS
    SELECT (a, b), (c, d, e)
    FROM TheTable
    LEFT JOIN Customer ON a= c
    ORDER BY a
    PRIMARY INDEX (b);

Value order is only allowed for 4 Byte numeric values, as this is historically the maximum value allowed for row hashes being stored in the array of data pointers in each data block (DATE can be used as it’s internally stored as INTEGER).

Join Indexes with Built-In Functions

If the functions DATE, CURRENT_DATE, or CURRENT_TIMESTAMP are used in the Expression partition of a join index, a one-time evaluation takes place when the index is created and not dynamically when the index is used. This also applies to the use of these functions in a WHERE condition of the join index.

However, “ALTER TABLE TO CURRENT” can be used on the index to update the value in the Expression partition to the current value.

Why you should use OUTER JOINS

Since with an INNER JOIN, the rows that cannot be matched are lost, it is better to define Join Indexes with OUTER JOIN. This allows more queries to be covered.

Join Index and Partitioning

  • A Join Index can be Row or Column Partitioned
  • Partitioning adds an additional access path
  • A Join Index can be partitioned on top of the non-partitioned table
  • A non-partitioned Join Index on top of a partitioned base table is not recommended for performance reasons
  • A Row compressed Join Index can’t be row-level partitioned
  • The Partitioning Expression of a join index can be numeric or of a character data type
  • DATE, CURRENT_DATE, or CURRENT_TIMESTAMP functions can be used in the partition expression of a join index

Join Index Coverage

If the join index does not contain all selected columns (“not covering”), it can access these missing columns using the ROWID of the base table. To do this, the dummy column “ROWID” usually must be included (exceptions see below) in the SELECT list.

The cost saving is that the relevant rowids are first extracted into a spool file using the join index. In a subsequent step, the rows of the base table belonging to the rowids are queried.

If the join index contains all columns of the SELECT list (“covering”), no access to the base table is required.

There is no coverage possible for SELECT * FROM table_name queries:

Avoid the practice of writing SQL queries without specifying a column list or join index coverage is not given!

When will the Optimizer use a non-covering Join Index?

  • One of the columns in the join index definition is the dummy column “ROWID”. If this is the case, the base table ROWID is staying with each index row, and a lookup of the base table row can be done as described above.
  • The UPI columns of the base table are available in the SELECT list of the Join Index. Teradata can hash this column combination in order to identify the AMP containing a row. Again, the rowids are used to access the base table. A message with the rowid is sent to the AMP in order to receive the required row.
  • The column set of a NUPI of the base table is contained in the select list of the join index (condition 1) and one of the two additional conditions is met:

    (2a) The Join Index contains the dummy column “ROWID” or
    (2b) There is a USI on the base table available which matches the column set of the NUPI.

If condition (2a) is met, the base table rows can be accessed via the column “ROWID” by rehashing the NUPI columns and accessing the base table rows via the rowids extracted into a spool file.

Condition (2b) is similar to the UPI case described above. The AMPs owning the requested rows are identified by rehashing the USI columns, everything else is the same as in condition (2a)

Coverage does not guarantee the use of a Join Index. The cost of using it must be less than the cost of accessing the base table with the usage of another index or a full table scan.

What are the Limitations of a Join Index?

  • The Usage of FULL OUTER JOIN is not allowed
  • LEFT or RIGHT JOINS: on the inner side at least one non-nullable column must be selected
  • OUTER JOIN is preferable for Join Index usage likelihood, but not allowed for Aggregate Join Index
  • HAVING and QUALIFY not allowed
  • No Set Operations are allowed (UNION, INTERSECT, MINUS)
  • No Subqueries are allowed
  • No inequality conditions in ON clauses of join index definitions. are allowed. Exception: If they are ANDed to at least one equality join condition
  • Only <, <=, >=, > as comparison operators allowed
  • TOP n and SAMPLE not allowed
  • During the restoration of a base table or database, the join index is marked as invalid
  • A maximum of 64 columns per table per Join Index is allowed
  • A maximum of 32 Indexes can be defined on a table including join indexes
  • UPI only allowed for single table join index
  • Value ordered NUSI and Partition Expression on the same Join Index columns are not allowed

Multi-Table Join Index

Multi-Table Join Indexes allow you to move resource-intensive joins from the online to the batch window.

Shifting the workload does not reduce the total workload, but shifts it to a point in time that is beneficial for the overall performance of the system.

The syntax for creating a join index is not much different than the syntax of a CREATE TABLE AS statement:

CREATE JOIN INDEX  AS
SELECT t01.PK, t01.COL_A,t02.COL_B
FROM
 t01
INNER JOIN
 t02
ON
t01.PK = t02.PK
PRIMARY INDEX (COL_B);

The statement above shows several important characteristics of a join index:

Join indexes are a good alternative to temporary tables. However, unlike a temporary table, the use of the join index is not guaranteed. Whether the optimizer uses an existing join index also depends on the available statistics.

A join index should only be created if it can be reused by the optimizer in different scenarios. The additional space required must be taken into account accordingly. If the use of the join index cannot be achieved consistently, a temporary table is probably better.

If we want to overcome any possible deviation between indexed columns and the queried columns, it is feasible to store the pseudo-column ROWID in the join index which allows the usage of this index even if not all needed columns are available in the query.

Teradata will get the missing columns from the base table (as it has the ROWID available, this is the same as primary index access).

Limitations of the Multi-Table Join Index

The bulk load utilities can only be used if there is no join index defined.

However, a multi-table join index is usually used for pre-joining tables to avoid expensive online joins, for example, when creating ad-hoc reports. Joins are one of the most expensive tasks. Often large amounts of data have to be redistributed.

The big advantage of the Join Index is that the maintenance is completely done by Teradata and optimized methods are used to e.g. keep an Aggregate Join Index up to date without having to re-aggregate the whole index table with every small change.

However, since Fastload and Multiload are not possible, there is a contradiction here:

If the join index has to be removed before loading and then created again, the advantages mentioned above (for example, optimized aggregation) are not available.

Restrictions for outer-joined tables in the join index are another problem:

CREATE JOIN INDEX  AS
SELECT t01.PK, t01.COL_A,t02.COL_B
FROM
 t01
LEFT OUTER JOIN
 t02
ON
t01.PK = t02.PK
PRIMARY INDEX (COL_B);

Each outer join  in a multi-table join index has to meet the following restrictions:

  • All columns from the left table have to be selected
  • At least one column from the right table requires must be defined as NOT NULL.

Compressed Join Index

Multi-Table Join Indexes can be compressed by putting brackets around a group of columns.  Compression reduces occupied disk space as each group of compressed values is only stored once (and not in each row).

Frequent combinations of column values are therefore good candidates.

CREATE JOIN INDEX  AS
SELECT t01.PK,(t01.COL_A,t02.COL_B)
FROM
 t01
LEFT OUTER JOIN
 t02
ON
t01.PK = t02.PK
PRIMARY INDEX (COL_B);

Single Table Join Index 

The name is somehow misleading as no join is involved. Single-Table join indexes are created from exactly one base table.

Their main purpose is to have the same table available with a different primary index, partitioning, or to have a smaller table (the join index table) with fewer columns to be spooled.

Let’s assume we have the following table below:

CREATE TABLE  (PK INTEGER, COL_A INTEGER) PRIMARY INDEX (PK);

By creating the below join index, we now have physically stored the same table twice, with two different primary indexes:

CREATE JOIN INDEX  AS
SELECT t01.PK, t01.COL_A
FROM
 t01
PRIMARY INDEX (COL_A);

The join index on with a different primary index gives the Teradata Optimizer an additional data access path.

Aggregate Join Index

Aggregate join indexes are pre-aggregations which are automatically maintained in an optimized way by Teradata (only the required rows are re-aggregated). An Aggregate Join Index can be based on one or multiple tables.

The aggregate join index allows defining a summary table without violating the normalization of the data model.

Below is an example of a single-table aggregate join index:

CREATE JOIN INDEX  AS
SELECT t01.PK, SUM(t01.COL_A)
FROM
 t01
GROUP BY 1
PRIMARY INDEX (PK);

Aggregate join indexes are limited in the aggregation functions which can be applied: SUM, COUNT, MIN, or MAX.

An aggregate join index is only used if the following conditions are fulfilled:

  • The grouping clause must include all columns in the grouping clause of the query
  • WHERE clause columns that belong to tables that are not defined in the aggregate join index must be part of the join index select list.
  • For an aggregate join index with a PPI, an aggregated column cannot be a partitioning column

The Teradata Optimizer will consider rewriting a query to make use of an existing Aggregate Join Index.


It will consider the usage of an existing Aggregate Join Index in the following scenarios:

  • For the  coverage of queries containing a subquery (derived table spool)
  • For internally created distinct and grouping steps
  • To partly cover outer join queries
  • For partial/early grouping steps

Sparse Join Index

Adding a WHERE condition to the join index turns a given index into a sparse join index.

Only those rows are stored in the join index that fulfills the WHERE condition. This saves space. However, a sparse index only makes sense if the WHERE condition is used in the queries. If not, the join index cannot be used.

A sparse join index consumes less permanent space as only the rows matching the WHERE conditions are stored.

A further advantage is given because the maintenance of the index is optimized:

If a row is to be inserted, it is first checked whether it meets the WHERE condition. Only if the row is to be inserted does meet the WHERE condition criteria, Teradata also checks whether it satisfies the partition expression.

Join Index Summary

Some knowledge and preparation are needed to allow the optimizer to use a join index.

A join index makes sense when queries are stable. In systems with many ad-hoc queries, it is usually difficult to find a set of useful join indexes that are also used by the optimizer.

Be warned when experimenting with join indexes in a production environment as they may have unexpected side effects. I will give you one real-life example:

A daily batch load had a SQL step deleting a huge 6+ billion records table:

DELETE FROM the_huge_table;

It’s usually a big deal…. The cylinder index marks the data blocks belonging to the table as free, and that’s it.

I overlooked this DELETE step in the load process and added a multi-table join index to the table.

The result was that the subsequent batch load took 20 hours to maintain the join index triggered by the DELETE.

I had to cancel the DELETE, and so did the subsequent ROLLBACK. This is always critical because the table is then in an inconsistent state. But that didn’t matter in this case, because the table should be deleted completely.

Translated with www.DeepL.com/Translator

You can use the query below helps to find all join indexes available in a Teradata System:

SELECT * FROM dbc.Indices WHERE indextype='J';

See also:
The Secondary Index in Teradata – The NUSI compendium

The Teradata SUBSTRING Function

The Teradata SUBSTRING or SUBSTR function is one of the Teradata string functions and used to cut a substring from a string based on its position.

You can use ANSI syntax or Teradata syntax. The ANSI syntax is recommended to remain compatible with other database systems.

ANSI Syntax:

SUBSTRING(str FROM pos [FOR count]);
Parameters:
Mandatory:
str is the string expression from which the substring is taken.
pos is the starting position in str
Optional:
[FOR] 
count is the length of the substring (optional). FOR count is omitted, the substring starts at position pos and takes all remaining characters from the string to the end.

Teradata Syntax:

SUBSTR(str,pos[,count]);
Parameters: 
Mandatory:
str is the string expression from which the substring is taken.
pos is the starting position in str
Optional:
[,count] 
count is the length of the substring (optional). FOR count is omitted, the substring starts at position pos and takes all remaining characters from the string to the end.

Allowed Argument Types

  • Character
  • Byte
  • Numeric
  • User-Defined Type (UDT):
    are implicitly cast to one of the following types: CHARACTER, NUMERIC, DATE, BYTE

Data Type of Return Value

  • If str is BLOB then BLOB
  • Any Byte String (except BLOB) returns VARBYTE(n)
  • CLOB returns CLOB(n)
  • Character String or Numeric Value except for CLOB returns VARCHAR(n)

In ANSI mode, the value of n for each DATATYPE(n) is equal to the value of the original string.

In Teradata mode, the value of n depends on the number of characters/bytes in the result string.

Examples for Teradata SUBSTRING

Here is an example for the SUBSTRING function applied to a character string column:

SELECT * FROM Country WHERE SUBSTRING (country_desc FROM 1 FOR 3) = 'TUN';

Here is an example for the Teradata SUBSTRING function applied to a CLOB column:

SELECT SUBSTRING (myCLOB FROM 100 FOR 120000) FROM TheCLOBTable; 

Tuning with Teradata SUBSTRING

Both SUBSTRING and SUBSTR store only the requested characters/bytes in the spool, and not the entire column. Therefore it may make sense to use this function early in the ETL process (e.g. in a volatile table) if only a part of a column is needed.

3

9 Things to consider when creating a Teradata Table

Teradata Table Design Checklist

The creation of tables in an ever-recurring task, so I thought it was time to offer a checklist.

  • Primary Index or NOPI?
  • Is the Primary Index distributing rows evenly?
  • Is the correct character set used?
  • Are character set and data types matching across tables?
  • Can the Primary Index be unique?
  • Should the table be MULTISET?
  • Should secondary indices be available?
  • Can temporary tables be normalized?
  • Is the correct data block size used?

1. Primary Index or NOPI Table?

The Primary Index is used for various purposes. It usually is optimized for even row distribution and to be sued as the main access path to the data.

But there are scenarios where it’s could be beneficial to create a table without a primary index (NOPI).

Staging is one example:  

By defining stage tables without a primary index, the load performance can be increased (for example, fastloads will not need to sort and distribute rows in its final phase of loading).

If NOPI tables help to improve the overall load performance, of course, it depends on the later steps in the ETL process and the impact of a missing primary index.

More details about the usage of NOPI tables you can find here:  NOPI Tables on Teradata

2. Is the Primary Index skewed?

The goal is to have a Primary Index which distributes the table rows evenly across all AMPs.  Consequently, it is necessary to know your data.

Don’t rely on small amounts of test data as a base for your primary index decision.

Data demographics of production data can change significantly. I experienced this several times; I will give you one real-life example (my client was a European telco):

The main customer table’s primary index was mapped 1:1 from the correspondent source system column, let’s call it “customer_id”.

During the development phase of the project, several files with customer test data were delivered, supporting the chosen primary index. As rows were distributed evenly across all AMPs, load performance was excellent.

The whole physical data model was optimized to support call-center employees to look up the customer value in real-time during their calls (i.e. tactical workload). 

Furthermore, the client’s table could be joined very efficient with other related tables, such as call details to support reporting.

Everything looked perfect until we entered the production phase. On the first day, the load times for the customer table increased from about 30 minutes to 6 hours with a subsequent “no more disk space” error.

What had happened?

During the test phase, all delivered files were related to “contract” customers. No one ever thought about testing “prepaid” customers. It turned out that all prepaid customers were assigned to a single “customer_id” in the source system…

The result was a massive redesign, a project delay of several months, and a complete loss of reputation.

In rare cases, you would like to “design” a skewed primary index, for example, if you want to force all rows into one data block of one AMP.

Doing this, you can achieve rowhash locking instead of table locks and have other advantages. For details, please read our article All Rows on one AMP by Design.

3. Which CHARACTER SET to use?

While it saves time to leave out such considerations, in the end, it never pays off. Of course, you could define each single character column as UNICODE.

But such an approach comes with two disadvantages: you will waste valuable disk space and decrease the performance of requests related to these tables.

Such a sloppy approach can lead to a disaster. Here one example from real life. It recently happened in a major European bank:

We had to deal with tables, each of them carrying several hundreds of columns. Load times (fast load) were horrible: 6 hours for a ridiculous 50GB of data!

When I was called to the project to analyze these performance issues, it turned out that every single column was defined with character set UNICODE.

Some simple calculations revealed that data rows wide: Only 1-2 rows were fitting into each data block.

A redesign of dozens of tables was required. By changing all columns to the character set LATIN, and only keeping a few UNICODE columns, the load times were reduced drastically as more rows could be packed into each data block.

This article shows you how significant the impact of wrongly chosen character sets can be: Teradata Tuning Success – The Best ever

4. Are Data Types and CHARACTER SETS matching across Tables?

Data type consistency is a crucial point but often overseen.

Data Type and Character Set Conversions during query executions are expensive operations (CPU time) and prevent AMP local activities (such as row-key based merge joins). Rows from different tables joined on their primary index are only co-located on the same AMP if data types match. The same holds true for character sets.

5. Can a Unique Primary Index be defined?

If possible, Always define your primary index as unique. It’s a very sloppy way of developing if you skip uniqueness because it’s more convenient (“no problem with duplicates”).

A unique primary index may not always be possible, for example in the case of row partitioned tables, if the partition column is not included in the primary index. In this case, you can achieve uniqueness by adding a USI (unique secondary index).

There are several (often not obvious) disadvantages if we skip uniqueness. Here are some examples:

  • NUPI tables drastically reduce the possibilities of the Teradata Optimizer to use the Nested Join strategy.
  • No duplicate row checks (see in point 6) are needed in the case of SET tables, as the UPI guarantees uniqueness.

6. Can MULTISET tables be used?

If you don’t have to make sure that duplicate rows are declined, always use MULTISET tables.

To understand what duplicate row checks are, you need to know how Teradata distributes the data over the AMPs using hashing.

Each time a row is inserted into a table, the parsing engine determines a hash value and uses the hashmap to determine the AMP that should write the row to its logical disk.

Therefore a message consisting of TableId (to uniquely identify the table), RowHash (the hash value), and the content of the Primary Index Columns is sent to the AMP.

Why is the value of the primary index columns sent?

This is because two different Primary Index values can generate the same row hash. It’s rare, but it happens.

This means that in order to sort a row into the data block at the right place, not only the hash value must be compared, but also the content of the primary index columns.

Since no two ident rows may be inserted into a SET table, each new row to be inserted must be compared against all existing values (“duplicate row check”).

The number of required comparisons increases exponentially and finally leads to one or more AMPs being overloaded in skewed insert statements.

If you cannot define a MULTISET table, then a duplicate row check can be prevented by creating a unique secondary index!

More details about SET and MULTISET tables you can find here: Teradata Multiset versus Set Tables

7. Should a Secondary Index be defined?

Delete secondary indexes and join indexes from tables before loading them with a large number of rows, even if the loading method does not require this (transactional load methods such as BTEQ, TPUMP).

It is usually better to remove the indexes, load the tables, and then create the indexes.

Here is an example of why this approach is advantageous

Once I had to solve the following problem:

A user had started an insert statement, which was not finished after 2 hours. Finally, this user became impatient and aborted the query.

This led to a rollback. I was notified the following day when the query was already in “ABORTED” status for another 10 hours.

I analyzed the situation:

  • The inserted primary index values were skewed and all rows ended up on the same AMP (mostly NULL values)
  • The table was a SET table; this was further worsening the situation because of “duplicate row checks” being done
  • There were 4 NUSIs defined on the table

I decided to cancel the rollback because this query had a massive impact on the performance of the Teradata system.

This is always an emergency solution because the table is then in an inconsistent state!

The combination of the SET table, 4 secondary indices, and the enormous skew had led to this catastrophic result.

The huge number of duplicate row checks performed by only one AMP was a major problem. But of course, the maintenance of 4 NUSIs additionally had an enormous negative effect on the performance of the system in this situation.

This problem could have been prevented with the above method

8. Can Columns be normalized on the fly with Temporary Tables?

If you are using temporary tables (volatile, global temporary or physical tables), you should at first consider improving all areas mentioned above (data types, character sets, primary index, etc.) to support your workload in the best possible way.

It is also a good practice to normalize columns in temporary tables. This prevents functions from having to be applied to these columns to obtain the required part of the information.

Functions applied to columns usually create a problem in joins as well as in WHERE conditions:

  • They often prevent the use of statistics on these columns.
  • They require a redistribution of the rows in joins.

Make them atomic. Splitting will help to improve the performance of joining, as you can avoid joining conditions such as “ON TABLE1.PK = SUBSTRING(TABLE2.PK,1,5).

9. Does the Data Block Size support your type of Workload?

Starting with Teradata 14 it is possible to create data blocks up to a size of 1MB.

Please note the following:

The larger the data blocks, the more data can be copied from the mass memory to the main memory in one read operation. However, more main memory is needed per data block.

In the worst case, Teradata reads a 1MB block of data to return a single row.

For tactical workload on a Teradata system, smaller data blocks are usually recommended, as individual rows are queried.

Larger data blocks are recommended for strategic workloads (which often include a full table scan and aggregations).

Teradata Architecture – AMP, BYNET and Parsing Engine

Teradata Query Parallelism

If we run a query on a Teradata system, it runs in parallel in every step, no matter if it is a join, sorting, aggregation step, etc.

The great advantage is that this applies to each step of the query process. That’s what makes Teradata unique. The big advantage Teradata has over other database systems is that its high level of parallelism was part of the Teradata architecture from the start many years ago (when most of its components were implemented in hardware instead of software)

In the course of time, even new features have been made possible in order to further increase parallelism. However, much would not have been possible without the architecture underlying Teradata.

Parallel Execution across the AMPs

AMPs are independent processes that perform a wide range of activities independently of each other. Each SQL query is first broken down into subtasks assigned to the AMPs. Each AMP completes its tasks and delivers a partial result. When all AMPs are ready, the overall result is returned.

Parallelism at AMP level is one of the main reasons for a Teradata System offering tremendous performance when used the right way.

AMPs are not specialized but capable of performing any kind of task.

Today there are hundreds of AMPs on each Teradata System available.

The Tasks of a Teradata AMP

  • Reading of Rows
  • Writing of Rows
  • Row Locking
  • Sorting
  • Aggregating
  • Index creation and maintenance
  • Maintaining the transaction log
  • Backup and Recovery
  • Bulk and Transactional Loading

teradata AMP

The Components of a Teradata AMP

Each AMP has its own exclusive resources:

  • Logical storage unit
  • Memory
  • CPU

Teradata AMP

Since Teradata is a shared-nothing architecture, all resources of an AMP are only assigned to this AMP. An extension of the system (by adding hardware) therefore goes hand in hand with the possibility of linear growth in performance.

The Teradata Primary Index – Hash Partitioning

Parallelism is achieved by hash partitioning. The existing data is evenly distributed among the existing AMPS, with the goal that each AMP must perform approximately the same.

Hash partitioning works great to distribute large amounts of data to the AMPs. The drawback is that individual AMPs can represent a bottleneck if the task distribution is not even – skewing is the result and this pressure will often occur.

Skewing is one of the main problems a performance tuner on a Teradata system has to solve.

Hash partitioning is easily achieved by defining one or more columns for each table from which a hash value is calculated. This hash value determines the AMP for each row. The columns used for hash partitioning are the so-called Primary Index:

CREATE TABLE Customer
(
Customer_ID BIGINT NOT NULL,
Lastname VARCHAR(500),
Firstname VARCHAR(500)
) UNIQUE PRIMARY INDEX (Customer_Id);

Teradata Pipelining Steps

Pipelining increases the parallelism of queries even more. Pipelining makes it possible to start a step even if the predecessor step is not yet finished.

As already mentioned, a request is split into several steps. Each step represents a subtask. This can be, for example:

  • Read all rows of a table (simple step)
  • Update a subset of table rows (simple step)
  • Read two tables, redistribute them and join them (complex step)

Steps can have different complexity, and AMPs may also need to interact with each other.

In a join step (which is complex), pipelining allows starting already the join activity while rows of the participating tables are still being redistributed.

Teradata Parallel Multi-Steps

While pipelining steps are nested into each other, i.e. the input of the predecessor step is made available to the successor as early as possible, Teradata also has another level of parallelism available.

Steps that are independent of each other can be executed in parallel.

The Teradata BYNET

Teradata is a shared-nothing architecture and as we know the work is done in parallel by the AMPs.

But since there is a need for AMPs to communicate with each other, a network is needed to exchange both messages and data: The BYNET.

The Tasks of the BYNET

However, BYNET is more than just a normal network. It has functionality that has been specially developed for Teradata:

  • Message Delivery: Guarantees that messages arrive at the target AMP
  • Coordinate multiple AMPs working on the same step
  • Sorting of the final result set when sending to the client
  • Minimizes the number of AMPs needed for a step
  • Congestion control to avoid an overloaded network

Message Passing & the BYNET

To better understand the tasks of BYNET we have to introduce a new kind of virtual process: The Parsing Engine. It is responsible for creating the execution plan of a request. The BYNET is the interface between the Parsing Engine and the AMPs.

Messages can be sent from the Parsing Engine via BYNET to the AMPs, but BYNET is also responsible for the AMP to AMP communication.

BYNET can send messages to all AMPs, a group of AMPS, or even a single AMP.

Sorting the Final Answer Set

Which is unique in Teradata:

The sorting of the ResultSet is done in parallel, at each level (AMP, Node, BYNET, Parsing Engine) data is pre-sorted to avoid an expensive Sort Step at the end:

  • Each AMP locally sorts its data (this is done in parallel)
  • Each Node takes one buffer of data from all its AMPs and sorts it (buffer by buffer by AMP)
  • The BYNET passes one buffer per Node to the Parsing Engine which does the final sort.

These were the essential components of the Teradata Shared Nothing architecture. More details about how data is stored can be found here:

https://www.dwhpro.com/teradata-physical-storage/

https://www.dwhpro.com/teradata-physical-storage-2/

1 2 3 38
>