Tuning Archives - DWHPRO

Archive

Category Archives for "Tuning"
2

Give Me 5 Minutes, I’ll Give You The Truth About Teradata Sessions And Performance!

Teradata Sessions explained in a few Sentences

What is a Teradata session?

A session is a LOGON of a user on a Teradata system, through which one or more sequential transactions can be executed. The session is terminated by a LOGOFF statement.

What are multiple Teradata sessions (e.g. in BTEQ)?

For Teradata, multiple sessions are like a number of users (but with the same user credentials). Each of these sessions belongs to the same Teradata user and can execute one or more sequential transactions between LOGON and LOGOFF just like a single session.

What are the multiple Teradata sessions used for?

The use of multiple sessions makes it possible to run multiple tasks in parallel. The aim is to increase performance.

Does the use of Teradata multiple sessions always increase performance?

Not at all. For several sessions (e.g. in a BTEQ script) to increase performance, Rowhash accesses must be made. Rowhash access only occurs if a UPI, NUPI, or USI is used. All other access paths require a table lock, and using multiple sessions does not improve performance.

When does a BTEQ script typically achieve an increase in performance by using multiple Teradata sessions?

A BTEQ import script can usually increase performance by using multiple sessions. BTEQ export scripts usually do not achieve any performance advantage by using multiple sessions.

How many Teradata sessions should I use?

This is difficult to answer and is best determined empirically. What is important is an access path via the Rowhash (UPI, NUPI, USI). Too few sessions do not bring out the maximum performance. Too many sessions do not bring any more performance advantage after a certain point but increase the administration effort.

For which access paths do multiple sessions not make sense?

For Full Table Scans and NUSI accesses.

Example 1: Multiple Sessions which do not improve Performance

teradata session

As shown in the graph above, all 4 transactions (T1, T2, T3, T4) are active on all AMPs simultaneously. Even if only one row is required, all of these transactions involve all AMPs to look this row up. None of the transactions makes access via a single rowhash. Therefore, there is no performance gain.

Example 2: Multiple Sessions which do improve Performance

teradata sessions

In example 2, each transaction is only active on exactly one AMP. Therefore the 4 AMPs can work on different tasks at the same time.

In summary, the use of multiple sessions is only useful if:

  • Each transaction uses less than all AMPs
  • Ideally, there are enough sessions available to keep all AMPs equally busy with transactions.

Teradata Fast Path INSERT/SELECT

Fast Path INSERT/SELECT is designed to load a single table as efficiently as possible.

Requirements for Teradata Fast Path INSERT /SELECT

  • All tables involved have the same primary index.
  • The target table is empty

How does the Fast Path INSERT/SELECT work?

The idea behind this is to minimize access to the mass memory (Hard Disk, Solid State Disk).

This is done by creating whole data blocks on each AMP and then writing them to the mass storage with a single IO.

Under the conditions mentioned above (target table is empty, all tables involved have the same primary index), the following is ensured at the same time:

The rows of the target and source tables are already sorted on the same AMP and also in ascending order according to the same rowhash within the data blocks.

Therefore, each AMP can prepare the data blocks to be written locally and write them to the mass memory with a single IO.

Teradata Fast Path
Data Blocks are assembled in parallel on each AMP and written to the target table

How does BTEQ handle the Fast Path INSERT/SELECT?

BTEQ uses a special syntax that must be adhered to when several tables are to be loaded into the empty target table at the same time using Fast Path:

The semicolon must be at the beginning (from the 2nd table) of the line (not at the end of the lines!):

INSERT INTO Customer SELECT * FROM Customer1
;INSERT INTO Customer SELECT * FROM Customer2
;INSERT INTO Customer SELECT * FROM Customer3
;INSERT INTO Customer SELECT * FROM Customer4

Advantages of the Fast Path INSERT/SELECT

  • Only one row is stored in the transient journal for a possible ROLLBACK
  • ROLLBACK is done immediately without delay
  • Rows are packed into data blocks and the whole blocks are written with one IO
  • Since the tables all have the same primary index, the processing is AMP-local and BYNET is not loaded.

9

Teradata SQL Tuning – 6 Golden Rules you must never forget!

1. Complete and up-to-date Statistics

Teradata SQL Tuning begins by providing the optimizer with the statistics it needs. This must always be done as a first step. Statistics influence the execution plan. It is not only important to define all required statistics, but also to keep them as up-to-date as possible.

There is a simple reason why I put statistics first: Often performance problems are solved simply by changing the execution plan.

Which columns require which type of statistics (full statistics collecting, random-AMP sampling or sample statistics) depends primarily on the data demography.

Random-AMP sampling

Random AMP sampling is usually sufficient for columns with many different values. The table should contain significantly more rows than there are AMPs in the system.
The following applies to random-AMP sampling: The more distinct the column values are, the better the estimate will be.

If the column values are skewed, there is a risk that the random-AMP sample will be taken from an AMP that is not representative of the data. Skewed data leads to overestimation or underestimation of the number of rows.

Good candidates for Random-AMP sampling are unique indices, i.e., UPI and USI.

Random-AMP sampling only takes place on indexed columns.

Therefore, You should always collect statistics on non-indexed columns that are used in WHERE conditions. Whether full or sample statistics are collected depends on how unique the column values are.

Fully collected Statistics

Fully collected statistics are required for skewed column values and columns with a small number of distinct values (NUPI and NUSI).

Statistics for Row-Partitioned Tables

The Teradata Optimizer has unique statistic requirements for PPI tables.

The following statistics should additionally be collected on PPI tables:

Dummy Column “PARTITION”

Statistics on dummy column “PARTITION” tell the optimizer the number of the empty statistics.

Dummy column “PARTITION” + PI column

These statistics are required if the partitioning columns are not part of the primary index. In this case, the same primary index value can exist in different partitions. Statistics on dummy column “PARTITION” + PI allow the optimizer to estimate the cost of the sliding window and rowkey based merge join and of dynamic partition elimination steps.

Below statement can be used to determine which statistics the optimizer would additionally need in a SQL statement:

DIAGNOSIS HELPSTATS ON FOR SESSION;

This statement displays a list of suggested statistics at the end of the Execution Plan (given by the EXPLAIN statement) and the Optimizer's opinion of their value (confidence levels).

By gradually adding these statistics, you can test their influence on the execution plan.

Identify Stale Statistics

There are several ways to identify stale statistics. The easiest way is to split the SQL statement and test each partial statement individually. Splitting is done merely by comparing the estimated number of rows (as shown in the Explain output) with the actual number of records returned by the query.

The above-described approach is particularly suitable if the entire SQL statement does not execute in a reasonable time.

Here's an example:

SELECT t01.* 
FROM <Tablename1> t01
INNER JOIN
<Tablename2> t02
ON t01.PK = t02.PK
WHERE
t01.<Column name> = 1 AND t02.<Column name> = 2;

The above query can be divided into two parts for testing:

SELECT * FROM <Tablename1> WHERE <Columnnname> = 1;
SELECT * FROM <Tablename2> WHERE <Columnnname> = 2;

If you execute both sub-queries, and the number of rows returned differs significantly from the estimate in EXPLAIN, the statistics may be obsolete.

2. Teradata Primary Index Choice

Teradata SQL Tuning is not complete without choosing the best possible Primary Index. Design your queries so that the primary index can be used in the joins.

All join columns must be part of the primary index. If only one column of the primary index is missing in the join condition, the result is a different row hash (the order how they are defined in the primary index doesn't matter)

Still, the join condition can contain additional columns. These are then applied as residual conditions after locating the rows via row hash.

To execute a join between two tables, the rows of both tables must be co-located on the same AMP.

This is true when they have the same primary index. In this case, the optimizer can use a join strategy that requires no redistribution of rows a so-called "direct join".

If for some reason you cannot change the primary indexes, or if you need a specific primary index for a particular SQL statement, create a volatile table (or a true temporary table) with the same structure and content as the original table but with a different primary index.

There are three things to consider when selecting the Primary Index: uniform distribution of data, good suitability for join operations, and low volatility:

teradata primary index

This is the only way to ensure that all AMPs start and finish their work simultaneously. This is exactly what parallel data processing is all about.

3. Teradata SQL Tuning with Indexing & Partitioning

Using indexes or partitioning is another way to improve query performance. 

Teradata offers a number of different indexes, all of which have their advantages and disadvantages. Which index should be used when depends mainly on the workload.

The Teradata Secondary Index

Secondary indexes come in two forms. As unique secondary index (USI) and as non-unique secondary index (NUSI). Although one could conclude from the name that these differ only in the uniqueness, their functionality is very different.

The Teradata USI

The USI is very similar to the Unique Primary Index:
The Index Rows of the Index subtable are distributed evenly over all AMPS by Rowhash and sorted by RowHash.

If the USI contains at least all columns used in a WHERE condition, the AMP can be determined which owns the index row (in this case the index row).

The determined index row contains the ROWID of the base table row that is being searched for, and this can then be used to access the desired row.

The USI is suitable for direct access to individual rows and is therefore high-performance. It is ideal for tactical workload where this feature is required.

The Teradata NUSI

The NUSI is not distributed according to a Rowhash. NUSI rows are always held together with the AMP that has the corresponding row of the base table. Therefore a NUSI access is always an all-AMP operation.

The NUSI index rows are sorted by rowhash or by an integer value. Sorting by an integer value (date is internally of data type integer, and therefore also possible) is one of the advantages of the NUSI, as this makes it ideal for range scans (e.g. all days of a certain month).

The Teradata Row-Partitioned Table

While secondary indexes in Teradata are stored in a sub-table and therefore require extra space, partitioning is just another way to structure the rows on the mass storage device.

The rows of a partitioned table are not only distributed to the AMPs according to rowhash, but are also assigned to the appropriate partitions and inside the partitions sorted according to rowhash.

The rows of a partition are arranged on the disk in such a way that a full cylinder scan can be performed. 

When you should partition a table, and when the use of a secondary index or join index is more appropriate, depends on the workload. It is also possible to create additional indexes on a partitioned table.

Partitioned tables are often used for strategic queries in which a series of data (for example, the sales for the current year) is aggregated.

Disadvantages of Indexing & Partitioning

When we work with indexing techniques, we need to keep an eye on the entire data warehouse architecture and decide whether our solution fits. Indexes can have a negative impact on the ETL process for several reasons.

Loading tools such as Fastload require secondary indexes and join indexes to be removed before loading.

The index sub-tables have to be managed by Teradata. Insert, Delete and Update statements require that in addition to the actual table, the index sub-table must be maintained.

If potentially useful indexes are not used by the optimizer and they are not helpful in the entire PDM design, drop them immediately. You're wasting space and resources.

4. Query Rewriting

The performance of a query can often be improved by rewriting the query.

Personally, I prefer to consider an SQL statement as a black box and to limit optimizations to technical methods first.

Here are a few examples:

  • EXISTS instead of IN
  • Splitting a large SQL statement into smaller parts
  • UNION ALL instead of UNION
  • DISTINCT instead of GROUP BY

Not having to understand the content and business logic of a query, I do not need to contact the author of the query. The purely technical optimizations are usually not that risky.

Only when I reach the limits of purely technical optimization do I begin to deal with business logic. Understanding business logic often opens up additional optimization opportunities.

Rewriting of queries often solves performance problems, even when all other techniques have failed.

5. Teradata SQL Tuning with Real-Time Monitoring

Teradata SQL tuning requires to watch the query running in real-time. Monitoring a query in viewpoint at runtime helps to identify the critical steps.

In Viewpoint you should look for the following steps:

  • Steps that take a long time to finish and consume a huge amount of resources.
  • Steps that are skewed, especially if the skew factor reaches 99%, as this means that all work is done by only one AMP or a few AMPs.

Analyzing the bad query

We have to think about the root cause of the bad query step, here are a few ideas:

  • Does the base table have a skewed primary index (static skew)?
  • Does the redistribution of rows cause skew (dynamic skew)?
  • Are massive amounts of data redistributed because of poor join preparation due to missing or stale statistics?
  • Are there several hash collisions during the execution of the INSERT statement?

Stale and missing statistics typically lead to incorrect decisions in join preparation (copying a table to all AMPs instead of rehashing) and to the use of incorrect join methods (for example, product join instead of merge join).

6. Comparison of Resource Usage

Another important task in Teradata SQL Tuning is measuring resource consumption before and after the optimization of the query.

Query run times are misleading because they can be affected by the simultaneous workload of other queries over which we have no control.

In performance tuning, we prefer to use absolute key figures that are independent of what else is running on our Teradata system.

Below is a query that gives you a detailed insight into how good each step of the SQL statement to be optimized is. To execute this query, you must have read access to table DBC.DBQLOGTBL (or related views):

Do not forget to give each version of the query to be optimized its own query band so that you can uniquely identify it in the table DBC.DBQLOGTBL:

SET QUERY_BAND = 'Version=1;' FOR SESSION; 
SELECT   
AMPCPUTIME,   
(FIRSTRESPTIME-STARTTIME DAY(2) TO SECOND(6)) RUNTIME,
SPOOLUSAGE/1024**3 AS SPOOL_IN_GB,
CAST(100-((AMPCPUTIME/(HASHAMP()+1))*100/
NULLIFZERO(MAXAMPCPUTIME)) AS INTEGER) AS CPU_SKEW,   
MAXAMPCPUTIME*(HASHAMP()+1) AS CPU_IMPACT,
AMPCPUTIME*1000/NULLIFZERO(TOTALIOCOUNT) AS LHR
FROM
DBC.DBQLOGTBL
WHERE      
QUERYBAND = 'Version=1;' ;

The above query will return the following measures:

  • The total CPU Usage
  • The Spool Space needed
  • The LHR (ratio between CPU and IO usage)
  • The CPU Skew
  • The Skew Impact on the CPU

Our goal is to decrease total CPU usage, consumed spool space and skewing on the CPU. The LHR is optimally around 1.00

2

Teradata Volatile Tables – Expectation vs. Reality

What is a Teradata Volatile Table?

Volatile tables are very similar to derived tables. In contrast to these, their lifetime is not limited to the duration of a query, but they exist until the end of the session in which they were created.

Volatile tables are created with the following syntax:

CREATE VOLATILE TABLE Customer
(
   CustomerId INTEGER NOT NULL,
   LastName VARCHAR(255)
) PRIMARY INDEX (CustomerId) ON COMMIT PRESERVE ROWS;

A comparison between Volatile Tables, Global Temporary Tables, and Derived Tables can be found here:

Comparison of temporary table types

Teradata Volatile Table Features

What kind of space does a Volatile Table use?

Volatile Tables are materialized in the spool space of the user.

Who can access a Volatile Table?

Volatile tables can only be accessed in the user's session in which they were materialized.

Is there a DBC table where I can see which Volatile Tables I have materialized?

No, but with the command HELP VOLATILE TABLE, all materialized tables are listed.

Where is the DDL of a volatile table stored if not in the Data Dictionary (DBC)?

The table definition is kept in the AMP cache.

How long is a volatile table available?

Volatile tables are automatically removed when a session is terminated or a DROP TABLE statement is executed.

What happens if Teradata is restarted while I have materialized a volatile table?

Unfortunately, the table is lost along with its contents.

Can an ALTER TABLE be executed on a volatile table?

This is not possible.

Can I create a secondary index on a volatile table?

This is only possible when the table is created. Therefore, you cannot first create the table and then execute a CREATE INDEX statement.

Can I create a Join Index or Hash Index on a volatile table?

No, that's not possible.

Who is the owner of a Volatile Table?

A Volatile Table belongs neither to a database nor to a user, but to the session of a user.

Can I create the same volatile table in different sessions?

Yes, but the content is session-specific, of course.

Can I use FALLBACK protection?

Yes, but this is not really useful because volatile sessions do not survive a system restart. It only makes sense to create a volatile table while an AMP is down.

Can a volatile table be used for performance tuning?

Volatile tables offer several possibilities for this. E.g. no DBC table is needed and therefore there can be no performance degradation if the competing workload on one of these tables holds e.g. a wide lock.

Can I create statistics on a volatile table?

Statistics are allowed, but when using the statement DIAGNOSTIC HELPSTAT ON FOR SESSION you have to be aware that they are still proposed in the EXPLAIN plan. This is because they are not available to the optimizer in a DBC table.

Can a Volatile Table be created as a PPI table?

Yes, this is possible.

Teradata Volatile Table Restrictions

  • No referential integrity can be defined
  • No CHECK constraints are allowed
  • DEFAULT values are not allowed
  • No TITLE can be defined for columns
  • No Join Index or Hash Index is allowed
  • Indexes can't be named
  • Permanent Journals can't be used
  • Column Partitioning is not allowed
  • No privileges can be set with a GRANT or REVOKE. That wouldn't make any sense either

Master the Differences between Teradata Primary and Secondary Index

Which of them is required?

None of these indices is mandatory. Nevertheless, in most cases, a primary index should be defined.

How many indices can be defined per type?

A table can have a maximum of one primary index, but up to 32 secondary indexes.

How many columns can the indexes consist of?

Both the primary index and the secondary index can consist of up to 64 columns.

What about uniqueness?

Both the primary index and the secondary index can be unique or non-unique. However, the distribution of rows for a non-unique secondary index (NUSI) is different (it is not distributed according to rowhash).

Can the definition of the index be changed?

You can drop a secondary index at any time and create it again with a different definition.
The column set of a primary index cannot be changed. However, you can change it from NUPI to UPI, for example:

ALTER TABLE Customer MODIFY UNIQUE PRIMARY INDEX (>NUPI columns>)

How about overhead for index maintenance?

The primary index deteriorates performance only if the distribution is skewed. No maintenance is needed.

A secondary index deteriorates performance because it is mapped in a sub table, which must be maintained with every UPDATE and INSERT statement. This means that a secondary index should only be created if it is also used!

How is a Teradata Index stored?

The primary index does not need its own data structure since it simply distributes the rows according to a hashing algorithm.
The secondary index is stored in a sub table.

What about Row-level Partitioning?

Row-level partitioning is only available for the primary index.

How are the rows on the AMPs sorted?

The rows are sorted by a rowhash of the primary index of an unpartitioned table. If the table is partitioned, it is sorted by rowhash after the rows have been assigned to the correct partitions.
For a non-unique secondary index, the rows can be sorted by the rowhash or by value (INTEGER datatype only). The unique secondary index can only be sorted by rowhash.

Teradata Temporary Tables

Teradata temporary tables come in different types and for different purposes. This article shows you the properties and uses of each one.

1. Derived Tables

  • Are materialized by a select statement within a query
  • Only exist within the time the query is executed
  • Use the spool space of the executing user
  • Disappear at the moment the query is finished

Derived tables must always be named with an alias, otherwise, there will be a syntax error!

Here is an example of a derived table (inside the parentheses):

SELECT * FROM (SELECT MAX(Revenue) FROM Invoices) AS InvoiceRevenue (MaxRevenue);

The table is named InvoiceRevenue and the column from the derived table is called MaxRevenue.

However, there are further possibilities to name the column of the derived table:

SELECT * FROM (SELECT MAX(Revenue) AS MaxRevenue FROM Invoices) AS InvoiceRevenue;

In the above example, we name the column already within the derived table.

Which method you use is a matter of taste, the result of the query is the same.

Derived Tables using the WITH Syntax

Here is another method how you can name a derived table and its columns in advance, and reference it later:

WITH InvoiceRevenue(MaxRevenue) AS  
(SELECT MAX(Revenue) AS MaxRevenue FROM Invoices)
SELECT * FROM  InvoiceRevenue; 

However, you must take into account that only one WITH statement is allowed per query.

The great thing about this syntax is that the table is pre-defined, and then can be easily inserted into the rest of the query.

Here is an example:

WITH InvoiceRevenue(MaxRevenue) AS  
(SELECT MAX(Revenue) AS MaxRevenue FROM Invoices)
SELECT * 
FROM 
  InvoiceRevenue t01 
INNER JOIN 
  Customer t02 
ON 
  t01.CustomerId = t02.CustomerId; 

It is also important to understand that the derived table exists only during the execution of a query, and not for the entire session.

Here is an example of two queries that are executed in one transaction. The second query will cause an error message:

BT;
WITH InvoiceRevenue(MaxRevenue) AS  
(SELECT MAX(Revenue) AS MaxRevenue FROM Invoices)
SELECT * 
FROM 
  InvoiceRevenue t01 
INNER JOIN 
  Customer t02 
ON 
  t01.CustomerId = t02.CustomerId; 

SELECT * FROM  InvoiceRevenue; --> causes an error

ET;

2. Volatile Tables

  • Are created and afterward materialized by adding data with and INSERT/SELECT statement or
  • Are created and materialized at the same time with a CREATE VOLATILE TABLE TheTable AS (<QUERY>) WITH DATA PRIMARY INDEX (PI) ON COMMIT PRESERVE ROWS statement
  • Use the spool space of the executing user
  • Disappear at the moment the session is logged of
  • The table definition is stored in each AMP's cache
CREATE VOLATILE TABLE Revenue, NO LOG
(
   CustomerId BIGINT NOT NULL,
   RevenueAmount DECIMAL(18,3)
) ON COMMIT PRESERVE ROWS;

NO LOG means that no Transient Journal is used.

This will usually not be necessary, because it is a temporary table and you will not need this functionality of a ROLLBACK.

NO LOG is the default setting and should usually not be changed.

ON COMMIT PRESERVE ROWS is not a default setting, and must therefore usually always be specified. Otherwise, the contents of the table are deleted immediately at the end of the transaction.

The default setting is ON COMMIT DELETE ROWS!

After the volatile table is created, it can be loaded like any other table:

INSERT INTO Revenue
SELECT CustomerId, SUM(Revenue) FROM Transaction;

Why do I need ON COMMIT DELETE ROWS?

Here is a practical example of how to use ON COMMIT DELETE ROWS.

As already mentioned, ON DELETE COMMIT ROWS deletes the data as soon as the corresponding transaction is completed.

We can take advantage of this to, for example, delete a volatile table immediately after it is no longer needed.

BT;
CREATE VOLATILE TABLE Revenue, NO LOG
(
   CustomerId BIGINT NOT NULL,
   RevenueAmount DECIMAL(18,3)
) ON COMMIT DELETE ROWS;

INSERT INTO Revenue SELECT CustomerId,AVG(Revenue) FROM Transaction;

SELECT * FROM Revenue; 
--> The volatile table still exists as the transaction is not ended!

ET; --> Now the volatile table is empty

--> Further Queries...

Additional possibilities to materialize volatile tables

–> Create the table with content

CREATE VOLATILE TABLE Revenue AS (SELECT SUM(Amount) AS AMT FROM Revenue_All) WITH DATA ON COMMIT PRESERVE ROWS;

–> Create the table empty

CREATE VOLATILE TABLE Revenue AS (SELECT SUM(Amount) ASAMT FROM Revenue_All ) WITH NO DATA ON COMMIT PRESERVE ROWS;
–> This volatile table contains only column AMT and ceratin table attributes from the table Transaction

CREATE VOLATILE TABLE Revenue AS (SELECT SUM(Amount) AS AMT FROM Revenue_All ) WITH DATA ON COMMIT PRESERVE ROWS;
–> This volatile table is a 1:1 copy of the table transaction



CREATE VOLATILE TABLE Revenue AS Revenue_All ) WITH DATA ON COMMIT PRESERVE ROWS;

How can I copy Statistics to the Volatile Table?

CREATE VOLATILE TABLE Revenue AS Revenue_All WITH DATA AND STATISTICS ON COMMIT PRESERVE ROWS;

How do I find all available volatile tables in my session?

All you have to do is execute this command:

HELP VOLATILE TABLE;

3. Global Temporary Tables

  • Are created with a CREATE TABLE DDL which is stored permanently in DBC tables
  • Are materialized with an INSERT/SELECT statement
  • Use the temp space of the executing user (not the spool space)
  • Disappear at the moment the session is logged of (but the table definition stays stored in the DBC tables!)
  • Each user can materialize his own copy a Global Temporary Table

Global Temporary Tables are useful if the definition is to be used by several users.

Like a Volatile Table, the Global Temporary Table only exists in the session where it was materialized. However, unlike the Volatile Table, any number of users can materialize this table (as the definition is stored permanently in DBC tables).

1 2 3 29
>