A session is a LOGON of a user on a Teradata system, through which one or more sequential transactions can be executed. The session is terminated by a LOGOFF statement.
For Teradata, multiple sessions are like a number of users (but with the same user credentials). Each of these sessions belongs to the same Teradata user and can execute one or more sequential transactions between LOGON and LOGOFF just like a single session.
The use of multiple sessions makes it possible to run multiple tasks in parallel. The aim is to increase performance.
Not at all. For several sessions (e.g. in a BTEQ script) to increase performance, Rowhash accesses must be made. Rowhash access only occurs if a UPI, NUPI, or USI is used. All other access paths require a table lock, and using multiple sessions does not improve performance.
A BTEQ import script can usually increase performance by using multiple sessions. BTEQ export scripts usually do not achieve any performance advantage by using multiple sessions.
This is difficult to answer and is best determined empirically. What is important is an access path via the Rowhash (UPI, NUPI, USI). Too few sessions do not bring out the maximum performance. Too many sessions do not bring any more performance advantage after a certain point but increase the administration effort.
For Full Table Scans and NUSI accesses.
As shown in the graph above, all 4 transactions (T1, T2, T3, T4) are active on all AMPs simultaneously. Even if only one row is required, all of these transactions involve all AMPs to look this row up. None of the transactions makes access via a single rowhash. Therefore, there is no performance gain.
In example 2, each transaction is only active on exactly one AMP. Therefore the 4 AMPs can work on different tasks at the same time.
In summary, the use of multiple sessions is only useful if:
Fast Path INSERT/SELECT is designed to load a single table as efficiently as possible.
The idea behind this is to minimize access to the mass memory (Hard Disk, Solid State Disk).
This is done by creating whole data blocks on each AMP and then writing them to the mass storage with a single IO.
Under the conditions mentioned above (target table is empty, all tables involved have the same primary index), the following is ensured at the same time:
The rows of the target and source tables are already sorted on the same AMP and also in ascending order according to the same rowhash within the data blocks.
Therefore, each AMP can prepare the data blocks to be written locally and write them to the mass memory with a single IO.
BTEQ uses a special syntax that must be adhered to when several tables are to be loaded into the empty target table at the same time using Fast Path:
The semicolon must be at the beginning (from the 2nd table) of the line (not at the end of the lines!):
INSERT INTO Customer SELECT * FROM Customer1 ;INSERT INTO Customer SELECT * FROM Customer2 ;INSERT INTO Customer SELECT * FROM Customer3 ;INSERT INTO Customer SELECT * FROM Customer4
Teradata SQL Tuning begins by providing the optimizer with the statistics it needs. This must always be done as a first step. Statistics influence the execution plan. It is not only important to define all required statistics, but also to keep them as up-to-date as possible.
There is a simple reason why I put statistics first: Often performance problems are solved simply by changing the execution plan.
Which columns require which type of statistics (full statistics collecting, random-AMP sampling or sample statistics) depends primarily on the data demography.
Random AMP sampling is usually sufficient for columns with many different values. The table should contain significantly more rows than there are AMPs in the system.
The following applies to random-AMP sampling: The more distinct the column values are, the better the estimate will be.
If the column values are skewed, there is a risk that the random-AMP sample will be taken from an AMP that is not representative of the data. Skewed data leads to overestimation or underestimation of the number of rows.
Good candidates for Random-AMP sampling are unique indices, i.e., UPI and USI.
Random-AMP sampling only takes place on indexed columns.
Therefore, You should always collect statistics on non-indexed columns that are used in WHERE conditions. Whether full or sample statistics are collected depends on how unique the column values are.
Fully collected statistics are required for skewed column values and columns with a small number of distinct values (NUPI and NUSI).
The Teradata Optimizer has unique statistic requirements for PPI tables.
The following statistics should additionally be collected on PPI tables:
Dummy Column “PARTITION”
Statistics on dummy column “PARTITION” tell the optimizer the number of the empty statistics.
Dummy column “PARTITION” + PI column
These statistics are required if the partitioning columns are not part of the primary index. In this case, the same primary index value can exist in different partitions. Statistics on dummy column “PARTITION” + PI allow the optimizer to estimate the cost of the sliding window and rowkey based merge join and of dynamic partition elimination steps.
Below statement can be used to determine which statistics the optimizer would additionally need in a SQL statement:
DIAGNOSIS HELPSTATS ON FOR SESSION;
This statement displays a list of suggested statistics at the end of the Execution Plan (given by the EXPLAIN statement) and the Optimizer's opinion of their value (confidence levels).
By gradually adding these statistics, you can test their influence on the execution plan.
There are several ways to identify stale statistics. The easiest way is to split the SQL statement and test each partial statement individually. Splitting is done merely by comparing the estimated number of rows (as shown in the Explain output) with the actual number of records returned by the query.
The above-described approach is particularly suitable if the entire SQL statement does not execute in a reasonable time.
Here's an example:
SELECT t01.*
FROM <Tablename1> t01
INNER JOIN
<Tablename2> t02
ON t01.PK = t02.PK
WHERE
t01.<Column name> = 1 AND t02.<Column name> = 2;
The above query can be divided into two parts for testing:
SELECT * FROM <Tablename1> WHERE <Columnnname> = 1;
SELECT * FROM <Tablename2> WHERE <Columnnname> = 2;
If you execute both sub-queries, and the number of rows returned differs significantly from the estimate in EXPLAIN, the statistics may be obsolete.
Teradata SQL Tuning is not complete without choosing the best possible Primary Index. Design your queries so that the primary index can be used in the joins.
All join columns must be part of the primary index. If only one column of the primary index is missing in the join condition, the result is a different row hash (the order how they are defined in the primary index doesn't matter)
Still, the join condition can contain additional columns. These are then applied as residual conditions after locating the rows via row hash.
To execute a join between two tables, the rows of both tables must be co-located on the same AMP.
This is true when they have the same primary index. In this case, the optimizer can use a join strategy that requires no redistribution of rows a so-called "direct join".
If for some reason you cannot change the primary indexes, or if you need a specific primary index for a particular SQL statement, create a volatile table (or a true temporary table) with the same structure and content as the original table but with a different primary index.
There are three things to consider when selecting the Primary Index: uniform distribution of data, good suitability for join operations, and low volatility:
This is the only way to ensure that all AMPs start and finish their work simultaneously. This is exactly what parallel data processing is all about.
Using indexes or partitioning is another way to improve query performance.
Teradata offers a number of different indexes, all of which have their advantages and disadvantages. Which index should be used when depends mainly on the workload.
Secondary indexes come in two forms. As unique secondary index (USI) and as non-unique secondary index (NUSI). Although one could conclude from the name that these differ only in the uniqueness, their functionality is very different.
The Teradata USI
The USI is very similar to the Unique Primary Index:
The Index Rows of the Index subtable are distributed evenly over all AMPS by Rowhash and sorted by RowHash.
If the USI contains at least all columns used in a WHERE condition, the AMP can be determined which owns the index row (in this case the index row).
The determined index row contains the ROWID of the base table row that is being searched for, and this can then be used to access the desired row.
The USI is suitable for direct access to individual rows and is therefore high-performance. It is ideal for tactical workload where this feature is required.
The Teradata NUSI
The NUSI is not distributed according to a Rowhash. NUSI rows are always held together with the AMP that has the corresponding row of the base table. Therefore a NUSI access is always an all-AMP operation.
The NUSI index rows are sorted by rowhash or by an integer value. Sorting by an integer value (date is internally of data type integer, and therefore also possible) is one of the advantages of the NUSI, as this makes it ideal for range scans (e.g. all days of a certain month).
While secondary indexes in Teradata are stored in a sub-table and therefore require extra space, partitioning is just another way to structure the rows on the mass storage device.
The rows of a partitioned table are not only distributed to the AMPs according to rowhash, but are also assigned to the appropriate partitions and inside the partitions sorted according to rowhash.
The rows of a partition are arranged on the disk in such a way that a full cylinder scan can be performed.
When you should partition a table, and when the use of a secondary index or join index is more appropriate, depends on the workload. It is also possible to create additional indexes on a partitioned table.
Partitioned tables are often used for strategic queries in which a series of data (for example, the sales for the current year) is aggregated.
When we work with indexing techniques, we need to keep an eye on the entire data warehouse architecture and decide whether our solution fits. Indexes can have a negative impact on the ETL process for several reasons.
Loading tools such as Fastload require secondary indexes and join indexes to be removed before loading.
The index sub-tables have to be managed by Teradata. Insert, Delete and Update statements require that in addition to the actual table, the index sub-table must be maintained.
If potentially useful indexes are not used by the optimizer and they are not helpful in the entire PDM design, drop them immediately. You're wasting space and resources.
The performance of a query can often be improved by rewriting the query.
Personally, I prefer to consider an SQL statement as a black box and to limit optimizations to technical methods first.
Here are a few examples:
Not having to understand the content and business logic of a query, I do not need to contact the author of the query. The purely technical optimizations are usually not that risky.
Only when I reach the limits of purely technical optimization do I begin to deal with business logic. Understanding business logic often opens up additional optimization opportunities.
Rewriting of queries often solves performance problems, even when all other techniques have failed.
Teradata SQL tuning requires to watch the query running in real-time. Monitoring a query in viewpoint at runtime helps to identify the critical steps.
In Viewpoint you should look for the following steps:
We have to think about the root cause of the bad query step, here are a few ideas:
Stale and missing statistics typically lead to incorrect decisions in join preparation (copying a table to all AMPs instead of rehashing) and to the use of incorrect join methods (for example, product join instead of merge join).
Another important task in Teradata SQL Tuning is measuring resource consumption before and after the optimization of the query.
Query run times are misleading because they can be affected by the simultaneous workload of other queries over which we have no control.
In performance tuning, we prefer to use absolute key figures that are independent of what else is running on our Teradata system.
Below is a query that gives you a detailed insight into how good each step of the SQL statement to be optimized is. To execute this query, you must have read access to table DBC.DBQLOGTBL (or related views):
Do not forget to give each version of the query to be optimized its own query band so that you can uniquely identify it in the table DBC.DBQLOGTBL:
SET QUERY_BAND = 'Version=1;' FOR SESSION;
SELECT
AMPCPUTIME,
(FIRSTRESPTIME-STARTTIME DAY(2) TO SECOND(6)) RUNTIME,
SPOOLUSAGE/1024**3 AS SPOOL_IN_GB,
CAST(100-((AMPCPUTIME/(HASHAMP()+1))*100/
NULLIFZERO(MAXAMPCPUTIME)) AS INTEGER) AS CPU_SKEW,
MAXAMPCPUTIME*(HASHAMP()+1) AS CPU_IMPACT,
AMPCPUTIME*1000/NULLIFZERO(TOTALIOCOUNT) AS LHR
FROM
DBC.DBQLOGTBL
WHERE
QUERYBAND = 'Version=1;' ;
The above query will return the following measures:
Our goal is to decrease total CPU usage, consumed spool space and skewing on the CPU. The LHR is optimally around 1.00
Volatile tables are very similar to derived tables. In contrast to these, their lifetime is not limited to the duration of a query, but they exist until the end of the session in which they were created.
Volatile tables are created with the following syntax:
CREATE VOLATILE TABLE Customer ( CustomerId INTEGER NOT NULL, LastName VARCHAR(255) ) PRIMARY INDEX (CustomerId) ON COMMIT PRESERVE ROWS;
A comparison between Volatile Tables, Global Temporary Tables, and Derived Tables can be found here:
Comparison of temporary table types
Volatile Tables are materialized in the spool space of the user.
Volatile tables can only be accessed in the user's session in which they were materialized.
No, but with the command HELP VOLATILE TABLE, all materialized tables are listed.
The table definition is kept in the AMP cache.
Volatile tables are automatically removed when a session is terminated or a DROP TABLE statement is executed.
Unfortunately, the table is lost along with its contents.
This is not possible.
This is only possible when the table is created. Therefore, you cannot first create the table and then execute a CREATE INDEX statement.
A Volatile Table belongs neither to a database nor to a user, but to the session of a user.
Yes, but the content is session-specific, of course.
Yes, but this is not really useful because volatile sessions do not survive a system restart. It only makes sense to create a volatile table while an AMP is down.
Volatile tables offer several possibilities for this. E.g. no DBC table is needed and therefore there can be no performance degradation if the competing workload on one of these tables holds e.g. a wide lock.
Statistics are allowed, but when using the statement DIAGNOSTIC HELPSTAT ON FOR SESSION you have to be aware that they are still proposed in the EXPLAIN plan. This is because they are not available to the optimizer in a DBC table.
Yes, this is possible.
None of these indices is mandatory. Nevertheless, in most cases, a primary index should be defined.
A table can have a maximum of one primary index, but up to 32 secondary indexes.
Both the primary index and the secondary index can consist of up to 64 columns.
Both the primary index and the secondary index can be unique or non-unique. However, the distribution of rows for a non-unique secondary index (NUSI) is different (it is not distributed according to rowhash).
You can drop a secondary index at any time and create it again with a different definition.
The column set of a primary index cannot be changed. However, you can change it from NUPI to UPI, for example:
ALTER TABLE Customer MODIFY UNIQUE PRIMARY INDEX (>NUPI columns>)
The primary index deteriorates performance only if the distribution is skewed. No maintenance is needed.
A secondary index deteriorates performance because it is mapped in a sub table, which must be maintained with every UPDATE and INSERT statement. This means that a secondary index should only be created if it is also used!
The primary index does not need its own data structure since it simply distributes the rows according to a hashing algorithm.
The secondary index is stored in a sub table.
Row-level partitioning is only available for the primary index.
The rows are sorted by a rowhash of the primary index of an unpartitioned table. If the table is partitioned, it is sorted by rowhash after the rows have been assigned to the correct partitions.
For a non-unique secondary index, the rows can be sorted by the rowhash or by value (INTEGER datatype only). The unique secondary index can only be sorted by rowhash.
Teradata temporary tables come in different types and for different purposes. This article shows you the properties and uses of each one.
Derived tables must always be named with an alias, otherwise, there will be a syntax error!
Here is an example of a derived table (inside the parentheses):
SELECT * FROM (SELECT MAX(Revenue) FROM Invoices) AS InvoiceRevenue (MaxRevenue);
The table is named InvoiceRevenue and the column from the derived table is called MaxRevenue.
However, there are further possibilities to name the column of the derived table:
SELECT * FROM (SELECT MAX(Revenue) AS MaxRevenue FROM Invoices) AS InvoiceRevenue;
In the above example, we name the column already within the derived table.
Which method you use is a matter of taste, the result of the query is the same.
Here is another method how you can name a derived table and its columns in advance, and reference it later:
WITH InvoiceRevenue(MaxRevenue) AS (SELECT MAX(Revenue) AS MaxRevenue FROM Invoices) SELECT * FROM InvoiceRevenue;
However, you must take into account that only one WITH statement is allowed per query.
The great thing about this syntax is that the table is pre-defined, and then can be easily inserted into the rest of the query.
Here is an example:
WITH InvoiceRevenue(MaxRevenue) AS (SELECT MAX(Revenue) AS MaxRevenue FROM Invoices) SELECT * FROM InvoiceRevenue t01 INNER JOIN Customer t02 ON t01.CustomerId = t02.CustomerId;
It is also important to understand that the derived table exists only during the execution of a query, and not for the entire session.
Here is an example of two queries that are executed in one transaction. The second query will cause an error message:
BT; WITH InvoiceRevenue(MaxRevenue) AS (SELECT MAX(Revenue) AS MaxRevenue FROM Invoices) SELECT * FROM InvoiceRevenue t01 INNER JOIN Customer t02 ON t01.CustomerId = t02.CustomerId; SELECT * FROM InvoiceRevenue; --> causes an error ET;
CREATE VOLATILE TABLE Revenue, NO LOG ( CustomerId BIGINT NOT NULL, RevenueAmount DECIMAL(18,3) ) ON COMMIT PRESERVE ROWS;
NO LOG means that no Transient Journal is used.
This will usually not be necessary, because it is a temporary table and you will not need this functionality of a ROLLBACK.
NO LOG is the default setting and should usually not be changed.
ON COMMIT PRESERVE ROWS is not a default setting, and must therefore usually always be specified. Otherwise, the contents of the table are deleted immediately at the end of the transaction.
The default setting is ON COMMIT DELETE ROWS!
After the volatile table is created, it can be loaded like any other table:
INSERT INTO Revenue SELECT CustomerId, SUM(Revenue) FROM Transaction;
Here is a practical example of how to use ON COMMIT DELETE ROWS.
As already mentioned, ON DELETE COMMIT ROWS deletes the data as soon as the corresponding transaction is completed.
We can take advantage of this to, for example, delete a volatile table immediately after it is no longer needed.
BT; CREATE VOLATILE TABLE Revenue, NO LOG ( CustomerId BIGINT NOT NULL, RevenueAmount DECIMAL(18,3) ) ON COMMIT DELETE ROWS; INSERT INTO Revenue SELECT CustomerId,AVG(Revenue) FROM Transaction; SELECT * FROM Revenue; --> The volatile table still exists as the transaction is not ended! ET; --> Now the volatile table is empty --> Further Queries...
–> Create the table with content CREATE VOLATILE TABLE Revenue AS (SELECT SUM(Amount) AS AMT FROM Revenue_All) WITH DATA ON COMMIT PRESERVE ROWS; | –> Create the table empty CREATE VOLATILE TABLE Revenue AS (SELECT SUM(Amount) ASAMT FROM Revenue_All ) WITH NO DATA ON COMMIT PRESERVE ROWS; |
–> This volatile table contains only column AMT and ceratin table attributes from the table Transaction CREATE VOLATILE TABLE Revenue AS (SELECT SUM(Amount) AS AMT FROM Revenue_All ) WITH DATA ON COMMIT PRESERVE ROWS; | –> This volatile table is a 1:1 copy of the table transaction CREATE VOLATILE TABLE Revenue AS Revenue_All ) WITH DATA ON COMMIT PRESERVE ROWS; |
CREATE VOLATILE TABLE Revenue AS Revenue_All WITH DATA AND STATISTICS ON COMMIT PRESERVE ROWS;
All you have to do is execute this command:
HELP VOLATILE TABLE;
Global Temporary Tables are useful if the definition is to be used by several users.
Like a Volatile Table, the Global Temporary Table only exists in the session where it was materialized. However, unlike the Volatile Table, any number of users can materialize this table (as the definition is stored permanently in DBC tables).