Volatile tables are very similar to derived tables. In contrast to these, their lifetime is not limited to the duration of a query, but they exist until the end of the session in which they were created.
Volatile tables are created with the following syntax:
CREATE VOLATILE TABLE Customer ( CustomerId INTEGER NOT NULL, LastName VARCHAR(255) ) PRIMARY INDEX (CustomerId) ON COMMIT PRESERVE ROWS;
A comparison between Volatile Tables, Global Temporary Tables, and Derived Tables can be found here:
Volatile Tables are materialized in the spool space of the user.
Volatile tables can only be accessed in the user's session in which they were materialized.
No, but with the command HELP VOLATILE TABLE, all materialized tables are listed.
The table definition is kept in the AMP cache.
Volatile tables are automatically removed when a session is terminated or a DROP TABLE statement is executed.
Unfortunately, the table is lost along with its contents.
This is not possible.
This is only possible when the table is created. Therefore, you cannot first create the table and then execute a CREATE INDEX statement.
A Volatile Table belongs neither to a database nor to a user, but to the session of a user.
Yes, but the content is session-specific, of course.
Yes, but this is not really useful because volatile sessions do not survive a system restart. It only makes sense to create a volatile table while an AMP is down.
Volatile tables offer several possibilities for this. E.g. no DBC table is needed and therefore there can be no performance degradation if the competing workload on one of these tables holds e.g. a wide lock.
Statistics are allowed, but when using the statement DIAGNOSTIC HELPSTAT ON FOR SESSION you have to be aware that they are still proposed in the EXPLAIN plan. This is because they are not available to the optimizer in a DBC table.
Yes, this is possible.
The hot standby node waits for its mission, which takes place when a node crashes out of its clique.
The ingenious thing about the design is the following: As soon as the crashed node is online again, it becomes the new hot standby node:
Locking of database objects is a must in RDBMS to avoid that multiple users can change and access the same data at the same time in a way that is causing data inconsistencies.
In other words, whenever a user is accessing or changing data, the affected objects are locked to ensure integrity.
For each access to data, Teradata decides how granular a lock must be and for what type of activity other users are locked.
The granularity of locking can also be increased during execution (e.g. from row lock to table lock).
If a RowHash is locked this affects one or more Rows because all Rows with the same Primary Index Value have the same RowHash.
A RowHash lock is used in the following cases:
For example, take the following UPDATE statement:
UPDATE Customer SET Gender = 'm';
Without using a WHERE condition, a write lock to the table is required.
If we change the query and use a WHERE condition on the primary index (CustomerId), a rowhash lock is used:
UPDATE Customer SET Gender = 'm' WHERE CustomerId = 1;
If rowhash locking is performed on a row-partitioned table, the partition (or a range of partitions) can be locked at the same time, i.e. it can be locked more specifically.
In total, the RowHash is then locked within one or more partitions. This means that other requests can access rows in the table that are either in different partitions or have a different rowhash.
Locks are only applied to databases or tables.
They are the strictest locks because all requests of all other users are queued if an object is exclusively locked. Exclusive locks are rarely used, usually when databases are restructured.
Write locks allow the owner of the lock to change data, all other users only reading with a so-called “dirty read”, because the data can be inconsistent. As long as there is a write lock, no new write or read locks are allowed.
Whenever a user holds a read lock on an object, no write locks or exclusive locks can be obtained by any other users. However, unlimited read locks (and access locks) can be held by other users on the same object at the same time.
Access locks allow a user to read “inconsistent” data. Access locks work even if another user holds already a write lock on the object. Access Locks are not allowed if there is an exclusive lock on the object.
Teradata automatically takes care of which lock is best suited for a particular situation. But you as a user can influence it. Here is an example:
SELECT * FROM Customer;
The above query normally causes a read lock.
But if the consistency of the data is not important, you can downgrade the lock to an access lock:
LOCKING Customer FOR ACCESS SELECT * FROM Customer;
The LOCKING FOR ACCESS modifier is particularly interesting because the data can be accessed even when the table is being written.
Analogous to the previous example you can upgrade the lock from a read lock to a write lock:
LOCKING Customer FOR WRITE SELECT * FROM Customer;
Any user who needs a read lock, a write lock or exclusive lock on the customer table must now wait until your query is finished.
NOWAIT is optional and means that the SQL statement will be aborted if the desired lock is not available.
Before we go into Teradata architecture in more detail, we need to talk about how a computer is built. After all, this is also the basis of a Teradata system.
Teradata, as one of the pioneers in data warehousing, was and is a role model for many subsequent database systems in terms of architecture.
Even if Teradata has come into the years, the developers of the system have already considered many details from the beginning that still make Teradata capable of competing today.
If we look at various modern database systems today, such as Redshift from Amazon (or Netezza), for example, we can recognize many things that were used by Teradata for the first time.
Teradata was designed from the beginning for parallelism in every smallest detail and can therefore still be found today among the top RDBMS.
Data is permanently stored on mass memory and is loaded into the main memory for processing by the CPU.
It is important to understand that accessing the mass storage device is many times slower than accessing the main memory. Accessing the main memory is much slower than accessing data already in one of the CPU caches.
Before data can be processed by the CPU, it must be loaded into the main memory (if it is not already there).
The Teradata architecture can now be easily imagined as a number of individual computers that can communicate with each other:
To split the workload, Teradata uses a hashing algorithm that distributes the rows of each table evenly among the so-called AMPs (we will talk later in this article about exactly what an AMP is and what its tasks are. For now it's sufficient to know that AMPs are doing the main work).
An important part of the Teradata architecture is the Parsing Engine (PE).
The Parsing engine receives a request (e.g. an SQL statement) and generates an execution plan for all AMPS that are required to complete the request. Ideally, the plan is structured so that all AMPs start and finish their tasks at the same time. This ensures optimal parallel utilization of the system.
As you can see in the figure above, between the AMPs and the parsing engine is the BYNET, which represents the communication network over which both the data and instructions are exchanged. We talk about the BYNET in detail later in this article.
The Parsing Engine has the following main tasks:
Each Teradata System can use multiple parsing engines.
The number of parsing engines can be increased by the system as needed because each parsing engine can only process a limited number of sessions.
AMPs are the real workers in a Teradata System who execute the instructions they receive from the Parsing Engine (the Execution Plan).
AMPs are independent units that have their own part of the main memory and mass storage allocated to them.
The allocation is exclusive, i.e. no AMP has access to the resources of another AMP.
These are the main tasks of an AMP:
Parsing engines and AMPs are processed and run on a node. A node is usually a Linux machine equipped with multiple physical CPUs.
Each node can run hundreds of AMPs. Each AMP has its own portion of the main memory and its own portion of mass memory (called virtual disk).
Nodes are connected to a disk array, and each AMP is assigned a part of it as a logical disk. Nowadays, SSDs are used and management is done by the Teradata Intelligent Memory system. But the principle is the same.
A Teradata system can consist of a large number of nodes. These, in turn, are connected to each other via BYNET.
However, this is a physical network, while the BYNET within a node connects the AMPs with the parsing Engine and with each other, is implemented in software:
Another view on the Teradata Design by TutorialsPoint
If we run a query on a Teradata system, it runs in parallel in every step, no matter if it is a join, sorting, aggregation step, etc.
The great advantage is that this applies to each step of the query process. That's what makes Teradata unique. The big advantage Teradata has over other database systems is that its high level of parallelism was part of the Teradata architecture from the start many years ago (when most of its components were implemented in hardware instead of software)
In the course of time, even new features have been made possible in order to further increase parallelism. However, much would not have been possible without the architecture underlying Teradata.
AMPs are independent processes that perform a wide range of activities independently of each other. Each SQL query is first broken down into subtasks assigned to the AMPs. Each AMP completes its tasks and delivers a partial result. When all AMPs are ready, the overall result is returned.
Parallelism at AMP level is one of the main reasons for a Teradata System offering tremendous performance when used the right way.
AMPs are not specialized but capable of performing any kind of task.
Today there are hundreds of AMPs on each Teradata System available.
Each AMP has its own exclusive resources:
Since Teradata is a shared-nothing architecture, all resources of an AMP are only assigned to this AMP. An extension of the system (by adding hardware) therefore goes hand in hand with the possibility of linear growth in performance.
Parallelism is achieved by hash partitioning. The existing data is evenly distributed among the existing AMPS, with the goal that each AMP must perform approximately the same.
Hash partitioning works great to distribute large amounts of data to the AMPs. The drawback is that individual AMPs can represent a bottleneck if the task distribution is not even – skewing is the result and this pressure will often occur.
Skewing is one of the main problems a performance tuner on a Teradata system has to solve.
Hash partitioning is easily achieved by defining one or more columns for each table from which a hash value is calculated. This hash value determines the AMP for each row. The columns used for hash partitioning are the so-called Primary Index:
CREATE TABLE Customer(Customer_ID BIGINT NOT NULL,Lastname VARCHAR(500),Firstname VARCHAR(500)) UNIQUE PRIMARY INDEX (Customer_Id);
Pipelining increases the parallelism of queries even more. Pipelining makes it possible to start a step even if the predecessor step is not yet finished.
As already mentioned, a request is split into several steps. Each step represents a subtask. This can be, for example:
Steps can have different complexity, and AMPs may also need to interact with each other.
In a join step (which is complex), pipelining allows starting already the join activity while rows of the participating tables are still being redistributed.
While pipelining steps are nested into each other, i.e. the input of the predecessor step is made available to the successor as early as possible, Teradata also has another level of parallelism available.
Steps that are independent of each other can be executed in parallel.
Teradata is a shared-nothing architecture and as we know the work is done in parallel by the AMPs.
But since there is a need for AMPs to communicate with each other, a network is needed to exchange both messages and data: The BYNET.
However, BYNET is more than just a normal network. It has functionality that has been specially developed for Teradata:
To better understand the tasks of BYNET we have to introduce a new kind of virtual process: The Parsing Engine. It is responsible for creating the execution plan of a request. The BYNET is the interface between the Parsing Engine and the AMPs.
Messages can be sent from the Parsing Engine via BYNET to the AMPs, but BYNET is also responsible for the AMP to AMP communication.
BYNET can send messages to all AMPs, a group of AMPS, or even a single AMP.
Which is unique in Teradata:
The sorting of the ResultSet is done in parallel, at each level (AMP, Node, BYNET, Parsing Engine) data is pre-sorted to avoid an expensive Sort Step at the end:
These were the essential components of the Teradata Shared Nothing architecture. More details about how data is stored can be found here:
Surrogate keys are used in Teradata to map the natural keys of source systems to a unique key.
Usually, one or more natural key columns are mapped to a surrogate key that is worth an INTEGER. Often a consecutive number is generated.
For a surrogate key in Teradata, the ROW_NUMBER() function can be used, or an IDENTITY column.
The advantage of an IDENTITY column is that a new surrogate key is generated by the system as soon as a row is inserted into the key table.
However, it is important to define the IDENTITY column so that no duplicates are created.
For this, the IDENTITY column must be defined with GENERATED ALWAYS and NO CYCLE.
GENERATED always prevents the surrogate key column from being updated, NO CYCLE prevents the reuse of values – If the highest possible number for an integer data type is reached, no new surrogate keys can be generated and an error is reported.
Detailed information about IDENTITY columns is available in table DBC.IdCol
If a key table that uses IDENTITY columns has to be reloaded, please pay attention to the following to avoid duplicates:
Here is an example where the highest value of the IDENTITY column is 200,000:
CREATE MULTISET TABLE TheKeys ( NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL, SURROGATE_KEY BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE -999999999999999 MAXVALUE 999999999999999 NO CYCLE), ) UNIQUE PRIMARY INDEX (NATURAL_KEY);
If we need to reload the key table the following DDL is required:
CREATE MULTISET TABLE TheKeys ( NATURAL_KEY VARCHAR(500) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL, SURROGATE_KEY BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE 200000 MAXVALUE 999999999999999 NO CYCLE), ) UNIQUE PRIMARY INDEX (NATURAL_KEY);
There are several reasons why it is a good idea to replace the natural keys with Teradata surrogate keys.
You may have more than one source system delivering the same value in the natural key for the same target table but with a different meaning.
This fact requires you to use surrogate keys because keeping the natural key from both sources would not allow you to distinguish the objects.
But it can also be that the same information comes from different source systems with different natural keys. For example, the customers of a bank may be kept in different source systems. If it is necessary to represent these as a single customer in the data warehouse, surrogate keys are perfectly suited for this.
Unfortunately, many Teradata data warehouse implementations decide against surrogate keys and put the effort of integrating the information into the reports.
This approach will ultimately cause more costs than having a right surrogate key design from the begin (don't forget, such an approach requires to repeat the logic of combining object information into each and every single report).
If the natural keys of the source system are used, and you undergo a replacement of the original system you end up in a bad situation.
In this case, let's compare the effort caused by it.
If we have surrogate keys available, only the mapping table (Natural Keys -> Surrogate Keys) has to be adjusted, all surrogate keys have to be assigned once to the new natural keys.
If we only have the natural keys available in our data warehouse, this usually means a complete re-design. The old natural keys must be replaced everywhere by the new natural keys!
In general, you never should tightly couple your data model to the source system structures. An invoice table should be called Invoice and not something like SAP-Invoice. Data Warehousing is not the 1:1 storage of operational data.
At the begin of a project, it is always more agile to avoid surrogate keys and integrate source system tables 1:1.
However, this only shifts costs to the end of the chain. Changes that have to be made afterward are many times more expensive!
Surrogate keys, if properly used, may increase your performance as you can replace several natural key columns by an integer column.
Often, natural key columns are character columns. From a performance point of view, this makes no difference for the hashing of the primary index. Where there is a negative effect on performance, however, is with joins, since Teradata must take character sets, etc. into account here.
Depending on the number of columns and data types of the natural key, this can have a negligible impact on performance or a noticeable effect.