Before we go into Teradata architecture in more detail, we need to talk about how a computer is built. After all, this is also the basis of a Teradata system.
Teradata Architecture – Why Does Everyone Copy It?
As one of the pioneers in data warehousing, Teradata was and is a role model for many subsequent database systems in terms of architecture.
Even if Teradata has come into the years, the system developers have already considered many details from the beginning that still make Teradata capable of competing today.
If we look at various modern database systems today, such as Redshift from Amazon (or Netezza), for example, we can recognize many things that were used by Teradata for the first time.
Teradata was designed from the beginning for parallelism in every smallest detail and can still be found today among the top RDBMS for Data Warehousing.
Data is permanently stored on mass storage devices and loaded into the CPU’s main memory for processing.
It is important to understand that accessing the mass storage device is much slower than accessing the main memory. Further, accessing the main memory is much slower than accessing data already in one of the CPU caches.
Before the CPU can process data, it must be loaded into the main memory.
The Teradata architecture can be easily imagined as a number of individual computers that can communicate with each other:
Teradata Data Distribution
To split the workload, Teradata uses a hashing algorithm that distributes the rows of each table evenly among the so-called AMPs (we will talk later in this article about exactly what an AMP is and what its tasks are. For now, it’s sufficient to know that AMPs are doing the main work).
The Parsing Engine
An important part of the Teradata architecture is the Parsing Engine (PE).
The Parsing engine receives a request (e.g., an SQL statement) and generates an execution plan for all AMPS that are required to complete the request. Ideally, the plan is structured so that all AMPs start and finish their tasks simultaneously. This ensures optimal parallel utilization of the system.
As you can see in the figure above, between the AMPs and the parsing engine is the BYNET, representing the communication network over which both the data and instructions are exchanged. We talk about the BYNET in detail later in this article.
The Parsing Engine has the following main tasks:
- Logging on and Logging Off Sessions
- The parsing of requests (syntax check, checking authorizations)
- Preparation and optimization of the execution plan
- The Parsing Engines uses statistics to build an optimized plan.
- Controlling the AMPs by Instructions
- Communication with the client software
- EBCDIC to ASCII conversion in both directions
- Transfers of the result of a request to the client tool
Each Teradata System can use multiple parsing engines.
The number of parsing engines can be increased by the system as needed because each parsing engine can only process a limited number of sessions.
Currently, there are 120 sessions that any parsing engine can manage. These can be sessions of different users, but also 120 sessions of the same user.
The Teradata AMP
AMPs are the real workers in a Teradata System who execute the instructions they receive from the Parsing Engine (the Execution Plan).
AMPs are independent units that have their own main memory and mass storage allocated to them.
The allocation is exclusive, i.e., no AMP has access to the resources of another AMP.
These are the main tasks of an AMP:
- Storing and retrieving of rows
- Sorting of rows (for details, read How Teradata sorts the result set)
- Aggregation of rows
- Joining of tables (see also: The Essential Teradata Join Methods)
- Locking of tables and rows
- Output conversion ASCII to EBCDIC (if the client is a mainframe)
- Management of its assigned space
- Sending of rows to the Parsing Engine or other AMPs (via the BYNET)
- Recovery handling
- Filesystem management
Each AMP can perform multiple tasks simultaneously. By default, 80 tasks can be executed in parallel.
The Teradata Node
Parsing engines and AMPs are processed and run on a node. A node is usually a Linux machine equipped with multiple physical CPUs.
Each node can run hundreds of AMPs. Each AMP has its own portion of the main memory and its own portion of mass memory (called virtual disk).
Nodes are connected to a disk array, and each AMP is assigned a part of it as a logical disk. Nowadays, SSDs are used, and the Teradata Intelligent Memory system does management. But the principle is the same.
Massive Parallel Processing
A Teradata system can consist of a large number of nodes. These, in turn, are connected via BYNET.
However, this is a physical network, while the BYNET within a node connects the AMPs with the parsing Engine and with each other, is implemented in software: