December 3

5 comments

Teradata Architecture Is So Famous, But Why?

By Roland Wenzlofsky

December 3, 2019

AMP, BYNET, Parsing Engine, PE

Before we go into Teradata architecture in more detail, we need to talk about how a computer is built. After all, this is also the basis of a Teradata system.

Teradata Architecture – Why Does Everyone Copy It?

Teradata, as one of the pioneers in data warehousing, was and is a role model for many subsequent database systems in terms of architecture.

Even if Teradata has come into the years, the developers of the system have already considered many details from the beginning that still make Teradata capable of competing today.

If we look at various modern database systems today, such as Redshift from Amazon (or Netezza), for example, we can recognize many things that were used by Teradata for the first time.

Teradata was designed from the beginning for parallelism in every smallest detail and can therefore still be found today among the top RDBMS for Data Warehousing.

single computer
Single Computer

Data is permanently stored on mass storage devices and loaded into the main memory for processing by the CPU.

It is important to understand that accessing the mass storage device is much slower than accessing the main memory. Further, accessing the main memory is much slower than accessing data already in one of the CPU caches.

Before data can be processed by the CPU, it must be loaded into the main memory.

The Teradata architecture can be easily imagined as a number of individual computers that can communicate with each other:

teradata architecture
Teradata System

Teradata Data Distribution

To split the workload, Teradata uses a hashing algorithm that distributes the rows of each table evenly among the so-called AMPs (we will talk later in this article about exactly what an AMP is and what its tasks are. For now it's sufficient to know that AMPs are doing the main work).

teradata hashing
Data Distribution by Hashing

The Parsing Engine

An important part of the Teradata architecture is the Parsing Engine (PE).

The Parsing engine receives a request (e.g. an SQL statement) and generates an execution plan for all AMPS that are required to complete the request. Ideally, the plan is structured so that all AMPs start and finish their tasks at the same time. This ensures optimal parallel utilization of the system.

parsing engine
The Parsing Engine controls the AMPs

As you can see in the figure above, between the AMPs and the parsing engine is the BYNET, which represents the communication network over which both the data and instructions are exchanged. We talk about the BYNET in detail later in this article.

The Parsing Engine has the following main tasks:

  • Logging on and Logging Off Sessions
  • The parsing of requests (syntax check, checking authorizations)
  • Preparation and optimization of the execution plan
  • The Parsing Engines uses statistics to build an optimized plan.
  • Controlling the AMPs by Instructions
  • Communication with the client software
  • EBCDIC to ASCII conversion in both directions
  • Transfers of the result of a request to the client tool

Each Teradata System can use multiple parsing engines.

The number of parsing engines can be increased by the system as needed because each parsing engine can only process a limited number of sessions.

Currently, there are 120 sessions that any parsing engine can manage. These can be sessions of different users, but also 120 sessions of the same user.

The Teradata AMP

AMPs are the real workers in a Teradata System who execute the instructions they receive from the Parsing Engine (the Execution Plan).

AMPs are independent units that have their own main memory and mass storage allocated to them.

The allocation is exclusive, i.e. no AMP has access to the resources of another AMP.

These are the main tasks of an AMP:

  • Storing and retrieving of rows
  • Sorting of rows (for details read How Teradata sorts the result set)
  • Aggregation of rows
  • Joining of tables (see also: The Essential Teradata Join Methods)
  • Locking of tables and rows
  • Output conversion ASCII to EBCDIC (if the client is a mainframe)
  • Management of its assigned space
  • Sending of rows to the Parsing Engine or other AMPs (via the BYNET)
  • Accounting
  • Recovery handling
  • Filesystem management

Each AMP can perform multiple tasks simultaneously. By default, there are 80 tasks that can be executed in parallel.

The Teradata Node

Parsing engines and AMPs are processed and run on a node. A node is usually a Linux machine equipped with multiple physical CPUs.

Each node can run hundreds of AMPs. Each AMP has its own portion of the main memory and its own portion of mass memory (called virtual disk).

Teradata Node
The Teradata Node

Nodes are connected to a disk array, and each AMP is assigned a part of it as a logical disk. Nowadays, SSDs are used and management is done by the Teradata Intelligent Memory system. But the principle is the same.

Disk array
Node with Disk Array managed by Teradata Intelligent Memory

Massive Parallel Processing

A Teradata system can consist of a large number of nodes. These, in turn, are connected to each other via BYNET.

However, this is a physical network, while the BYNET within a node connects the AMPs with the parsing Engine and with each other, is implemented in software:

Two Nodes combined with Hardware BYNET. Within each Node BYNET is Software
See how Hashing is done on Teradata

Another view on the Teradata Design by TutorialsPoint

Roland Wenzlofsky


Roland Wenzlofsky is a graduated computer scientist and Data Warehouse professional working with the Teradata database system for more than 20 years. He is experienced in the fields of banking and telecommunication with a strong focus on performance optimization.

You might also like

  • It is important to understand that accessing the main memory is many times slower than accessing the main memory. typo here

    Reply

  • Great article, just thought you’d like to know there’s a couple of errors: “It is important to understand that accessing the main memory is many times slower than accessing the main memory.”, think that first ‘main memory’ should be ‘mass storage’.

    Also the phrase “Each node can run hundreds of AMPs. Each AMP has its own portion of the main memory and its own portion of mass memory (called virtual disk).” is repeated both before and after the Teradata Node diagram.

    These are minor things, thanks for writing these interesting articles. Keep up the good work.

    Reply

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    Never miss a good story!

     Subscribe to our newsletter to keep up with the latest trends!

    >