We are strongly convinced that a deep understanding of a Teradata system’s technical details is necessary to become an expert in performance optimization. Many of us are well-informed about what optimization opportunities exist. Still, to solve the most challenging performance problems, it is sometimes necessary to understand in detail how the data is stored internally.
Traditional, only hard disks have been used to store data. Even though lately, the trend goes to solid-state disks, we will still see the hard drive for quite a while.
Current Teradata systems use a combination of solid-state disks and hard drives to store the data.
Hard drives are divided into cylinders, sectors, and blocks located inside the sectors, and this is where our journey into the innermost of the Teradata data storage system begins.
Each data record is stored in a block that belongs to a specific cylinder on the AMPs disk.
Teradata Physical Storage – How can a table row be accessed?
Each request sent from the Parsing Engine to the AMPs carries the internal TABLEID, which uniquely identifies each table within a Teradata system. Two indexes are used to track down any data record to its location on the disk:
The Teradata Master Index
The Master Index is the main index that stores information about which cylinder contains a table’s rows. To prevent that, all cylinders have to be read to find only one data row (for example, primary index access), the cylinder index holds the lowest ROWID and the highest ROWHASH value per cylinder.
The master index keeps a sub-index with all free cylinders. This sub-index is used to find free cylinders for write operations (inserts, updates).
The master index is locally available on each AMP and always cached in the FSG cache as it is used each time data has to be located.
The Teradata Cylinder Index
As soon as the AMP found the cylinders holding the table records (by querying the master index), the cylinder index is used to find the cylinder sectors holding the data blocks, finally containing the data rows it needs. For each data block of a table, the cylinder index stores the first and last sectors.
As the cylinder index is much bigger than the master index, it may not be kept entirely in the FSG cache.
Like in the master index, the lowest ROWID and the highest ROWHASH are available to restrict access to sectors that hold required records. The cylinder index keeps a sub-index with all free disk sectors.
As soon as the AMP knows the first and last sector of the data block holding the required data record(s), the data block is ready for being moved to the FSG cache where select, update, insert or delete activities can take place.
The Binary Search
Finally, the AMP will locate the requested data record(s) by doing a binary search on the data block(s).
The binary search algorithm finds a row within a sorted pointer array of rowids by comparing in steps the searched rowid value with the rowid value of the central element of the pointer array.
Each pointer is marking the starting position of a table row. If the pointer array’s central element is the rowid we searched for, the row can be accessed via the position the entry points to. Otherwise, if the rowid is less than the central element’s rowid, then the algorithm repeats its action on the elements to the left of the middle element or, if the search key is greater, on the elements to the right. This algorithm runs in a loop until the searched rowid is found, or there are no more elements available for searching, which indicates that the row is not available.
In the second part of the articles series “The ultimate Teradata Physical Storage guide”, we will take a closer look at the structure of data blocks, how records are stored within the blocks, and how Teradata handles different data block size requirements.
Further, we will show you how Teradata does data block housekeeping by adding, shrinking, and splitting blocks on demand to minimize fragmentation and keep the system as fast as possible.