In the second part of this series on Teradata Physical Storage, let me take over from Roland to continue where we stopped in part 1. We will enter the world of data blocks and dive deeper into the sequence of actions and consequences for the lives of rows populating data blocks.
Recall that each AMP holds rows from all tables, but the rows of different tables are never put together in the same data block.
Using an analogy to illustrate this, think of an AMP as one federal state (of some union of countries that make up your system), a table as a tie for rows much like a tribe or ethnic group, only more binding, and the individuals belonging to it as the rows. Then, the data block can be seen as a gated community concept resulting from strict social stratification at the local level. So while a person (row) comes together with others from all kinds of backgrounds to perform tasks (query) in every state (AMP), people from different tribes (tables) never reside in the same quarter (blocks).
What does the interior of such a gated community, i.e., block like?
The first thing we come across is a Table Header – much like a place name sign. An Array of 2-byte Pointers works much like a detailed, personalized road map, that shows where every row beginning can be found. It is always sorted.
The concrete sorting of the rows upon arrival at the block – translate this to the question of who moves in what house next to whom when the community is opened – is determined by the loading mechanism applied or, more generally, the way the row was told to be stored for the table. Fastload and Multiload inserts lead to rows being stored in ascending order by their ROWHASH value while rows arriving via SQLs insert, updates of TPUMP will populate the data block wherever space is free at the time they arrive at the block. The ROWHASH value could be seen as a passport ID in our analogy.
Data blocks come in a standard size of at least 512 bytes each upon creation. Every community is founded on a piece of land of equal minimum size. Maximum tolerance for crowding policy is in place that foresees a sudden increase of community land every time the population grows up to the point where every address dwells. A block with no more (predicted, standard) space is then added another 512 bytes and growth can continue. This enlargement can continue up to a maximum block size of traditionally 255 sectors of 255 bytes each before the block is ultimately split into two blocks.
These standard size thresholds change as new Teradata versions are introduced, much like a government authority can change its definition, policy or budget for housing and community size regulations. In these days of ever-larger data amounts that translate into ever broader table rows, the general tendency is upward, much like the demographic trend of rising population and obesity becoming an issue in developed, well-nourished countries.
Depending on the particular usage of storage, the arrival of a new row can trigger one of the following activities for a block:
If there is enough free space in one end-to-end chunk, the row can be stored smoothly. Having been stored, the Pointer Array and the Cylinder Index are refreshed to reflect the change.
If, on the contrary, there is free space for housing another row as such, but not in one piece anymore, a defragmentation process tidies up storage space usage before the row can be added. This would be the community chief walking from tent to tent, asking the Meyers to move a few meters to the left and the Millers to park their car at the other side of the street, over the entire community until free space is one chunk where the newcomers can settle.
If even the most efficient use of space reveals that there is no more room to settle one more row, a new 512 bytes piece of land is declared new settlement ground for the community, but not without the Cylinder Index, much like a government agency for housing, land use and resettlement affairs, being contacted first to ask for an uninhabited new ground to settle the entire current community and another 512 bytes for future row generations. The Cylinder Index Free List returns such a piece of land and, as a consequence, the block moves to the new location. The land once used by the block is categorized as free again. Furthermore, adjacent free blocks or sectors are merged to one more substantial piece of free land and are then kept in the Cylinder Index Free List as one entry.
If the enlarged block to move is so large that there are not enough adjacent sectors in a cylinder, all existing blocks of a cylinder have to move such that a larger chunk of space is created and the enlarged original block to move finds its space. Imagine having to relocate the bigger-to-be state’s capital. For this to succeed without transgressing the county limits, all towns have to move closer to each other and to the north so that the new larger capital finds a resulting piece of land.
Having reached the point where even the cylinder is about to be consumed up, Teradata can move up to 10 blocks to a neighboring cylinder or a free cylinder if the adjacent cylinder operates under the same scarcity of space.
Image now the opposite situation of a large number of defined blocks, but each of them only sparsely populated with rows. Blocks of tables after gross delete operations can be in such a state. In our analogy, this would be a community after a sudden rise in mortality. Here, the issue is not to enlarge blocks, but rather to merge them together. From Teradata 13.10 on, within the sphere of a cylinder, up to 8 blocks can be merged together automatically. It is as if the government of Norseland told the small communities of Norrmen of chief Erik, those of chief Einar and those of chief Björn to please form one community of Norrmen and free the ancient land of their chiefdom each.
The process of dealing with automatic merging of blocks to economize is called MiniCylpack. It works over the cylinders over one AMP. An all-AMP equivalent that has to be triggered manually is the Packdisk process. With the introduction of Teradata 13.10, AutoCylpack does what Packdisk on a system level whenever it is determined as idle and therefore without manual intervention by database administrators.
We are firmly convinced that a deep understanding of the technical details of a Teradata system is necessary to become an expert in performance optimization. Many of us are well-informed about what optimization opportunities exist, but to solve the toughest performance problems, it is sometimes necessary to understand in detail how the data is stored internally.
Traditional only hard disks have been used to store data. Even though lately the trend goes to solid state disks, we will still see hard drive in use for quite a while.
Current Teradata systems use a combination of solid state disks and hard drive to store the data.
Hard drives are divided into cylinders, sectors, and blocks located inside the sectors, and this is where our journey into the innermost of the Teradata data storage system begins.
Each data record is stored in a block which belongs to a certain cylinder on the AMPs disk.
Each request sent from the Parsing Engine to the AMPs carries the internal TABLEID which uniquely identifies each table within a Teradata system. Two indexes are used to track down any data record to its location on the disk:
The Master Index is the main index which stores information about which cylinder contains the rows of a table. To prevent that all cylinders have to be read to find only one data rows (for example primary index access), the cylinder index holds the lowest ROWID and the highest ROWHASH value per cylinder.
The master index keeps a sub-index with all free cylinders. This sub-index is used to find free cylinders for write operations (inserts, updates).
The master index is locally available on each AMP and always cached in the FSG cache as it is used each time data has to be located.
As soon as the AMP found the cylinders holding the table records (by querying the master index), the cylinder index is used to find the cylinder sectors holding the data blocks, which are finally containing the data rows it needs. For each data block of a table, the cylinder index stores the first and the last sector used.
As the cylinder index is much bigger than the master index, it may not be kept entirely in the FSG cache.
Like in the case of the master index, the lowest ROWID and the highest ROWHASH are available to restrict access to sectors which hold required records. The cylinder index keeps a sub-index with all free disk sectors.
As soon as the AMP knows the first and last sector of the data block holding the required data record(s), the data block is ready for being moved to the FSG cache where select, update, insert or delete activities can take place.
Finally, the AMP will locate the requested data record(s) by doing a binary search on the data block(s).
The binary search algorithm finds a row within a sorted pointer array of rowids by comparing in steps the searched rowid value with the rowid value of the central element of the pointer array.
Each pointer is marking the starting position of a table row. If the central element of the pointer array is the rowid we searched for, the row can be accessed via the position the entry points to. Otherwise, if the rowid is less than the central element’s rowid, then the algorithm repeats its action on the elements to the left of the middle element or, if the search key is greater, on the elements to the right. This algorithm runs in a loop until the searched rowid is found or there are no more elements available for searching which indicates that the rowed is not available.
In the second part of the articles series “The ultimate Teradata Physical Storage guide” we will take a closer look at the structure of data blocks, how records are stored within the blocks and how Teradata handles different requirements regarding data block size.
Further, we will show you how Teradata does data block housekeeping by adding, shrinking and splitting blocks on demand to minimize fragmentation and keep the system as fast as possible.
Certainly most of us have seen the Teradata DATABLOCKSIZE option in the CREATE TABLE statement, but I suppose only a few of us are familiar with the usage of this option. This article should give you a short overview how you can use it to minimize your disk IO’s.
Teradata stores data rows (or column containers) in data blocks and each data block is composed of several sectors on a disk cylinder (for details about how disks are constructed take a look at http://en.wikipedia.org/wiki/Disk_cylinder )
Traditionally, each data block only store rows from the same table. Rows from different tables are never mixed up in one data block. For this reason each data block holds the table id in the block header which uniquely identifies the related table.
Since Teradata 14.10 and with the availibility of columnar storage, values of one or several columns of several rows are packed together into the same data block.
The DATABLOCKSIZE basically determines how many sectors piece together one data block. The default size for one sector is 512 bytes. Each time a sector is full, Teradata adds another 512 Bytes sector to the data block until the maximum block size is reached.
After a data block reached the maximum block size, Teradata will split the block into two blocks.
What does this all mean for us? Basically you can store the same amount of rows within a few big data blocks or many small ones. Smaller data blocks will result in more frequent block splits if rows are inserted.
Which data block size is the best depends on the requirements of your workload.
Tables which are accessed most of the time by full table scans should probably have a big data block size. This way, many rows can be moved efficiently into the AMPs memory at once.
On the other hand, if you have a lot of unique index accesses on your table which means only a few rows are accessed, it will be better to define small data blocks in order to avoid to clutter up the AMPs memory each time you retrieve the data.