How Advanced Is The Teradata Columnar Solution?
A Column Store is basically characterized by 3 properties:
- Improved compression compared to Row Stores.
This simply results from the fact that by packing columns into data blocks, considerably better compression rates can be achieved. Since the values of a column come from a domain, the entropy is much lower.
- Columnar projection
Only the columns that are required are read
- Columnar procession
This is only reserved for real Column Stores. Teradata forms rows from the columns and then operates on them using the traditional row-based engine. It is therefore not a pure column store.
Why Is The Teradata Solution Great Even Though It Does Not Offer Columnar Processing?
Let's assume a typical query. Often a selection is made from a date range where a date restriction exists on the WHERE condition.
But at the same time often only a part of the existing rows of a table is needed.
Teradata offers exactly this: Column partitions and row partitions at the same time!
In Teradata, subsets of rows and columns can be defined as partitions. A pure column store keeps the data per column in a partition. Teradata, on the other hand, allows several columns to be defined as a column partition and simultaneously partitioned by rows.
The Advantages Of Teradata Columnar
Teradata offers two ways to store columnar data. In the COLUMN format Teradata offers many advantages of a pure column store:
- Columns can be considered as arrays with all the advantages that come with that, such as direct access to elements via the index, simultaneous access to multiple array elements via Single Instruction Multiple Data (SIMD) operations, etc.
- Great compression due to low entropy of column values (Dictionary Compression, Run-Length Encoding)
But saving in COLUMN format also has disadvantages, since the desired index must be searched for in the column for a specific row.
Therefore Teradata offers the ROW format as an alternative. Here the ROWID is stored with each column value. This allows Teradata to access the value directly as usual.
ROW Or COLUMN Format?
The following applies:
If many columns are selected, the ROW format is more suitable, the overhead for storing the ROWID per column value is then negligible.
If a few columns are selected, the COLUMN format is typically more suitable (better compression rate, SIMD processing).
Primary Index Or Primary AMP Index Access
A further advantage of the hybrid solution from Teradata is that the data can also be accessed via the Primary Index or Primary AMP Index. The Primary Index allows direct access to rows via hashing, the Primary AMP Index limits the work to AMPs that contain the required data.
Should I Use From Now On Only Column Partitioned Tables?
Not at all. Column partitioning is ideal for reading access, but performance is worse for DML statements than Row-Partitioned Tables.
What Is The Biggest Disadvantage Of The Teradata Solution For Column Partitioning?
The many possibilities to create tables are tempting but require a lot of experience to get the best performance out of them. Some typical questions are:
- Row partitioning, column partitioning or both?
- Column partitioning using ROW or COLUMN format?
- Maybe indexing with USI, NUSI, Join Index is a better choice?
- Column partitioning with NOPI, Primary Index or Primary AMP Index?
So a lot can be optimized in a Teradata system. This makes it insanely powerful but also requires a lot of experience.
If you compare this with a pure column store like Snowflake where the main possibilities for optimization are the execution of an additional cluster (as a virtual warehouse to execute several queries in parallel) or the enlargement of an existing EC2 instance (to speed up the queries), you have to think carefully about which is the more suitable database system.
At the end of the day, one must be clear. Modern cloud databases mean that you exchange all these optimization possibilities (or requirements) for elasticity. In other words, you don't optimize anymore, you simply rent additional computing power.
And in the end, you trade “total cost of ownership” or capital expenditures for operational expenditures.
Careful With The Cloud
But my biggest problem with cloud databases is: Who says that the price models (billing by the minute, etc.) will remain the same forever? Let's take a look at the development of tariffs in the mobile sector and how creative the operators have become I would be cautious.