How Advanced Is The Teradata Columnar Solution?
Three properties characterize a Column Store:
- Improved compression compared to Row Stores.
This results from considerably better compression rates that we can achieve by packing columns into data blocks. Since the values of a column come from a domain, the entropy is much lower. - Columnar projection
Only the columns that are required are read. - Columnar procession
This is only reserved for actual Column Stores. Teradata forms rows from the columns and then operates on them using the traditional row-based engine. It is, therefore, not a pure column store.
Why Is The Teradata Solution Great Even Though It Does Not Offer Columnar Processing?
Let’s assume a typical query. Often a selection is made from a date range where a date restriction exists on the WHERE condition.
But at the same time, often, only a part of the existing rows of a table is needed.
Teradata offers precisely this: Column partitions and row partitions simultaneously!
In Teradata, subsets of rows and columns can be defined as partitions. A pure column store keeps the data per column in a partition. On the other hand, Teradata allows several columns to be defined as a column partition and simultaneously partitioned by rows.
The Advantages Of Teradata Columnar
Teradata offers two ways to store columnar data. In the COLUMN format, Teradata offers many advantages of a pure column store:
- Columns can be considered arrays with all the advantages, such as direct access to elements via the Index, simultaneous access to multiple array elements via Single Instruction Multiple Data (SIMD) operations, etc.
- Significant compression due to low entropy of column values (Dictionary Compression, Run-Length Encoding)
But saving in COLUMN format also has disadvantages since Teradata must search for the desired Index in the column for a specific row.
Therefore Teradata offers the ROW format as an alternative. Here the ROWID is stored with each column value. This allows Teradata to access the value directly as usual.
ROW Or COLUMN Format?
The following applies:
If many columns are selected, the ROW format is more suitable, and the overhead for storing the ROWID per column value is negligible.
If a few columns are selected, the COLUMN format is typically more suitable (better compression rate, SIMD processing).
Primary Index Or Primary AMP Index Access
A further advantage of Teradata’s hybrid solution is that Teradata can also access the data via the Primary Index or Primary AMP Index. The Primary Index allows direct access to rows via hashing. The Primary AMP Index limits the work to AMPs containing the required data.
Should I Use From Now On Only Column Partitioned Tables?
Not at all. Column partitioning is ideal for reading access, but performance is worse for DML statements than Row-Partitioned Tables.
What Is The Biggest Disadvantage Of The Teradata Solution For Column Partitioning?
The many possibilities to create tables are tempting but require a lot of experience to get the best performance out of them. Some typical questions are:
- Row partitioning, column partitioning, or both?
- Column partitioning using ROW or COLUMN format?
- Maybe indexing with USI, NUSI, or Join Index is better?
- Column partitioning with NOPI, Primary Index, or Primary AMP Index?
Final Considerations
So we can optimize a lot in a Teradata system. This makes it insanely powerful but also requires a lot of experience.
Suppose you compare this with a pure column store like Snowflake. The main possibilities for optimization are executing an additional cluster (as a virtual warehouse to run several queries in parallel) or the enlargement of an existing EC2 instance (to speed up the queries). In that case, you have to think carefully about a more suitable database system.
One must be clear. Modern cloud databases mean that you exchange all these optimization possibilities (or requirements) for elasticity. In other words, you don’t optimize anymore. You rent additional computing power.
And in the end, you trade “total cost of ownership” or capital expenditures for operational expenditures.
Be Careful With The Cloud
But my biggest problem with cloud databases is: Who says that the price models (billing by the minute, etc.) will remain the same forever? Let’s look at the development of tariffs in the mobile sector and how creative the operators have become. I would be cautious.