Archive

Category Archives for "Compression"
1

First the PDM, then Compression

The importance of the PDM

Compression is a popular way of saving IOs, by allowing to pack more rows into each data block. Data blocks are the smallest unit transferred between hard drives and main memory, the most expensive operation in any database system.

Compression can be done automatically by the system (block level compression) or is applied by the user (MVC). Another option is to use compression software, such as the one offered by Prise Ltd. (check it out at https://www.prisetools.com/products/compress-wizard).

While developers tend to use Multivalue Compression (MVC), other opportunities to cut resource usage are most times overseen or left out for several reasons. Very often the PDM i.e. the proper implementation of data types and character sets is neglected.

First the PDM, then Compression

Compression depends on the underlying data types. Therefore, before starting any compression activity, we have to make sure that the correct data types and character sets are used in the PDM. Otherwise, any redesign of the PDM will force us to rework compressions.

While the correct choice of data types and compression both help to cut resource usage, there are even some advantages of the right data type/ character set choice which compression doesn’t have.

The right data types will reduce resource usage independently from the data content, while compression has no space or resource usage benefit for unique columns.

The choice of correct data types always helps to cut resource usage (by allowing to pack more rows into each data block).

The selection of the right data types & character sets is part of physical data modeling. Unfortunately, often this step is skipped because it causes costs early in the implementation process.

Many times, data types are directly taken over from the stage area into the PDM of your core model, and from there they are directly moved into the data marts.

While this approach overcomes costs at the begin of the implementation process, we don’t recommend it. The costs will only be transferred to a later stage when it is much harder to do changes on the PDM.

Only because a column definition is a DECIMAL(38,n) in the source system doesn’t mean it will ever hold such huge numbers. Remember, space usage of decimals is in the range of 1 to 8 bytes. The same is valid for integer values (byte int -> 1 byte, big int -> 8 bytes)

In my experience, many decimals could be stored in barely 4 bytes, but because of laziness, time pressure & cost reasons, developers often save them in the largest decimal number available(which consumes a total of 8 bytes).

A single decimal, if improved in the way described above, need 4 bytes less per table row. Consider the possible space reduction for a table containing billions of rows, and don’t forget that this improvement is before applying compressions!

Similarly, using character set LATIN instead of UNICODE helps to cut space usage of a character field by 50% per character. Often I even see simple codes, such as “ACTIVE,” “SUSPENDED,” or “CANCELED” stored as UNICODE characters!

As a Teradata Consultant, you never should leave out this important optimization step. Yes, it takes time when we create the PDM, but it will pay off at the end.

I think I gave you enough arguments to insist on this activities. Don’t forget about it in your next project!

If you have any questions about all this, please ask in the comments! I’ll be paying close attention and answering as many as I can. Thank you for reading. Whatever this blog has become, I owe it all to you.

2

Teradata Multivalue Compression, Nullability & the Presence Bytes

Multivalue compression and the indication of nullability are different matters, but with one common task: The storage of information about column values in every single row. To keep space requirements low, Teradata stores information about nullability and multivalue compress together in the so-called presence bytes.

The alert reader might recognize that the name “presence byte” suggests that it’s indicating the presence of NULL values. In earlier releases of Teradata, this naming convention was introduced as only the nullability information was stored, and multivalue compression did not exist at this time.

Each row is always equipped with at least one presence byte. More presence bytes are added if needed. Only 7 bits of the first presence byte are used to store information about nullability or multivalue compression, as the first bit is occupied by the system (and it’s always set to 1).

How is Teradata storing Information about NULL Values?

For each nullable column, one bit per row is used to find NULL values. The bits being used to show nullability, are always stored in the bits used by the multivalue compression feature (which means the bits are used from right to left).  Columns defined as NOT NULL of course don’t need the nullability indicator bit.

The logic of indicating NULL values is simple:

  • If the presence bit is set to 1, a NULL value is present in the column
  • If the presence bit is set to 0, any other value is present in the column

See below an example table, containing eight rows. As TheCol is nullable, an indicator bit is used to show rows containing NULL values in TheCol (as mentioned earlier, the first bit is always set to 1 as it is utilized by the system):

Compress3

We already said that the only 7 bits of the first presence byte could be used. We can cover seven nullable columns with the always existing presence byte. Each further presence byte can store the null value information of 8 more nullable columns.

If we increase the number of nullable columns from 7 to 8, an extra presence byte is added. A result is that each row consumes one more byte of permanent disk space.
Keep the following limits in mind when designing your tables:

Up to 7   nullable columns -> The already existing presence byte is used
Up to 15 nullable columns -> 1 presence byte is added (2 in total)
Up to 23 nullable columns -> 2 presence bytes are added (3 in total)

It’s recommended that columns are defined as NOT NULL when possible, to keep the table size as small as possible. Furthermore, unused bits of a presence byte can be used by the multivalue compression feature.

Teradata Multivalue Compression and the Presence Bytes

Presence bytes are also used in multivalue compression.  For each column, up to 255 different values can be compressed.

The original column values are kept in the table header, and each row will only hold the binary number which represents its related column value. If we are storing 255 different values, 8 bits are needed (a total presence byte).

The algorithm will only occupy as many bits as needed to encode the original column value:

Different Values Bits needed
 1  1
 3  2
 7  3
 15  4
 31  5
 63  6
 127  7
 255  8

Let’s do another example. Below table has eight rows.  If we compress 3 column values ‘A’,’B’,’C’ we need two presence bits: ‘A’ could be represented by the binary number 01,  ‘B’ by binary 10, ‘C’ by binary 11 (the presence bits are occupied from right to left):

Compress1

We extend our example by making TheCol additionally nullable. In this case, one presence bit is needed to show rows containing NULL values in column TheCol. The compression is unchanged, just shifted by one bit to the left, as nullability bits are always stored first:

Compress2
As soon as we compress at least one value of TheCol, NULL values are automatically compressed and without any more cost. The nullability bit of the column can be used for NULL values, no encoding and storing in the table header is required.

By using the same presence bytes for multivalue compression and nullability at the same time, space consumption can be reduced. Nevertheless, when designing your tables, always think about the joint costs of compressed columns and nullability indicators and how one extra compress value or nullable column will influence the size of your table.

Compression in Teradata Columnar

Compression in Teradata Columnar offers two main benefits: Reduction in used permanent space and disk IOs.

When dealing with the compression of columns, there is usually a significant advantage over row compression: A limited data variance per column.
Teradata Columnar uses several compression methods which are exploiting this property.

Teradata Columnar & Run-Length Encoding

This compression method stores each column value exactly once but adds information about the consecutive rows which contain this value. Assuming that a column container for a date column contains the date ‘2015-10-29’ in the rows 100-200, run-length encoding would store this information like this:

‘2015-10-29’;100-200

Teradata Columnar & Dictionary Encoding

If the column values are not repeated successively, another compression method can be applied, the dictionary encoding. Dictionary encoding is done by storing compressed forms of the full value. The dictionary entries have a fixed length, making navigation easy. Teradata Columnar keeps one dictionary per container.

For example, the values for customer the segments “Business” and “Private” could be stored as dictionary entries 1 and 2:

1,”Business”,2,”Private”

The column values would be stored as 1,2,1,2,2,2,1,1,2,1,…

As you can see, this allows reducing used disk space by mapping larger column values to smaller ones.

Teradata Columnar & Delta Compression

While run-length and dictionary encoding are straight forward, Teradata Columnar has another more sophisticated way of compression, the delta compression.

If the column values of a container are in a tight range, only the offset from an average container value will be stored.
Let’s assume the following column values:

10,20,50,100,20,10

Delta compression would store the encoded information like this (notice, that the average column value is 35):

-25,-15,+15,+65,-15,-25

You will immediately think “But what’s the advantage in used space if I replace the original values with the numbers below.” The answer is the data type. If our column container holds BIGINT values, the offsets may be stored in an SMALLINT value, saving 6 Bytes per row.

Conclusion

The total effect of the compression methods available in Teradata Columnar is a vast improvement over the usual multivalue compression (MVC) used for row stored data.

Fortunately, Teradata decides on its own which compression algorithm will be used (but this behavior can be changed). The compression method can vary across all the table columns or even from container to container within a column. Even multiple methods can be used with each column at the same time.

14

Teradata Compression with the DWHPRO TITHONIZER

Do you work in an environment where a compression tool is installed that works to your satisfaction?

If not, I assume you are probably not very pleased with having to walk your way through the databases, table by table, to find out what to compress on which column. Typically, this is time-consuming, fatiguing and subject to momentary human judgment to some extent.

Every reevaluation is costly again. Often, therefore, Teradata compression activities stop once the largest decile of tables has been treated, leaving a lot of potential unused. I can offer you a solution to this problem:

Teradata Compression with the DWHPRO TITHONIZER

Inspired by the Greek legend persona of Tithonos, whose fate it was to shrink ever more, DWHPRO TITHONIZER determines the best possible and feasible current compression potential for a table, under given constraints and assumptions.

The DWHPRO TITHONIZER can be freely downloaded by all newsletter subscribers. If you are not already subscribed or are an existing subscriber who would like to download the DWH TITHONIZER, please enter your e-mail address at the end of this article. 

DWHPRO TITHONIZER determines set of compression candidate columns and the compression candidate values per column.

It then selects the one combination that achieves the highest savings under the given technical constraints and with the current table demographics.

It can be used for compressing a table for the first time or an update on given compressions.

It does not apply the compression findings directly. Rather, it stored them in a separate table to be implemented at a later stage. This allows you to separate the potentially resource-intense analysis from the application work.  Also, DWHPRO TITHONIZER can be applied to one table in isolation or all of the tables of a database.  This enables you to perform demo or study runs over representative tables before any mass activity.

The core assumption behind DWHPRO TITHONIZER is that it is possible to determine the one combination of compressions over columns that achieves the highest net benefit under the given circumstances and constraints.

It gathers the following information:

Is the table populated by enough rows for compression to have a savings potential at all?

Which columns can be candidates for compression for technical reasons?

Which columns shall be candidates for compression from a business or policy point of view?

Which values, given columns, are candidates?

What is the benefit in saved bytes of compressing one value?

What is the cost of compressing this one value?

What is the Presence Byte consumption of a given compression in total?

For every column, compressed value sets are summarized in levels that represent one Presence Bit consumed each.

Based on this information, a dominant path for compression is determined by ordering the various levels of columns by the “savings per header cost ratio”. Highly frequent level 1 values with little cost in bytes for the table header come first in line. At the other end one will find lengthy and infrequent values with a razor-thin or even no net benefit. After potentially adjusting for net benefit and local Presence Byte break-even points, the dominant path is translated into a series of compression statements that are stored in the result table TITHONOS. From there, alter table statements can be generated to apply the compression.

For a detailed description of DWHPRO TITHONIZER, please refer to the pdf file attached to this article.

I recommend to read it before you start working with the Stored Procedure.

DWHPRO TITHONIZER was developed and tested for Teradata Versions 13 and 14.

The current version offered for download and use here is the first one.  Improvements, refinements or modifications of functionality or methodology will occur as experience with it is gained and time to do it is available.

Feel free to use or modify DWHPRO TITHONIZER for your training or working purposes.

Before you start finding out what the Stored Procedure can do for you, let me remind you that the responsibility for the use of the Stored Procedure and any consequence thereof lies wholly in the hand of whoever uses it.  I am not responsible for any damage, misuse or other negative consequences that emerge out of DWHPRO TITHONIZER applications.

If you have gathered experience, I would be pleased to hear of any results you achieved with it.

[mailpoet_form id=”5″]

Good luck and good results!

4

Teradata Compression: Should you bother?

The Teradata DBMS has always supported compression in a (growing) number of forms.

Version 5.0 introduced Multi-Value Compression (MVC), version 13.10 introduced Algorithmic (ALC) and Block Level (BLC) compression, and version 14.0 introduced multiple automatic compression options for Columnar tables.

All of these can produce disk space and associated performance savings; however implementing Teradata compression with MVC or ALC takes man-time, which has a cost associated with it.

It also needs to be remembered that because data demographics change over time, analysis of either of these types of compression needs to be revisited on a regular basis (which may only be once a year).

Yes, there are tools which will help with this, but the point of this article is real to ask if the time you are spending on compression is worthwhile? Is it giving you benefits that are commensurate with the costs? Like many things, this is a question of “cost vs. benefit”.

On one recent customer engagement I produced the following analysis for them:

  • They have about 98000 tables.
  • About 26000 tables have MVC implemented.
  • On their system, only 4000 tables are big enough that a full table scan is estimated to take longer than 1 second – even allowing for a reduced scan rate because of concurrent queries.

I am not recommending that you remove existing MVC compression on small tables. The use of MVC doesn’t impose any real CPU overhead, and removing compression on a lot of tables will be a lot of effort for negligible if any gain. If you have ALC applied to a small table, then you might get some savings.

What I would recommend is that when designing new tables or possibly when making changes to existing tables that the compression setting should be carefully considered.

If that table is likely to be small, then I would consider not applying any compression to it. This is more change in a process than any particular remedial action that needs to be taken.

Teradata Multivalue Compression – a Quick Approach

Whenever you have to analyze a huge Teradata table regarding Multivalue compression, from a workload point of view, this is a costly operation.

Unfortunately, Teradata (the company) does not offer any tool which would allow automatizing this task.

Usually, analysis requires counting the number of distinct values per table column. Counting is resource and time intense for huge tables.

Luckily, with Teradata 14.10 we have a less costly method available:

SHOW STATISTICS VALUES COLUMN <column> ON <table>;

Above statement outputs the most biased column values and many other useful statistics about a column. Usually, the biased values are great candidates for Teradata Multivalue compression.

The advantage of this approach is that no costly table analysis is required (compare this against the approach of counting the distinct column values for each column of a huge table), nevertheless it delivers astoundingly good compression suggestions.

By additionally using the number of NULL values per column, you should have a great starting point for a quick win regarding space reduction.

Nevertheless, as good as this method is for achieving quick results, it comes with some drawbacks:

  1. It only works for columns with collected and correct statistics.
  2. Unfortunately, the information used in the SHOW STATISTICS statement is taken from a binary object (“FieldStatistics”). There is no easy way to extract this information. As a substitute, we have to use the text output of the statement and parse it, to be able to automatize the process of creating compress statements.

>