Teradata Compression Considerations

When Teradata compression becomes pressing

Have you ever seen your data warehouse approaching space limits or some of your tables’ Current Perm skyrocketing as you start loading data?

If so, use compression as an addition to your space management effort.

Compression is a table column definition addendum whereby you define the most common values as “given” so that they do not consume space once for every occurrence in the row. The value is kept on a “higher-order,” namely in the table header definition.

Before we explore the possibilities, a note on the limitations:

  • One cannot compress all kinds of tables; volatile ones are exempt
  • One cannot compress on index or partitioning columns
  • Depending on the Teradata Release you are working with, data types such as VARCHAR or TIMESTAMP cannot be compressed
  • You can add compression values only up to the point where the table header section’s space is used up

Now for the more rewarding part, what can and should be done.

Ideally, data demographics or reliable forecasts on a table are already at hand. If not, do a row count over the attributes you intend to compress and sort it over the row number in descending order.

Compression is a promising try for attributes with relatively few values. It works best for near-constant flags or types code columns of large tables with tens of millions of records and more. We do not recommend to compress IDs or other columns that are close to holding a separate value for every row. Also, we have never seen results worth implementing for columns with equal distributions over relatively many distinct values.

 

A budget approach for getting it done

To execute the following procedure for column choice, required documentation and recommendation you do not need to have a detailed technical understanding of what exactly happens where. Use the approach presented below as a “budget” one when you need to deliver some results quite soon.

Start with all columns of all tables of your database and apply the following filters:

Technical filter1:

Exclude all columns that are not eligible for compression because of their technical role (primary index, partition). Excluding is a hard filter at this point. Only a change of the physical data model could alter your decision here.

Technical filter 2:

Exclude all columns that do not have the right data type for compression. This filter is less restrictive as data types can be changed if the original design was more of an educated guess than a prudent decision.

Next, we can apply a global minimum benefit filter on the tables by eliminating all those below a current size as well as those with no further growth expected, given size. If you do not want to exclude tables right away or have no “small” ones in your database, you can rank them by non-technical urgency criteria such as usage in reporting.

For all the table columns left, we can now have a look at the degree of concentration of values in the column. A column with a very high concentration on only a few values is a more promising candidate for compression than a column whose millions of different values spread evenly. We can apply a standard measure from the field of market concentration for our purpose: the Herfindahl-Index (HI). The closer the value of a column is to 1 the more potential for compression.

In order to store the numeric basis of our decision, we keep a record of the HI value in a register table like this:

— COLUMN DEMOGRAPHICS REGISTER

CREATE MULTISET TABLE P_1111_A.DMnn_COL_HI_INDEX ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

HI_INDEX DECIMAL(38,18))

PRIMARY INDEX ( TableName, ColumnName );

We insert the HI values by first preparing the insert statement for all relevant tables that we will then execute:

— HHI INDEX CALCULATION PREPARATION

sel TABLENAME, columnname,

‘INSERT INTO P_1111_A.DMnn_COL_HI_INDEX SELECT CURRENT_DATE, ”’|| trim(tablename) ||”’,”’|| trim(columnName) || ”’,’|| trim(columnName) || ‘,  SUM (CAST( (ANZAHL/GES)**2 AS DECIMAL(38,18)) ) AS HI_INDEX

FROM

( SELECT ‘ ||trim(columnName)|| ‘, CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) ||’  GROUP BY 1 ) x

JOIN

(SELECT CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES FROM ‘ || trim(databasename) ||’.’|| trim(tablename) || ‘ ) GES

ON 1=1; ‘  AS HHI

from dbc.columns

where databasename=’P_1111_A’

and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)

and substr(TABLENAME,1,1) IN (‘P’,’T’)

and ColumnType NOT IN (‘CV’,’TS’,’BO’)

order by 1,2

;

Note that the query has a handful of constraints on the dbc column list to remove mere technical fields, exclude non-production tables and columns with non-compressible fields.

Per table, we select those columns with an HI high enough for the preparation of an insert statement of compression value candidates. For documentation purposes, we create a second register table first:

— COLUMN COMPRESSION CANDIDATES

CREATE MULTISET TABLE P_1111_A.DMnn_COL_COMP_CAND ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

COL_VALUE CHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,

VALUE_SHARE   DECIMAL(18,2)  )

PRIMARY INDEX ( TableName, ColumnName, COL_VALUE );

— FIND ALL COMPRESSION VALUE CANDIDATES

sel TABLENAME, columnname,

‘INSERT INTO P_1111_A.DMnn_COL_COMP_CAND  SELECT CURRENT_DATE, ”’|| trim(tablename) ||”’,’|| trim(columnName) || ‘, CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) AS VALUE_SHARE

FROM

( SELECT ‘ ||trim(columnName)|| ‘, CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) ||’  GROUP BY 1 ) x

JOIN

(SELECT  CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES , CAST(COUNT(DISTINCT ‘ ||trim(columnName)|| ‘)  AS DECIMAL(38,18)) AS DDVAL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) || ‘ ) GES

ON 1=1

WHERE CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) > CAST( (1/DDVAL)*100 AS DECIMAL(38,2))

ORDER BY 3 DESC

;’   AS COMPRESSION_VALUES

from dbc.columns

where databasename=’P_1111_A’

and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)

and substr(TABLENAME,1,1) IN (‘P’,’T’)

and ColumnType NOT IN (‘CV’,’TS’,’BO’)

order by 1,2

;

You can use both register tables to put together a compression recommendation.

Register tables allow for regular reinsertion of new layers of demographics via the STATUS_DT so that demographic shifts can be documented.

Before you prepare your final list for deployment, put your compression project in a long-term perspective:

How often will you be able to update your compression list?

How often do you expect it to be necessary?

How volatile is the column value set over time?

Yes-No flags will contain Yes or No as long as they exist, where daily account balances or order numbers are transitory.

These considerations cannot be taken over by Teradata. Rather, they should emerge from interaction with business representatives and management or from your curiosity if the latter are out of reach.

 

A success story about Teradata compression to inspire you

About a year ago we were walking through dozens of tables with only occasional or no compression at all. It was just a matter of time for the space of this already extensive database to run out. Additional space could not have been provided easily in the order of magnitude we needed it. First inspections of table contents indicated that there is potential for compression.

The results of introducing compression systematically where very satisfying:

While

  • no other changes than compression were implemented and
  • working in a production environment with daily growth,
  • over 1 TB space was saved in total with
  • an average reduction of 25% and as much as 76.2% on a table level.

The result gave us many months more for daily production!

It paved the way for other necessary activities that led to even more data inserted.

Paul Timar
 

Paul Timar is a graduated Sociologist and Economist who turned Data Warehouse Specialist. He has more than 7 years of experience in working with Oracle and Teradata Data Warehouses in the fields of Social Security Data and Banking. His latest focus is on performance optimization.

>