Teradata Compression Considerations

Paul Timar

March 17, 2014

minutes reading time


 

When Teradata compression becomes pressing

Have you ever seen your data warehouse approaching space limits or some of your tables’ Current Perm skyrocketing as you start loading data?

If so, use compression as an addition to your space management effort.

Compression is a table column definition addendum that defines the most common values as “given” to not consume space once for every occurrence in the row. The value is kept on a “higher-order,” in the table header definition.

Before we explore the possibilities, a note on the limitations:

  • One cannot compress all kinds of tables; volatile ones are exempt.
  • One cannot compress on index or partitioning columns.
  • Depending on the Teradata Release you are working with, data types such as VARCHAR or TIMESTAMP cannot be compressed.
  • You can add compression values only up to the point where the table header section’s space is used.

Now for the more rewarding part, what can and should be done.

Ideally, data demographics or reliable forecasts on a table are already at hand. If not, do a row count over the attributes you intend to compress and sort it over the row number in descending order.

Compression is a good try for attributes with relatively few values. It works best for near-constant flags or types code columns of large tables with tens of millions of records and more. We do not recommend compressing IDs or other columns close to holding a different value for every row. We have never seen results worth implementing for columns with equal distributions over relatively many distinct values.

 

A budget approach for getting it done

To execute the following procedure for column choice, required documentation, and recommendation, you do not need to have a detailed technical understanding of precisely what happens. Use the approach presented below as a “budget” one when you need to deliver some results quite soon.

Start with all columns of all tables of your database and apply the following filters:

Technical filter1:

Exclude all columns that are not eligible for compression because of their technical role (primary index, partition). Excluding is a hard filter at this point. Only a change in the physical data model could alter your decision here.

Technical filter 2:

Exclude all columns that do not have the right data type for compression. This filter is less restrictive as data types can be changed if the original design was more of an educated guess than a prudent decision.

Next, we can apply a global minimum benefit filter on the tables by eliminating all those below a current size and those with no further growth expected, given size. If you do not want to exclude tables right away or have no “small” ones in your database, you can rank them by non-technical urgency criteria such as usage in reporting.

For all the table columns left, we can now look at the degree of concentration of values in the column. A column with a very high concentration on only a few values is a more promising candidate for compression than a column whose millions of different values spread evenly. We can apply a standard measure from the field of market concentration for our purpose: the Herfindahl-Index (HI). The closer the value of a column is to 1, the more potential for compression.

To store the numeric basis of our decision, we keep a record of the HI value in a register table like this:

— COLUMN DEMOGRAPHICS REGISTER

CREATE MULTISET TABLE P_1111_A.DMnn_COL_HI_INDEX ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

HI_INDEX DECIMAL(38,18))

PRIMARY INDEX ( TableName, ColumnName );

We insert the HI values by first preparing the insert statement for all relevant tables that we will then execute:

— HHI INDEX CALCULATION PREPARATION

sel TABLENAME, columnname,

‘INSERT INTO P_1111_A.DMnn_COL_HI_INDEX SELECT CURRENT_DATE, ”’|| trim(tablename) ||”’,”’|| trim(columnName) || ”’,’|| trim(columnName) || ‘,  SUM (CAST( (ANZAHL/GES)**2 AS DECIMAL(38,18)) ) AS HI_INDEX

FROM

( SELECT ‘ ||trim(columnName)|| ‘, CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) ||’  GROUP BY 1 ) x

JOIN

(SELECT CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES FROM ‘ || trim(databasename) ||’.’|| trim(tablename) || ‘ ) GES

ON 1=1; ‘  AS HHI

from dbc.columns

where databasename=’P_1111_A’

and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)

and substr(TABLENAME,1,1) IN (‘P’,’T’)

and ColumnType NOT IN (‘CV’,’TS’,’BO’)

order by 1,2

;

Note that the query has a handful of constraints on the dbc column list to remove mere technical fields and exclude non-production tables and columns with non-compressible fields.

Per the table, we select those columns with an HI high enough to prepare an insert statement of compression value candidates. For documentation purposes, we create a second register table first:

— COLUMN COMPRESSION CANDIDATES

CREATE MULTISET TABLE P_1111_A.DMnn_COL_COMP_CAND ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

COL_VALUE CHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,

VALUE_SHARE   DECIMAL(18,2)  )

PRIMARY INDEX ( TableName, ColumnName, COL_VALUE ); — FIND ALL COMPRESSION VALUE CANDIDATES

sel TABLENAME, columnname,

‘INSERT INTO P_1111_A.DMnn_COL_COMP_CAND  SELECT CURRENT_DATE, ”’|| trim(tablename) ||”’,’|| trim(columnName) || ‘, CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) AS VALUE_SHARE

FROM

( SELECT ‘ ||trim(columnName)|| ‘, CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) ||’  GROUP BY 1 ) x

JOIN

(SELECT  CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES , CAST(COUNT(DISTINCT ‘ ||trim(columnName)|| ‘)  AS DECIMAL(38,18)) AS DDVAL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) || ‘ ) GES

ON 1=1

WHERE CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) > CAST( (1/DDVAL)*100 AS DECIMAL(38,2))

ORDER BY 3 DESC

;’   AS COMPRESSION_VALUES

from dbc.columns

where databasename=’P_1111_A’

and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)

and substr(TABLENAME,1,1) IN (‘P’,’T’)

and ColumnType NOT IN (‘CV’,’TS’,’BO’)

order by 1,2

;

You can use both register tables to put together a compression recommendation.

Register tables allow for regular reinsertion of new layers of demographics via the STATUS_DT to document demographic shifts.

Before you prepare your final list for deployment, put your compression project in a long-term perspective:

How often will you be able to update your compression list?

How often do you expect it to be necessary?

How volatile is the column value set over time?

Yes-No flags will contain Yes or No as long as they exist, where daily account balances or order numbers are transitory.

Teradata cannot take over these considerations. Instead, they should emerge from interaction with business representatives and management or your curiosity if the latter is out of reach.

 

A success story about Teradata compression to inspire you

About a year ago, we were walking through dozens of tables with only occasional or no compression at all. It was just a matter of time for the space of this already extensive database to run out. Additional space could not have been easily provided in the order of magnitude we needed. First, inspections of table contents indicated that there is potential for compression.

The results of introducing compression systematically were very satisfying:

While

  • no other changes than compression were implemented and
  • working in a production environment with daily growth,
  • over 1 TB of space was saved in total
  • an average reduction of 25% and 76.2% on a table level.

The result gave us many months more for daily production!

It paved the way for other necessary activities that led to even more data being inserted.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>