When Teradata compression becomes pressing

Have you noticed your data warehouse nearing capacity or a significant increase in Current Perm for specific tables during data loading?

Implement compression as a supplement to your space management strategy.

Compression is a table column addition that designates frequently occurring values as “given” to avoid taking up space with each instance in the row. The value is stored in a “higher order” within the table header definition.

Prior to delving into the potential opportunities, it is important to acknowledge the constraints:

  • One cannot compress all kinds of tables; volatile ones are exempt.
  • One cannot compress on index or partitioning columns.
  • Data types such as VARCHAR or TIMESTAMP cannot be compressed depending on the Teradata Release you are working with.
  • You can add compression values only when the table header section’s space is used.

Moving on to the more gratifying aspect, what actions are necessary and advisable?

Ideally, reliable data demographics or forecasts should already be available in tabular form. If not, conduct a row count of the attributes you wish to compress and sort them by descending order of row number.

Compression is effective for attributes with few values, particularly for constant flags or type code columns in large tables with millions of records. We advise against compressing IDs or columns with a unique value for each row. Columns with an even distribution of numerous distinct values do not typically yield significant compression results.

 

A budget approach for getting it done

A thorough technical comprehension of the procedure is unnecessary for column selection, acquiring essential documentation, and providing recommendations. Use the subsequent method as a prompt guide to achieving prompt outcomes.

Apply the following filters to all columns in every table of your database:

Technical filter1:

Eliminate columns that cannot be compressed due to their technical function, such as primary index or partition. This filtering process is irreversible at this stage and can only be modified by modifying the physical data model.

Technical filter 2:

Filter out columns with incompatible data types for compression. This criterion is more lenient because data types can be modified if the initial design relied more on speculation than caution.

A global minimum benefit filter can be used to eliminate tables that fall below the current size and have no further growth expected. If small tables are not present in the database or if you prefer not to exclude them immediately, you can rank them based on non-technical criteria such as usage in reporting.

Now, we will assess the remaining table columns to determine the concentration of values in each. A column with a high concentration of only a few values is more suitable for compression than one with numerous values spread uniformly. To measure this concentration, we will use the Herfindahl-Index (HI), a standard market concentration measure. The higher the value of the HI for a column, the more potential there is for compression.

To store the numeric basis of our decision, we keep a record of the HI value in a register table like this:

— COLUMN DEMOGRAPHICS REGISTER

CREATE MULTISET TABLE P_1111_A.DMnn_COL_HI_INDEX ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
STATUS_DT DATE FORMAT 'YYYY-MM-DD',
TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
HI_INDEX DECIMAL(38,18))
PRIMARY INDEX ( TableName, ColumnName );

We insert the HI values by first preparing the insert statement for all relevant tables that we will then execute:

— HHI INDEX CALCULATION PREPARATION

sel TABLENAME, columnname,
'INSERT INTO P_1111_A.DMnn_COL_HI_INDEX SELECT CURRENT_DATE, '''|| trim(tablename) ||''','''|| trim(columnName) || ''','|| trim(columnName) || ',  SUM (CAST( (ANZAHL/GES)**2 AS DECIMAL(38,18)) ) AS HI_INDEX
FROM
( SELECT ' ||trim(columnName)|| ', CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ' || trim(databasename) ||'.'|| trim(tablename) ||'  GROUP BY 1 ) x
JOIN
(SELECT CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES FROM ' || trim(databasename) ||'.'|| trim(tablename) || ' ) GES
ON 1=1; '  AS HHI
from dbc.columns
where databasename='P_1111_A'
and columnname NOT IN ('X1_ID','X2_ID','X3_ID','X4_ID','X5_DT','X6_DT','X7_DT')
and substr(TABLENAME,1,1) IN ('P','T')
and ColumnType NOT IN ('CV','TS','BO')
order by 1,2

;

Note that the query has a few constraints on the dbc column list to remove mere technical fields and exclude non-production tables and columns with non-compressible fields.

Per the table, we select those columns with an HI high enough to prepare an insert statement of compression value candidates. For documentation purposes, we create a second register table first:

— COLUMN COMPRESSION CANDIDATES

CREATE MULTISET TABLE P_1111_A.DMnn_COL_COMP_CAND ,
NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,
TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
COL_VALUE CHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
VALUE_SHARE   DECIMAL(18,2)  )
PRIMARY INDEX ( TableName, ColumnName, COL_VALUE ); — FIND ALL COMPRESSION VALUE CANDIDATES

sel TABLENAME, columnname,
'INSERT INTO P_1111_A.DMnn_COL_COMP_CAND  SELECT CURRENT_DATE, '''|| trim(tablename) ||''','|| trim(columnName) || ', CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) AS VALUE_SHARE
FROM
( SELECT ' ||trim(columnName)|| ', CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ' || trim(databasename) ||'.'|| trim(tablename) ||'  GROUP BY 1 ) x
JOIN
(SELECT  CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES , CAST(COUNT(DISTINCT ' ||trim(columnName)|| ')  AS DECIMAL(38,18)) AS DDVAL FROM ' || trim(databasename) ||'.'|| trim(tablename) || ' ) GES
ON 1=1
WHERE CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) > CAST( (1/DDVAL)*100 AS DECIMAL(38,2))
ORDER BY 3 DESC
;'   AS COMPRESSION_VALUES
from dbc.columns
where databasename='P_1111_A'
and columnname NOT IN ('X1_ID','X2_ID','X3_ID','X4_ID','X5_DT','X6_DT','X7_DT')
and substr(TABLENAME,1,1) IN ('P','T')
and ColumnType NOT IN ('CV','TS','BO')
order by 1,2
;

You can use both register tables to put together a compression recommendation.

Register tables facilitate the periodic insertion of updated demographic information using the STATUS_DT field to record demographic changes.

Before finalizing your deployment list, consider the long-term perspective of your compression project.

What is the frequency of your compression list updates?

How frequently do you anticipate it will be required?

What is the level of volatility of the column value set over time?

As long as they exist, Yes-No flags will hold either Yes or No, while daily account balances or order numbers are temporary.

Teradata cannot take over these considerations. Instead, they should emerge from interaction with business representatives and management or your curiosity if the latter is out of reach.

 

A success story about Teradata compression to inspire you

A year ago, we traversed numerous tables with infrequent or no compression. The database’s vast size reached its capacity, and acquiring additional space of the required magnitude was not feasible. Initially, examinations of the tables’ contents revealed that compression was feasible.

The results of introducing compression systematically were very satisfying:

While

  • No other changes than compression were implemented and
  • working in a production environment with daily growth,
  • Over 1 TB of space was saved in total,
  • an average reduction of 25% and 76.2% on a table level.

The result gave us many months more for daily production!

It paved the way for other necessary activities, leading to even more data being inserted.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>