Teradata Data Types – Part 1/2

 

If you work in a well-organized environment where diligence and discipline prevail and great effort is put into designing things right, in the beginning, the following post might seem outlandish to you. Else, read on.

In a more “experimental” table design landscape, you will soon come across inconsistencies between column names, Teradata data types and their length decimal or format settings. Some of them affect table sizes and performance, so do not turn a blind eye on them even if you see yourself a technician and administrator first. If you do not wish to be reduced to that role, here might be your ladder upstairs.

Evaluate the following aspects of your data warehouse landscape upon deployment and creation, but also when the opportunity for change is there. For every point mentioned, find out to what extent your tables and columns are neat and tidy (The Good), outright irregular but obviously and visibly so (The Bad) or deficient a second glance (The Ugly).

Cross-Table Name and Content Matches

Here, we have to differentiate between what I would like to call semantic, lexical and technical mismatches.

The worst case is the semantic mismatch. Attributes containing prices and ratios are named equally across all tables, they look like prices and rates, but while they are used correctly, their content is different. Sometimes even in such a subtle way that you cannot detect it . That is the ugly case.

Lexical mismatches – inconsistent naming of related columns – are a bad thing, but as they are in plain sight, you can detect and correct them with limited effort.  Here is what you can look for:

  • Prefix and suffix conventions

Which existing prefixes are indicative of the table content and not just remnants from input sources or early developer sketches?

  • Abbreviation conventions

Is every date field called “_DT” in the end or are there some “_DATE”s as well?

  • Diverging labelling

Is what is called “_ID” in one place a “_NUM” or something with nothing attached to in another place?

  • Name switching along

Avoid renaming table columns just to please particular end users. Solve that issue in view layers. Embellishment comes last.

  • Content over Source

Name your tables and columns according to their content and not based on the source the data came from. Expect to be loading from different sources sooner or later.

  • “Blind” and “disguised” attributes

Do not create reserve fields “just in case” to fill them with specific content when necessary. An even worse habit is to reuse abandoned attributes for different content. Will you remember one year from now what the “EXTRAFIELD1” contains and from when on the EXCHANGE_RATE was used for holding commission fees instead and why?

If you are on the safe side, all your columns carry straightforward names and have the same data type and settings within across all tables.

Here is an example of where to start your investigation.


— What datatypes do FLAGS have?

sel TABLENAME, columnname, ColumnType, ColumnLength, DecimalTotaldigits, DecimalfractionalDigits, Nullable

from dbc.columns
where databasename=’P_1111_A’
and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)
and substr(TABLENAME,1,1) IN (‘P’,’T’)
and columnname LIKE ‘%FLAG%’
;

The following result shows that there are flags of the same table with diverging character type and length settings.

P1111_XY_HIST                     FF_FLAG                CV           32            ?              ?              Y
P1111_XY_HIST                     CC_FLAG               CF           1              ?              ?              Y

Look at the content to decide whether “FLAG” is just not the right name or CV,32 is a wrong data type setting. Reducing it to CF,1 saves you 31+2 Bytes of potential space for every row. That alone, repeated over several attributes, can shrink your table by 5 or 10%!

Data Type Appropriateness

First, evaluate variability appropriateness. i.e. a column with fixed-length inputs should not be defined as a variable. The variability option costs you two bytes extra.

A wrong decision I saw several times is VARCHAR(1). This is the most costly character flag you can have because it costs you three bytes in total for a single Y/N stored.

Second, does it come in fractions? Do not define significant decimals when fractions of a unit never occur or are always .00 in the source.

Third, aim for unmixing input. In-house grading or rating input should either be all numbers or all characters.

Parsimony

Since you should allow name and description fields to have a certain length to cope with unusual entries, these fields should not become grotesque. Reduce a VARCHAR(2000) to VARCHAR(500) or even less if the largest entry ever after years of loading is some 250 tokens long.

The same parsimony should prevail with Decimal fields. It is overdone to store a percentage rate with two digits that can never get out of the 0.00 to 100.00 range as DECIMAL(38,18). Not only is it extra space consumed for no reason, but it is also misleading from a business point of view.

The common motives for extremely long data types are fear of failure and uncertainty about the upcoming content. If you detect these reasons, take them as a warning sign that the project went into implementation too early because the source interface definitions are missing or incomplete. It will all come back to you tenfold in the form of ongoing fixes and reporting user complaints.

Type-Specific Size Threshold Awareness

Be aware of the fact that a narrower data type parameter setting does not linearly translate into less space consumed. DECIMAL(19,6) and DECIMAL(20,6) cost the same, where DECIMAL(18,6) is one level down in the bytes reserved for entries. Look up the technical handbook overview to see up to which point you can be generous without extra space cost.

Character Set Suitability

 Whenever you rate character settings, think of your most exotic named friend and how he or she would fit in. Loads can fail for not taking into account the cross-culture nature of the customer base. The ugly case here is a landscape of “rotten” entries, i.e. misconverted and now unreadable characters that happen without runtime fails.

4

The new Summary Collect Statistics feature starting with Teradata Release 14

 

Summary statistics are a very useful feature, which is new in Teradata Release 14. Contrary to traditional statistics, which are collected on columns or indexes, summary statistics are table level statistics:

COLLECT SUMMARY STATISTICS ON <TABLE>;

Table level statistics are the information about the number of rows per table, the average data block sizes, and the average row sizes.

The number of rows per table is a critical measure for the Teradata Optimizer to estimate the cost of a full table scan. So, how was this information made available in previous releases of Teradata?

In earlier versions of Teradata, this was achieved by collecting column statistics on the dummy “PARTITION” and the Primary Index columns.

Summary statistics which are defined on a table will be refreshed each time the statistics on any column of the table are updated. If you want to avoid the waste of resources, the best practice is to update all column statistics at once instead of doing it column by column:

The excellent solution is this one:

COLLECT STATISTICS ON <TABLE>; — all column statistics are refreshed, summary statistics are only updated once.

Although less intense on resources than traditional column/index statistics, this should still be avoided:

COLLECT STATISTICS ON <TABLE> <COLUMN_1>; — first collection of summary statistics

COLLECT STATISTICS ON <TABLE> <COLUMN_2>; — needless 2nd collection of summary statistics

Conclusion: Summary statistics collection is a very useful feature to refresh the number of rows information per table quickly.

Teradata Historisation – Daily snapshot to change history

 

For a variety of reasons, we may have the need to switch our Teradata historisation from snapshot tables to a start date/end date logic table, which I usually call a change history as only changes are triggering a historisation.

A principal motivation for such a move away from a daily snapshot table could be space restrictions.

Whatever the reason may be, changing the historisation type is easier than one may expect. Here is a SQL template which handles exactly this task. The columns named RSD and RED are the technical columns defining the validity start date and end date of each record.

[quote]

SELECT
<KEY_COLUMNS>,
<NON_KEY_COLUMNS>,
RSD,
CASE WHEN
MAX(CHANGED_IND) OVER (PARTITION BY <KEY_COLUMNS> ORDER BY RSD ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING ) = 2
THEN
MAX(RED_PREV) OVER ( PARTITION BY <KEY_COLUMNS>, ORDER BY RSD ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING )
ELSE
COALESCE(((MAX(RSD) OVER ( PARTITION BY KEY_COLUMNS> ORDER BY RSD ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING ))-1), ENDI)
END AS RED
FROM
(
SELECT
<KEY_COLUMNS>,<NON_KEY_COLUMNS>,SNAPSHOT_DATE AS RSD,SNAPSHOT_DATE AS RED,
MAX(RED) OVER (PARTITION BY <KEY_COLUMNS> ORDER BY RSD ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS RED_PREV,

CASE
WHEN
(RED_PREV+1) <> RSD THEN 2
WHEN
(MIN() OVER (PARTITION BY <KEY_COLUMNS> ORDER BY RSD ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)=
MAX() OVER (PARTITION BY <KEY_COLUMNS> ORDER BY RSD ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) OR
(MIN() OVER (PARTITION BY <KEY_COLUMNS> ORDER BY RSD ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) IS NULL AND
MIN() OVER (PARTITION BY <KEY_COLUMNS> ORDER BY RSD ROWS BETWEEN CURRENT ROW AND CURRENT ROW) IS NULL))

AND

AND

(MIN() OVER (PARTITION BY <KEY_COLUMNS> ORDER BY RSD ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)=
MAX( <ATTRIBUTEn) OVER (PARTITION BY ORDER BY RSD ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) OR
(MIN() OVER (PARTITION BY <KEY_COLUMNS> ORDER BY RSD ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) IS NULL AND
MIN( <ATTRIBUTEn) OVER (PARTITION BY ORDER BY RSD ROWS BETWEEN CURRENT ROW AND CURRENT ROW) IS NULL))

THEN 0 ELSE 1
END AS CHANGED_IND,

MAX(RED) OVER ( PARTITION BY <KEY_COLUMNS> ORDER BY RSD ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS RED_NEXT,
MAX(RED) OVER ( PARTITION BY <KEY_COLUMNS> AS ENDI
FROM
<SNAPHOT_TABLE> i
QUALIFY (CHANGED_IND=0 and RED_PREV IS NULL) OR (CHANGED_IND <> 0)
) i;

[/quote]

 

Teradata Compression Considerations

When Teradata compression becomes pressing

Have you ever seen your data warehouse approaching space limits or some of your tables’ Current Perm skyrocketing as you start loading data?

If so, use compression as an addition to your space management effort.

Compression is a table column definition addendum whereby you define the most common values as “given” so that they do not consume space once for every occurrence in the row. The value is kept on a “higher-order,” namely in the table header definition.

Before we explore the possibilities, a note on the limitations:

  • One cannot compress all kinds of tables; volatile ones are exempt
  • One cannot compress on index or partitioning columns
  • Depending on the Teradata Release you are working with, data types such as VARCHAR or TIMESTAMP cannot be compressed
  • You can add compression values only up to the point where the table header section’s space is used up

Now for the more rewarding part, what can and should be done.

Ideally, data demographics or reliable forecasts on a table are already at hand. If not, do a row count over the attributes you intend to compress and sort it over the row number in descending order.

Compression is a promising try for attributes with relatively few values. It works best for near-constant flags or types code columns of large tables with tens of millions of records and more. We do not recommend to compress IDs or other columns that are close to holding a separate value for every row. Also, we have never seen results worth implementing for columns with equal distributions over relatively many distinct values.

 

A budget approach for getting it done

To execute the following procedure for column choice, required documentation and recommendation you do not need to have a detailed technical understanding of what exactly happens where. Use the approach presented below as a “budget” one when you need to deliver some results quite soon.

Start with all columns of all tables of your database and apply the following filters:

Technical filter1:

Exclude all columns that are not eligible for compression because of their technical role (primary index, partition). Excluding is a hard filter at this point. Only a change of the physical data model could alter your decision here.

Technical filter 2:

Exclude all columns that do not have the right data type for compression. This filter is less restrictive as data types can be changed if the original design was more of an educated guess than a prudent decision.

Next, we can apply a global minimum benefit filter on the tables by eliminating all those below a current size as well as those with no further growth expected, given size. If you do not want to exclude tables right away or have no “small” ones in your database, you can rank them by non-technical urgency criteria such as usage in reporting.

For all the table columns left, we can now have a look at the degree of concentration of values in the column. A column with a very high concentration on only a few values is a more promising candidate for compression than a column whose millions of different values spread evenly. We can apply a standard measure from the field of market concentration for our purpose: the Herfindahl-Index (HI). The closer the value of a column is to 1 the more potential for compression.

In order to store the numeric basis of our decision, we keep a record of the HI value in a register table like this:

— COLUMN DEMOGRAPHICS REGISTER

CREATE MULTISET TABLE P_1111_A.DMnn_COL_HI_INDEX ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

HI_INDEX DECIMAL(38,18))

PRIMARY INDEX ( TableName, ColumnName );

We insert the HI values by first preparing the insert statement for all relevant tables that we will then execute:

— HHI INDEX CALCULATION PREPARATION

sel TABLENAME, columnname,

‘INSERT INTO P_1111_A.DMnn_COL_HI_INDEX SELECT CURRENT_DATE, ”’|| trim(tablename) ||”’,”’|| trim(columnName) || ”’,’|| trim(columnName) || ‘,  SUM (CAST( (ANZAHL/GES)**2 AS DECIMAL(38,18)) ) AS HI_INDEX

FROM

( SELECT ‘ ||trim(columnName)|| ‘, CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) ||’  GROUP BY 1 ) x

JOIN

(SELECT CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES FROM ‘ || trim(databasename) ||’.’|| trim(tablename) || ‘ ) GES

ON 1=1; ‘  AS HHI

from dbc.columns

where databasename=’P_1111_A’

and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)

and substr(TABLENAME,1,1) IN (‘P’,’T’)

and ColumnType NOT IN (‘CV’,’TS’,’BO’)

order by 1,2

;

Note that the query has a handful of constraints on the dbc column list to remove mere technical fields, exclude non-production tables and columns with non-compressible fields.

Per table, we select those columns with an HI high enough for the preparation of an insert statement of compression value candidates. For documentation purposes, we create a second register table first:

— COLUMN COMPRESSION CANDIDATES

CREATE MULTISET TABLE P_1111_A.DMnn_COL_COMP_CAND ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

COL_VALUE CHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,

VALUE_SHARE   DECIMAL(18,2)  )

PRIMARY INDEX ( TableName, ColumnName, COL_VALUE );

— FIND ALL COMPRESSION VALUE CANDIDATES

sel TABLENAME, columnname,

‘INSERT INTO P_1111_A.DMnn_COL_COMP_CAND  SELECT CURRENT_DATE, ”’|| trim(tablename) ||”’,’|| trim(columnName) || ‘, CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) AS VALUE_SHARE

FROM

( SELECT ‘ ||trim(columnName)|| ‘, CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) ||’  GROUP BY 1 ) x

JOIN

(SELECT  CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES , CAST(COUNT(DISTINCT ‘ ||trim(columnName)|| ‘)  AS DECIMAL(38,18)) AS DDVAL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) || ‘ ) GES

ON 1=1

WHERE CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) > CAST( (1/DDVAL)*100 AS DECIMAL(38,2))

ORDER BY 3 DESC

;’   AS COMPRESSION_VALUES

from dbc.columns

where databasename=’P_1111_A’

and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)

and substr(TABLENAME,1,1) IN (‘P’,’T’)

and ColumnType NOT IN (‘CV’,’TS’,’BO’)

order by 1,2

;

You can use both register tables to put together a compression recommendation.

Register tables allow for regular reinsertion of new layers of demographics via the STATUS_DT so that demographic shifts can be documented.

Before you prepare your final list for deployment, put your compression project in a long-term perspective:

How often will you be able to update your compression list?

How often do you expect it to be necessary?

How volatile is the column value set over time?

Yes-No flags will contain Yes or No as long as they exist, where daily account balances or order numbers are transitory.

These considerations cannot be taken over by Teradata. Rather, they should emerge from interaction with business representatives and management or from your curiosity if the latter are out of reach.

 

A success story about Teradata compression to inspire you

About a year ago we were walking through dozens of tables with only occasional or no compression at all. It was just a matter of time for the space of this already extensive database to run out. Additional space could not have been provided easily in the order of magnitude we needed it. First inspections of table contents indicated that there is potential for compression.

The results of introducing compression systematically where very satisfying:

While

  • no other changes than compression were implemented and
  • working in a production environment with daily growth,
  • over 1 TB space was saved in total with
  • an average reduction of 25% and as much as 76.2% on a table level.

The result gave us many months more for daily production!

It paved the way for other necessary activities that led to even more data inserted.

Collect Statistics in Teradata – Part 4

 Collect Statistics in Teradata – Maintenance

 

  1. Maintenance Milestones before you decide

Now that you have defined, collected and evaluated all statistics necessary and helpful, it is time to design the maintenance steps.

First, what are the speed and variance of your Teradata data warehouse objects? What is the most relevant time unit of measurement in it?

Your world is anywhere between a smooth, gradual growth or decline over time and a mostly unpredictable earthquake-like makeover you are never told in advance.

Second, how often are you granted time windows for statistics maintenance and how predictable, stable and how many subjects to debate are they? How much precedence can you give it to other tasks?

The answers to these questions decide how advanced your maintenance task could become.

Keep in mind that any of the following approaches are not due to the nature of a collect statement itself, but due to the environmental constraints you have.

2. Aging and Recollection

As statistics can become outdated in principle, the question is when to refresh them?

The need to give Teradata an update on table statistics results either from healthy development or structural changes. The former is the typical quantitative table content evolve over the ordinary course of business. The latter are business-driven table reconstructions such as the addition or dismissal of table attributes or a “regime change” in the data leading to new kinds of values as well as any physical data model changes such as repartitioning or redefinition of indexes.

Here are recollection approaches we use(d) ourselves or have heard of:

  • Just-so recollection of the entire collect combination set every couple of days or when someone “feels the need”.
  • Complete recollection of the whole collect combination set every n days or weeks.
  • Partial recollection of the whole collect combination during a calendar time window to continue on the next calendar time window until one round is done.
  • Triggered partial recollection based on the collection date.
  • Triggered partial recollection based on the relative table size change.

We do not recommend to define a fixed time or date schedule for entire recollections as the mere passage of time is typically not the driving force behind what makes statistics outdated. Rather, a sudden table size change, the insertion or deletion of data layers containing new values for individual fields or the recreation of a table due to physical data model changes are a call to action.

The triggered relative size approach is a (semi-)automated way that spreads the recollection over time so that smaller time windows – of which there are typically more –  can be used. If done automatically, the most equal spreading possible can be achieved.

But what are triggering criteria?

It is time to recollect when there is a considerable difference between query subset and step figures estimated and the real result values.

3. A Collection statement register

In order not to lose what we have considered the last set of collect combinations, we recommend creating a record table that holds this information.

CREATE MULTISET TABLE P_XYZ_A.DM_STATS_REGISTER, NO FALLBACK,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,

CollectCombination VARCHAR(250) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL)

PRIMARY INDEX ( TableName, CollectCombination );

The information in this table looks like this:

T1111_ENTITY1  L_ID

T1111_ENTITY1  PARTITION,L_ID

T1111_ENTITY1  L_ID,C_DT

T1111_ENTITY1  PARTITION

T1111_ENTITY1  C_DT

4. A triggered partial recollection specification:

Based on the register table content (see above), one could define views that return all statistics considered old or missing as well as collection statement templates for selective recollection.

The tricky part here is to derive the current statistics collection state out of the dbc tables.

Once you have the dbc input at hand, you need to decide when and what to collect or recollect.

For a recollection when it is required and spread evenly across time, we use the following procedure:

Store the table size and date of that size in a separate register table as part of the daily activity.

Look at the dbc statistics information to decide the latest collection date of any one statistic for the table.

Compare the current size of the table as of that latest collection date to the current size from now on.

If that size difference is more than 10 percent, select the table and all actual statistics for recollection.

Key features of this recollection specification are:

  • Only a part of the entire set of tables is ready for recollection at any time
  • A table reappears for recollection until it is done. You do not have to keep in mind what you did as the specification will remind you
  • Tables are suggested according to their needs, with stable lookups never mentioned, and space shooting stars up several times in a row

There is  a drawback, too:

If you recollect selectively and spontaneously, the chances are that outdated statistics become undetected for a while as the most recent recollection date is the base case.

Collect Statistics in Teradata – Part 3

Collect Statistics in Teradata – sample statistics or full?

 Once a table has gotten vast and the time window for regular recollection is about to become smaller or a given one overconsumed, it is time to reconsider one collection option: could a short sample over the row be sufficient? Unless defined otherwise in your collect statement, sampling refers to a 1 to 10% draw of the total data over no more than a small number of AMPs, exact details depending on the Teradata release.

Sampling will become a net advantage for collection combinations at the extremes of possible distributions, i.e. when data are spread more or less evenly across a large set of distinct values or when the collect combination is a constant. For anything in between, the collection result numbers that are not accurate enough and so become a fen fire in search for the optimal execution plan.

Think of it as a kind of opinion poll problem: asking just the next few persons that come your way gives you the right idea on how the population as a whole thinks when either there is a strong consensus at large or no two people with the same opinion in a community. In all other cases, your sample will misrepresent where the majority stands.

Just as with an opinion poll, consider the fact that it makes no sense to sample statistics when your entire population (i.e. the table) is tiny so that you might just as well go over the entire data. The rock bottom for non-sampling is where your table has fewer records than we have AMPs on the system.

>