Teradata Data Type Considerations – Part 2

2
377

Continuing the Data Type considerations, here are more issues that demand attention from time to time.

Happenstance Nullability

Aim for nullability settings that are compliant with the data model and unified across all tables. The cost of a mandatory column left open for null entries is both on the storage side and on the optimizer side. An extra presence bit per column has to be reserved and a null option to be considered when making the SQL execution plan.

Content Cleanliness

Double-check whenever you extract parts of one source table attribute into a target table attribute so that you do not create phantom variance and incongruence of values. Assume a source attribute ORDER_NUMBER can be structured differently depending on where the order originated from:

AAA.123.456O or BBB.123.456. Assume further that the positions of 456 represent a customer type code. The “O” stands for order. The application creates it that way. If you merely extract the number using a substring starting from position 5, your result will not be entirely numeric – which you might not even realize as you load into a character field! – And there are invalid customer type codes. Surprisingly wrong reporting results for customer type aggregations despite no load fails and quantitative checks passed will keep you busy at a later stage.

Content Sobriety

This is one of the noblest yet hardest-to-achieve goals of a warehouse, at least as long as we humans are allowed to type in values in source system applications. Nevertheless, we need to reach out for it, just as we tidy up our apartments, not in the expectation of a one-time final orderliness.

In contrast to cleanliness, which refers to the adequate processing of data, I like to call sobriety a state of redundancy-free and simplified attribute content down to the point where it cannot be pruned further without loss of information.

First, values should be all written out or all coded and abbreviated. A threefold categorization of “A,” “B” and “DEFAULT” could be converted to A,” “B” and “D” without loss of information, but size needs down to one character byte instead of several. Taking this further, could dozens of full-text entries be converted to official codes so that your table does not take in extra calories?

An anonymized real-world example:

An ENTITY_NAME, let’s say a branch name, is stored like this, thousands of times each:

“ABCD – ABCD Division of ABCD Industries Fantasia”

Why not cut it down to “ABCD” and store the meaning of the ENTITY_NAME in a tiny lookup table?

By this, I could cut down the ENTITY_NAME from VARCHAR(200) to CHAR(4). The savings can be estimated to (average length of current entries-4+2) bytes per record. The table I derived this example from had 177 million records at the time I wrote these lines.

Second, summarize and prune. Although this involves changing the table content somewhat, it will not alter the meaning of the content while fewer distinct values emerge. They then have a higher likeliness to be subject to compression. If relevant, you will need fewer partitions. Address and name fields are always a good place to start.

Derived from what I just saw on a production table, imagine a street field mapped 1 to 1 from the source where we have stored “Main Street,” “Main Str.” as well as “Mainstreet” thousands of times each. Once a standard form of street name representation is defined, the distinct values can probably be cut in half.

Even if you are of the opinion that the blanks or abbreviations have to be preserved because they carry a, however, a subtle difference in meaning, a hundred different ways of saying “no input” don’t have to. Cut down all kinds of meaningless input to no more than two values, one meaning NULL and one meaning that the original input is unusable, be it because of refusal, pranks or severe typing errors that cannot be reverse-engineered.

Propose and plan this and the chances are that your reporting and analytics department colleague will add you as a new friend.

A Success Story to Inspire You

Recently I tested a data type resizing over dozens of tables. These tables were already compressed and mostly partitioned and indexed well. Unfortunately, they were implemented under either significant uncertainty concerning the actual content or lack of due diligence in this respect. Pruning character and decimal fields closer to the reality of the table content saved 5 to 25% of the used table space, even for small tables, with extremes of 35% and more.

See also:
First the PDM, then Compression – Measures to reduce Resource Usage

Our Reader Score
[Total: 1    Average: 5/5]
Teradata Data Type Considerations – Part 2 written by Paul Timar on March 28, 2014 average rating 5/5 - 1 user ratings

2 COMMENTS

  1. Hello,

    These two part of data types considerations are really nice .
    It is not only the waste space that we have on our tables , (so more time for backup and so on) but as you said , it looks like
    that we do not know our data .

    One thing that i have to mention here , is how our DDL efffects our performance and statistics .

    By defining , for simplicity reasons the columns as unicode , do not only affect the space of the table (since they want two
    bytes per character) but and the length that statistics can keep .
    Until TD 13.10 , it was 16 and a lot of wrong stats have been defined due to this .From 14 and on , this has been partial
    solved from TD by setting the length at 25 or manually change up to 50 (even when collecting stats on multicolumn ,
    the acceptable length from TD until 13 was then sum to be 16 , but from TD 14 every column on the multicolumn can be
    from 25 to 50 ) .

    Furthermore , as you mentioned , never define varchar column with length much more than the expected one . TD has an issue here , on spool space during transactions. If the columns participate on group by / order by / union or on Multiload / FastLoad operations ,
    then TD does not take the actual length of the varchar column but the defined one . This leads TD to consume more CPU / IO
    even for a simple query . Imagine , on your example , if you have 1 B rows and the
    column is defined as VARCHAR(200) but your actual max length is 50 , you waste 150 bytes * 1 B rows (and if the column is defined as
    Unicode , * 2 ).

    Thanks.

    • Hello Dimitrios!

      Thank you for your refinement on the length and character set issue!
      You inspire me to investigate and test the relationship between Character set choice, length and statistics a litte deeper.
      Kind Regards,
      Paul

LEAVE A REPLY

Please enter your comment!
Please enter your name here