Introduction to Teradata Data Types and Data Modeling
Suppose you work in a well-organized environment where diligence and discipline prevail, and significant effort is put into designing things right in the beginning. In that case, the following post might seem outlandish to you. Else, read on.
In a more “experimental” table design landscape, you will soon encounter inconsistencies between column names, Teradata data types, and length decimal or format settings. Some affect table sizes and performance, so do not turn a blind eye to them even if you see yourself as a technician and administrator first. If you do not wish to be reduced to that role, here might be your ladder upstairs.
Evaluate the following aspects of your data warehouse landscape upon deployment and creation and when the opportunity for change is there. For every point mentioned, find out to what extent your tables and columns are neat and tidy (The Good), outright irregular but obviously and visibly so (The Bad), or deficient a second glance (The Ugly).
Cross-Table Name and Content Matches
Here, we have to differentiate between what I would like to call semantic, lexical, and technical mismatches.
The worst case is the semantic mismatch. Attributes containing prices and ratios are named equally across all tables, they look like prices and rates, but while they are used correctly, their content is different. Sometimes even in such a subtle way that you cannot detect it. That is the ugly case.
Lexical mismatches – inconsistent naming of related columns – are bad, but you can detect and correct them with little effort as they are in plain sight. Here is what you can look for:
- Prefix and suffix conventions
Which existing prefixes indicate the table content, not just remnants from input sources or early developer sketches?
- Abbreviation conventions
Is every date field called “_DT” in the end, or are there some “_DATE” s as well?
- Diverging labeling
Is what is called “_ID” in one place a “_NUM” or something with nothing attached to it in another place?
- Name switching along
Avoid renaming table columns to please particular end users. Solve that issue in view layers. Embellishment comes last.
- Content over Source
Name your tables and columns according to their content and not based on the source the data came from. Expect to be loading from different sources sooner or later.
- “Blind” and “disguised” attributes
Do not create reserve fields “just in case” to fill them with specific content when necessary. An even worse habit is to reuse abandoned attributes for different content. Will you remember one year from now what the “EXTRAFIELD1” contains and from when on the EXCHANGE_RATE was used for holding commission fees instead and why?
If you are on the safe side, all your columns carry straightforward names and have the same data type and settings within all tables.
Here is an example of where to start your investigation.
— What datatypes do FLAGS have?
sel TABLENAME, columnname, ColumnType, ColumnLength, DecimalTotaldigits, DecimalfractionalDigits, Nullable
and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)
and substr(TABLENAME,1,1) IN (‘P’,’T’)
and columnname LIKE ‘%FLAG%’
The following result shows flags of the same table with diverging character type and length settings.
P1111_XY_HIST FF_FLAG CV 32 ? ? Y
P1111_XY_HIST CC_FLAG CF 1 ? ? Y
Look at the content to decide whether “FLAG” is not the proper name or CV; 32 is the wrong data typesetting. Reducing it to CF,1 saves you 31+2 Bytes of potential space for every row. That alone, repeated over several attributes, can shrink your table by 5 or 10%!
Data Type Appropriateness
First, evaluate variability appropriateness. i.e., a column with fixed-length inputs should not be defined as a variable. The variability option costs you two bytes extra.
A wrong decision I saw several times is VARCHAR(1). This is the most costly character flag you can have because it costs you three bytes for a single Y/N stored.
Second, does it come in fractions? Do not define significant decimals when fractions of a unit never occur or are always .00 in the source.
Third, aim for unmixing input. In-house grading or rating input should either be all numbers or all characters.
Since you should allow name and description fields to have a certain length to cope with unusual entries, these fields should not become grotesque. Reduce a VARCHAR(2000) to VARCHAR(500) or even less if the largest entry after years of loading is 250 tokens long.
The same parsimony should prevail with Decimal fields. It is overdone to store a percentage rate with two digits that can never get out of the 0.00 to 100.00 range as DECIMAL(38,18). Not only is it extra space consumed for no reason, but it is also misleading from a business point of view.
The common motives for extremely long data types are fear of failure and uncertainty about the upcoming content. If you detect these reasons, take them as a warning sign that the project went into implementation too early because the source interface definitions are missing or incomplete. It will come back to you tenfold in the form of ongoing fixes and reporting user complaints.
Type-Specific Size Threshold Awareness
Be aware that a narrower data type parameter setting does not linearly translate into less space consumed. DECIMAL(19,6) and DECIMAL(20,6) cost the same, whereas DECIMAL(18,6) is one level down in the bytes reserved for entries. Look up the technical handbook overview to see which point you can be generous without extra space cost.
Character Set Suitability
Whenever you rate character settings, think of your most exotic named friend and how they would fit in. Loads can fail for not considering the customer base’s cross-culture nature. The ugly case here is a landscape of “rotten” entries, i.e., misconverted and now unreadable characters that happen without runtime fails.