Tag Archive

Tag Archives for " data type "

The Costs of Decomposable Columns

What are Decomposable Columns?

Decomposable columns can be split into smaller units. Decomposable columns mean that the physical data model is not normalized.

If we don’t use a decomposable column for data retrieval, there is no negative impact on performance.

Still, if the column is used for data access, it should be decomposed into several columns which can be utilized independently for indexed access. You will get increased performance because there will be fewer full table scans.

Additionally, if the columns created from a decomposable column are NUSI columns, the possibilities for NUSI bitmapping are increased:

Here is an example:

— Select all Private Gold Products being sold by our company

SELECT * FROM Transaction WHERE ProductCode LIKE ‘PRIVGOLD%’;

— Select all Private Products being sold by our company

SELECT * FROM Transaction WHERE ProductCode LIKE ‘PRIV%’;

In both cases typically a full table scan is required to access the data.

By decomposing the column ProductCode, we give the Optimizer a chance to use existing NUSIs:

CREATE INDEX COLUMN(ProductSegment) ON Transaction;
CREATE INDEX COLUMN(ProductLevel) ON Transaction;

SELECT * FROM Transaction WHERE ProductSegment = ‘PRIV’ AND ProductLevel = ‘Gold’;

Often we can’t change the physical data model, and we have to find workarounds to avoid the negative performance impact of decomposable columns. Here is a simple but efficient solution:

SELECT p.ProductDescription,t.ProductCode
FROM
Transaction t
INNER JOIN
Product p
ON
t.ProductCode = p.ProductCode
WHERE
p.ProductCode LIKE ‘%Gold%’
;

Table Product is just a lookup containing all distinct ProductCodes. The Optimizer will probably do a full table scan on the Product Table with a subsequent merge join to the Transaction table. As the Product table is small, the cost of full table scan is negligible.

Still, the first choice is to avoid the implementation of decomposable columns whenever possible.

Questions?
If you have any questions about all this, please ask in the comments! I’ll be paying close attention and answering as many as I can. Thank you for reading. Whatever this blog has become, I owe it all to you.

1

First the PDM, then Compression

PDM

The importance of the PDM

Compression is a popular way of saving IOs, by allowing to pack more rows into each data block. Data blocks are the smallest unit transferred between hard drives and main memory, the most expensive operation in any database system.

Compression can be done automatically by the system (block level compression) or is applied by the user (MVC). Another option is to use compression software, such as the one offered by Prise Ltd. (check it out at https://www.prisetools.com/products/compress-wizard).

While developers tend to use Multivalue Compression (MVC), other opportunities to cut resource usage are most times overseen or left out for several reasons. Very often the PDM i.e. the proper implementation of data types and character sets is neglected.

First the PDM, then Compression

Compression depends on the underlying data types. Therefore, before starting any compression activity, we have to make sure that the correct data types and character sets are used in the PDM. Otherwise, any redesign of the PDM will force us to rework compressions.

While the correct choice of data types and compression both help to cut resource usage, there are even some advantages of the right data type/ character set choice which compression doesn’t have.

The right data types will reduce resource usage independently from the data content, while compression has no space or resource usage benefit for unique columns.

The choice of correct data types always helps to cut resource usage (by allowing to pack more rows into each data block).

The selection of the right data types & character sets is part of physical data modeling. Unfortunately, often this step is skipped because it causes costs early in the implementation process.

Many times, data types are directly taken over from the stage area into the PDM of your core model, and from there they are directly moved into the data marts.

While this approach overcomes costs at the begin of the implementation process, we don’t recommend it. The costs will only be transferred to a later stage when it is much harder to do changes on the PDM.

Only because a column definition is a DECIMAL(38,n) in the source system doesn’t mean it will ever hold such huge numbers. Remember, space usage of decimals is in the range of 1 to 8 bytes. The same is valid for integer values (byte int -> 1 byte, big int -> 8 bytes)

In my experience, many decimals could be stored in barely 4 bytes, but because of laziness, time pressure & cost reasons, developers often save them in the largest decimal number available(which consumes a total of 8 bytes).

A single decimal, if improved in the way described above, need 4 bytes less per table row. Consider the possible space reduction for a table containing billions of rows, and don’t forget that this improvement is before applying compressions!

Similarly, using character set LATIN instead of UNICODE helps to cut space usage of a character field by 50% per character. Often I even see simple codes, such as “ACTIVE,” “SUSPENDED,” or “CANCELED” stored as UNICODE characters!

As a Teradata Consultant, you never should leave out this important optimization step. Yes, it takes time when we create the PDM, but it will pay off at the end.

I think I gave you enough arguments to insist on this activities. Don’t forget about it in your next project!

If you have any questions about all this, please ask in the comments! I’ll be paying close attention and answering as many as I can. Thank you for reading. Whatever this blog has become, I owe it all to you.

2

The Ultimate Teradata CREATE TABLE Checklist

The Ultimate Teradata CREATE TABLE Checklist

The creation of tables in an ever-recurring task, so I thought it was time to offer a checklist.

1

Will you define a Primary Index?

The Primary Index is used for various applications. It usually is designed for even row distribution and as the main access path to your data.

But there are scenarios where it’s could be beneficial to create a table without a primary index (NOPI).
Staging is one example:  By defining a stage tables without a primary index, we can speed upload processing (for example, fastloads will not need to sort and distribute rows in the final load phase). If this helps to improve overall load performance, of course, depends on the later steps, you have to consider if and how they get slowed down because of a missing primary index.

More details about the usage of NOPI tables you can find here:  NOPI Tables on Teradata

2

Will your Primary Index be skewed?

Almost always, your main goal is to have a Primary Index which distributes the table rows evenly across all AMPs.  Consequently, it is necessary to know your data. Don’t rely on small amounts of test data as a base for your primary index decision. Data demographics of production data can change significantly. I experienced this several times; I will give you one example of an epic fail from real life:

Once upon a time in a European Telecommunication company: The main customer table’s primary index was mapped 1:1 from the correspondent source system column, let’s call it “customer_id”. During the development phase of the project, several files with customer test data were delivered, supporting the chosen primary index. As rows were distributed evenly across all AMPs, load performance was excellent.

The whole physical data model was optimized to support shop members to look up the customer status during calls (i.e. tactical workload). Furthermore, the client’s table could be joined very efficient with other related tables, such as call details to support reporting.

Everything looked perfect until we entered production phase. On the first day, the load times for the customer table increased from about 30 minutes to 6 hours with a subsequent “no more disk space” error.

What happened? During the test phase, all delivered files were related to “contract” customers. No one ever thought about testing “prepaid” customers. It turned out that all prepaid customers were assigned to a single “customer_id” in the source system…

The result was a massive redesign, a project delay of several months, and a complete loss of reputation.

In rare cases, you would like to “design” a skewed primary index, for example, if you want to force all rows into one data block of one AMP. This way you could achieve rowhash locking instead of table locks and other advantages. For details, please read our article All Rows on one AMP by Design.

3

Did you choose the correct CHARACTER SET for each Column?

While it saves time to leave out such considerations, in the end, it never pays off. Of course, you could define each single character column as UNICODE. But such an approach comes with two disadvantages: you will waste valuable disk space and decrease the performance of requests related to these tables.

If you are unlucky, such a sloppy approach will lead to a disaster. Here another example from real life. It recently happened in a major European bank:

We had to deal with tables, each of them carrying several hundreds of columns. Load times (fastload) were horrible: 6 hours for a ridiculous 50GB of data!

When I was called to the project to analyze these performance issues, it turned out that each single column was defined with character set UNICODE. Some simple calculations revealed that data rows were huge: Only 1-2 rows were fitting into each data block.

Put briefly, a redesign of dozens of table DDL’s was required. By changing all columns to character set LATIN, and only keeping a few as UNICODE, load times could be reduced drastically as more rows could be packed into each block.

This article shows you how significant the impact of wrongly chosen character sets can be: Teradata Tuning Success – The Best ever

4

Do Data Types and CHARACTER SETS match across all Tables?

Data type consistency is a crucial point but often overseen. Data Type and Character Set Conversions during query executions are expensive operations (CPU time) and may prevent AMP local activities (such as joins). Rows from different tables joined via their primary index are only co-located on the same AMP if data types match. The same holds true for character sets.

5

Can the Primary Index be UNIQUE?

I would recommend: always define your primary index unique if possible. Again, it’s a very sloppy way of developing if you skip uniqueness because it’s more convenient (“no problem with duplicates”).

Hint: This may not always be possible, for example in the case of row partitioned tables, if the partition column is not included in the primary index.

There are several (often not obvious) disadvantages if we skip uniqueness. Here are some examples:

  • NUPI tables drastically reduce the possibilities of the Teradata Optimizer to use the Nested Join strategy.
  • No duplicate row checks are needed in the case of SET tables, as the UPI guarantees uniqueness.
6

Can the Table be a MULTISET Table?

If you don’t have to make sure that duplicate rows are declined, always use MULTISET tables. They don’t need a costly duplicate row check when data is inserted. Furthermore, in the case of a skewed primary index, the difference in performance, when adding rows, can become huge.

More details about SET and MULTISET tables you can find here: Teradata Multiset versus Set Tables

7

When is the best Time to add an Index?

If you have secondary indexes or join indexes defined on your table, it’s always a good idea to drop them before inserting big amounts of data into already populated tables. You can recreate the indexes afterward, and it will be much cheaper, especially if something goes wrong, like in the following example:

Recently, I worked on the following problem:

Somebody had executed an insert statement. As is was still not finished after hours, the same person decided to abort the session. As a result, a rollback started.
I was informed the next day when the request was executing already 12 hours…

I rapidly collected the facts:

  • The inserted primary index values were skewed; all rows went to the same AMP (an enormous amount of NULL values)
  • The table was a SET table; this was further worsening the situation
  • There were 4 NUSIs defined on the table

I decided to cancel the rollback as it was clear that it would run for at least 30 more hours!

Without the 4 NUSIs, the rollback would have been much faster. But maintaining five rows per table row (4 index sub-table rows plus the base table row) was increasing this disaster.

8

Temporary Tables: Can Columns be made atomic?

If you are working with any temporary tables (volatile, global temporary or physical), you should always consider improving all areas mentioned above (data types, character sets, primary index, etc.) to support your workload in the best possible way.

Furthermore, it’s a good practice to split columns which contain different types of information into separate columns. Make them atomic. Splitting will help to improve the performance of joining, as you can avoid join conditions such as “ON TABLE1.PK = SUBSTRING(TABLE2.PK,1,5).

9

Does the Data Block Size support your Rows?

Since Teradata 14 we can define data blocks as big as 1024KB. If you increase data block size, the performance improvement can be high if you are dealing with enormous rows or have a lot of strategic queries scanning all rows.

Feel free to comment and add extra points to the check list. I will merge them into the article.

Teradata Tuning Success – The Best ever

As a Teradata SQL tuner, most of the times we have to deal with skew. In my experience, up to 90% of SQL performance issues are related to skew. But this article is not about skew. Sometimes a slight change is enough to produce miracles.

In this article, we I show the possible impact of wrongly chosen character sets on SQL performance. A topic which is often overlooked or ignored.

How it all began

Recently, while checking the QryLog table, my attention was drawn to a DELETE statement. It was consuming a huge amount of CPU seconds and causing a lot of IOs. In fact, each day this one statement caused more IOs than all the other workload together.

Even worse, it looked like the resource consumption was growing rapidly (on average, each day 50 million rows were added) , as you can see in below chart:

DELETE

The mentioned DELETE statement executes against a huge table “TheTable”.  “TheTable” contains almost 3 billion rows. A subquery is used to identify the rows to be deleted (please note that table “TheVolatileTable”, being used for filtering, contains exactly 1 row):

DELETE  FROM TheDatabase.TheTable
WHERE (COL1, COL2, COL3, COL4) IN ( SELECT COL1, COL2, COL3, COL4 FROM TheVolatileTable );

Below you can see the table definitions of both involved tables:

CREATE MULTISET TABLE TheDatabase.TheTable (
PK BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE -999999999999999999 MAXVALUE 999999999999999999 NO CYCLE),
Col1 VARCHAR(3) CHARACTER SET LATIN NOT CASESPECIFIC,
Col2 VARCHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC,
Col3 DATE FORMAT ‘YYYY-MM-DD’,
Col4 CHAR(12) CHARACTER SET LATIN NOT CASESPECIFIC,
RefDat DATE FORMAT ‘YYYY-MM-DD’
)  PRIMARY INDEX ( PK )
PARTITION BY RANGE_N(RefDat  BETWEEN ‘2010-01-01’ AND DATE + INTERVAL ‘365’ DAY  EACH INTERVAL ‘1’ DAY , NO RANGE, UNKNOWN);

SELECT CAST(COUNT(*) AS BIGINT) FROM TheDatabase.TheTable;
Count(*)
2.928.780.765 –> almost 3 billion rows!

This is the table used for filtering the rows to be deleted (it is populated with exactly 1 row):

CREATE MULTISET VOLATILE TABLE TheVolatileTable AS
(
SELECT
‘CODE’ AS Col1 ,CASE WHEN Col5 = ‘X’ THEN ‘Y’ ELSE ‘N’ END AS Col2 ,
CAST(substr(TheDate,1,4) || ‘-‘ || substr(TheDate,5,2) || ‘-‘ || substr(TheDate,7,2) AS DATE FORMAT ‘YYYY-MM-DD’) AS Col3
,Col5 AS Col4
FROM DummyTable GROUP BY 1,2,3,4
) WITH DATA ON COMMIT PRESERVE ROWS ;

SELECT COUNT(*) FROM  TheVolatileTable;
Count(*) 1 –> exactly 1 row

The Initial Situation

In the original setup, the volatile table is created with this syntax:
CREATE TABLE AS (<SUBQUERY>);

This is a comfortable method for creating and populating tables in one step.  Still, the author of the query has overseen that it’s important to take care about data types and character sets. On the affected Teradata system, character columns are created by default with character set UNICODE . Unfortunately, the table “TheTable” defines these character columns as character set LATIN, causing a mismatch.

SHOW TABLE TheVolatileTable

CREATE MULTISET VOLATILE TABLE TheVolatileTable
(
Col1 VARCHAR(3) CHARACTER SET UNICODE NOT CASESPECIFIC,
Col2 VARCHAR(1) CHARACTER SET UNICODE NOT CASESPECIFIC,
Col3 DATE FORMAT ‘YYYY-MM-DD’,
Col4 CHAR(12) CHARACTER SET UNICODE NOT CASESPECIFIC)
NO PRIMARY INDEX ON COMMIT PRESERVE ROWS;

The resulting execution plan for the bad performing DELETE statement looks like this:

EXPLAIN DELETE   FROM TheDatabase.TheTable
WHERE (COL1, COL2, COL3, COL4) IN
(
SELECT COL1, COL2, COL3, COL4 FROM TheVolatileTable
);

1) First, we lock a distinct TheDatabase.”pseudo table” for write
on a RowHash to prevent global deadlock for
TheDatabase.TheTable.
2) Next, we lock TheDatabase.TheTable for write.
3) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from
TheDatabase.TheTable by way of an all-rows scan
with no residual conditions into Spool 2 (all_amps)
(compressed columns allowed) fanned out into 50 hash join
partitions, which is built locally on the AMPs.  The size of
Spool 2 is estimated with high confidence to be 2,928,780,765
rows (123,008,792,130 bytes).  The estimated time for this
step is 32.77 seconds.

2) We do an all-AMPs RETRIEVE step from
TheVolatileTable by way of an all-rows scan with no
residual conditions into Spool 4 (all_amps), which is
redistributed by the hash code of (
TheVolatileTable.COL4,
TheVolatileTable.COL3,
TheVolatileTable.COL2,
TheVolatileTable.COL1) to all AMPs.  Then
we do a SORT to order Spool 4 by the sort key in spool field1
eliminating duplicate rows.  The size of Spool 4 is estimated
with high confidence to be 2 rows (186 bytes).  The estimated
time for this step is 0.02 seconds.

4) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of
an all-rows scan into Spool 3 (all_amps) (compressed columns
allowed) fanned out into 50 hash join partitions, which is
duplicated on all AMPs.  The size of Spool 3 is estimated with
high confidence to be 780 rows (72,540 bytes).

5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an
all-rows scan, which is joined to Spool 3 (Last Use) by way of an
all-rows scan.  Spool 2 and Spool 3 are joined using a inclusion
hash join of 50 partitions, with a join condition of (
“((TRANSLATE((COL1 )USING LATIN_TO_UNICODE))=
COL1) AND (((TRANSLATE((COL2 )USING
LATIN_TO_UNICODE))= COL2) AND ((COL3 = COL3) AND
((TRANSLATE((COL4 )USING LATIN_TO_UNICODE))= COL4 )))”).
The result goes into Spool 1 (all_amps), which is redistributed by
the rowkey of (TheDatabase.TheTable.ROWID) to all
AMPs.  Then we do a SORT to partition Spool 1 by rowkey and the
sort key in spool field1 eliminating duplicate rows.  The size of
Spool 1 is estimated with low confidence to be 12,253 rows (
220,554 bytes).  The estimated time for this step is 0.50 seconds.

6) We do an all-AMPs MERGE DELETE to
TheDatabase.TheTable from Spool 1 (Last Use) via the
row id.  The size is estimated with low confidence to be 12,253
rows.  The estimated time for this step is 1.87 seconds.

7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.

-> No rows are returned to the user as the result of statement 1.

There are several issues with this execution plan. First of all, the join columns of the huge table “TheTable” have to be converted from LATIN to UNICODE. This has to be done for each row: 3 billion times!

This is an expensive operation, which is consuming a lot of CPU time.

But that is not all: The huge table is spooled into 50 hash partitions in preparation for the consecutive join. A lot of resources are needed to spool such a huge table containing billions of rows. Unfortunately, this is the prefered option for the Optimizer, because it has to deal with the character set conversion.

Below you can see the amount of IOs and CPU seconds being used:

TotalIOCount AMPCPUTime SpoolUsage
3.585.081.549,00 71.510,38 1.397.760,00

The Tuning Approach

The tuning of this query was simple, but very effective. By manually creating the volatile table, I was able to choose character set LATIN for the character columns, matching the character sets of both tables:

CREATE MULTISET VOLATILE TABLE TheVolatileTable
(
Col1 VARCHAR(3) CHARACTER SET LATIN  NOT CASESPECIFIC,
Col2 VARCHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC,
Col3 DATE FORMAT ‘YYYY-MM-DD’,
Col4 CHAR(12) CHARACTER SET LATIN NOT CASESPECIFIC
) PRIMARY INDEX (Col1, Col2, Col3, Col4) ON COMMIT PRESERVE ROWS;

Below you can see the improved execution plan:

EXPLAIN DELETE  FROM TheDatabase.TheTable
WHERE (COL1, COL2, COL3, COL4) IN
(
SELECT COL1, COL2, COL3, COL4 FROM TheVolatileTable
);

1) First, we lock a distinct TheDatabase.”pseudo table” for write
on a RowHash to prevent global deadlock for
TheDatabase.TheTable.

2) Next, we lock TheDatabase.TheTable for write.

3) We do an all-AMPs RETRIEVE step from TheVolatileTable by
way of an all-rows scan with no residual conditions into Spool 3
(all_amps), which is built locally on the AMPs.  Then we do a SORT
to order Spool 3 by the sort key in spool field1 (
TheVolatileTable.Col1,
TheVolatileTable.Col2,
TheVolatileTable.Col3,
TheVolatileTable.Col4) eliminating duplicate rows.
The size of Spool 3 is estimated with high confidence to be 2 rows
(124 bytes).  The estimated time for this step is 0.02 seconds.

4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps) (compressed columns
allowed), which is duplicated on all AMPs.  The size of Spool 2 is
estimated with high confidence to be 780 rows (48,360 bytes).

5) We do an all-AMPs JOIN step from TheDatabase.TheTable
by way of an all-rows scan with no residual conditions, which is
joined to Spool 2 (Last Use) by way of an all-rows scan.
TheDatabase.TheTable and Spool 2 are joined using a
inclusion dynamic hash join, with a join condition of (
“(TheDatabase.TheTable.Col1 =
Col1) AND ((TheDatabase.TheTable.Col2
= Col2) AND ((TheDatabase.TheTable.Col3 =
Col3) AND (TheDatabase.TheTable.Col4
=Col4 )))”).  The result goes into Spool 1 (all_amps), which
is built locally on the AMPs.  Then we do a SORT to partition
Spool 1 by rowkey and the sort key in spool field1 eliminating
duplicate rows.  The size of Spool 1 is estimated with low
confidence to be 12,253 rows (220,554 bytes).  The estimated time
for this step is 13.06 seconds.

6) We do an all-AMPs MERGE DELETE to
TheDatabase.TheTable from Spool 1 (Last Use) via the
row id.  The size is estimated with low confidence to be 12,253
rows.  The estimated time for this step is 1.87 seconds.

7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.

-> No rows are returned to the user as the result of statement 1.

In the improved execution plan, the Teradata Optimizer is doing a dynamic hash join. Dynamic hash joins don’t spool to rehash the join columns, but calculate the row hash “on the fly”, row by row. No spooling of the 3 billion rows is required!

This small change caused that the resource consumption is reduced immensely:

IO Count CPU Time SpoolUsage
Tuned 1.335.246,00 2.916,23 599.040,00
Original 3.585.081.549,00 71.510,38 1.397.760,00
 Reduced to: 0,04% 4,08% 42,86%

These are impressive numbers: 99,96% less IOs and about 96% less CPU seconds. Even spool usage has been halved!

This tuning success shows how it is possible to achieve huge improvements with very little effort. Always identify your worst workload at first and try to improve it. Tuning your worst queries should always be one of your major goals.

3

Why you should use the Teradata CLOB instead of VARCHAR()

CLOB

Wide textual information often is stored in columns of data type VARCHAR(). This data type can hold up to 32,000 (UNICODE) or 64,000 (LATIN) characters.
The impact of storing comprehensive text is that only a small number of rows fits into each data block.

The effect of storing large text is that only a relatively limited number of rows fits into each data block: In this case, full table scans are more expensive(more data blocks are required to store the whole set of data).

Luckily, there is another possibility to store extensive textual information:

The Teradata CLOB

Character large objects (data type CLOB) can be a great alternative for storing comprehensive textual information. The data of a CLOB column is stored as a separate sub-table. Separating them from the base table can give a significant performance advantage for queries which are accessing the table with a full table scan:  More rows can fit into each data block.

As a consequence, fewer data blocks have to be moved from disk to FSG cache, vastly reducing the number of disk IOs.

Here is an example table, we are assuming that column “Desc” of each table row contains extensive textual information:

CREATE TABLE Varchar_Table
(
ID INTEGER NOT NULL,
COl1 CHAR(01),
Desc VARCHAR(30000)
) PRIMARY INDEX (ID);

In the absence of a secondary index, all data blocks have to be moved from disk to FSG cache, even if we are only interested in the content of column ID. As we are a assuming that all rows contain extensive textual information in column Desc this will be expensive from a resource consumption point of view:

SELECT ID FROM Varchar_Table WHERE COl1 = ‘X’  ;

By changing the data type from VARCHAR() to CLOB, resource consumption (disk IOs) can be decreased significantly; much more rows are packed into each data block. Therefore fewer blocks have to be moved from the drive into the FSG cache:

CREATE TABLE CLOB_Table
(
ID INTEGER NOT NULL,
COl1 CHAR(01),
Desc CHARACTER LARGE OBJECT
) PRIMARY INDEX (ID);

SELECT ID FROM Varchar_Table WHERE COl1 = ‘X’;  — fewer blocks have to be moved from disk to FSG cache

No performance advantage is measurable if the CLOB column is part of the select list. In this case, the base table plus the sub-table holding the CLOB data have to be accessed:

SELECT ID,Desc FROM Varchar_Table WHERE COl1 = ‘X’  ;

 

2

Teradata Data Type Considerations – Part 2

Continuing the Data Type considerations, here are more issues that demand attention from time to time.

Happenstance Nullability

Aim for nullability settings that are compliant with the data model and unified across all tables. The cost of a mandatory column left open for null entries is both on the storage side and on the optimizer side. An extra presence bit per column has to be reserved and a null option to be considered when making the SQL execution plan.

Content Cleanliness

Double-check whenever you extract parts of one source table attribute into a target table attribute so that you do not create phantom variance and incongruence of values. Assume a source attribute ORDER_NUMBER can be structured differently depending on where the order originated from:

AAA.123.456O or BBB.123.456. Assume further that the positions of 456 represent a customer type code. The “O” stands for order. The application creates it that way. If you merely extract the number using a substring starting from position 5, your result will not be entirely numeric – which you might not even realize as you load into a character field! – And there are invalid customer type codes. Surprisingly wrong reporting results for customer type aggregations despite no load fails and quantitative checks passed will keep you busy at a later stage.

Content Sobriety

This is one of the noblest yet hardest-to-achieve goals of a warehouse, at least as long as we humans are allowed to type in values in source system applications. Nevertheless, we need to reach out for it, just as we tidy up our apartments, not in the expectation of a one-time final orderliness.

In contrast to cleanliness, which refers to the adequate processing of data, I like to call sobriety a state of redundancy-free and simplified attribute content down to the point where it cannot be pruned further without loss of information.

First, values should be all written out or all coded and abbreviated. A threefold categorization of “A,” “B” and “DEFAULT” could be converted to A,” “B” and “D” without loss of information, but size needs down to one character byte instead of several. Taking this further, could dozens of full-text entries be converted to official codes so that your table does not take in extra calories?

An anonymized real-world example:

An ENTITY_NAME, let’s say a branch name, is stored like this, thousands of times each:

“ABCD – ABCD Division of ABCD Industries Fantasia”

Why not cut it down to “ABCD” and store the meaning of the ENTITY_NAME in a tiny lookup table?

By this, I could cut down the ENTITY_NAME from VARCHAR(200) to CHAR(4). The savings can be estimated to (average length of current entries-4+2) bytes per record. The table I derived this example from had 177 million records at the time I wrote these lines.

Second, summarize and prune. Although this involves changing the table content somewhat, it will not alter the meaning of the content while fewer distinct values emerge. They then have a higher likeliness to be subject to compression. If relevant, you will need fewer partitions. Address and name fields are always a good place to start.

Derived from what I just saw on a production table, imagine a street field mapped 1 to 1 from the source where we have stored “Main Street,” “Main Str.” as well as “Mainstreet” thousands of times each. Once a standard form of street name representation is defined, the distinct values can probably be cut in half.

Even if you are of the opinion that the blanks or abbreviations have to be preserved because they carry a, however, a subtle difference in meaning, a hundred different ways of saying “no input” don’t have to. Cut down all kinds of meaningless input to no more than two values, one meaning NULL and one meaning that the original input is unusable, be it because of refusal, pranks or severe typing errors that cannot be reverse-engineered.

Propose and plan this and the chances are that your reporting and analytics department colleague will add you as a new friend.

A Success Story to Inspire You

Recently I tested a data type resizing over dozens of tables. These tables were already compressed and mostly partitioned and indexed well. Unfortunately, they were implemented under either significant uncertainty concerning the actual content or lack of due diligence in this respect. Pruning character and decimal fields closer to the reality of the table content saved 5 to 25% of the used table space, even for small tables, with extremes of 35% and more.

See also:
First the PDM, then Compression – Measures to reduce Resource Usage

>