fbpx

Anyone working with Teradata experiences sooner or later how the Impact of Skewing on Teradata impairs query run times.

However, I assume only a few of us know exactly the order of magnitude of influence we have to expect.

Let me, therefore, find some key figures which are helpful in getting the big picture.  That is why I decided to create a test scenario based on absolute measures.

I am a big fan of CPU usage and disk accesses. In contrast, to run times, these numbers always tell the truth. Further,  it is easy to get this information from DBC.DBQLOGTBL, if you are not in the unfavorable situation that database administrators restricted DBC table access.

In order to enlarge the test scenario a bit, I tested SET and MULTISET tables. The idea behind adding this table level attribute was the following:

If skewing alone already affected performance, shouldn’t a SET table in need for duplicate row checks add even more workload to the poor AMPs being a victim of the skewing?

My test setup looks like this:

CREATE SET TABLE TABLE_NOT_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;

CREATE MULTISET TABLE TABLE_NOT_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;

CREATE SET TABLE TABLE_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;

CREATE MULTISET TABLE TABLE_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;

As you can see, I created a skewed table and an unskewed table, each of them as SET and MULTISET. I added some data to each of these tables. Column A is almost unique, column B has only few distinct values:

INSERT INTO TABLE_NOT_SKEWED  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_NOT_SKEWED_MULTISET  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_SKEWED  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

INSERT INTO TABLE_SKEWED_MULTISET  SELECT RANDOM(1,1000000), RANDOM(1,5) FROM  SYS_CALENDAR.CALENDAR;

Finally, I analysed DBC.DBQLOGTBL:

SELECT   QUERYTEXT,  MAXAMPCPUTIME,  MINAMPCPUTIME,  MAXAMPIO,  MINAMPIO,  TOTALIOCOUNT,  AMPCPUTIME  FROM DBC.DBQLOGTBL WHERE SESSIONID = ; — my SQL Assistant session

Below you can see the result of this test. Here are my observations:

QueryTextMaxCPUMinCPUMaxIOMinIOTotalIOAMPCPU
INSERT INTO TABLE_NOT_SKEWED…0,01041393.6190,33
INSERT INTO TABLE_NOT_SKEWED_MULTISET…0,02038403.5340,44
INSERT INTO TABLE_SKEWED…8,59041.53613204.65142,44
INSERT INTO TABLE_SKEWED_MULTISET…0,03086111.3500,33

SET TABLES:

The Disk accesses for the skewed table are 15 times higher
The busiest AMP has 4000 times more Disk accesses to do than the AMP with the least work, with a comparable result for CPU seconds!

 

MULTISET TABLES:

Multiset tables need fewer disk accesses compared against their set table equal. This is because the costly duplicate row check is omitted.
Very interesting: The skewed multiset table needs less than 50% of the disk accesses, compared with the unskewed table. Does this mean it is better? I don’t think so.
The reason probably is that most rows are physically co-located next to each other on a small set of AMPs (disks).  I assume that caching of rows in memory and enhanced block reads can be aggressively applied, reducing disk accesses significantly.
The bad thing is: only a few AMPs have to manage many rows, which leads to a bottleneck situation. So don’t be fooled by these numbers: basically, the inserting into the table is still much slower as done by just one AMP (or a few of them).

Conclusion: Use multiset tables when possible. The higher the table skew is, the worse is the impact of the duplicate row check.

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
Buy the Book Teradata Query Performance Tuning

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

  • Avatar
    Dimitrios says:

    Hello, it is a very interesting approach but in my opinion you have to add two or three more things to have a complete test.

    What if I have already data on my table, how this impact the CPU / io for a multiset or set table?
    What if I will add more columns on the PI of my Set table in order to increase the information on the histogram(less group of values to check)?
    What if I add a USI on my Set table?

    Furthermore, always I have an impact on CPU/io consumption with a Set table, but if I have a UPI or USI defined, the behavior is the same since the bottleneck is not the skewed Data – this problem I will have it on both cases – but TD will recognize the unique values to insert, and this is controlled only from PI, USI, and stats.

    Ending, should it better to say, the higher amount of rows that I have on a skew table, the worse is the impact of the duplicate row check?

    Thank you.

  • Avatar
    Helmut Wenzlofsky says:

    Very interesting, I think You are right.
    Regards Helmut

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >