Collect Statistics in Teradata – Part 3

Collect Statistics in Teradata – sample statistics or full?

 Once a table has gotten vast and the time window for regular recollection is about to become smaller or a given one overconsumed, it is time to reconsider one collection option: could a short sample over the row be sufficient? Unless defined otherwise in your collect statement, sampling refers to a 1 to 10% draw of the total data over no more than a small number of AMPs, exact details depending on the Teradata release.

Sampling will become a net advantage for collection combinations at the extremes of possible distributions, i.e. when data are spread more or less evenly across a large set of distinct values or when the collect combination is a constant. For anything in between, the collection result numbers that are not accurate enough and so become a fen fire in search for the optimal execution plan.

Think of it as a kind of opinion poll problem: asking just the next few persons that come your way gives you the right idea on how the population as a whole thinks when either there is a strong consensus at large or no two people with the same opinion in a community. In all other cases, your sample will misrepresent where the majority stands.

Just as with an opinion poll, consider the fact that it makes no sense to sample statistics when your entire population (i.e. the table) is tiny so that you might just as well go over the entire data. The rock bottom for non-sampling is where your table has fewer records than we have AMPs on the system.

Paul Timar

Paul Timar is a graduated Sociologist and Economist who turned Data Warehouse Specialist. He has more than 7 years of experience in working with Oracle and Teradata Data Warehouses in the fields of Social Security Data and Banking. His latest focus is on performance optimization.