Speeding up the LIKE operator without Teradata Indexing
When Teradata Indexing fails
We all have been in the situation where we had to select rows from extremely wide tables containing billions of rows. While this is mild on resource consumption with direct access to the rows (Primary Index, USI), it becomes a resource killer if the rows have to be selected by using a LIKE operator:
SELECT * FROM wide_table WHERE column_B LIKE ‘%A%’;
Without any further measures, above query, most of the times will be resolved with a full table scan (FTS).
While an STJI could be a good choice, it will be very hard to convince the optimizer to replace the FTS on the base table with an FTS on an NUSI sub-table, as statistics (sampled or collected) may not be sufficient to consider index usage (in general, “LIKE ‘value%'” will more often allow the optimizer to do an NUSI traversal than “LIKE ‘%value%'”, this is related to the way statistic histograms are stored on Teradata).
Furthermore, be aware that LIKE operators disallow a binary search within the NUSI data blocks as the row pointer array is sorted by ROWHASH of the indexed columns (or a 4-byte integer value) but the LIKE operator is not hashable. The only kind of access would be a full table scan of the NUSI sub-table which makes only sense if the number of NUSI data blocks to be read is significantly lower than the number of base table data blocks accessed with an FTS.
The Solution without Teradata Indexing
Luckily, there exists an amazing trick to avoid the full table scan on the wide table:
We create a copy of the comprehensive table, containing only a subset of columns, namely the primary key columns and only the ones needed to satisfy the LIKE operator. Primary Key and Primary Index of the table have to match to be able to use this trick!
Instead of directly selecting from the wide table, we will join the wide table with our auxiliary table:
t01.primary_key = t02.primary_key
t02.column_A LIKE ‘%A%’;
In case all needed statistics are available to satisfy above query, the optimizer will create an execution plan much more benign on resource usage than the original one:
A full table scan on the small auxiliary table will be done, only the rows satisfying the LIKE operator are moved into the spool. As the auxiliary table is much smaller, this leads to a huge reduction in IO.
In a second step, a primary index merge join (AMP-local and therefore fast) between the spooled records and the wide table takes place.
The solution described above was successfully implemented in several telecommunication projects, where queries with LIKE operators on huge call detail record (CDR) tables are done on a daily base. The creation of auxiliary tables is usually moved into the nightly batch load window.