Teradata CSV splitting

DWH Pro Admin

March 21, 2021

minutes reading time


Often we are faced with the challenge of splitting the content of a column. Teradata offers several ways to do this, each with its own advantages and disadvantages, which we will look at in detail in this article. First, we will create a table with enough random data to determine the performance and resource consumption of each option:

CREATE TABLE DWHPRO.Split
(
    PK INTEGER,
    ToBeSplit   VARCHAR(200)
) PRIMARY INDEX (PK);
INSERT INTO DWHPRO.Split
SELECT 
	ROW_NUMBER() OVER (ORDER  BY 1) AS PK,
	'Row' || TRIM(PK) || 'Col1,' || 'Row' || TRIM(PK) || 'Col2,' || 'Row' || TRIM(PK) || 'Col3'
FROM 
(
	SELECT * FROM SYS_CALENDAR.CALENDAR
	UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
	UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
	UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
	UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
	UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
	UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
	UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
	UNION ALL SELECT * FROM SYS_CALENDAR.CALENDAR
) t01; 
--> 

The Teradata CSVLD Function

With the CSVLD function, we must consider that all defined columns must be present. If not all columns can be created, an error message will be displayed:


SET QUERY_BAND = 'Method=CSVLD;' FOR SESSION;
SELECT * FROM TABLE (CSVLD(Split.ToBeSplit, ',', '')
RETURNS (col1 varchar(200), col2 varchar(200), col3 varchar(200))) as T1;

The Teradata Unpack Function

In contrast to the CSVLD function, the Unpack function also allows you to omit rows that cannot be split into all defined columns. The parameter “IgnoreInvalid” is used for this purpose:


SET QUERY_BAND = 'Method=Unpack;' FOR SESSION;
SELECT 
	PK,
	CAST(Col1 AS VARCHAR(255)),
	CAST(Col2 AS VARCHAR(255)),
	CAST(Col3 AS VARCHAR(255))
FROM Unpack (
  ON Split
  USING
  TargetColumn ('ToBeSplit')
  OutputColumns ('Col1', 'Col2', 'Col3')
  OutputDataTypes ('varchar', 'varchar', 'varchar')
  Delimiter (',') 
  Regex ('(.*)')
  RegexSet (1)
  IgnoreInvalid ('true')
) AS t01 
;

The Teradata strtok_split_to_table Function

Teradata strtok_split_to_table not only splits into columns, but also creates a new row for each column, but marks it with a column number so that it can be identified:


SET QUERY_BAND = 'Method=strtok_split_to_table;' FOR SESSION;
WITH T1 as
( 
 SELECT d.* FROM TABLE (strtok_split_to_table( Split.PK, Split.ToBeSplit,  ',')
RETURNS (PK integer, tokennum integer, Col varchar(100)) ) as d
)
SELECT T1.* FROM t1
;
SELECT QUERYBAND,Spoolusage,TotalIOCount,AMPCPUTIme 
FROM DBC.DBQLOGTBL
WHERE QUERYBAND LIKE '%Method%'

Performance Considerations

Of course, only the Teradata functions CSVLD and Unpack are directly comparable. They create exactly one output row per input row. strtok_split_to_table multiplies the number of rows and therefore needs more spool space.

According to our empirical measurements (we ran each query 3 times), the Unpack method requires about 43% less spool space and 59% fewer I/Os.

QuerybandSpoolUsageTotalIOCountAMPCPUTime
=S> Method=CSVLD;110 559 232332,001,89
=S> Method=CSVLD;110 559 232332,001,89
=S> Method=CSVLD;110 559 232332,001,90
=S> Method=strtok_split_to_table;245 719 040608,004,71
=S> Method=strtok_split_to_table;245 719 040608,004,72
=S> Method=strtok_split_to_table;245 719 040608,004,98
=S> Method=Unpack;63 406 080136,001,90
=S> Method=Unpack;63 406 080136,001,79
=S> Method=Unpack;63 406 080136,001,87
{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>