Teradata Map Reduce - SQL JOIN and GROUP BY
Map Reduce is one of the Hadoop core functionalities, which most database vendors are adding to their RDBMS. The following example should help you to understand the relation between a traditional RDBMS and Hadoop.
As an example, we use a SQL aggregation statement, which is joining two tables together; Based on this example, we will show how logic can be pushed down to the "Hadoop implementation" of the RDBMS, which is being handled by the Map-Reduce algorithm:
t01.clienttype_id = t02.clienttype_id
GROUP BY Segment;[/highlight]
Above SQL statement is summing up the client balances for each client segment (private, business, etc.).
Let's see how this can be implemented with a Map-Reduce algorithm.
Our example SQL aggregation statement from above cannot be covered by only one Map-Reduce task.
In a first step, the JOIN step is done. The data is sorted by clienttype_id, allowing each reducer to emit the balance for each segment. In other words, the first Map-Reduce task does the join of the Client and ClientType table, but is not doing the grouping:
The second Map-Reduce task sorts the data by segment so that all rows for a particular segment are being emitted to the same Reducer. The reducer does the summarizing needed in the GROUP BY statement.
Above example showed, how RDBMS functionality can be designed in Map Reduce. It helps to understand in a better way, what we are relating to, when analyzing the ability of an RDBMS like Teradata, to push down joining and aggregation functionality to Map Reduce.