Parallel computing is not a new invention, being available now for more than 30 years. Parallel database architectures followed soon.
Everything started with the shared memory architecture, which is still widely used this day in servers:
The shared memory architecture is a multiprocessor system with a common memory, a single operating system and a shared disk or storage area network for storing the data. Processes are scattered across the CPU’s.
One significant disadvantage of the shared memory architecture is that it is not scaling very well and therefore, even with modern multi-core processors, you will not find any setup containing more than a few hundred of processors.
Current databases are making use of other two architectures, namely the shared disk and the shared nothing architecture (Teradata is based on the shared nothing architecture):
Both architectures overcome the problem of limited scaling.
In the shared disk architecture, the processors communicate over a network, but still, only a shared disk or storage area network exists for storing the data. In a shared disk architecture, therefore, you can find two networks: the network for inter-processor communication and the storage system for communication with the disk.
The shared nothing architecture uses local disks for each processor. Only the network for processor communication exists.
Databases based on any of this architecture are built to execute queries in parallel, but in the shared nothing architecture, the data is additionally distributed across the disks.
Everybody familiar with Teradata will easily spot the relationship between the shared nothing architecture and the AMPs (processors), drives and the BYNET (processor communication network).
One important goal for any shared nothing architecture based database system is the even distribution of data across the available disks. We all know how Teradata is doing this, namely by hashing the primary index columns and storing the rows on the related AMPs.
A column-store database would instead distribute different columns across the disks. Still, the goal is the same: even data distribution to optimally exploit parallelism.
Parallel databases are nowadays state of the art. But there is one issue which can’t be handled in a good way by parallel databases (based on any of the described architectures): Fault tolerance.
All architectures behave okay when there are a limited amount of processors available. The statistical probability of processor failure is small in this case.
The situation would look quite different in the event of a system with thousands of processors. Chance of a processor failure would be near to 100%.
Parallel databases based on one of the three described architectures fundamentally are not fault tolerant, that’s why Teradata, for example, offers Hot Standby Nodes doing nothing else than replicating the data across a network. I would consider this a very costly approach, but it is a usual method (not only for Teradata).
If we are comparing parallel databases with the Map Reduce Framework (please take a look at my related article for further details), we will quickly spot the parallel file system (HDFS) and fault handling as a good base for fault tolerance.
It is easy to recognize that long running batch operations fit much better Map Reduce than a parallel database system like Teradata.
I guess now it becomes clearer why all big database vendors start playing in the big data arena. Big data needs a different architecture.
Currently, the trend is “Hadoop over SQL” to make the transition as smooth as possible as SQL is the primary interaction language for most database systems.
We will see what the next years will bring. Teradata’s shot is SQL-H, other database vendors develop their integrations. None of them is currently really convincing me.
Below you can see an illustration how real-world map reduce implementations are designed (for example Hadoop):
The input files are located on a distributed file system. In case of Hadoop and many others this is the HDFS, Google calls it the Google File System (GFS).
Worker processes can take over either mapper tasks or reducer tasks.
The mapper processed access pieces of the data located on the HDFS, apply the mapping function and write the result to their local disks.
Each time a mapper process is ready with a particular reduce key (as you remember, output of the map phase are key/value pairs), it informs the master process that this key is ready.
As you can see from below illustration, the master process plays an important role. It assigns the keys which are finished to different reducer processes.
Please notice, that each process handling a task can take over either a mapper or a reduce task. They are not specialized.
When a reducer process is informed by the master process that a particular reduce key is available, it is at the same time receiving the information from which mapper (and its local disk) the data can be picked up.
Be aware that a reducer task needs to know when it can start to work on a particular reduce key i.e. all mapper processes have delivered data for this particular key.
Again, the master process plays an important role as it keeps record and knows if all mapper tasks delivered data for the particular key (which is prerequisite for the reduce task to start its work on this key).
One very important advantage of the Map Reduce paradigm and its real-world implementations against traditional database systems is fault tolerance.
Consider a traditional database system like Teradata.
Although many features are built into a Teradata System to increase fault tolerance, it is hard to imagine a Teradata System scaling to hundred of thousands of processes.
Statistical probability ensures that failure is almost guaranteed in the long run.
Let’s do a simple example:
If the probability for each process to fail is only 1%, the probability for one process out of hundreds of thousands to fail is around 100% . Basically this means, failure is guaranteed with such a huge system we may see more and more in the Big Data arena.
Fault tolerance is a huge topic for Map Reduce implementations.
In principle, the same statistical rules like mentioned above, apply for Map Reduce. Imagine a Map Reduce task spawning hundreds of thousands of mapper and reducer processes. Such a task can run for hours or even days on really huge amounts of data applying complex calculations like pattern recognition.
The always exists a master process which keeps communicating regulary with all other processes (mapper or reducer) and records the key ranges which have been sucessfully processed by each mapper or reducer task.
If a mapper fails, the master task just reassigns the complete key range to another mapper which accesses the HDFS and repeats work on this key range.
In case a reducer fails, the same logic from above applies and a new reducer is assigned to this task. Despite, already written keys to the output do not have to be reprocessed.
As you can see, the master process keeps track about all activities in the Map Reduce task and handles failures gracefully even for systems with hundreds of thousands of processes.
Further, even not real failures, the master process detects slow mappers or reducers and takes away the key range they are working upon, assigning it to a new process.
One might assume that the master process is a single point of failure. This is indeed the case, and if the master process fails, the complete Map Reduce tasks fails.
But again, statistical probabilities have to be considered:
Even if the chance that any of the thousands of processes fails is almost 100%, the risk that exactly this particular master process fails is very small. Exactly this property makes Map Reduce implementations very fault tolerant and more suitable for very large scale tasks than traditional database impementations.
Map Reduce is one of the Hadoop core functionalities, which most database vendors are adding to their RDBMS. The following example should help you to understand the relation between a traditional RDBMS and Hadoop.
As an example, we use a SQL aggregation statement, which is joining two tables together; Based on this example, we will show how logic can be pushed down to the “Hadoop implementation” of the RDBMS, which is being handled by the Map-Reduce algorithm:[highlight] SELECT
Above SQL statement is summing up the client balances for each client segment (private, business, etc.).
Let’s see how this can be implemented with a Map-Reduce algorithm.
Our example SQL aggregation statement from above cannot be covered by only one Map-Reduce task.
In a first step, the JOIN step is done. The data is sorted by clienttype_id, allowing each reducer to emit the balance for each segment. In other words, the first Map-Reduce task does the join of the Client and ClientType table, but is not doing the grouping:
The second Map-Reduce task sorts the data by segment so that all rows for a particular segment are being emitted to the same Reducer. The reducer does the summarizing needed in the GROUP BY statement.
Above example showed, how RDBMS functionality can be designed in Map Reduce. It helps to understand in a better way, what we are relating to, when analyzing the ability of an RDBMS like Teradata, to push down joining and aggregation functionality to Map Reduce.
Every once in a while, innovation is sending IT markets into rapture. Although parallel computing is not a big deal these days, the emergence of Hadoop makes parallel computing available to the public for the first time.
As usual, a lot of hype and marketing is happening, concealing the real value of Hadoop and the related topics.
Significant confusion is generated, many wrong statements are made. Big data is a buzz word. Together with Hadoop and Map Reduce so-called experts claim this will be the future, replacing the traditional Data Warehouse. I heavily doubt this prediction.
Let’s take a look and compare Teradata and Hadoop Data Warehousing. Both of them are indeed quite similar from a technical point of view: Parallel systems with a shared nothing architecture.
On a high-level observation, it seems that Hadoop could be the better solution: scalability is unlimited, massive amounts of data can be collected and processed, leveraging the HDFS and Map Reduce.
In a typical Data Warehouse (which is valid for Teradata as well), at a certain point, you will run into a bottleneck situation. While the Teradata System is scaling entirely as soon as the data made it into the database by adding more nodes, massive amounts of data (or big data, if you prefer the buzz word) have to be collected, moved to some storage devices and finally loaded into the database. Instead of being limited by network bandwidth, ETL-servers, and storage devices, you can leverage the advantages of the Hadoop framework.
I think exactly here comes Hadoop into play, adding value to your Teradata Data Warehouse: Collecting and preprocessing the huge amount of data before it is loaded into the Teradata Data Warehouse.
Don’t forget, HDFS is probably a much more cost-efficient way of storing your data, than keeping it online (not aggregated) or being forced to save it on tapes or even being forced to remove it completely after loading.
One could consider Hadoop combined with Map Reduce as a powerful and massive parallel ETL-Server, preprocessing incoming data and preparing it for further processing.
Don’t forget that while you are limited to particular types of “SQL-near” data processing in a typical ETL-Server environment, the Hadoop / Map-Reduce approach gives you the possibility to do any processing on the incoming data.
Once the hype has vanished, I think what I described above will probably be one of the useful methods of integrating Hadoop into your data warehouse life cycle.
All big database vendors are somehow following the Hadoop hype, adding functionality directly into their RDBMS, just to mix it at the top. If the fusion of the RDBMS is a good idea, time will show. I am not quite convinced yet.
Currently, many of these attempts to be part of the big data train are not convincing.
Teradata’s implementation (SQL-H) is just covering a part of a fully equipped Hadoop framework. Parallelism is restricted to parallel pipelines down to the HDFS (Hadoop Filesystem), but no advanced possibilities like pushing down joins or aggregations are available, weighing on performance.
Currently, this will force developers to find other “Hadoopish” solutions. Just keep in mind, how ETL-tools are misused:
In case the ETL-Server is not performing well for a certain kind of transformation, many times it is decided to move the mapping logic for the critical part directly into some pre-aggregation steps taking place in the RDBMS. This is breaking the data lineage but “state of the art” … but it ‘s hard to remove such implementations at a later time.
Another blind spot is the specific SQL-H syntax Teradata offers. It does not add a lot to the convenience, it is not ANSI SQL, neither Teradata SQL and many 3rd party tools will not support it during an extended period.
I am delighted about the possibilities Hadoop & Map Reduce offer; I fear some implementations may end up as a kind of shoddy work being replaced sooner or later by a proper implementation. Unfortunately, this probably causes a lot of redesigns on the client side to fully leverage functionality.
Although big data is the new buzz word, I don’t think most companies are anyway in a big rush to acquire these new technologies, and I would wish that database vendors would give themselves more time. We are talking about a new technology/framework, and while it has great potential for handling big amounts of data, I still can see so many customers struggling even with the basics of data warehousing. I remember the times when “active data warehousing” was “the next big thing”…