The History of Parallel Database Architectures
Parallel computing is not a new invention as it has been available for more than 30 years. Parallel database architectures followed soon after.
Everything started with the shared memory architecture, which is still widely used this day in servers:
The shared memory architecture is a multiprocessor system with common memory, a single operating system, and a shared disk or storage area network for storing the data. Processes are scattered across the CPUs.
One significant disadvantage of the shared memory architecture is that it is not scaling very well. Therefore, even with modern multi-core processors, you will not find any setup containing more than a few hundred processors.
Current databases are making use of other two architectures, namely the shared disk and the shared-nothing architecture (Teradata is based on the shared-nothing architecture):
Both architectures overcome the problem of limited scaling.
In the shared disk architecture, the processors communicate over a network, but only a shared disk or storage area network exists to store the data. Therefore, you can find two networks in a shared disk architecture: the network for inter-processor communication and the storage system for communication with the disk.
The shared-nothing architecture uses local disks for each processor. Only the network for processor communication exists.
Databases based on any of this architecture are built to execute queries in parallel, but the data is additionally distributed across the disks in the shared-nothing architecture.
Everybody familiar with Teradata will easily spot the relationship between the shared-nothing architecture and the AMPs (processors), drives, and the BYNET (processor communication network).
A critical goal for any shared-nothing architecture database system is the even distribution of data across the available disks. We all know how Teradata is doing this by hashing the primary index columns and storing the rows on the related AMPs.
A column-store database would instead distribute different columns across the disks. Still, the goal is the same: even data distribution to optimally exploit parallelism.
Parallel databases and fault tolerance or why Hadoop rules
Parallel databases are nowadays state of the art. But there is one issue that can’t be handled well by parallel databases (based on any of the described architectures): Fault tolerance.
All architectures behave okay when there are a limited amount of processors available. The statistical probability of processor failure is negligible in this case.
The situation would look quite different in the event of a system with thousands of processors. The chance of a processor failure would be near 100%.
Parallel databases based on one of the three described architectures fundamentally are not fault-tolerant. That’s why Teradata, for example, offers Hot Standby Nodes doing nothing but replicating the data across a network. I would consider this a very costly approach, but it is a usual method (not only for Teradata).
Suppose we are comparing parallel databases with the Map-Reduce Framework (please look at my related article for further details). In that case, we will quickly spot the parallel file system (HDFS) and fault handling as a good base for fault tolerance.
It is easy to recognize that long-running batch operations fit much better Map Reduce than a parallel database system like Teradata.
Now it becomes clearer why all big database vendors have started playing in the big data arena. Big data needs a different architecture.
Currently, the trend is “Hadoop over SQL” to make the transition as smooth as possible as SQL is the primary interaction language for most database systems.
We will see what the next years will bring. Teradata’s shot is SQL-H. Other database vendors develop their integrations. None of them is currently really convincing me.