Every once in a while, innovation is sending IT markets into rapture. Although parallel computing is not a big deal these days, the emergence of Hadoop makes parallel computing available to the public for the first time.
As usual, a lot of hype and marketing is happening, concealing the real value of Hadoop and the related topics.
Significant confusion is generated, many wrong statements are made. Big data is a buzz word. Together with Hadoop and Map Reduce so-called experts claim this will be the future, replacing the traditional Data Warehouse. I heavily doubt this prediction.
Let’s take a look and compare Teradata and Hadoop Data Warehousing. Both of them are indeed quite similar from a technical point of view: Parallel systems with a shared nothing architecture.
On a high-level observation, it seems that Hadoop could be the better solution: scalability is unlimited, massive amounts of data can be collected and processed, leveraging the HDFS and Map Reduce.
In a typical Data Warehouse (which is valid for Teradata as well), at a certain point, you will run into a bottleneck situation. While the Teradata System is scaling entirely as soon as the data made it into the database by adding more nodes, massive amounts of data (or big data, if you prefer the buzz word) have to be collected, moved to some storage devices and finally loaded into the database. Instead of being limited by network bandwidth, ETL-servers, and storage devices, you can leverage the advantages of the Hadoop framework.
I think exactly here comes Hadoop into play, adding value to your Teradata Data Warehouse: Collecting and preprocessing the huge amount of data before it is loaded into the Teradata Data Warehouse.
Don’t forget, HDFS is probably a much more cost-efficient way of storing your data, than keeping it online (not aggregated) or being forced to save it on tapes or even being forced to remove it completely after loading.
One could consider Hadoop combined with Map Reduce as a powerful and massive parallel ETL-Server, preprocessing incoming data and preparing it for further processing.
Don’t forget that while you are limited to particular types of “SQL-near” data processing in a typical ETL-Server environment, the Hadoop / Map-Reduce approach gives you the possibility to do any processing on the incoming data.
Once the hype has vanished, I think what I described above will probably be one of the useful methods of integrating Hadoop into your data warehouse life cycle.
All big database vendors are somehow following the Hadoop hype, adding functionality directly into their RDBMS, just to mix it at the top. If the fusion of the RDBMS is a good idea, time will show. I am not quite convinced yet.
Currently, many of these attempts to be part of the big data train are not convincing.
Teradata’s implementation (SQL-H) is just covering a part of a fully equipped Hadoop framework. Parallelism is restricted to parallel pipelines down to the HDFS (Hadoop Filesystem), but no advanced possibilities like pushing down joins or aggregations are available, weighing on performance.
Currently, this will force developers to find other “Hadoopish” solutions. Just keep in mind, how ETL-tools are misused:
In case the ETL-Server is not performing well for a certain kind of transformation, many times it is decided to move the mapping logic for the critical part directly into some pre-aggregation steps taking place in the RDBMS. This is breaking the data lineage but “state of the art” … but it ‘s hard to remove such implementations at a later time.
Another blind spot is the specific SQL-H syntax Teradata offers. It does not add a lot to the convenience, it is not ANSI SQL, neither Teradata SQL and many 3rd party tools will not support it during an extended period.
I am delighted about the possibilities Hadoop & Map Reduce offer; I fear some implementations may end up as a kind of shoddy work being replaced sooner or later by a proper implementation. Unfortunately, this probably causes a lot of redesigns on the client side to fully leverage functionality.
Although big data is the new buzz word, I don’t think most companies are anyway in a big rush to acquire these new technologies, and I would wish that database vendors would give themselves more time. We are talking about a new technology/framework, and while it has great potential for handling big amounts of data, I still can see so many customers struggling even with the basics of data warehousing. I remember the times when “active data warehousing” was “the next big thing”…