Every once in a while, innovation sends IT markets into rapture. Although parallel computing is not a big deal these days, Hadoop’s emergence makes parallel computing available to the public for the first time.
As usual, much hype and marketing is happening, concealing Hadoop’s actual value and the related topics.
Significant confusion is generated; many wrong statements are made. Big data is a buzzword. With Hadoop and Map Reduce, so-called experts claim this will be the future, replacing the traditional Data Warehouse. I heavily doubt this prediction.
Let’s take a look and compare Teradata and Hadoop Data Warehousing. Both are pretty similar from a technical point of view: Parallel systems with a shared-nothing architecture.
On a high-level observation, Hadoop could be the better solution: scalability is unlimited, and massive amounts of data can be collected and processed, leveraging the HDFS and Map Reduce.
In a typical Data Warehouse (valid for Teradata), you will run into a bottleneck situation at a certain point. While the Teradata System is scaling entirely as soon as the data makes it into the database by adding more nodes, massive amounts of data (or big data, if you prefer the buzzword) have to be collected, moved to some storage devices, and finally loaded into the database. Instead of being limited by network bandwidth, ETL servers, and storage devices, you can leverage the Hadoop framework’s advantages.
I think precisely here comes Hadoop into play, adding value to your Teradata Data Warehouse: Collecting and preprocessing the massive amount of data before it is loaded into the Teradata Data Warehouse.
Don’t forget that HDFS is probably a much more cost-efficient way of storing your data than keeping it online (not aggregated), being forced to save it on tapes, or even being forced to eradicate it after loading.
One could consider Hadoop combined with Map Reduce as a powerful and massive parallel ETL-Server, preprocessing incoming data and preparing it for further processing.
Don’t forget that while you are limited to particular types of “SQL-near” data processing in a typical ETL-Server environment, the Hadoop / Map-Reduce approach allows you to do any processing on the incoming data.
Once the hype has vanished, I think what I described above will help integrate Hadoop into your data warehouse life cycle.
Big database vendors follow the Hadoop hype, adding functionality directly into their RDBMS to mix it at the top. If the fusion of the RDBMS is a good idea, time will show. I am not entirely convinced yet.
Many of these attempts to be part of the big data train are not convincing.
Teradata’s implementation (SQL-H) covers just a part of a fully equipped Hadoop framework. Parallelism is restricted to parallel pipelines down to the HDFS (Hadoop Filesystem), but no advanced possibilities like pushing down joins or aggregations are available, weighing on performance.
Currently, this will force developers to find other “Hadoopish” solutions. Just keep in mind how ETL tools are misused:
Suppose the ETL-Server is not performing well for a certain kind of transformation. In that case, it is often decided to move the mapping logic for the critical part directly into some pre-aggregation steps in the RDBMS. This is breaking the data lineage but is “state of the art” … but it’s hard to remove such implementations later.
Another blind spot is the specific SQL-H syntax Teradata offers. It does not add much to the convenience, neither ANSI SQL nor Teradata SQL, and many 3rd party tools will not support it for an extended period.
I am delighted about the possibilities Hadoop & Map Reduce offer; I fear some implementations may end up shoddy work being replaced sooner or later by the proper implementation. Unfortunately, this probably causes a lot of redesigns on the client-side to fully leverage functionality.
Although big data is the new buzzword, I don’t think most companies are in a big rush to acquire these new technologies, and I wish database vendors would give themselves more time. We are talking about a new technology/framework. While it has great potential for handling big amounts of data, I still can see many customers struggling even with data warehousing basics. I remember when “active data warehousing” was “the next big thing”…