Hadoop — a consultant perspective

Please read an excellent post https://medium.com/@acmurthy/hadoop-is-dead-long-live-hadoop-f22069b264ac before reading this.

Sep 15, 2019

A bit of context, my Hadoop journey started with 0.20 series considerably at an early stage of Hadoop project evolution. As a consultant with a background in both applications and data warehouse architectures, it was easy to transition from tool/workflow-based data management to programmable data management. As the title suggests this post is about the perspective of a solution consultant on Hadoop.

Please read an excellent post https://medium.com/@acmurthy/hadoop-is-dead-long-live-hadoop-f22069b264ac before reading this.

Parallel programming at tip of programmer hands — sure the high performance computing HPC did exist before but Hadoop and mapreduce effectively put parallel/distributed programming at hands of the programmers for day to day use, no more requirements of exotic and expensive computing machines, just a bunch of low-end hardware machines and the magic of distributing computing unfolds. Given that most of the programming languages/frameworks didn’t have trivial support for taking advantage of multi-core architectures, this was the phenomenal contribution of Hadoop and mapreduce. This innovation continued beyond mapreduce framework to new frameworks Spark, flink etc..and it is not going to stop.
Schema on read — Hadoop separated computing and storage and effectively removed the need to define schema first hand to store the data making it possible to allow heterogeneous data formats and thus hitting hard the traditional database schema first approach, another big contribution of Hadoop.
The open-source ecosystem — The number of Hadoop sub-projects is just incredible, as of today there are around 49 sub-projects in Apache big-data category, and a majority of them are great projects/frameworks on their own, Hive, HBase, Spark, Parquet the list is amazing. Add to this Apache list the projects that came out of the experience of using Hadoop (not necessarily on Apache or with Apache license), data stores, filesystems, file formats etc, is another big contribution of Hadoop and great open innovation.
Hadoop as data warehouse replacement !!! — Way back after my first Hadoop presentation, the question that was asked was is Hadoop a low-cost replacement for existing data warehouse solution? and my answer was it has potential but not today not immediately, maybe that is not the way it has to be looked at all!!. Today as we look at it, it sure changed the way how data management and data warehouse solutions are viewed, adaptation of heterogeneous data formats and sources, flexible schemas, programmable and not tool-driven workflows and it sure did shake-up traditional databases, data warehouses, MPP systems, appliances, ETL tools etc and contributed to paradigm shift in data management. So the question, is Hadoop can be a low-cost replacement for data-warehouse?
Mutable datasets, transactions, role-based access, BI tool reporting — these are possibly some of the core requirements for a data-warehouse implementation and either Hadoop addressed them too late in the evolution or solutions were not ideal and at par with traditional RDBMS solutions, certainly the philosophy was not to have these attributes in the beginning but later added as part of Hadoop project evolution.
SQL for the data, SQL is the KING! — relational databases, NoSQL, document stores, graph stores, key-value stores, streaming events, this and that anything and everything that stores data needs SQL, SQL is the KING (the number of SQL wrappers on top of all solutions), no way out of it, the big learning for anyone implementing data management solutions.
Competing solutions/projects in Hadoop ecosystem — the nature of the open-source ecosystem is such that anyone can come up with their specific implementation over existing solutions, each one’s milieu and their customers experience will vary and so this resulted in multiple solutions/projects in the same problem space which created confusion for customers.
Wrong use cases — there was of-course wrong use cases identified for Hadoop implementation, a lot of the business if not everyone thought it as low-cost replacement for database/data-warehouse, analytical data stores/reporting, treating Hadoop implementations as database migration rather than transformational projects, the approach would be quite different.
Cloud-native and less operational complexity — the low-cost, cloud-native, readymade and pay as you go SaaS data management solutions as alternatives to Hadoop are probably making the Hadoop as a tool redundant, but they are of-course have Hadoop philosophy embedded in them which is distributed computing, separate storage and compute, multiple data formats, flexible schemas, with traditional RDBMS attributes of ACID, change datasets and RBAC.

So,

Hadoop as philosophy will continue and probably regarded as a system that brought a paradigm shift for data management.
Distributed computing, storage & computing as separate systems, flexible schemas, and heterogeneous data formats are foundations of new data management systems.
No matter how many tools available when it comes to data SQL is the Lion in the room.

Disclaimer: All views expressed here are personal views

Datum

Discussion about this post