Cost-Effective Data Management: How Modern Infrastructure Outperforms Hadoop

How Much Does My Hadoop Cost Me?

I will not spend your valuable time on obvious cost elements in the typical enterprise setting. To use Hadoop, you need to buy hardware and licenses. Additionally, there are fees for support provided by enterprise-level vendors such as Cloudera. At least one team of engineers is necessary to make things running 24/7. If your company operates globally, you probably need to multiply the cost by 3 to handle all the time zones and meet data locality requirements. These obvious costs are an integral part of the business, and you either include them in the company’s budget or stop doing big data.

Certain businesses may need to pay extra for special data protection. Such expenses include frequent backups, maintaining historical data for audit purposes, and complying with rigid disaster recovery protocols. They are particularly common in industries like finance and healthcare, where the total cost of ownership can increase significantly. However, it’s the cost the company should expect.

A few years back, I was reading an IDC report titled The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR. It’s from late 2018 but it provides insights into the cost of on-premise data centers. The report is based on data from several enterprises dealing with over 3.5 PB of data on average, and its figures are staggering. During a five-year operational span, the IT infrastructure costs average to more than $30M. The IT staff expenses vary by roles, summing up to more than $20M. These figures emphasize the level of magnitude, sparking thoughts about whether big data is only for the biggest market players. But is it really?

Hidden Costs and Technological Debt

I would like to focus now on the less obvious expenses associated with Hadoop-based big data in the 2020s. The very first thing that comes to my mind is technical debt. This broad term covers various aspects of software development and system architecture. You might wonder why Hadoop, in particular, could cause technical debt. The answer isn’t self-evident, it’s more about the upgrade process which is costly in terms of time, effort, and potential disruptions. Its complexity depends on the scale of the deployment.

Let me consider the example of the most popular big data processing framework. Many companies have been using the older Spark versions for years. Spark 2.4 was released in 2018. The current release is 3.4.x. The need of an upgrade arises if you want to retain support for older versions. The commercial Hadoop distributions require upgrades at some point if you want to avoid paying absurd support fees. Migrating hundreds of workloads to a new platform is a complex task, to say the least.

Maintaining Hadoop, that is, a typical Hadoop cluster, becomes difficult due to its bare-metal deployment nature. All essential software components required to run your big data workloads must be physically installed on cluster machines. Employing a newer Spark version demands installing it on every cluster machine, while also ensuring compatibility between multiple installations on the same cluster. Commercial vendors like Cloudera usually offer packaged solutions for this. However, this approach leads to delays in adopting the latest Spark version. Even if you are a developer, and for any reason, the latest stable release of Spark is a perfect match for your processes, and you could potentially benefit from using it, you need to wait for vendor support (for the applicable Spark version) and cluster upgrades (which is non-trivial). Finally, when the time comes, there is a high chance you’ve already developed what you need using the old version, however; you’ll never get time for refactoring this due to other priorities in place. As a result, you’ve accumulated technical debt. However, there is no need to worry, as this issue can take place when you have certain limitations.

Exploring Alternatives

As mentioned above, the key issue of the typical Hadoop deployment is their bare-metal essence. What is the opposite of the bare-metal? Virtualization, or more intriguingly, containerization (which is a form of virtualization). What’s interesting, Hadoop started supporting containerized applications on YARN since version 3.x. Nevertheless, the adoption of this system among enterprises remained rather low. Possible reasons for this include some security implications, Hadoop’s complexity, and availability of reasonable alternatives.

Containerization introduces a pivotal advantage: isolation. You can have the whole library of the Docker images featuring an array of Spark versions. Additionally, a single infrastructure setting can support multiple versions of Spark and other software components. This enables developers to use the latest stable release of any software. So, what is the best available container orchestrator? Kubernetes, of course. And it has supported Spark since version 3.1. This is a real game-changer compared to the YARN setting.

Kubernetes is a powerful container orchestrator that can be used to run distributed data processing applications. Running autoscaling Spark applications on Kubernetes is relatively straightforward, and I can elaborate on it in my next article.

Ok, but so what?

So far, I’ve explained hidden costs arising from the Hadoop cluster upgrade complexities and explored how containerization can help you avoid falling into permanent technical debt. Kubernetes is one of the interesting technologies you should consider when dealing with big data, as it offers potential for substantial savings.

By using the cloud, there is no need for huge data centers. It can be simple only if you configure the cloud properly, monitor resource consumption, optimize infrastructure components, and tag the infrastructure components efficiently. By following these steps, organizations can get a fair understanding of their computational demands and consequently reduce expenses. I’m not suggesting you should close all your data centers. Yet, you should consider the public cloud as a nice supplementary computation power source before escalating hardware investments. The concept of hybrid cloud is another topic I would like to write about.

Kubernetes is a cloud-native technology. It’s a buzzword lacking a clear definition, but I like it very much. It’s worth noticing that you can effortlessly establish Kubernetes clusters across major public cloud platforms, employing autoscaling mechanisms that seamlessly integrate additional nodes in response to increased computational demands. I’m sure that some of you now think that AWS EMR, GCP Dataproc, and Azure HDInsight can also use autoscaling. Of course, those Hadoop distributions benefit from the automated scaling (in the report I mentioned above, you can also check the savings reported by some enterprises that switched to EMR). But as those are typical Hadoop clusters, they have their issues with efficient cloud operation (they’re simply not cloud-native). This is something I can cover in another blog article. Anyway, as proof that Kubernetes is a good choice, let’s take the fact that Databricks uses Kubernetes under the hood for spinning up their clusters.

Autoscaling can effectively address the peaks in your computational demands, and it scales clusters down to near-zero when no processing takes place. It’s a perfect suite for typical batch processing scenarios. In many cases, this approach is far more cost-effective compared to maintaining huge internal data centers.

Democratizing Big Data Processing

Cloud, open source projects and Kuberntes are the catalysts of the important change. In the not so distant past, to do big data, you had to be an enterprise equipped with enormous on-premise data centers and an extensive team of engineers. It has totally changed. Nowadays, you can launch a startup and initiate processing the massive amount of data instantly provided you have funds secured. Wealthier companies can opt for user-friendly commercial big data platforms that hide a lot of complexity from the developers – Databricks is a pioneering example here. However, smaller and less wealthier companies have an option of composing powerful platforms on top of Kubernetes using open-source components. This approach is much less expensive, but it requires a skilled team that can implement the solution properly. Big data is truly for everybody these days.

Author

Krzysztof Bokiej
Cloud and Big Data Architect
Krzysztof Bokiej is a Technical Leader and IT Expert from Craftware. His experience covers Data Science, Big Data / Hadoop, Business Intelligence and Data Warehousing.

He has been devoted to supporting the Life Sciences and Healthcare industry for years. Krzysztof is passionate about applying modern infrastructure and advanced Big Data techniques to specialised medical data modalities to improve the efficiency of processing and analysis.

Małgorzata Niemczyk

text review