• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Cloudera and Hortonworks merger means Hadoop’s influence is declining

October 7, 2018   Big Data

On Wednesday, Cloudera and Hortonworks announced a “merger of equals,” where Cloudera is acquiring Hortonworks with stock so that Cloudera shareholders end up with 60 percent of the combined company. The deal signifies that the Hadoop market could no longer sustain two big competitors. Hadoop has been synonymous with big data for years, but the market — and customer needs — have moved on. Several megatrends are driving this change:

The public cloud tide is rising

The first megatrend is the shift to public cloud. Companies of all sizes are increasing their adoption of AWS, Azure, and Google Cloud services at the expense of on-premises infrastructure and software. Enterprise server revenues reported by IDC and Gartner continue to decline. The Top 3 cloud providers (90 percent of the market) offer their own managed Hadoop/Spark services, such as Amazon’s Elastic Map Reduce (EMR). These are fully integrated offerings that have a lower cost of acquisition and are cheaper to scale. If you’re making the shift to cloud, it makes sense to look at alternative Hadoop offerings as part of that – it’s a natural decision-point. Ironically, there has been no Cloud Era for Cloudera.

Crushing storage costs

The second megatrend? Cloud storage economics are crushing Hadoop storage costs. At introduction in 2005, the Hadoop Distributed File System (HDFS) was revolutionary: It took servers with ordinary hard drives and turned them into a distributed storage system capable of parallel IO consumable by Java apps. There was nothing like it, and it was a crucial component that allowed large scale data sets that didn’t fit onto a single machine to be processed in parallel. But that was 13 years ago. Today, there is a plethora of much cheaper alternatives, primarily object storage services like AWS S3, Azure Blob Storage, and Google Cloud Storage. A terabyte of cloud object storage costs about $ 20 a month, compared to about $ 100/month for HDFS (not including the cost to operate it). Which is why Google’s HDFS service, for example, is merely a shim that translates HDFS operations onto object storage operations – because that’s 5x cheaper.

Faster, better, and cheaper cloud databases

Hadoop’s problems don’t end there, because it’s not just about direct competition from cloud-vendor Hadoop/Spark services and cheaper storage. The third megatrend is the advent of “serverless” cloud services that completely eliminate the need to run Hadoop or Spark at all. A common use case for Spark is to handle ad-hoc distributed SQL queries for users. Google was first to market with a revolutionary service called BigQuery in 2011 that solves the same problem in a completely different way. It lets you run ad-hoc queries on any amount of data stored in its object storage service (you don’t have to load it into special storage like HDFS). You just pay for the compute time: If you need 1,000 cores for 3.5 seconds to run your query, that’s all you pay for. There is no need to provision servers, install the OS, install software, configure everything, scale the cluster to 1,000 nodes, and feed and care for the cluster as you would with Hadoop/Spark. Google does all that, hence the moniker “serverless.” There are banks running 2,000-node Hadoop/Spark clusters operated and maintained by scores of IT people that can’t match BigQuery’s flexibility, speed, and scale. And they have to pay for all the hardware, software, and people to run and maintain Hadoop.

BigQuery is just one example. Other cloud database services are similarly massive scale, highly flexible, globally distributed “pay for what you use” databases. There’s start-up Snowflake, Google Big Table, AWS Aurora, and Microsoft Cosmos. They’re all much easier to use than a Hadoop/Spark install, and you can be up and running in 5 minutes for tens of dollars – no $ 500k purchase order and weeks of installation, configuration, and training required.

Python and R data science running on containers and Kubernetes

The fourth megatrend is containers and Kubernetes. Hadoop/Spark is not just a storage environment but also a compute environment. Again, back in 2005, this was revolutionary – the Map-Reduce approach of Hadoop provided a framework for parallel computation of Java applications. But the Java-centric nature (Scala-centric for Spark) of Cloudera and Hortonworks infrastructure is at odds with today’s data scientists doing machine learning in Python and R. The need to constantly iterate and improve machine learning models and to have them learn on production data means native deployment of Python and R models is a necessity, not a “nice to have.”

As recently as this week, the big Hadoop vendors’ advice has been “translate Python/R code into Scala/Java,” which sounds like King Hadoop commanding the Python/R machine learning tide to go back out again. Containers and Kubernetes work just as well with Python and R as they do with Java and Scala, and provide a far more flexible and powerful framework for distributed computation. And it’s where software development teams are heading anyway – they’re not looking to distribute new microservice applications on top of Hadoop/Spark. Too complicated and limiting.

A shift in data gravity

The net is that after a good 10 years of Cloudera and Hortonworks being the center of the Big Data universe, the center of gravity has moved elsewhere. The leading cloud companies don’t run large Hadoop/Spark clusters from Cloudera and Hortonworks – they run distributed cloud-scale databases and applications on top of container infrastructure. They do their machine learning in Python, R, and other languages that are not Java. Increasingly, enterprises are shifting to similar approaches because they want to reap the same speed and scale benefits. It’s time for the Hadoop and Spark world to move with the times.

Mathew Lodge is SVP of Products and Marketing at Anaconda. He has over 20 years’ diverse experience in cloud computing and product leadership. Prior to joining Anaconda, he served as Chief Operating Officer at Weaveworks, the container and microservices networking and management startup; and he was previously Vice President in VMware’s Cloud Services group and co-founded what became VMware’s vCloud Air IaaS service.

Let’s block ads! (Why?)

Big Data – VentureBeat

Cloudera, Declining, Hadoop’s, Hortonworks’, influence, Means, merger
  • Recent Posts

    • NOT WHAT THEY MEANT BY “BUILDING ON THE BACKS OF….”
    • Why Healthcare Needs New Data and Analytics Solutions Before the Next Pandemic
    • Siemens and IBM extend alliance to IoT for manufacturing
    • Kevin Hart Joins John Hamburg For New Netflix Comedy Film Titled ‘Me Time’
    • Who is Monitoring your Microsoft Dynamics 365 Apps?
  • Categories

  • Archives

    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited