• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Four factors for comparing the top Hadoop distributions

May 4, 2016   BI News and Info
TTlogo 379x201 Four factors for comparing the top Hadoop distributions

Although the software components that constitute the Hadoop ecosystem stack are open source technologies, there are numerous benefits to paying a vendor for a subscription to use its commercial Hadoop platform. For example, a subscription provides technical support and training, as well as access to enterprise features not available to the open source community. While the enterprise editions of vendor Hadoop distributions all provide the core components of the Hadoop ecosystem stack, the key differentiators are what these vendors offer beyond the openly accessible functionality.

Recent changes in the market have thinned the ranks of Hadoop vendors. Just this month, for example, Pivotal Software pulled the plug on its own Hadoop distribution and said it would start reselling Hortonworks’ instead. But there’s still a diverse group of suppliers to consider, including independent Hadoop specialists, cloud providers and two of the largest IT vendors.

To help you determine which Hadoop provider is right for your organization, this article distinguishes the top Hadoop distributions based on several key characteristics; these include deployment models, enterprise-class features, security and data protection features, and support services.

Note that while the Hadoop big data management ecosystem is engineered to support scalable data storage and high-performance distributed computing, your actual performance may vary for several reasons, including the software implementation. But many performance issues are dependent on the planned applications themselves. To address this, we’ll further examine how the Hadoop product distributions are targeted to meet the business needs of user organizations.

1. Hadoop deployment models

Most of the Hadoop vendors support a mix of deployment methods, but Hadoop offerings from Microsoft and Amazon Web Services are deployed solely in cloud environments. Microsoft leverages its Azure cloud infrastructure for HDInsight, a managed service based on the Hortonworks Data Platform (HDP) — the same Hadoop distribution that Pivotal is now reselling. AWS uses its Amazon Elastic Cloud Computing platform and S3 data store to underpin Amazon Elastic MapReduce (EMR), which bundles its Hadoop distribution with various other tools and technologies. In addition, Amazon EMR provides the option of using MapR’s Hadoop distribution instead of the Amazon one.

The cloud deployment model provides a rapid yet low-effort means of provisioning a Hadoop cluster, and both Microsoft and AWS enable users to resize their environments on demand to handle dynamic computing and storage capacity needs. This elasticity is desirable for organizations with computational and storage needs that may vary over time.

While the other major Hadoop vendors — Cloudera, Hortonworks, IBM and MapR — all offer cloud-based deployments, they aren’t limited to that model. They allow users to download distributions that can be deployed on-premises or in private clouds on a variety of servers, including Linux and Windows systems. In addition, Cloudera and MapR also provide sandbox versions that can be run in a virtual environment such as VMware.

The bottom line: Consider whether your organization prefers to manage its big data environment in-house or use a hosted service. In-house management implies oversight and maintenance of the software environment and continuous monitoring of the system, whether that environment is a physical platform on premises or housed using a cloud-based service. The on-premises option may be preferable if you have experienced staff and know the proper system sizing characteristics, or if security concerns warrant managing the system behind a trusted firewall.

The alternative is to use a vendor with a hosted services platform that will help configure, launch, manage and monitor your operations. This may be preferable if you aren’t sure what size system you will need or expect that the system size will grow based on increasing demand. The benefit of working with a cloud or hosted service is that it will provide the necessary elasticity for both storage and processing resources.

2. Enterprise-class features of the top Hadoop distributions

There are some notable differences in the development approaches of the three independent Hadoop vendors. Cloudera often augments the Hadoop core with internally developed add-on technologies, for example; its Impala SQL-on-Hadoop query engine; Cloudera Manager administration tools; and Kudu, an alternative data store to the Hadoop Distributed File System (HDFS) for use in real-time analytics applications. Typically, the company now open sources such technologies after doing the initial development work itself. Hortonworks, on the other hand, promotes that it’s “innovating 100% of its software in the Apache Hadoop community, and there are no proprietary extensions.” Add-on technologies that it’s the driving force behind, such as the Ambari provisioning and management software, are launched as open source projects from the outset. In addition, Hortonworks has banded together with IBM and other companies to form the Open Data Platform Initiative (ODPi), an organization devoted to creating a common set of core technical specifications for Hadoop platforms. ODPi members claim that will improve interoperability and minimize vendor lock-in.

MapR has taken a third path by developing its own file system, MapR-FS, instead of using HDFS, as well as its own NoSQL database, MapR-DB, and other foundational technologies in an effort to support deployments of large clusters with enterprise-class performance needs. MapR also is increasingly focusing on real-time and stream processing applications. In late 2015, the company rebranded its product as the MapR Converged Data Platform, which combines Hadoop and the MapR file system and database with the Apache Spark processing engine and a new event streaming technology called MapR Streams in order to handle both batch and real-time jobs.

From a features standpoint, the enterprise version of the Cloudera CDH distribution provides tools for operational management and reporting and for supporting business continuity. This includes such items as configuration history and rollbacks, rolling updates and service restarts, and automated disaster recovery. MapR’s enterprise offering provides tools to better manage and ensure the resiliency and reliability of data in Hadoop clusters, as well as multi-tenancy and high availability capabilities. Hortonworks provides proactive monitoring and maintenance with its HDP support subscriptions.

IBM, meanwhile, has adopted an analytics-oriented strategy on its BigInsights for Apache Hadoop distribution, in keeping with its broader focus on selling business intelligence and advanced analytics tools. IBM offers different value-add modules with enterprise-grade features as part of BigInsights, including separate Analyst and Data Scientist modules. Its Analyst module provides Big SQL for federated SQL access to Hadoop and other data sources. BigSheets, which is part of the Analyst module, allows users to explore, transform and perform visualizations on large data sets stored in Hadoop, using an intuitive spreadsheet-like interface. The BigInsights Data Scientist Module includes a version of the R language, text analytics and a machine learning library called SystemML that has been contributed to the open source community.

While its cloud platform is AWS’ primary calling card for Amazon EMR, it also offers tools for monitoring and managing clusters and enabling application and cluster interoperability as part of the Hadoop service.

Amazon EMR collects metrics that are used to track progress and measure the health of a cluster. Cluster health metrics can be accessed through the command line interface, software developer kits or APIs and can be viewed through the EMR management console. Additionally, Amazon’s CloudWatch monitoring service can be used along with its implementation of the Apache Ganglia performance monitoring component to check the cluster and set alarms on events triggered by these metrics.

The bottom line: Choosing a vendor that provides value-add components as part of its enterprise subscription may mean committing to a long-term relationship — especially if these components are tightly integrated with its standard stack distribution. If you’re concerned about vendor lock-in, consider those vendors that are participating in the OPDi.

3. Security and protection offerings from the Hadoop vendors

Despite the expanding use of open source software for enterprise-class applications, there remain suspicions about its suitability for production use from a security and protection perspective. Several Hadoop vendors have taken steps to alleviate some of this anxiety.

For example, Hortonworks has teamed up with other vendors and customers to launch a Data Governance Initiative for Hadoop, with an initial focus on a new Apache project called Atlas for managing shared metadata, data classification, auditing, and security and policy management for data protection. It’s also working to integrate Atlas with Ranger, an open source security tool for enforcing data access policies. Cloudera provides tools that enable users to manage data security and governance for the CDH platform, supporting an organization’s need to meet compliance and regulatory requirements.

In addition, Hortonworks, Cloudera, MapR and IBM all provide data encryption. Both Hortonworks and Cloudera support encryption of data at rest. MapR provides encryption of data transmitted to, from and within a cluster. IBM offers the product InfoSphere Guardium, which enforces data privacy as well as provides encryption and masking of confidential data.

The bottom line: The Hadoop vendors provide different approaches to authentication, role-based access control, security policy management and data encryption. Carefully specify your security and protection requirements and review how each vendor addresses those needs.

4. Support subscriptions for the top Hadoop distributions

The fundamental value proposition for the open source software model is the bundling and simplification of system deployment with support and services. One alternative for deploying Hadoop involves downloading the source code for each component from the open source repository and then building and integrating all the parts together. This takes both skill and effort, and is likely to be an iterative process. Open source vendors have already done the heavy lifting, providing preconfigured distributions and maintaining an up-to-date integrated stack.

What differentiates the vendors to a large degree is their support models. Hortonworks provides several models, ranging from its Jumpstart edition with Web-based support during business hours and one-day response time to its Enterprise edition with 24/7 support and much shorter response times depending on the severity of the issue. Cloudera offers a support subscription with one-hour and 24/7 support options for enterprise license holders. It also offers premium support for organizations with the Flex or Data Hub edition licenses that include a 15-minute response time for critical issues.

All AWS accounts include basic support, which provides 24/7 customer service, access to community forums and documentation, as well as access to the AWS Trusted Advisor application. Developer support includes one-hour response for severe issues — with 12- or 24-hour response times for most issues. Business-level support provides 24/7 email access to cloud support engineers as well as shortened response times based on severity. Enterprise-level support adds less than 15-minute response time for critical issues as well as a dedicated technical account manager, plus additional launch and operation support benefits.

MapR offers a Premium support service that adds Web and email support, custom portal, training, urgent bug fixes, follow-the-sun support and 24/7 phone support for priority issues. The company’s Premium+ Support adds priority queuing of tickets and single point of contact support, and offers options for onsite or remote dedicated support. IBM provides support for organizations that purchase the licensed components — also referred to as their value-add modules — that extend their Open Platform with Apache Hadoop.

The bottom line: If support services are the source of added value from the vendor, the costs for the different support subscriptions should be aligned with customer expectations. Subscriptions providing one-hour or even 15-minute response times on a 24/7 basis with dedicated support staff will cost a lot more than 24-hour response time from a Web-based interface during business hours.

Hadoop has transformed the business intelligence and analytics industry during the past 10 years. But, as we’ve examined, the open source Hadoop framework offers only so much, and companies that need more robust performance and functionality capabilities as well as maintenance and support are turning to commercial Hadoop software distributions. Hopefully, this information will help you make a more informed choice when purchasing a Hadoop distribution. 

Let’s block ads! (Why?)


SearchBusinessAnalytics: BI, CPM and analytics news, tips and resources

Comparing, Distributions, Factors, Four, Hadoop
  • Recent Posts

    • How to Use CRM Integration to Your Advantage – Real World Examples
    • WATCH: ‘Coming 2 America’ Movie Review Available On Amazon Prime & Amazon
    • IBM launches AI platform to discover new materials
    • 3 Ways a Microsoft Dynamics 365 Supply Chain Management and EDI Integration Enhance E-Commerce CRM Strategy
    • The Neanderthals
  • Categories

  • Archives

    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited