• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Big data platforms pose structural issues for new users

March 23, 2016   BI News and Info
TTlogo 379x201 Big data platforms pose structural issues for new users

Taking full advantage of big data platforms, such as Hadoop and Spark, often requires a new education for IT and analytics teams on how to configure systems and partition data to maximize processing speeds.

For example, when Valence Health was working to deploy a Hadoop cluster in early 2015, the healthcare technology and services provider initially focused its internal training efforts on Drill, an open source SQL-on-Hadoop query engine that IT developers would be using to write extract, transform and load (ETL) scripts for processing incoming data. But that turned out to be the wrong approach.

The bigger issue, Valence CTO Dan Blake said, was getting a better understanding of Hadoop’s underlying structure and how to work with it effectively to optimize data processing performance. “We kind of started out with Drill training, but we really needed to do more in-depth training on Hadoop itself first,” he said. “It’s very different from a relational database.”

Blake’s team eventually went back to the basics on Hadoop, and he said it’s now working to wring as much processing speed as it can out of the cluster and Drill, which both went into production use last May.

Chicago-based Valence, which works with hospitals and health systems looking to transition to value-based care methodologies, is using Drill to help pull 3,000 daily data feeds containing 45 different types of healthcare data into the 15-node Hadoop cluster for downstream analysis. That can amount to up to 25 million records on a busy day, Blake said, adding that the cluster — based on MapR Technologies’ Hadoop distribution — can now handle the processing workload in “an hour or two.”

Learning curve to climb on using Hadoop

Figuring out how to fully leverage Hadoop was also a big challenge for developers and systems administrators at Progressive Insurance, according to Chris Barendt, an IT architect at the Mayfield Village, Ohio, auto insurer. “It was kind of a steep learning curve on understanding how to run the environment,” he said.

You still need to do good design — it’s not magic. Dan BarendtIT architect at Progressive Insurance

Progressive is using Hive, another open source SQL-on-Hadoop technology, for both ETL and analytics to give its SQL-savvy business analysts and data scientists a familiar programming environment. But echoing Blake, Barendt said that Hadoop is a “completely different” platform to work with than SQL-based relational databases are.

And there is configuration work to do in Hadoop, even though it doesn’t impose the same kind of rigid data formats that relational systems do. The data in a Hadoop cluster may be unstructured or semi-structured in nature, but it has to be properly set up and partitioned to get good query performance, Barendt said. “You still need to do good design — it’s not magic.”

In fact, he added that deploying Hive “was probably the easiest part of using Hadoop” at Progressive, which is running Hortonworks’ distribution of the big data framework.

Speed not a given on big data platforms

Sellpoints Inc. faced similar configuration and partitioning hurdles after it began using a cloud-based Spark system from Databricks early last year. “It took us a while to figure out how to make that work,” said Benny Blum, vice president of product and data at the Emeryville, Calif., company, which provides online marketing and advertising services to corporate clients.

Sellpoints uses Spark to process online activity data captured from websites, running ETL routines created with the technology’s Spark SQL module to prepare the data for analysis. At first, Blum said, his team primarily had to focus on getting the Spark system to “a steady state” on processing — a common priority for users implementing big data platforms, especially with emerging technologies like Spark. But steady didn’t necessarily mean speedy, he added.

Last fall, Sellpoints began working to partition its data sets for faster query performance. ETL jobs that previously took 30 to 45 minutes to run with Spark can now be completed in as little as 10 seconds, Blum said. “You really need to take the time to understand the right way to structure your data,” he advised other users, saying that doing so properly “allows you to access your Spark data in a much more efficient way.”

Craig Stedman is executive editor of SearchDataManagement. Email him at cstedman@techtarget.com and follow us on Twitter: @sDataManagement.

Let’s block ads! (Why?)


SearchBusinessAnalytics: BI, CPM and analytics news, tips and resources

data, Issues, Platforms, pose, structural, users
  • Recent Posts

    • The Dynamics 365 Sales Mobile App Helps Salespeople Stay Productive From Anywhere
    • THEY CAN FIND THE GUY WHO BROKE A WINDOW BUT NOT A MURDERER?
    • TIBCO4Good and She Loves Data Offer Free Data Skills Workshops During a Time of Vulnerability
    • Aurora partners with Paccar to develop driverless trucks
    • “Without Data, Nothing” — Building Apps That Last With Data
  • Categories

  • Archives

    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited