Enterprise Spark maturity is still a work in progress

TTlogo 379x201 Enterprise Spark maturity is still a work in progress

Apache Spark has been available as an open source project since 2010, but even though adoption has increased exponentially during that time, the distributed computing framework may still be finding its footing as a broadly applicable enterprise tool.

For his part, Spark creator and co-founder of Databricks, Matei Zaharia, feels the readiness of enterprise is no longer an issue. “It’s not just being experimented with, but it’s actually being used,” he said in an interview at the Spark Summit East 2017 conference in Boston. “When we look, there are over 1,000 companies of all sizes using it.”

Spark is being used at large tech companies like Microsoft, Facebook, Google and Apple, as well as many financial services companies, like Capital One, Goldman Sachs and Bloomberg. A number of smaller companies are also using Spark. Zaharia said whether or not a company uses Spark doesn’t come down to whether they feel it is a mature technology, but rather whether they are doing anything with big data.

“If you’re doing big data, it’s very likely you are using Spark,” Zaharia said.

Technology seen as fairly mature

Aaron Colcord, director of engineering at banking software and mobile payment company FIS Global, based in Jacksonville, Fla., also said he views Spark as a fairly mature technology, though not without caveat.

“I think the platform is at the tipping point for enterprise use,” he said in an interview. “The platform is still really techy, but I see a lot of innovation around that to improve.”

Colcord said a lot of enterprises are looking for tools that offer simple graphical user interfaces, something Spark doesn’t really have today. He also doubted the willingness of some enterprises to go with true open source tools, particularly in his company’s industry — financial services. Concerns about stability and security make any true open source tool a questionable investment.

FIS Global got around some of these concerns by using Databricks’ cloud enterprise Spark platform, which offers additional tools on top of the basic open source Spark to address some of these hang-ups. This allows the company to take advantage of Spark’s best features, including fast data processing and advanced machine learning capabilities. FIS works with banks and other financial institutions to provide the technical back end for their applications. As part of the service, FIS offers reports and intelligence on how users interact with the app. Colcord and his team are using algorithms from Spark’s MLlib to develop these insights.

‘Techy’ side of Spark problematic

Still, new Spark users are likely to run into some issues related to the techy nature of the platform, which can limit its enterprise readiness for less experienced users. One common issue several users at the Spark summit said they encounter is running out of memory. One of the reasons Spark offers fast compute times is it caches data in memory. Users need to account for this when they write queries and data transforms, or their jobs could fail due to a lack of memory.

“If you’re not familiar, pay attention when you go with Spark,” said Ruslan Vaulin, senior data scientist at cybersecurity company Sqrrl Data Inc., based in Cambridge, Mass., in a presentation at the summit. “These types of errors are happening all the time, and you really have to understand Spark’s architecture.”

This is particularly true when putting something into production on Spark. Vaulin said jobs may run fine when they’re just using small test data sets, but problems with performance or memory usage often become apparent only when they’re operating at full scale.

Comparisons highlight complexity

The complexity of running Spark becomes clear when considered next to other competing technologies. Nirmal Sharma, principal software engineer at Walmart Labs, compared the knowledge users need to run enterprise Spark applications to what’s needed with Apache Hive. He said, when properly tuned, Spark delivers much faster query performance than Hive. The difference is Hive will complete jobs whether it’s tuned properly or not, whereas a poorly written job in Spark will fail. What you gain in performance with Spark has to be paid for with expertise.

“If you’re new, the first thing you’re going to run into is the memory issue,” Sharma said in a presentation at the summit. “You have to have some basic knowledge to tune.”

Editorial assistant Trea Lavery contributed to the reporting of this article.

Let’s block ads! (Why?)


SearchBusinessAnalytics: BI, CPM and analytics news, tips and resources