• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Expert Interview (Part 1): Reynold Xin, Databricks Chief Architect and Founder, on Driving Forces Behind Major Changes to Spark 2.x

May 30, 2017   Big Data

The major version release of Spark has been getting a lot of attention in the Big Data community. At the last Strata + Hadoop World in San Jose, Syncsort’s Big Data Product Manager, Paige Roberts sat down with Reynold Xin (@rxin) of Databricks to get the details on the driving factors behind Spark 2.x and its newest features.

Reynold Xin is the Chief Architect for Spark core at Databricks and one of Spark’s founding fathers. He had just finished giving a presentation on the full history of Spark, from taking inspiration from mainframe databases to the cutting edge features of Spark 2.x.

Paige Roberts: First, let’s go ahead and have you introduce yourself.

Reynold Xin: Okay, sounds good. My name is Reynold Xin. I’m one of the co-founders of Databricks. I’m also chief architect and have been leading the development of Spark with the company for the past couple of years.

Before Databricks, I was working on Spark as part of my graduate school studies at UC Berkeley. So, I’ve been working on Spark for a while.

Roberts: How long ago was that?

Xin: I started around 2011, so more than five years. I’ve been behind most of the major efforts of the past couple of years.

blog banner accessing integrating app data Expert Interview (Part 1): Reynold Xin, Databricks Chief Architect and Founder, on Driving Forces Behind Major Changes to Spark 2.x

Roberts: We just came from your talk where you started the Spark history with IMS databases on mainframes, and you ended up at Spark 2.0. That was quite a journey.

Xin: Yes, absolutely.

I don’t want to make you re-do your whole talk, so let’s focus on the main changes between Spark 1.6 and 2.0.

It’s a big version bump. It’s the first major release of Spark other than the initial Spark 1.0. We focused primarily on three aspects. One is performance optimization.

We started rolling out the DataFrame API in Spark 1.3, I think, which lays down the foundation. Because it’s higher level API, now we have more room to do performance optimizations. If the user gives us a specific function, and we have to run it, there’s not much we can do.

With Spark 2.0 we were able to, in many cases, improve performance anywhere from 2x to around 100x. Through a couple projects, mostly project Tungsten.

Right. I did a blog post on Tungsten last year. Very cool project.

Blog 3 23 Expert Interview (Part 1): Reynold Xin, Databricks Chief Architect and Founder, on Driving Forces Behind Major Changes to Spark 2.x

So, that’s the first focus area, performance. The second one is Structured Streaming.

We have heard from a lot of our customers that new requirements are surfacing in building real-time continuous applications. These applications have to make decisions nonstop on usually a live stream of data. And often these types of applications also have to combine with batch applications.

We started thinking about how we can actually build a new streaming engine and APIs that are suitable for these kinds of applications. Our end result was actually pretty simple. We just took the DataFrame API – there’s no change to it – added a couple small extensions, and then the users could express that streaming computational logic exactly how they would express the batch part. So, that is the second part, Structured Streaming.

Nice! Batch and streaming together, without having to learn a new API.

Last but not least, a lot more work is being done in SQL.

So, Spark 2.0 has become the most SQL 2003 standard-compliant open source Big Data query engine. We added window functions, sub-queries, and 2.0 can run every single one of the 99 TCP-DS queries, which is a standard benchmark, without modifying the queries. As far as I know, none of the other engines can do that.

This makes it much easier for business analysts to run their existing work, and port their existing business applications over to Spark SQL.

So Databricks and the Spark community have put a lot of emphasis on usability and performance. That makes a lot of sense.

In part 2 of this interview, Reynold Xin will give us some good information on the differences between stream and Structured Streaming, how to integrate Structured Streaming with Apache Kafka, and some hints about the future of Spark.

Download Syncsort’s latest white paper, “Accessing and Integrating Mainframe Application Data with Hadoop and Spark,” to learn about the architecture and technical capabilities that make Syncsort DMX-h the best solution for accessing the most complex application data from mainframes and integrating that data using Hadoop.

 

Let’s block ads! (Why?)

Syncsort blog

Architect, Behind, Changes, Chief, Databricks, Driving, Expert, Forces, founder, Interview, Major, Part, Reynold, Spark
  • Recent Posts

    • Search SQL Server error log files
    • We were upgraded to the Unified Interface for Dynamics 365. Now What?
    • Recreating Art – the unexpected way
    • Upcoming Webinar: Using Dynamics 365 to Empower your Marketing and Sales Teams with Digital Automation
    • Center for Applied Data Ethics suggests treating AI like a bureaucracy
  • Categories

  • Archives

    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited