• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Spark with Tungsten Burns Brighter

March 26, 2016   Big Data

This blog was originally posted on the “Big Data Page by Paige” Blog.

Tungsten is a new thing in the Spark world. As we all know, Spark is taking over the big data landscape. As always happens in the big data space, what Spark could do a year ago is radically different from what Spark can do today. It busted the big data sort benchmark last year, and is just getting better as it goes. A project called Tungsten represents a huge leap forward for Spark, particularly in the area of performance. That much was clear, but if you’re like me, you can’t help but wonder what Tungsten is, how it works, and why it improves Spark performance so much.

Spark has gotten better and better over time at optimizing workloads and staying away from IO bottlenecks. They even moved to a BitTorrent style of protocol to speed up network transmission rates. Now, the problem areas are all in the shuffle, serialization and hashing, all CPU intensive operations. IO bandwidth isn’t generally where Spark jobs slow down anymore. CPU compute efficiency and memory restrictions are the choke points.
 Spark with Tungsten Burns Brighter

Tungsten is a shiny metal that gets used a lot in light bulbs. I’m assuming that’s why the Spark folks had the bright idea to name this new project Tungsten. The project improves the efficiency of memory and CPU usage for Spark applications. I had some crazy hope that this new project might start taking advantage of chip cache, and after some research, I was delighted to find out I was right. (Buffing fingernails on shirt and smugly saying, “I told you so.”)

As I said in my post from a year ago “In-Memory Analytics Databases are So Last Century,” in-chip data processing is the wave of the future. Tungsten is surfing that wave like a champ.

In the gaming software industry, using GPU chip cache for data processing is a necessary fact of life. Gaming systems and video cards use GPU chip cache to do intense video data crunching. Ordinary hardware systems have CPU’s for data processing that have high speed memory cache available. It’s just that no one has been building software to take advantage of it properly. Outside of the gaming industry, Actian and Sisense were the only companies or projects I knew of that were taking advantage of in-chip data processing before now, due to the tricky aspect of vectorizing the data in order for it to fit the cache. The challenge of storing and processing data as a vector is getting tackled by the Tungsten project admirably. The Tungsten version of sort that uses chip cache memory is 3 times as fast as RAM in-memory sort. They haven’t gotten the new functionality into all the Spark base algorithms yet, but it’s coming.

As an aside, thanks to a helpful comment on my old In-Memory vs In-Chip  post, I even discovered there’s a Spark project called UCores to take advantage of chip cache in GPU’s and other less common types of chips. Go Spark. Someone is finally taking advantage of modern hardware strengths that should have been exploited ages ago, but only gaming companies seemed to know existed. (Stepping off of soapbox.)

Back to Tungsten. This project also dynamically optimizes Spark operations to take advantage of the RAM memory cache far more efficiently than the JVM does. Tungsten ditches a lot of JVM required overhead and takes over some of the memory cleanup. I would worry about memory leaks, but with the ridiculous number of people working on Spark, around 1000 now, those will probably get plugged as fast as people can find them.

 Spark with Tungsten Burns Brighter

The other thing Tungsten does is create code dynamically at runtime with an optimizer. Most code in the world gets written beforehand and only interpreted at runtime. This works, but can lead to the job getting done in less efficient ways than if the procedure was turned around a little. Dynamic runtime code generation can be a mind-blowing idea, but it really works.

Essentially, the idea is for a developer to define what he or she wants done, then let an optimizing engine generate the ideal code to do that job based on runtime conditions like available memory, CPU cycles, data configuration, etc. That method provides a big performance boost because the code is perfectly fitted to the need and to the available resources that a developer couldn’t know about ahead of time. A good optimizer gives you better, more performant jobs in the same way a good query optimizer gives you faster queries in a database. Optimized code generation at runtime is also one of those things that I’ve been espousing from the rooftops for a few years now.

So, to sum it all up, Tungsten is a project to manage memory and processing for Spark to make it perform even better than it already does. And it uses strategies like chip-cache utilization and optimized code creation at runtime that I’ve been telling you were awesome for ages.

So, that brings me to two clear predictions.  First, my head is going to swell a bit from feeling all smug and brilliant, and second, Spark is going to continue to get more and more capable and performant and dominate the big data market for the next few years.

Oh, and Tungsten is shiny.

Let’s block ads! (Why?)

Syncsort blog

Brighter, Burns, Spark, Tungsten
  • Recent Posts

    • Experimenting to Win with Data
    • twice-impeached POTUS* boasts: “I may even decide to beat [Democrats] for a third time”
    • Understanding Key Facets of Your Master Data; Facet #2: Relationships
    • Quality Match raises $6 million to build better AI datasets
    • Teradata Joins Open Manufacturing Platform
  • Categories

  • Archives

    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited