• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

How Do You Move Data Preparation Work from MapReduce to Spark without Re-Coding?

May 25, 2016   Big Data

So, is this a situation you recognize? Your team creates ETL and data preparation jobs for the Hadoop cluster, puts a ton of work into them, tunes them, tests them, and gets them into production. But Hadoop tech changes faster than Texas weather. Now, your boss is griping that the jobs are taking too long, but they don’t want to spring for any more nodes. Oh, and “Shouldn’t we be using this new Spark thing? It’s what all the cool kids are doing and it’s sooo much faster. We need to keep up with the competition, do this in real-time.”

You probably want to pound your head on your desk because, not only do you have to hire someone with the skills to build jobs on another new framework, and re-build all of your team’s previous work, but you just know that in a year or two, about the time everything is working again, some hot new Hadoop ecosystem framework will be the next cool thing, and you’ll have to do it all over again.

Doing the same work over and over again is so very not cool. There’s got to be a better way. Well, there is, and my company invented it. And now I’m allowed to talk about it.

I promised a while back to talk about some of the cool technical things that excited me about Syncsort, but I had to hold off for a while until some of them were public. Well, as of today, the cat is officially out of the bag. The announcement of the new capabilities added to version 9 of Syncsort DMX and DMX-h went out, and I already did an official Syncsort blog post with a Wizard of Oz theme,Syncsort V 9 Big Data Integration – Streaming and Kafka and Spark, Oh My! (That was a fun post to write.) So, duty done. Now, I can geek out a bit about my favorite bit of super-cool tech that my new company invented.

Intelligent eXecution (IX) is what Syncsort calls this super-cool thing, but it’s really two different things under the covers.

First, Intelligent eXecution does for Syncsort what Tungsten does for Spark in some ways, or what a good query optimizer does for a database. You design jobs in the Syncsort graphical user interface. You don’t specify HOW you want the jobs to run, just what you want them to do. This is a lot like how you write a SQL query for a database, but you don’t specify HOW that query will execute, or you define a logical DAG for Spark, but don’t specify the physical execution of that DAG.

Syncsort specializes in sorting. It’s what they do better than anyone else on any platform. Half the world’s mainframes use Syncsort software for sorting. When Syncsort moved into the ETL business, they realized that the slow choke-points of most ETL processes were in the sort-related data prep functionality. Things like joins and aggregations were where everything bogged down. They’ve done some good business just replacing ETL processes that were draggy slow, and speeding up that particular sort related job or task. Since Syncsort has hundreds of sort algorithms, the smart way to accelerate sort wasn’t making some poor schmuck guess which sort algorithm would be best in every situation and specify it at design time, it was building an engine that could derive the ideal sort algorithm at runtime based on the task at hand, the data configuration, the available resources, etc. You could call that a sort optimizer. That’s kind of the heart of what almost all Syncsort products do, but that’s just a tiny piece of what IX does.

As Syncsort moved into the Hadoop data preparation arena, the obvious choke point to fix was distributed sort and shuffle. It’s the slowest part of nearly every Hadoop job. Mainframes have been doing distributed sorts for decades, and Syncsort has been doing it better than anyone for decades. All the Syncsort engineers had to do was figure out a way to plug the Syncsort engine into the MapReduce framework. Since early MapReduce 1.x wasn’t designed to allow sort to be plugged in, or even bypassed when not needed, they dove in and contributed a bunch of code to give MapReduce that capability. MapReduce 2.x now has pluggable and bypassable sort. You’re welcome.

Now, if you design in Syncsort, you can execute locally with the Syncsort engine, or you can execute on a Hadoop cluster in the MapReduce framework with Syncsort speeding up all the sorts, shuffles, joins and aggregations. Intelligent eXecution handles things like load balancing, minimizing I/0 impact, and taking best advantages of available CPU cycles, so that job runs as efficiently as possible without you having to do a bunch of performance tuning.

That’s one very cool thing that Intelligent eXecution is: an automatic distributed ETL job execution optimizer.

Intelligent eXecution is also a layer that abstracts the design from the execution.

What that means is, you can design a job in DMX-h, and execute it with Syncsort’s own engine right on your laptop, or on an edge node, or a server. No Hadoop of any flavor involved.

Then, you can change a setting on the job you built, point at a Hadoop cluster with Syncsort installed on it, choose MapReduce as the execution framework, and execute there. No mappers or reducers defined at any time. No need to tune and tweak, adjusting big sides and little sides, etc. Then, when your boss gives you crap about switching to Spark, you change that setting, point at a Spark cluster, or a Hadoop cluster with Spark added, and execute again.

Boom. You just migrated your code from MapReduce to Spark without re-building anything.

SiliconANGLE caught on to this capability really well, and explains it probably better than I am.

The really exciting thing about this, though, isn’t just that in Syncsort DMX-h version 9.0, Intelligent eXecution now supports Spark. A couple years down the road when Flink or Heron or Storm or whatever is the next cool, fast, best framework, Syncsort plans to add that to IX. You’ll be able to grab the latest version of DMX-h, change a few settings on your jobs, and migrate again, no problem. Zero re-development work or cost, even if you end up deploying on a framework that wasn’t invented when you designed your job. That’s the idea our marketing folks call “future-proofing.” I’d call it shelter from the storm.

And no matter what framework you execute in, your jobs will all run with better performance than they would execute in that framework by itself because the sorts, aggregations and joins will still be optimized by Syncsort, the expert in sorting.

Conclusion: Intelligent eXecution is a cool tech that makes Syncsort’s future look warm and bright.

Intelligent eXecution Shines How Do You Move Data Preparation Work from MapReduce to Spark without Re Coding?

Let’s block ads! (Why?)

Syncsort blog

data, from, MapReduce, Move, preparation, ReCoding, Spark, Without, work
  • Recent Posts

    • Kevin Hart Joins John Hamburg For New Netflix Comedy Film Titled ‘Me Time’
    • Who is Monitoring your Microsoft Dynamics 365 Apps?
    • how to draw a circle using disks, the radii of the disks are 1, while the radius of the circle is √2 + √6
    • Tips on using Advanced Find in Microsoft Dynamics 365
    • You don’t tell me where to sit.
  • Categories

  • Archives

    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited