Tag Archives: Expert

Expert Interview (Pt 3): EMA’s John Myers on How Fast, Easy Data Integration Can Break the 80/20 Rule

At this year’s Strata Data Conference in New York City, Syncsort’s Paige Roberts sat down with John Myers (@johnlmyers44) of Enterprise Management Associates to discuss what he sees in the evolving Big Data landscape. In this final blog in the three-part interview, we’ll discuss the 80/20 rule of data science which points out that most data scientists spend 80% of their time getting data ready for analysis, rather than doing what they do best.

In case you missed the earlier parts of our interview… In the first part of the discussion, Myers pointed out a shift away from technology and toward business value and some advantages of in-memory processing for machine learning. In part two, we talked about how to deal with cultural pushback against machine learning applications and how to get machines and people working together to take advantage of the strengths of each.

Roberts:  One of the things with machine learning that you’ve heard a hundred times the 80/20 rule, where 80% of a data scientist’s job is data prep. Getting the data where they need it in the format they need it. That’s where we help. So, I’m going to ask a totally self-centered question.

Myers: Okay.

blog 80 20 rule data science Expert Interview (Pt 3): EMA’s John Myers on How Fast, Easy Data Integration Can Break the 80/20 Rule

According to the 80/20 rule of data science, 4 days of each business week is spent on gathering data, while only one is spent on running algorithmic models. But what if data scientists already had the data they needed?

So you have an idea of what we’re up to at Syncsort. What’s exciting to you? What do you think is cool in this area?

Well, you’re right. Most of what a scientist has to do is you get the right data together so they can apply to their model, or to manipulate the data that they have. Now, if I don’t know where it is, I must go traipsing around looking for it. Being able to discover, being able to have it at my fingertips, being able to move it around and things of that nature …

Find it, join it.

Exactly. Back to the concept of what do data scientist like to do, do they like to manipulate data? No.

Not particularly.

They like to run models, and they like to compare them, and they like to do that.

Build algorithms and play with them.

Exactly! That’s the real value it provides. If we could have systems to take on some of that burden and say, “Hey, maybe we’ve got a dataset.” If you can discover what’s in the systems and then say, “Oh, now I need to bring it to someone.” And say, “That’s a great one…”

And push a couple of buttons and boom, it’s where you need it.

Exactly. Now, instead of like you said, 80%, that’s four days out of the week. Right?

Yeah.

That leaves me one day to manipulate.

To do what I actually like doing.

Right, and flip that over. I now have one day to pull data, and I have four days to play with it.

How good are our machine learning algorithms going to be now?

Right. If I have one day out of my week. I’m going to get one answer, per se. Right?

blog banner eBook Mainframe Challenge Expert Interview (Pt 3): EMA’s John Myers on How Fast, Easy Data Integration Can Break the 80/20 Rule

That’s true.

If I have four days, I may have four answers or I may have eight answers. And now I’m looking at the best, not just a. If we can flip that over, I don’t think we can ever get rid of it because…

You’ve got to have the data. You can’t do anything with data until you first have it. And have it in the form that you need it.

Exactly. I call it spindle, fold and mutilate, and it’s not necessarily that way. But to go through that process, it’s gonna take…

You must get the data. You have to push it together. You have to change its form. You have to take out the stuff you don’t want.

Exactly! And then when you have that, and if you can flip that over, go from 80-20 to 20-80, then you have more time in your day.

And all your smartest people aren’t spending all their time playing around getting data in the right form.

Right. And I’m sure they’re like me. The more time I spent with my fingers in the data, the more insights I find. If I’ve only got two hours out of eight, or I’ve only got one day out of four, I will find a limited number of insights. But the more time I have, …

The more you’ll find.

… the more insights I will pull together, the more things I’ll do. And I think that’s like our data scientists. They’re special, expensive people and we want to help them be the best possible people that they can be.

Well, being an ETL person rather than a data scientist, that’s where I live. But my job is essentially to create something for them that makes it easier. To make sure that when you go to do your machine learning algorithm, you’re using all the data because there isn’t some feed coming in from Kafka that you can’t get to.

Right.

Or there isn’t some data sitting on the mainframe that’s like, “Hey, look, I have 20 years of customer data sitting there, but I can’t get to it because it’s on the mainframe in some obscure format that nobody’s heard of in the last 20 years.”

Well, back to something that we talked about a little earlier before the start of the recording, Select star, that’s sometimes not the greatest answer.

Yeah. [Laughter]

I only want these four columns.

Right. But I don’t know which four columns until after I get all that data and look at it.

Exactly! With the data scientist who has access to like you said, ALL the data. They’re going to pick the right four to five columns. If they’ve only got four to five columns…

Then they’re just taking what they got. And making do with it.

Exactly! So, I think we’ve got some great opportunities. I think it’s continuing to grow, and I’m really looking forward to what we can learn at the show.

In our eBook, Mainframe Challenge: Unlocking the Value of Legacy Data, we review ways to help you tackle the obstacles of data integration to unlock the value of your mainframe data.

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 2): Alation’s CEO Sangani Discusses Big Data Management Trends and Best Practices

In Part 1of this two-part interview, Satyen Sangani (@satyx), CEO and co-founder of Alation, spoke about data cataloging. In today’s Part 2, he provides his thoughts on trends and best practices in Big Data management.

What are some of the more outdated or inefficient processes involved with accessing relevant data today? What is slowing businesses down?

It can take a long time to extract data from the data lake and get the right data to the right person exactly when they need it.

Businesses are moving to self-service analytics solutions where it isn’t necessary to have the involvement of the IT department to access and work with data. However, self-service tools often fail at helping users understand how to appropriately use the data. Specifically, they don’t always know which data sets to use, which definitions to use or which metrics are correct.

blog TDWI Data Lake Checklist 1 Expert Interview (Part 2): Alation’s CEO Sangani Discusses Big Data Management Trends and Best Practices

What should companies be doing today to prepare for how they’ll use data in the future? What should their long-term strategies look like?

Ultimately, you want to get data, business context and technical context in front of your employees as quickly as possible. The days where you could take months to prepare a report are over.

Given this, companies need to spend time thinking a.) how they can get data to their employees as fast as possible and b.) how to train their workforces to find, understand and use that data to get insights fast.

What’s one piece of advice you find yourself repeating to your clients over and over? Something you wish more companies were doing to get more out of their data?

Data governance has traditionally implied a top down, command and control oriented approach. Such an approach generally works when compliance is the primary goal, but when the goal is to get data consumers to use data more often, it’s important to take an iterative and agile approach to data governance.

It’s less about prescribing rules than reacting to users by gently correcting and improving their behavior.

What trends or innovations in Big Data management are you following today? Why do they excite you?

Self-service is, of course, a big one. We also like distributed computation engines like Presto and Spark. The notion that we can disconnect compute from storage is finally becoming a reality.

AI and Machine Learning need to be embedded into every layer of the stack. There’s too much manual work in data and that manual work comes at the cost of speed.

To learn how to put your legacy data to work for you, and plan and launch successful data lake projects with tips and tricks from industry experts, download the TDWI Checklist Report: Building a Data Lake with Legacy Data

2017 Big Data Survey Promo Expert Interview (Part 2): Alation’s CEO Sangani Discusses Big Data Management Trends and Best Practices

 Expert Interview (Part 2): Alation’s CEO Sangani Discusses Big Data Management Trends and Best Practices

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 1): Satyen Sangani of Alation on Data Cataloging

Satyen Sangani (@satyx) is the CEO and co-founder of Alation. In founding Alation, he aspired to help people become more data literate.

Before Alation, Sangani spent nearly a decade at Oracle, where he ran the Financial Services Warehousing and Performance Management business. Prior to Oracle, he was an associate with the private investment firm, Texas Pacific Group and an analyst with Morgan Stanley & Co. He holds a Masters in Economics from the University of Oxford and a Bachelors from Columbia College.

We recently asked Sangani for his insight on data cataloging and how businesses can better manage their data. Here’s what he shared:

Tell us about the mission at Alation. How are you hoping to change the ways businesses approach managing data?

At Alation, we’re looking to fundamentally change the way data consumers, creators and stewards find, understand and trust data. Our product, a data catalog, provides a single source of reference for data in the enterprise.

What is collaborative data cataloging? How does it work?

No matter where your data resides, in a data lake, a data warehouse or a business intelligence system, Alation regularly and automatically indexes your data and automatically gathers knowledge about the data and how it is being used.

Like Google, Alation uses machine learning to continually improve its understanding of that data. Clients use Alation to work better together, to leverage data with greater confidence, to improve productivity and to index all your data knowledge. Anyone who works with data, from IT to a less-than-technical business user can collaborate and comment on data using annotations and threaded discussions much as you’d find in a consumer catalog such as Yelp!

What are the benefits to businesses of using a collaborative data catalog? How do they make data management more efficient and effective?

Alation replaces tribal knowledge with a complete repository for all data assets and data knowledge in your organization including capabilities such as:

  • Business glossary
  • Data dictionary
  • Wiki articles

blog banner landscape Expert Interview (Part 1): Satyen Sangani of Alation on Data Cataloging

Alation profiles data and monitors usage to ensure that users have accurate insight into data accuracy. This includes providing insights through:

  • Usage reports
  • Data profiles
  • Interactive lineage

Alation provides deep insight into how users are creating and sharing knowledge from raw data. This includes surfacing details that include:

  • Top users
  • Column-level popularity
  • Shared joins & filters

What are the common challenges you’re observing businesses facing with data management today? What is at the root of these challenges?

The problem isn’t really about the volume of data that businesses must deal with these days. It’s really about how to find, understand and trust that data to fundamentally understand what is going on with your business in terms that everyone in the organization can agree upon. That’s what data cataloging can provide.

Tune in tomorrow for Part 2 of this interview where Satyen Sangani will discuss Big Data management trends and best practices.

Discover the new ways data is being moved, manipulated, and cleansed – download Syncsort’s eBook The New Rules for Your Data Landscape.

 Expert Interview (Part 1): Satyen Sangani of Alation on Data Cataloging2017 Big Data Survey Promo Expert Interview (Part 1): Satyen Sangani of Alation on Data Cataloging

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 1): Splunk’s Andi Mann on IT Service Intelligence, ITOA and AIOps

For over 30 years across five continents, Andi Mann (@AndiMann) has built success with Fortune 500 corporations, vendors, governments, and as a leading research analyst and consultant. He currently serves as Splunk’s Chief Technology Advocate. He is an accomplished digital business executive with extensive global expertise as a strategist, technologist, innovator, and communicator. In the first of this two-part interview, he shares his thoughts on IT Service Intelligence (ITSI) and its role in IT Operational Analytics (ITOA) and Artificial Intelligence Operations (AIOps).

What is ITSI and how does it fit with ITOA and/or AIOps?

According to Gartner, IT Operational Analytics (ITOA) is a market for solutions that bring advanced analytical techniques to IT operations management use cases and data. ITOA solutions collect, store, analyze, and visualize IT operations data from other applications and IT operations management (ITOM) tools, enabling IT Ops teams to perform faster root cause analysis, triage, and problem resolution.

As it has become more sophisticated, Gartner has redefined ITOA as “AIOps,” initially calling it Algorithmic IT Ops, now morphing into “Artificial Intelligence Ops,” reflecting the increasing use of machine learning, predictive analytics, and artificial intelligence in these solutions.

Splunk IT Service Intelligence (ITSI) is a next-generation monitoring and analytics solution in the ITOA/AIOps space, built on top of Splunk Enterprise or Splunk Cloud. ITSI uses machine learning and event analytics to simplify operations, prioritize problem resolution, and align IT with the business.

blog banner eBook Ironstream case studies 1 Expert Interview (Part 1): Splunk’s Andi Mann on IT Service Intelligence, ITOA and AIOps

Using metrics and performance indicators that are aligned with strategic goals and objectives, ITSI goes beyond reactive and ad hoc troubleshooting to proactively organize and correlate relevant metrics and events according to the business service they support. With ITSI, IT Ops can better understand and even predict KPI trends, to identify and triage systemic issues, and to speed up investigations and diagnosis.

This allows maturing IT organizations to quickly yet deeply understand the impact that service degradation has not only on the components in their service stack, but also on service levels and business capabilities – think more “web store” than “web server.”

We are seeing a lot of investment by organizations in leveraging the value of Big Data, what do you see as the major drivers for this?

I see three main drivers for this new focus on Big Data.

Firstly, the increasing volume of data is creating a maintenance nightmare, but also an analytics dream. This new data – from online applications, mobile devices, cloud systems, social services, partner integrations, connected devices, and more – is full of insights, but cannot be managed with traditional tools. Big Data is often the only way to understand a modern business service at scale.

Secondly, speed and agility are emerging as market differentiators. Slow, old-school techniques like data warehousing, Extract-Transfer-Load (ETL) operations, batch data processing, and scheduled reporting are not fast enough. New-style Big Data tools, by contrast, ingest data in real time, use machine learning and predictive analytics to generate meaning, instantly display sophisticated and customizable visualizations, and produce actionable insights from Big Data as it is produced.

Thirdly, there is an increasing focus on data-driven decisions to drive innovation. From junior IT admins to senior business execs, innovation requires all stakeholders make accurate decisions in real time. Big Data allows everyone to try new ideas, determine what works and what doesn’t, and then iterate quickly to course-correct from failures or double-down on successes, quickly adjusting to new information and meeting the changing demands of the market.

In Part 2, Andi Mann discusses the reasons mainframe and distributed IT are sharing data, and the use cases where organizations are building more effective digital capabilities with mainframe back ends.

Download Syncsort’s latest eBook, Ironstream in the Real-World for ITOA, ITSI, SIEM, to explore real world use cases where new technologies can provide answers to the questions challenging organizations.

Try Ironstream fbanner Expert Interview (Part 1): Splunk’s Andi Mann on IT Service Intelligence, ITOA and AIOps

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 2): Andi Mann Compares ITSI and Business Service Management (BSM)

In Part 1 of this two-part expert interview with Andi Mann (@AndiMann), Splunk’s Chief Technology Advocate, he talks about IT Service Intelligence (ITSI) and how it fits with ITOA and AIOps, and the main drivers for the big investments organizations are making in Big Data. In today’s Part 2, he compares ITSI and Business Service Management, and discusses the reasons mainframe and distributed IT are sharing data, and the use cases where organizations are building more effective digital capabilities with mainframe back ends.

ITSI sounds a lot like what we have called Business Service Management (BSM) for the last decade, what is different about it?

Today, it is clear that the promise of Business Service Management was never fulfilled. It was too ambitious for its time, too complex in its make-up, and suffered from deficient underlying technologies – not least its database-driven approach.

Business Service Management was too hard to create and update service definitions, and too rigid in how service data was collected and managed. It relied on too few data sources for actionable business insight, and was typically restricted to too few (typically tech-centric) users to have broad business impact.

blog banner eBook ITSI Need to Know Expert Interview (Part 2): Andi Mann Compares ITSI and Business Service Management (BSM)

By contrast, Splunk IT Service Intelligence (ITSI) uses an analytics-driven solution, with machine learning, open integrations, and real-time processes as part of a modern business-centric solution. Unlike legacy BSM tools, ITSI integrates data sources from across the organization (and beyond), providing highly customizable visualizations, even in rapidly changing environments. ITSI is flexible and secure enough to provide real-time insights for any user, on-demand and on-the-fly. BSM tools pale in comparison.

One of short-comings of BSM, or whatever we chose to call it, had been the weak integration between IT metrics from distributed platforms with mainframe systems within IT infrastructures – how does ITSI help to address this?

With so much mission-critical data coming from mainframe platforms, it is amazing how many tools and solutions ignore it, or at best publish some kind of loosely-coupled connector and call it a day. That is not nearly enough for such a valuable source of intelligence. Instead you need to treat the mainframe as a first-class citizen in the service environment, alongside cloud, *nix, mobile, and other systems.

ITSI integrates tightly with solutions from trusted mainframe partners like Syncsort to ingest data from mainframe platforms, and combine them with distributed systems, to provide cohesive insights into the activity, status, and performance of cross-enterprise services.

This means more than just scraping application outputs, but also safely and securely integrating data from syslog, RMF, SMF, IMS/CICS, MICS, and other sources on zOS and zLinux partitions. Tightly integrating mainframe data in this way is the only way to provide total visibility of enterprise-wide services.

2017 Mainframe Survey Promo Expert Interview (Part 2): Andi Mann Compares ITSI and Business Service Management (BSM)

We are seeing the walls between mainframe and distributed IT coming down as organizations are more open to sharing information – what are some of the primary use cases driving this?

People are starting to understand that, despite some challenges, mainframe applications and data are too critical to leave in their own silo.

As organizations work through new “digital” projects, they eventually realize that the mainframe is the locus of so much mission-critical information. Teams focused on mobile engagement, customer experience, sentiment analysis, web interfaces, application modernization, digital transformation, and even innovation are realizing that you cannot just “ringfence” the mainframe.

You need mainframe data for a truly cohesive application, so they are building more effective experiences by integrating new “digital” capabilities with mainframe back ends.

Download Syncsort’s eBook, IT Service Intelligence: What Professionals Need to Know, to see how an IT Service Intelligence approach extends ITSSM to provide end-to-end visibility and insight into the operational health of critical IT and business services which span distributed systems, mainframe, and even mobile devices.

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 2): Decideo’s Nieuwbourg Talks Trends in Big Data, IoT & Data Visualization

In Part 1 of our two-part conversation, Phillipe Nieuwbourg (@nieuwbourg) talked about the founding of Decideo, interesting developments in data, what tools and skills you need, and mistakes to avoid to get the most out of data. In today’s part 2, he addresses what businesses should be doing with data today, and down the road, and trends in Big Data, IoT and data visualization.

In general, what should businesses be doing from a technology standpoint today to prepare for the ways they’ll use data down the road?

First: collect and store data. Without fresh and accurate data in your data lake, you can’t analyze anything. It’s the first step of a data culture. Of course, privacy and regulation are key to collecting the right data you will be able to analyze.

From a technology standpoint, it means to create what we call a “data lake” – a place where data will be stored, ready to be used by analytics applications we already know or for future needs. And of course, I repeat it, to verify and protect data from their sources to the data lake, to be sure we will be able to use it to generate value.

blog TDWI Data Lake Checklist 1 Expert Interview (Part 2): Decideo’s Nieuwbourg Talks Trends in Big Data, IoT & Data Visualization

What are the most promising new IoT applications you’re observing in business today?

IoT is fantastic. Humans can generate data, but they have limits that sensors don’t have. Objects, sensors, can generate thousands of data every minute. It’s an inexhaustible data source.

Predictive maintenance, consumer behavior analytics, video recognition for security applications … there’s no industry that will avoid the IoT wave. If you don’t know yet how IoT will transform your industry, focus on it! It will, you are just late!

Related: The IoT-Big Data Convergence

Why is effective data visualization so critical to an organization’s ability to understand their data? What should organizations look for in high-quality data visualization tools?

Imagine you have analyzed terabytes of data. You found something, a behavior, a trend, a pattern… how will you bring this to your management? Will you, like a research scientist, produce a black and white text presentation with complex equations? Certainly not the best way to convince your general manager. Just one image is better than a report or a long text explanation. But how to choose the right graphical visualization? How to create the Wow effect in your boss’ eyes? That’s where data visualization software come.

Do you know that Excel can generate only two types of graphic? There’s plenty of highly understandable graphics that your management will love, but that you can’t do in Excel. Just remember how Charles Joseph Minard created in 1869 the famous Napoleon’s march to Moscow graphic. The data visualization you will choose doesn’t depend on your data, it depends on your message. And you need an advanced software to do it.

Related: Visualization for Big Data

How has the way companies are able to visualize their data evolved in recent years? What developments interest you in the world of data visualization right now?

We’ve moved from data visualization to data storytelling.

Like in Hans Rosling famous TED presentations, you must tell a story. People never remember the statistics you put in your PowerPoint, they will remember the story you told them. And to transform your boring slides into a stunning story, you will need to apply storytelling techniques to data analysis. Animated data, storytelling techniques, you have the keys – the same used in Hollywood movies, or to write the scenario of House of Cards next season. By the way, do you know that Netflix is the first data driven movie/series producer?

What trends or innovations in the field of Big Data, data visualization and IoT are you following right now? Why do they excite you?

I continue to follow the data storytelling field. Actual software is a first generation. I miss a lot of functionalities. There’s actually no real data storytelling software on the market. I hope to discover one very soon.

And I’m really interested in deep learning, specially applied to IoT video feeds and photos analysis. If we can automatically “understand” a situation, we can automatically suggest an action. It can be a recommendation, a human intervention, but if we help people to understand data more quickly, we will make a step forward to the “augmented intelligence” I was talking before.

What’s one piece of advice you find yourself repeating to organizations over and over related to Big Data? One takeaway you think every organization should hear?

It’s not a technology project! Big Data is not a technology issue, it’s all about business. Stop buying technology before having understood and measured your data value. The result of Big Data analysis can be very small. Imagine a model that every morning gives you the list of the 10 prospects you must call today because they are ready to sign. Perhaps it’s Big Data analysis, but the result is a list of 10 names … small data. Don’t focus on technology, focus on business.

And don’t tell me Big Data is expensive. It’s never too costly. Because you will always prototype, often with open source software or with tools you already have, and anticipate your ROI before investing in high level infrastructure. Big Data will never be a charge, it’s will always be a source of revenue. If you accept to move step by step and focus on business challenges. And I wish the best of luck to all our readers. Data Driven Economy is a fantastic opportunity for the actual and next generations to generate value, and do things better.

To learn how to put your legacy data to work for you, and plan and launch successful data lake projects with tips and tricks from industry experts,download Syncsort’s Building a Data Lake Checklist Report!

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 1): Philippe Nieuwbourg of Decideo on the Evolution of Data Management

Philippe Nieuwbourg (@nieuwbourg) is an independent analyst, author and lecturer focusing his work on information technology to improve a data-driven economy. In today’s part 1 of this two-part interview, Nieuwbourg discusses the founding of Decideo and the evolution of data management.

Can you tell us about your professional background? How did you become interested in Data Science?

I have been working in “data” since I created Decideo, more than 25 years ago. After studying accounting, I worked in a couple of companies, first to install accounting software in medium and large-sized organizations, and in a professional IT magazine as editor in chief.

After those experiences, I founded Decideo, as the first French-speaking professional community about “data.” It has been called reporting, business intelligence, data warehouse, business analytics, big data, artificial intelligence … words change like marketing waves, but it’s always about data!

What have been the most interesting developments in the field since you started your career? What has made the biggest impact on the way we use data today?

For 60 to 70 years, Information Technology was only about numbers. Created during the second world war, the first computers for decades were just able to manipulate numbers and characters chains viewed like ASCII codes. The shift came with mobile phones, social networks and micro-processors power.

Since then, we can collect, store, manipulate and analyze what is called “unstructured data” coming from photos, audio and video files. That’s a huge step. It means that what we say, hear and view, can be “understood” by computers and analyzed, a little like our brain did it.

blog banner landscape Expert Interview (Part 1): Philippe Nieuwbourg of Decideo on the Evolution of Data Management

With the progress of artificial intelligence and deep learning, it’s clear that we will soon have an augmented intelligence available at our fingertips. Unstructured data analysis will have a huge impact on businesses during the next years.

What do businesses need to know about selecting the right tools to manage their data? How should they approach finding the tools that will work best for their needs?

Don’t focus on the data visualization part! It’s a sexy, amazing and entertaining software to choose, but it’s the cherry on the cake. Keep it for the end of your project. First focus on the data integration tools. It’s much more important. You could have the best of class data visualization tool, but if your data isn’t accurate or comprehensive, you won’t find any value in it.

During the typical day of a data scientist, only a small part of their time is used for noble tasks, like machine learning, data storytelling, graphical analysis. More than 50 percent of their time is used for manipulating datasets and fixing data quality problems. List your data sources (for today and tomorrow), think about data quality, metadata management, GDPR, regulation, privacy, security. When you will have fixed all this, enjoy a little time to choose the best graphical tool. It’s the easy part of the job.

What are the most common mistakes you observe businesses making when searching for and using different data management tools? What should they be doing differently?

The most common mistake is to focus on technology. Believe me, you can do a lot of things without having a Hadoop cluster in your data center! I’ve met a lot of companies, like a bank I remember in Canada, which bought a Hadoop distribution without knowing why – only because its main competitor did it before. And three years later… nothing… Why? Because it was a technology purchase made without connection with business needs.

What type of skills and training would you like to see more businesses focus on when it comes to data management? What training should they invest in to prepare for the future?

I think that “data analysis” skills are not an option anymore. Especially for business people. Do you really want to rely on an IT department that just focuses on technology and doesn’t really understand your business needs? I don’t. All business people should be trained to acquire basic skills on data management and data analysis. If data is the new oil of the economy, all business people should be able to generate value from it.

I don’t try to say here that IT people are not doing their job. They have the most important one: focusing on infrastructure, compliance, cybersecurity … but have to let business people take care of data analysis.

In universities and business schools, all students should be prepared to manipulate datasets.

Tomorrow, in part 2, Phillipe talks about what organizations should be doing with data today, and down the road, and trends in Big Data, IoT & data visualization.

Download Syncsort’s latest eBook to discover the new rules that are redesigning the relationship between business and IT.

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 4): Databricks’ Jules Damji on the Advantages of Moving Big Data Processing to the Cloud

We’ve been enjoying highlights the long conversation between Syncsort’s Paige Roberts and Jules Damji (@2twitme), the Spark Community Evangelist for Databricks. In today’s final installment of the four-part series, we’ll talk more specifically about Cloud platforms, and what the advantage is of doing big data processing and data analysis work on the Cloud, and some specifics about the Databricks cloud.

blog the cloud Expert Interview (Part 4): Databricks’ Jules Damji on the Advantages of Moving Big Data Processing to the Cloud

Paige Roberts: You talked about the Databricks Cloud a bit. You guys have got a nice Spark Cloud capability that you provide. Is it like a SaaS (Software as a Service), is it a PaaS (Platform as a Service), what is it exactly?

Jules Damji: Well, it depends on how you think about it and how you would use it. But primarily, SaaS. You are a scientist, and your company wants to create data science models and provide them as a service. In that respect, we are both SaaS and PaaS.

First, you’ll have to do some data ingestion and exploration. For you to run with this idea in Apache Spark, you’ll need clusters. You must go to a data center, hire a hosting company, start getting machines, install software. You must do all these things, find right libraries etc., create your own infrastructure.

blog banner BBDtL ExpertsSay Expert Interview (Part 4): Databricks’ Jules Damji on the Advantages of Moving Big Data Processing to the Cloud

Roberts: Hire some admins.

Damji: Yeah, you’ll need some admins. You don’t have all this money if you’re a startup company. And your core competency is data analysis and data engineering. All you care about is ingesting and exploring your data. You don’t want to worry about management. Apache Spark is great, but you don’t want to install it. You want the latest, and greatest, with all the fixes.

So, we provide Spark, along with platform built around it with additional software artifacts, as a service, running in the cloud.

You just want to use it.

You just want to use it. What do you do? You either do your own on-premise cluster, or you come to Databricks, right?

You get end-to-end, fully managed Cloud service with Apache Spark. You don’t need to worry about creating a cluster, you don’t need to know about managing it, you don’t need to know about tuning it. You don’t have to worry about monitoring it. You don’t have to worry about the reliability, SLAs and all that, all that’s taken care of.

You get a high level of SLAs from Amazon on EC2, and you get really beefed up machines. AWS has created all this competency around infrastructure that gives even large corporations confidence in them. S3 storage is very reliable. They provide quality control at 99.99999 – 11 nine’s.

blog damji quote s3 storage Expert Interview (Part 4): Databricks’ Jules Damji on the Advantages of Moving Big Data Processing to the Cloud

That’s solid. And the Cloud saves a big chunk of time and money.

And then you can build things on top of that, right? You build your data science application, for example. The utility part, the stable infrastructure, is taken care of, the software is already installed for you.

A good analogy that I normally like to use is this: when Edison created electricity, his partner said, “What about if he created this as an electricity grid?”

Cloud computing is heading that way as a grid. I just come in, and I plug in my plug, and I’m guaranteed to get 110 Watts if I am in North America. Or, I’m guaranteed to get 220 Watts if I’m in Europe, right? When I plug in, I know I’m going to get that consistent service, and it’s will be reliable.

Take that analogy and say, “What if I’m able to do that with the Cloud?” I just go with one of the mega Cloud providers, and I’m guaranteed to get what I need. I’ll be able to scale, I’ll be able to get beefy machines, I’ll get the compute power. I’ll get the software components I need. I’ll get the storage power. I’ll get the reliability, I’ll get the bandwidth, I’ll get the throughput.

I get the security.

I get the security. What you must worry about is writing things on top of that?

Edison’s partner said that, “The people who are going to make money are not only the power utility companies that are gonna provide the power, that’s going to be a commodity. The people who are gonna make money are the people who are gonna build appliances on top of that.” Refrigerators, lamps, toasters, TVs, all those appliance manufacturers are making money. And they depend on the grid being already there.

Is Databricks the utility company in this analogy?

Databricks is not the utility company. We are providing a service on top of that on which you can use the Cloud grid. So, I’m providing you a refrigerator you can store things in, or…

A range, and you can create dinner on top of the range. Okay. I get it. You create data science applications on Databricks, which is the appliance, and the Cloud company is the grid.

Right. The providers are going to be Amazon with AWS, Google, IBM, Microsoft with Azure, and now, lately, Oracle. They are the cloud utility companies.

Does Databricks support all of those?

Right now, we are only on Amazon. But the whole idea behind data source API is the ability for you to get data from the other myriad places that you have it stored.

So, you can get data today from HBase. You can get it from MongoDB. You can get it from Redis. You can get it from all of these different storage areas. These data source APIs allow us to work with all those ecosystems.

Right. Well, thank you so much for talking to me, it’s been really cool.

Any time. I’m an evangelist. I’ve got to keep the lines of communication open.

If you missed any part of this four-part interview series with Damji, it’s not too late to catch up!

For more talk about the future of Big Data, including more on Spark and the Cloud, read our eBook, Bringing Big Data to Life: What the Experts Say.

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 3): Databricks’ Damji Discusses Security, Cloud and Notebooks

Syncsort’s Paige Roberts recently caught up with Jules Damji (@2twitme), the Spark Community Evangelist at Databricks, and they enjoyed a long conversation. In Part 3 of this four-part interview series, we’ll look more at the importance of security to Spark users, the overwhelming move of a lot of Big Data processing to the Cloud, and what the Databricks Platform brings to the table.

In case you missed it. In Part 1, we looked at the Apache Spark community. And, in the second post, we covered how Spark and Hadoop ecosystems are merging, which supports AI development.

Paige Roberts: So, we’ve talked a lot about the new single API for Spark, a single API for Datasets and DataFrames. I can build my application once; I can run it in streaming, I can run it in batch. It doesn’t even matter anymore. I can execute it on this engine now, and maybe next year, I can execute it on another engine, and I won’t have to rewrite it every time. You won’t have to rebuild if it uses the same API. That’s very similar to a Syncsort message we’ve been calling it Intelligent Execution , or Design Once, Deploy Anywhere.

Someone asked at Reynold Xin’s talk, “What do you do when you go from RDD to DataFrames?” The answer was, “Well, you have to re-write.”

[Both laugh]

Damji: Yeah. We can’t quite do it that far back.

blog banner landscape Expert Interview (Part 3): Databricks’ Damji Discusses Security, Cloud and Notebooks

Roberts: Still, that’s a very exciting and appealing model for a lot of folks, designing jobs once and having them execute wherever without re-designing. One of the things I see that Spark has as a distinct advantage over everybody else is just the level of the APIs. They are so much easier to use, they are so much more robust. Even more so with version 2.x. That seems to broaden your community, and make it easier for the community to add to the Spark ecosystem.

Damji: It does make a huge difference in community support and participation.

So, one thing we haven’t touched on much is about the Databricks business model. How does it work?

That’s a good question. Hardly anyone has effectively cracked the code on how to monetize only on open source technology. Probably one of the few companies that a lot of newer company’s model on is Red Hat.

blog damji quote no one has Expert Interview (Part 3): Databricks’ Damji Discusses Security, Cloud and Notebooks

Red Hat had a model of saying, “We are going to take Linux, which is open source, and we are gonna add proprietary and coveted enterprise features on it to make it available and suitable for an enterprise. Then we are going to charge for a subscription and provide support and services with it since Linux is our core competency. We have the brilliant hackers who can write your kind of device drivers and that sort of thing.”

We know it better than anyone else.

Exactly. We know it better than anyone, so one added value is a core competency. Another is enterprise kinds of security, which you won’t usually get in open source out of the box or from downloading from the repo. Kafka is going the same way with Confluent right?

So, I think that’s the trend. Whoever provides the best experience for Apache Spark on their particular platform, is going to win. Databricks provides the best Apache Spark platform, along with a Unified Analytics Platform that brings people, processes and infrastructure (or platforms) together. We provide the unified work space with notebooks, which data engineers and data scientists can collaborate on; we provide the best IO access for all your storage. We provide enterprise-grade security for both data at rest and data in motion. And we provide a fine-grained pool of serverless clusters.

As more and more data is going into the Cloud, people are more and more worried about sensitive data, and how do you protect that? So, security comes as part of this augmented offering.

blog damji quote financial institutions Expert Interview (Part 3): Databricks’ Damji Discusses Security, Cloud and Notebooks

They are! A lot of our customers are banks, insurance companies, and they’re really concerned with information security.

Financial institutions are a good example, and we have customers in that vertical. Financial institutions are warming up to the fact that Cloud is the future, and a good alternative. We have the same vision. So, we provide this unified analytics platform powered by Apache Spark with other stuff around it, which is Databricks specific. It gives you this comprehensive platform, which differentiates between computing and storage, because we don’t tell you what storage to use.

Related: Expert Interview: Livy and Spot are Apache Spark and Cyber Security Projects, Not the Names of Sean Anderson’s Dogs

Store it however you want.

Right.You can store it however you want. We’ll give you the ability to bring the data in quickly and process it fast and write it back quickly. All these different aspects of Databricks bring tremendous value to our customers: security, fast IO access, core competency of Apache Spark, and the integrated workspace of notebooks.

The data scientist and ETL engineers and business analysts can work collaboratively through the Databricks notebook platform. You bring the data in, you explore the data, you do your ETL, you write notebooks, you create pipelines. So, that’s the added features for our customers that come on top of open source. But underneath it is powered by Apache Spark.

Finally, you also get the ability to productionize your jobs using our job scheduler. And the ability to manage your entire infrastructure without worrying about.

blog Spark community Expert Interview (Part 3): Databricks’ Damji Discusses Security, Cloud and Notebooks

And as long as you keep making Apache Spark better and better, and the community keeps jumping in and loving it, then you guys have got a good future.

Yes! If you try our Community Edition, you’ll actually see those benefits. If you start using our Professional Edition, you begin to see more. Every time we create a new release, we release it for our customers as well as the community. They get that instantaneously.

That’s about as fast as it gets.

Don’t miss the final post of this four-part conversation with Jules Damji (Monday, August 14th), which features more about Spark and Databricks, and the advantages of Cloud data analysis.

Big Data is constantly evolving – are you playing by the new rules? Download our eBook The New Rules for Your Data Landscape today!

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 2) with Databricks’ Damji: Spark + Hadoop = Artificial Intelligence = Science Fiction Becoming Reality

At this year’s Strata + Hadoop World, Syncsort’s Paige Roberts caught up with Jules Damji (@2twitme), the Spark Community Evangelist for Databricks and had a long conversation. In this second post of our four-part interview, they discuss the trend of the Spark and Hadoop technologies and communities merging over time, and how that’s creating a science fiction novel kind of world, where artificial intelligence is becoming commonplace.

blog damji quote today theres not Expert Interview (Part 2) with Databricks’ Damji: Spark + Hadoop = Artificial Intelligence = Science Fiction Becoming Reality

Paige Roberts: One thing I’ve noticed over the last few years is that to a certain extent; the Spark and Hadoop communities seem to be merging. We just had a Hadoop focused conference, and yet half the sessions were about Spark. Why do you think that is?

Jules Damji: Apache Spark is such an integral part of Big Data because it allows people to deal with and process large scale data in a very quick manner. It allows people to run different workloads on a single, unified engine. That’s one of the main attractions.

If you look at the history of Big Data, you had all these different systems and you had to stitch them together to do your end-to-end job pipeline. It was difficult. You had to learn five different systems.

Another reason people are rallying around Apache Spark is that it works very well with the Hadoop ecosystem. You can store your data in HDFS or S3 or whatever. The API works well with the storage level. It works well with the applications. Apache Spark talks to BI tools, to Sqoop, to all these third-party data ingestion tools. And it can be deployed in different environments as well. You can have it running on YARN, on its own cluster, or on Mesos.

These dimensions of Apache Spark’s flexibility make it an integral part of Hadoop or Big Data in general. Today there’s not a single conversation that’s happening in the world where Big Data and Apache Spark are not mentioned in the same sentence.

blog banner BBDtL ExpertsSay Expert Interview (Part 2) with Databricks’ Damji: Spark + Hadoop = Artificial Intelligence = Science Fiction Becoming Reality

Roberts: Right. I see that, too.

Damji: We are in the Big Data era. We have seen data coming in fast and we need this real-time end-to-end solution. If I get data, I should be able to make a decision fast. And I should be able to consult either my machine learning model in split second time or I should be able to interact with my stored data. One of the things that Apache Spark provides through Structured Streaming is the ability to write a continuous application.

Today, you heard Reynold Xin speak about the ability to write fault-tolerant applications that give you the ability to interact with streaming data and query it as if you were querying your old, stationary data. It gives you the ability to do ad hoc analysis on the fly. Before, it took you a long time to do this after you finished getting the data. Now, you can do it instantly. That’s one thing.

The other thing I see is that Artificial Intelligence (AI) has come to its fore, and Spark is going to play a big role in the democratizing aspects of Big Data and AI.

blog damji quote AI has come Expert Interview (Part 2) with Databricks’ Damji: Spark + Hadoop = Artificial Intelligence = Science Fiction Becoming Reality

Yeah, you’re seeing artificial intelligence now. On things like self-driving cars and such.

Yes, exactly, self-driving cars, image and voice recognition, recommendation engines, and so much more. At the center of that is the ability to do advanced analytics quickly. The ability to employ popular framework like TensorFlow with Apache Spark, to be able to do machine learning using Apache Spark’s library at scale, to be able build deep neural networks, and do computational analysis quickly. That enters us into this new era of Artificial Intelligence. We now have some of these AI systems, which used to be science fiction. Now, they are taking realistic form.

blog damji quote AI systems Expert Interview (Part 2) with Databricks’ Damji: Spark + Hadoop = Artificial Intelligence = Science Fiction Becoming Reality

The science fiction novels that I read as a kid are now old hat. Yeah, we did that last year.

You will see more and more Apache Spark playing an integral role in this Big Data and Artificial Intelligence era, what I call the Zeitgeist of Big Data. At the core is the ability to process a lot of data fast, ability to manage large clusters seamlessly, ability to transform data at immense speeds, ability to process myriad kinds of data, such as text, video, unstructured, and structured data. It can all be done through the same processing engine such as Spark.

Streaming, batch, …

We’re streaming, we’re doing batch. Before, all these different systems had different formats of data, and different engines.

Yeah, different engines, and different APIs…

Right. But now you have a unified API. You have workloads that run on the same engines so that makes things a little easier. It’s the stepping stone to this powerful digital revolution. No previous industrial industrial revolutions had so many fast technology trends and innovations than this digital revolution. Just in less than few years, I mean, look at what we are going through with Apache Spark.

blog Spark 2.0 Expert Interview (Part 2) with Databricks’ Damji: Spark + Hadoop = Artificial Intelligence = Science Fiction Becoming Reality

Yeah, it’s amazing!

And 10 years from now you might have something else which might be different from Spark, but the next five, we will see Apache Spark growing. We’ll see more and more intelligent application built on top of machine learning techniques that Apache Spark facilitates and catalyzes. And we’ll see huge performance improvements.

Like Project Tungsten?

Tungsten is the second generation of how you can have 10X to 40X the performance. The need is there. The need is not new. That you have data coming in at enormous velocity is new. So, you need capacity to process it instantly. In order to do that, you need very performant distributed systems. And I think you and I are both living in the heart of this data Zeitgeist.

This revolution has been a lot like being in the center of a tornado. Everything around you changes so quickly. So, how do you like the Strata conference?

Oh, this has been wonderful. Like you said, this is a Big Data Hadoop conference and to see how many Apache Spark talks were there was amazing. A testament that Spark is an integral part of Big Data.

It’s certainly is, yes.

It is. Spark Summit is growing fast, too. Big Data and Apache Spark, that’s become a very symbiotic relationship. It’s very complimentary. You can’t really talk about Big Data and not talk about Apache Spark.

Be sure to read the first part one of this conversation, on the importance of the Apache Spark community, and don’t miss the next part of the conversation! We’ll talk about Security, and the big move of Big Data processing to the Cloud.

For more talk about the future of Big Data, including more on Spark and Hadoop, read our eBook, Bringing Big Data to Life: What the Experts Say.

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog