• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Tag Archives: Interview

NVMe vs. SATA, Part 2: Industry insider interview

October 19, 2019   Big Data

This article is part of the Technology Insight series, made possible with funding from Intel.

In our previous discussion of these competing protocols for solid state drives (SSDs), there was a big unanswered question: The NVMe specification first arrived eight years ago. Where’s all the adoption?

Sometimes it takes an industry insider to shed new light on tough subjects. We turned to Eric Pike, formerly of Fusion-io and now director of Western Digital’s enterprise devices group. The company looks to bolster its hard disk lineup (which increasingly focuses on high-capacity data center needs) with NVMe SSDs. Pike sees the technology as a perfect fit for the storage behemoth’s next generation. Our forty-minute discussion covered lots of landscape, but we’ve condensed it to focus on core issues for business.

VB: Where are we at now with NVMe adoption in the SMB and enterprise markets?

EP: It depends on whose numbers you look at, but for cloud providers, you have somewhere around 80% to 85% on a unit basis as well as a petabyte storage basis. But if you flip it around to SMBs and mainstream enterprises, you’re between 12% and 15%. And that includes direct-attached devices and appliances like EMC and NetApp.

VB: That seems…really low.

EP: If you’ve ever read the book Crossing the Chasm, that actually represents the front-end distribution you see before you cross the chasm. But why haven’t we crossed it? Fusion-io kind of jump-started putting storage on the PCIe bus, but we had a proprietary protocol, and everybody was waiting for a standard. Once the standard arrived, people thought, “This is the death of SATA and SAS. All of this performance with PCIe, now coupled with a standard interface — that’s it.” Fast forward, though, and here we are at 12% to 15% overall.

Inside Track Research Note, NVMe – The State of Play, Freeform Dynamics Ltd, 2019 (via WD blog)

VB: Who are the early adopters in that little wedge?

EP: The ones who can take advantage of it, right? They’re running high-frequency transactional systems; large-scale, high-performance databases; highly virtualized environments where the throughput of an NVMe device is very valuable. Even on Western Digital’s site and around the web, you’ll see copious examples of where NVMe applications make sense. That said, 85% of the application environment  is still satisfied with “good enough.”

VB:  Why isn’t NVME’s performance good enough to get us over the chasm?

EP: The very first comment [the author] makes in that book is that you have to take consideration of the whole product, which means it’s more than just a performance statement. And it’s not always about price, because there are pockets of the market where the price delta could be argued to be negligible. But more broadly, there’s a premium at the system level. Before you even put an NVMe device inside of a box, most of the ODM/OEMs — the guys building the boxes — have to add costs. They add higher-end solutions around NVMe. We’re trying to understand why that is. What are the drivers behind that? It’s probably too early to share anything, but we have some theories we’re looking through.

VB: So, system builders packer higher-end components around higher-end storage to get higher-end prices? There must be more to it than that.

EP: Oh, sure. When a customer buys a server, high availability, HA, is important. It’s one of the things that differentiates enterprise customers from client or even workstation customers. So, they tend to put drives in boxes using RAID configurations. However, the availability of hardware RAID for large-scale installations is still in its early stages on NVMe. You can get RAID controllers that support 12, 20, 24 SATA devices, but most state-of-the-art NVME RAID controllers support four devices. If a customer wants to put in a RAID with six or eight NVMe drives, they either have to create two RAID sets with two RAID controllers or use software RAID. Now, we know from the early days of NVMe and PCIe technology, you lower latency by direct-connecting to a CPU for software RAID, but those early instantiations were notoriously — I’ll use somebody else’s term — performance-challenged. You end up using a lot of CPU cycles. It’s getting better, though, and we are seeing improvements in that area.

VB: But not better enough?

EP: We still have remnants of barriers. The issue hasn’t yet been addressed to the point where a customer could have the equivalent of a SATA installation experience with an NVMe installation.

VB: So, what are the best roles today for NVMe with small/medium business and enterprise customers?

EP: Think of high-end traders and high-end analytics work. End users in those scenarios love NVMe. In terms of business value, the difference of a couple of hundred dollars in price just starts to sound like unnecessary noise.

VB: And tomorrow?

EP: The short answer is eventually everyone. Because NVMe has value with just about any workload in high-performance environments. And even without the raw performance benefits, you still have TCO benefits over the long run.

VB: Such as?

EP: You’ve probably seen reports about data center resource utilization numbers that are in the teens, right? Part of that is because they design their environment for peak workloads. There’s a lot of overprovisioning, and that also applies to storage performance. To hit a certain performance level, you need X sort of storage. But with hyperconverged infrastructure, you have a lot more control over scaling individual resources. You don’t have to overprovision everything all together. With hyperconvergence, NVMe is going to give you a lot more scaling efficiency.

VB: What specifically do you mean by scaling efficiency?

EP: It’s tied to access density, meaning how quickly I can access all the data on a drive. The higher the access density, the more efficient my use of that storage is. Think about the RAID rebuild time on a 16TB SATA drive. I don’t even know how long it is. Hours? Even days? NVMe over PCIe gen 4 is a fraction of the time, and gen 5 is going to halve that again in the near future. NVMe lets you scale storage capacity with scalable access density so you can access these large amounts of data.

VB: For businesses, is there more to the TCO discussion than scaling efficiency?

EP: At a large scale, power is definitely a TCO factor. Recall throughput. A SATA device does around 450 MB/s, right? A NVMe device capable of saturating the bus is going to give you six to seven times the performance. Now, that SATA device will run on 7W or 8W, while NVMe is about 25W. So, we’re talking about, say, 3.5X on the power delta but 6.5X on the performance delta. NVMe is more power efficient. There are also value NVMe devices down in the 11W to 14W range. You still get about four times the performance of a SATA device for less than two times the power. So again, if you’re talking about one drive in one system, you may not notice this, but if you have a fairly large-scale installation and TCO matters, these things start adding up.

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Expert Interview (Part 1): Kenny Scott on the Challenges of Data Management

January 5, 2019   Big Data
Expert Interview Part 1 Kenny Scott on the Challenges of Data Management Expert Interview (Part 1): Kenny Scott on the Challenges of Data Management
 Expert Interview (Part 1): Kenny Scott on the Challenges of Data Management

Paige Roberts

January 2, 2019

At the Collibra Data Citizens event in May of this year, Paige Roberts had a chance to speak with Kenny Scott, a Data Management Consultant. In part one of this two part series, Roberts and Scott speak about some of the challenges that come along with being a data management consultant.

Roberts: Let’s start off with an introduction of yourself for our readers.

Scott: My name is Kenny Scott. I started in banking 29 years ago, in the Trustee department providing a regulatory service to large Investment companies. After a few years I moved to London, returning to Edinburgh three years later to take up a role in a newly formed Business Systems Team, working with business objects, database creation, and forming the bridge between the business and Technology teams. There was a considerable amount of diverse data, from shareholder registers to complex market derivatives data at a time when Data Governance was not as prolific a subject as it is now.

Looking back with the experience I have now, there were a few practices that would definitely not be allowed in today’s data environment.

Roberts: Wild, wild west.

Scott: I spent some time in Luxembourg, bringing data into a complex monitoring tool. Working with a lot of good people, but was quite a maverick environment. Around this time I started working on data quality. An opportunity came for a business intelligence role. I didn’t get it, as another candidate has better experience, but they liked my approach and attitude and said, “How would you like to be a Metadata Manager? They said no one knows metadata, but you can find out what it is and deliver it.”

You can figure it out. [Laughter]

Within six months, I had a handle on Metadata Management and started to deploy it across the organization. We implemented hotkeys functionality, which brought up the definitions and linked them to business process. When we started, there were 25,000 business terms, this was too many so we removed them and went back to the basics. By the time I left, it was at 600 approved business terms there at the organization as opposed to 25,000 term of little value

After a year, I was given a data quality team to look after, because the data quality manager left. That’s when I started using Syncsort’s Trillium Data Quality software.

When was this?

Four and a half, five years ago. Ever since then I’ve been ensconced in Trillium software. The metadata and Trillium software were working together. They’re very complementary, as I was talking about in the presentation.

After a reorganization I found myself looking for new challenges and took the opportunity to take the contracting route for a few years

What problems did you have when you arrived? What was the driver for them to hire you?

It was a Data Foundation program that was in place. The company was finding it difficult to attract anyone with Data Quality experience, especially with exposure to Trillium software, they saw my CV, and thought, “well somebody knows Trillium” because that’s the tool that they had bought.

They were using Trillium Data Quality software and they got training so they could produce some business insights, but they were giving figures to the business senior managers, 80% and less, 60% and less, 40% and less. Businesses don’t care, they want to know the problem. They want it written back to them in English.

One of the challenges was that previous Quality Assessments could not be replicated due to a lack of documentation and process. They’ve done a huge customer analysis and it gave some great insight and narrative but nobody could tell us how they’ve cut the data, how they’ve sourced the data, how it’s been sliced, or what the rules were that they used to get to the focus.

The key is that everything we do is documented with a standard operating process. Those are used every time through the process. If people find a better way of doing something, we tweak it and then make it better, but we’ve written in such a way that anyone can come in off of the street and, as soon as they have access to the data, they can do the processes.

What has really been your major stumbling block in implementation?

Access to data. Getting the data out of systems when nobody wants to show anything because it could pull down the system or hunker performance, even in non-production environments. That’s changing now because they see the value of what we’re doing and there’s no tax on the server.

You had to explain to them that Trillium software would not mess with their performance before they would let you pull the data?

Yeah, those are the problems on the data governance side which wasn’t my task. What we were doing with data quality started to give validation to what we were doing in data governance.

When you say what you’re doing with the quality, you’re talking about the discovery aspect of it? You’re discovering the problems, and you could say, “Look, here they are.”

Yes, absolutely. What we do is focus on the customer data funnel where we get people from in the business. For example, it was alluded to address line two must be kept blank. And it’s actually in the manual to leave that blank. Who leaves a critical field like address line two blank?

Why do you have address line two it if you’re never going to put data in it?

Exactly. [Laughter] Because you’ve got an inconsistency in your data mastering platform, you’ve got to make your algorithms more complex to harvest the data to do it. The market sector we were working in was for Agriculture and farms in the UK. You’ve got a house, a street, and a street type. You’ve effectively got a house name, a farm name, an area name, and a town. There’s no numbers in four strings and when you’re using a name and address for projects you look for numbers in the patterns.

And you don’t have those patterns, you can’t find them.

They’re also doing things like the address fields having name information Mister and Missus. Mr. Scott could be the first line of the address. You’ve got names and addresses all over the place. That’s why we need consistency.

That’s a challenge.

That thing is the biggest challenge. Actually with all of the data, I reckon that we could get 96-97% exact match to a postal address if we just structure that correctly, which is a pretty good place to be.

Make sure to check part two where Roberts and Scott speak about the final stages of a data consultant project and where Scott’s next move will take him.

Check out our eBook on 4 ways to measure data quality.

Let’s block ads! (Why?)

Syncsort Blog

Read More

Expert Interview (Part 2): Kenny Scott on the Final Stages of a Data Management Project

January 3, 2019   Big Data
Expert Interview Part 2 Kenny Scott on the Final Stages of a Data Management Project Expert Interview (Part 2): Kenny Scott on the Final Stages of a Data Management Project
 Expert Interview (Part 2): Kenny Scott on the Final Stages of a Data Management Project

Paige Roberts

January 3, 2019

At the Collibra Data Citizens event in May of this year, Paige Roberts had a chance to speak with Kenny Scott, a data management consultant. In part one of this two part series, Roberts and Scott spoke about some of the challenges that come along with being a data management consultant. Part two focuses on the final stages of a data consultant project and what’s next for Kenny Scott.

Roberts: Going back to where you are now, what stage are you at in the implementation now? Where along the journey are you?

Scott: We started off as a group day on the 1st of January this year after about 18 months as a program. We’ve engaged several parts of the business. We’re recruiting those good people to come in. The next stage is to get them in, get them trained, and do the handover.

Roberts: So you’ve actually pretty much got them going. It’s just a matter of knowledge transfer and making sure that they’ve got people in-house who can continue.

Scott: What I’d like to do is transition into more of a data governance role, and help with ideas and then contribute to the data strategy. IT has the data architecture space and they’ve got the access to Collibra. They do the data models and the governance part of it that we don’t need to do. They handle the data quality and the data governance for our data pool as opposed to data governance as a whole.

Right. Collibra is focused on that whole top level, and Syncsort’s Trillium software is very focused on the quality aspect.

All right. Well, you’re talking about winding this down and you’re doing the hand-off. What do you think you’re going to do next?

At the moment I’d like to get back into a bigger organization. I’d like to get into a car manufacturer, or the utility industry, or aviation, or even maybe the medical field. I want to start looking at different datasets. Somewhere that’s big enough to be invested in the tools, somewhere with innovation, you know?

Yeah.

I’m all for business, and I’m looking for a business value, the business process, the business drivers, and a strong business unit

The whole purpose of the technology is to solve a business problem. If you don’t focus on the business problem first, you’re kind of missing the point.

I’ve seen this happening for years. You want the technology to work for you. This is your business problem, you’re telling us you wouldn’t do it. You’re telling us that we’ve got to use these tools, what you’re saying is, “This is our business requirement. This is the tool we’ve identified. We want you to bring it in and implement it so we can build the processes around it.”

That’s good. You really have to talk to each other. It’s too easy for the technology and the business half to get separated until they’re not even communicating.

It was interesting to hear some of those speeches out there now, and even some of the questions that come up about technology driving the governance. I don’t see that. I see technology having a process requirement to make things efficient by knowing where their catalog is or where their assets are. That’s the driver of governance. That is putting something in that is going to help them deliver the toolsets.

They’re the ones that hurt when it’s not done right. I heard somebody talk about a data quality campaign that had been done at their business, and it saved something like 600,000 pounds. They were wasting all of that money on the marketing campaigns that were never going anywhere.

And that’s fair. I wanted to get to where we are, at the moment, we started to monetize for all of that returned mail. I get figures back for what it costs for returned mail, what it costs for the email campaign so we see what we don’t get back from this.

So you can take that information up to the CEO, and say, “This is why it matters.”

And also since we have a permanent head of data coming in the next month so we can make the transfer and keep the show on the road.

That transfer of knowledge, it’s like there’s a certain degree of knowing what they’re supposed to be doing, and how it’s supposed to work, and you need to move that.

Exactly.

Well, thank you for taking the time to speak with me and I really enjoyed your presentation today.

Thank you very much for your support.

Check out our eBook on 4 ways to measure data quality.

Let’s block ads! (Why?)

Syncsort Blog

Read More

Expert Interview (Part 3): Jeff Bean and Apache Flink’s Take on Streaming and Batch Processing

December 2, 2018   Big Data
Expert Interview Part 3 Jeff Bean and Apache Flinks Take on Streaming and Batch Processing Expert Interview (Part 3): Jeff Bean and Apache Flink’s Take on Streaming and Batch Processing
 Expert Interview (Part 3): Jeff Bean and Apache Flink’s Take on Streaming and Batch Processing

Paige Roberts

November 29, 2018

At the recent Strata Data conference in NYC, Paige Roberts of Syncsort sat down for a conversation with Jeff Bean, the Technical Evangelist at data Artisans.

In the first of the three part blog, Roberts and Bean discussed data Artisans, Apache Flink, and how Flink handles state in streaming processing. Part two focuses on the adoption of Flink, why people tend to choose Flink, and available training and learning resources.

In the final installment, Roberts and Bean speak about Flink’s unique take on streaming and batch processing, and how Flink compares to other stream processing frameworks.

Roberts: So, aside from the ability to describe state, what else about Flink makes it especially cool?

Bean: It’s the real-time stream processing. A lot of vendors will say that they offer real time stream processing. When you look at what they actually offer, it’s some derivative side project, maybe a set of libraries that does stream processing, or a couple of extra functions that you can call. Flink is designed from the ground up for stream processing, and it treats batch processing as a special case rather than the other way around. I think that is more interesting for applications that want to handle both analytic data, and real-time data with streams. You don’t need to have two different sets of applications for that since Flink treats them both as the same. It sees batch as a special case of streaming rather than the other way around.

I can picture how if you have a batch process that you sort of chop it into tinier and tinier batches until you’re down to one event, and now you have streaming, but if I’m starting from a streaming point of view, how do I get to batch? What kind of a special case is batch?

Batch is basically streaming with bounds. You point your processor at a fixed data set, and it will process it one record at a time as if it were a stream. Off it goes and it’s done. With Flink, it’s really designed for streams as input. You can point it at a file or a table and say, “That’s a stream.”” I found that when you consider batch processing as a special case of streaming, rather than the other way around, it all comes together more easily.

And if you’ve pointed it at a table, after it pulls all the data off, then new transactions coming in actually become a stream?

Yep, exactly. We’re trained to think about data as if it were static, fixed objects like tables and files, but in fact all data is generated as a stream.

You didn’t get a million records all at once. You got them one at a time. 

The New Rules for Your Data Landscape Expert Interview (Part 3): Jeff Bean and Apache Flink’s Take on Streaming and Batch Processing

So, I remember one time asking one of the Spark experts about true streaming handling in Spark Streaming, and they said, “Well, yeah, it does true streaming.” And I said “I thought it did microbatch.” They said, “Well, everybody does microbatch. It’s just that our microbatches have gone down to one message at a time.” What’s your opinion on that?

When you’re working with Spark Streaming, in order to get optimal performance, you have to tune your microbatch size, or your microbatch interval. In Flink you choose the time characteristic instead, and in the event time characteristic, events are microbatched until the watermark advances. It’s a similar issue but it’s closer to the business problem. There is microbatching, but it happens at the framework level, and the OS level, at the level of the network buffer. Which is where it should be, really.

Okay. That makes sense. 

It’s more intuitive, it’s more expressive. As a developer, I find it much easier to learn.

Are there any non-programming interfaces? I know there’s a lot of ways you can write Spark jobs that have nothing to do with Java or Scala or Python. You can build a KNIME workflow and execute in Spark. Syncsort DMX-h can build a data integration job and execute it on Spark. There are notebooks and such. Is there anything like that for Flink?

Not so much, at least on the commercial side. Zeppelin supports Flink, though. I would love to see more of that, and I kind of see that as part of my charter to help build.

So, before we wrap up, is there anything you’d like to let the readers know about before we end?

I mentioned it a little earlier but make sure to check out training.data-artisans.com for some great courses.

Alright. Thanks for taking the time do this. Good talking to you. Good luck with the new job!

Thank you.

Let’s block ads! (Why?)

Syncsort Blog

Read More

Expert Interview (Part 2): Paco Nathan on the Current State of Agile and Deep Learning

November 7, 2018   Big Data
Expert Interview Part 2 Paco Nathan on the Current State of Agile and Deep Learning Expert Interview (Part 2): Paco Nathan on the Current State of Agile and Deep Learning
 Expert Interview (Part 2): Paco Nathan on the Current State of Agile and Deep Learning

Paige Roberts

November 6, 2018

At the recent Strata Data conference in NYC, Paige Roberts of Syncsort has a moment to sit and speak with Paco Nathan of Derwen, Inc. In part one of the interview, Roberts and Nathan discuss the origins, current state, and the future trends of artificial intelligence and neural networks.

In the second part, Roberts and Nathan go into the current state of Agile and deep learning.

Roberts:  Changing the subject a little, one of the other things you talked about which kind of struck me pretty strong is basically the father the Agile says, don’t do Agile anymore. [Laughter]

Nathan: [Laughter] Right!

Roberts: Can you talk about that a little bit?

Nathan:  Yeah, I was referencing a recent paper this year, actually just a few months ago, by Ron Jeffries who created Extreme Programming. Pair Programming came out of that. Scrum came out of that. A lot of the things we recognize as Agile came from that. He was one of the signatories of the Agile Manifesto 20 years ago. Recently he came out saying that the definition of Agile that he’s seen floating around in industry don’t have anything to do with the intention that they were trying to strike at. He wrote down, “20 years later, here’s my advice for what you really need to do with your team. Let’s get away from the names, and let’s just really focus on how to make teams better.”

Roberts:  Wow. Okay. What’s the paper that he did?

Nathan:  It’s called “Developers Should Abandon Agile.”

That’s pretty interesting. I think there are tons of software companies right now that for them, that’s the Bible. You have to do Agile to survive.

If you saw the talk by David Talby that was a really good one too. It was called, “Ways That Your Machine Learning Model Can Crash and What You Can Do About It.” He’s done a lot of work, especially in healthcare, with machine learning and he just had case study after case study of what goes wrong. The point there was, the real work is not developing the machine learning model. The real work is once you put it into production, what you have to do to make sure that it’s right, and that’s ongoing.

Yeah. That’s always true.

I heard David’s talk in London five minutes before my talk, and I made a slide to represent some of the things he talked about because it fit in with what I was saying. I showed it and then there were arguments out in the hallway afterwards, because the Agile people were like, “How dare you say that!” It’s really salient because if I’m developing a mobile web app, and I have a team that I’m engineering director of, I’m going to bring in my best architects and team leads early in the process. They’re going to go define the high level definitions and define the interfaces. As the project progresses more into fleshing out different parts of the API and getting into more maintenance mode, I don’t have to have my more senior people involved.

Right.

With machine learning, it is the exact opposite. If I’ve got a dataset, and I want to train a model, that’s a homework exercise for somebody who’s just beginning in data science. I can do that off the shelf. But once you get deployed and start seeing edge cases and the issues that have to do with ethics and security, that’s not a homework exercise. Unless you’re in context, and actually running in production, you’re not going to know in advance what those issues are.

Yeah, but a lot of the conversation now is about the fact that most of your datasets are in some way biased, and there’s a lot of ethics involved in launching a machine learning model. I just saw an article online where they’re making ethics in machine learning a first year course for people that they’re training for ML and AI (Carnegie Mellon, University of Edinburgh, Stanford). I guess it actually speaks a little bit to what you said about putting your experts at the end during production. To a certain extent, it seem to me like you also want to have the experts at the beginning, looking at the data before it even starts the process.

Definitely. Deloitte, McKinsey, Accenture, all of them, when we do executive briefings, they all want it set at the beginning. Before we even talk about introducing machine learning into your company, you need to get your ducks in a row as far as breaking down the data silos, and getting your workflow for cleaning your data in place, and a culture that’s based around using data engineering and data science appropriately. You need to do all of those things before you can even start on machine learning. There’s a lot of foundation that needs to be done correctly.

The Rise of Artificial Intelligence for IT Operations banner Expert Interview (Part 2): Paco Nathan on the Current State of Agile and Deep Learning

I said something about the high percentage of machine learning projects that never make it into production on Twitter, and got a response from John Warlander, a Data Engineer at Blocket in Sweden. He said, “I sometimes wonder how many of those ‘not in production’ big data projects happen in companies that don’t even have their ‘small data’ in order. That’s often where most of the low-hanging fruit is.” I’ll put that in my blog post about the Strata event themes and industry trends. We’re talking about a lot of those important themes, so I’ll probably put a lot of quotes from you in it.

David Talby had a great quote, “Really, if you want to talk about AI in a product, what you’re talking about is what you’re going to do once you’re deployed and the products being used by customers. How do you keep improving, because if you’re not doing that you’re not doing AI.”

Well if you’re not doing that, you’re certainly not having that feedback loop. You’ve lost that. When looking at the improvement in accuracy over random chance of any model, there’s always that curve that says this is more and more accurate and then it becomes less and less accurate over time if you don’t constantly retrain your models. One of the themes for Syncsort, as a data engineering kind of company, is making sure that the data that you’re feeding in there is itself constantly refreshed and improved. You said something in your talk that stuck with me. The value in ML and AI right now isn’t as much in iterating through models, or getting the best model, it’s feeding your models the best datasets.

I mean if you want a good data point on that, a lot of these companies, even ones who are leaders in AI, will share their code with you. They’re not going to share their data. That was kind of the punchline of the situation with Crowdflower or Figure Eight. Google bought into self-driving cars, and they realized they could replace a lot of one-off machine learning processes with deep learning, but to do that, they needed really good labelled datasets. Other manufacturers saw their success and wanted to do self-driving cars, too. They hired the talent and the first thing they find out is that if they want to do deep learning, they don’t have enough data, or enough good, labelled data. So, they go to Figure 8 and ask, Hey, can you label our datasets?

Lukas Biewald, the founder of Figure Eight, was talking in San Francisco a couple years ago, saying, “Yeah, for about $ 2 – 3 Million per sensor, we’d be happy to work with you on that.” And he had customers lined up, GM and all the others, because …

Because it’s worth it.

Yeah and if they don’t have it, they’re out of the self-driving car business. It may be a high price but it will likely include years of data.

People focus so much on the models. I have to have the most sophisticated algorithm, …

No. That’s not it.

Expert Interview Part 2 Paco Nathan on the Current State of Agile and Deep Learning quote Expert Interview (Part 2): Paco Nathan on the Current State of Agile and Deep Learning

The only reason that AI didn’t take off back in the 80’s or the 90’s when you and I were first studying it, was because we didn’t have enough data. We couldn’t crunch it. We couldn’t ingest that amount of data and do anything with it, affordably.

There needed to be millions of cat pictures on the internet before we could really do deep learning.

Before we could create something that could identify a cat picture. That’s just the nature of the game.

That was the paper that launched it all. And then the open source for using GPU’s to accelerate it.

That’s really taking off more now in spaces other than video games. Walking the Strata floor, there are a lot more vendors out there taking advantage of GPU’s.

There’s nothing really sacred about the architecture of a GPU with respect to machine learning. It just happens to be faster than a general-purpose CPU at doing linear algebra. But now we’re seeing more ASICs that can do more advanced linear algebra, at enough scale that you don’t have to go across the network. That’s the game. We’ll probably see a lot more custom hardware. Basically we’re in this weird sort of temporal chaos regime where hardware is moving faster than software and software is moving faster than process.

Hardware ALWAYS moves faster than software. Most software is just now finally, in the last few years, catching up to things like using vectors to take advantage of regular CPU chip cache.  

And now we’re putting Tensorflow compositions in GPU’s.

Exactly. And we’re creating compute hardware that’s specific to task. Software always lags behind the hardware and then business processes have to develop after that.

Yeah, you have to log some time doing the job before you can really figure out the process. I think you’re company is in a really good space right now. You’ve gotta get the data right. And it’s not just a one-off. You’ve got to keep getting the data right across your company. Now, and forevermore.

Yeah, tracking and reproducing data changes in production is a big challenge for our customers. If you made 25 changes to the data to make it useful for model training, you then have to make those exact same 25 changes in production so that the model sees data in the format it’s expecting. I’m doing a series of short webinars on tackling the challenges of engineering production machine learning data pipelines, including one on tracking data lineage and reproducing data changes in production environments. So is there anything else going on at the moment that you’d like to let us know about?

I have a little company called Derwen.ai. If you check there, we’ve got a lot of articles. It’s my consulting firm and we do a lot of work with the conferences. We get to see a real bird’s eye view, and we hear from all kinds of people. We’re like Switzerland. We get to hear what a lot of people are working on, even if they’re not ready to go public with it. I hear the pain points people are dealing with, and help out the start-ups. It’s kind of like a distributed product management role.

Cool. All right, well, thanks for talking to me. I really enjoyed your presentation.

Thank you very kindly, so good to see you.

Check out our eBook on the Rise of Artificial Intelligence for IT Operations.

Let’s block ads! (Why?)

Syncsort Blog

Read More

Expert Interview (Part 3): James Kobielus on the Future of Blockchain, AI, Machine Learning, and GDPR

October 5, 2018   Big Data
Expert Interview Part 3 James Kobielus on the Future of Blockchain AI Machine Learning and GDPR Expert Interview (Part 3): James Kobielus on the Future of Blockchain, AI, Machine Learning, and GDPR
 Expert Interview (Part 3): James Kobielus on the Future of Blockchain, AI, Machine Learning, and GDPR

Paige Roberts

October 5, 2018

Since Syncsort recently joined the Hyperledger community, we have a clear interest in raising awareness of the Blockchain technology. There’s a lot of hype out there, but not a lot of clear, understandable facts about this revolutionary data management technology. Toward that end, Syncsort’s Integrate Product Marketing Manager, Paige Roberts, had a long conversation with Wikibon Lead Analyst Jim Kobielus.

In the first part of the conversation, we discussed the basic definition of what the Blockchain is, and cut through some of the hype surrounding it. In the second part, we dove into the real value of the technology and some of the practical use cases that are its sweet spots. In this final part, we’ll talk about the future of Blockchain, how it intersects with artificial intelligence and machine learning, how it deals with privacy restrictions from regulations like GDPR, and how to get data back out once you’ve put it in.

Roberts: Where does Blockchain go from here? What do you see as the future of Blockchain?

Kobielus: It will continue to mature. In terms of startups, they’ll come and go, and they’ll start to differentiate. Some will survive to be acquired by the big guys, who will continue to evolve their own portfolios, while integrating those into a wide range of vertical and horizontal applications.

Nobody’s going to make any money off of Blockchain itself. It’s open source. The money will be made off of cloud services, especially cloud services that incorporate Blockchain as one of the core data platforms.

Believe it or not, you can do GDPR on Blockchain but, here’s the thing: the GDPR community is working out exactly what you can do to delete the data records consistently on the Blockchain. Essentially, you can encrypt the data and then delete the key.

Right. If you can’t decrypt it, you can’t ever read it.

Yeah. Inaccessible forever more in theory. That’s a possibility of harmonizing Blockchain architecture with the GDPR and other mandates that require the right to be forgotten. The regulators also have to figure out what is Kosher there. I think there will be some reconciliation needed between the techies pushing Blockchain, and the regulators trying to enforce the various privacy mandates.

Just as important in terms of where it’s going, Blockchain platforms as a service, PAAS, will become ever more important components of the data providers overall solutions. Year by year, you’ll see the Microsofts, IBMs and Oracles of the world evolve Blockchain-based Cloud services into fairly formidable environments.

There are performance issues, in terms of speed of updates with Blockchain now, but I also know that there is widespread R & D to overcome those. VMWare just announced they’re working on a faster consensus protocol, so that different nodes on the Blockchain can come to consensus rapidly, allowing more rapid updates to the chain. Lots of parties are looking for better ways to do that. So, maybe it might become more usable for transactional applications in the future.

Blockchain deployment templates are going to become the way most enterprise customers power this technology. AWS and Microsoft already offer these templates for rapid creation and deployment of a Blockchain for financial or supply chain or whatever. We’re going to see more of those templates as the core way in which people buy, in a very business friendly abstraction. There will be a lot of Blockchain-based applications for specific needs. We’ll see a lot of innovation in terms of how to present this technology and how to deliver it so that you don’t have to understand what a consensus protocol is or really give a crap about what’s going on in the Blockchain itself. It should be abstracted from the average customer.

More in terms of going forward, you’ll see what I call “Blockchain domain accelerators.” There are Blockchain consultants everywhere now. There are national Blockchain startup accelerators. There are industry-specific Blockchain startup accelerators. There are Blockchain accelerators in terms of innovation of cryptocurrency and Internet of Things. We’re going to see more of these domain accelerator industry initiatives come to fruition using Blockchain as their foundation. They’ll analyze and make standards of how to deploy, secure and manage this technology specific to industry and use case requirements. That definitely is the future.

As I mentioned before, it will become a bigger piece of the AI future, because of Blockchain-based distributed marketplaces for training data. Training data for building and verifying machine learning models for things like sentiment analysis has real value. There’s not many startups in the world that would have massive training datasets already. To build the best AI, you’ll need to go find the best training datasets for what you’re working on.

Debugging Data Why Data Quality Is Essential for AI and Machine Learning Success Expert Interview (Part 3): James Kobielus on the Future of Blockchain, AI, Machine Learning, and GDPR

I talked about that a little with Paco Nathan at Strata, how labelled, valid, useful training datasets were incredibly valuable now, and AI companies recognize that. They will share their code with you, but not their data, not for free.

I really think you’ll see a lot more AI training dataset marketplaces with Blockchain as the backing technology. It’s going to become a big piece of the AI picture.

Blockchain security is another big thing going forward. The Blockchain is the weak link is in protecting your private keys, which provide you with secure access to your cryptocurrencies that are running out of the chain. What we’re going to see is that there will be more emphasis on security capabilities that are edge-to-edge in terms of securing Blockchains from the weakest link, which is the end-user managing their keys. I think you’ll start to see a lot of Blockchain security vendors that help you manage your private keys, and also smart contracts. Smart contracts on the Blockchain have some security vulnerabilities in their own right. We’ll see a lot of new approaches to making these tamper-proof. There’s already a lot of problem with fraud.

I think I’ve covered most of the big things I see coming. That is the really major stuff.

One more thing, I’m curious about since Blockchain is still fairly new to me. There’s a lot of conversation about how you store data on the Blockchain, and a lot of research into things like securing it, and speeding up update speed, but storing data is only half the story with data management. Once you’ve put all this data in, you have to then get it out. If I’ve got a Blockchain, it has all this information I need, how do I go find and retrieve information from it? Do I use SQL?

There’s a query language in the core Blockchain code base.

So, it has its own specific query language, and people will have to learn a whole other way to retrieve data?

Basically, the core of Hyperledger has got a query language built in. It’s called Hyperledger Explorer. Hyperledger, in itself, is an ecosystem of projects just like Hadoop is and was, that will evolve. It’ll be adopted at various rates, some projects will be adopted widely, and some very little during production Blockchain deployments.

There’s some parallels with early Hadoop. Some of the early things that Hadoop had under their broad scope, they had an initial query language that didn’t take off, they updated that, and improved it with HiveQL. Same thing with Spark. They started out with a query language Shark, and switched to another one, Spark SQL.

We have to look at the entire ecosystem. Over time, some pieces may be replaced by proprietary vendor offerings, or different open source code that does these things better. It’s part of the maturation process. Five years from now, I’d like to see what the core Blockchain Hyperledger stack is. It may be significantly different. It may change as stuff gets proved out in practice.

Yeah, Hadoop changed a lot over the last decade.

Hadoop has become itself just part of a larger stack with things like Tensorflow, R, Kafka for streaming. Innovation continues to deepen the stack. The NoSQL movement, graph databases, the whole data management menagerie continues to grow. We’ll see how the core protocol of Blockchain evolves too. It’s a work in progress, like everything else.

I’ve written a bunch of articles on this. It’s changing all the time.

I’ll be sure to include some links in the blog post, so folks can learn more. I really thank you for taking the time to speak with me. It was really informative.

No problem. I enjoyed it.

Jim is Wikibon’s Lead Analyst for Data Science, Deep Learning, and Application Development. Previously, Jim was IBM’s data science evangelist. He managed IBM’s thought leadership, social and influencer marketing programs targeted at developers of big data analytics, machine learning, and cognitive computing applications. Prior to his 5-year stint at IBM, Jim was an analyst at Forrester Research, Current Analysis, and the Burton Group. He is also a prolific blogger, a popular speaker, and a familiar face from his many appearances as an expert on theCUBE and at industry events.

Also, make sure to download our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.

Let’s block ads! (Why?)

Syncsort Blog

Read More

Expert Interview (Part 2): James Kobielus on Blockchain’s Sweet Spot in Practical Business Use Cases

October 2, 2018   Big Data
Expert Interview Part 2 James Kobielus on Blockchain’s Sweet Spot in Practical Business Use Cases Expert Interview (Part 2): James Kobielus on Blockchain’s Sweet Spot in Practical Business Use Cases
 Expert Interview (Part 2): James Kobielus on Blockchain’s Sweet Spot in Practical Business Use Cases

Paige Roberts

October 2, 2018

Since Syncsort recently joined the Hyperledger community, we have a clear interest in raising awareness of the Blockchain technology. There’s a lot of hype out there, but not a lot of clear, understandable facts about this revolutionary data management technology. Toward that end, Syncsort’s Integrate Product Marketing Manager, Paige Roberts, had a long conversation with Wikibon Lead Analyst Jim Kobielus.

In the first part of the conversation, we discussed the basic definition of what the Blockchain is, and cut through some of the hype surrounding it. In this second part, we dove into the real value of Blockchain technology and some of the practical use cases that are its real sweet spots.

Roberts: The hype cycle tends to make all kinds of wild claims. It will do everything but wash your socks. Which claims for Blockchain do you feel have some validity?

Kobielus: First of all, since it doesn’t support CRUD, it’s not made for general purpose database transactions. It’s made for highly specialized environments where you need to have a persistent immutable record like logging. Logging of security related events, logging system events for later analysis correlation, etc. Or, where you have an immutable record of assets, video, music and so forth, in a marketplace where these are intellectual property that need protection against tampering. If you have a tamper-proof distributed record, which is what Blockchain is, it’s perfect for maintaining vast repositories of intellectual properties for downstream monetization. Or, for tracking supply chains.

A distributed transaction record that can’t be repudiated, that can’t be tampered with, that stands up in legal situations is absolutely valuable. So, Blockchain makes a lot of sense in those kinds of applications. In addition to lacking the ability to delete and edit the data, Blockchain is slow. It’s not an online transactional database. Updates to the chain can take minutes or hours depending on how the chain is set up, and how extensive the changes are, so you can’t have a high concurrency of transactions. It’s just not set up for fast query performance. It’s very slow.

Also, in a world moving towards harmonization around privacy protection, consistent with what the European Union has done with the General Data Protection Regulation (GDPR), and the recent California privacy regulation that is similar to GDPR. GDPR requires that any personally identifiable information (PII) must be capable of being forgotten, meaning people have the right to request deletion of their personal data, or to edit it if it’s wrong. In Blockchain, you can’t delete, and you can’t edit a record that’s written in Blockchain. There’s a vast range of enterprise applications that have personally identifiable information. The bulk of your business, sales, marketing, customer service, HR, etc. has tons of PII data.

So, Blockchain is not suitable for those core transaction processing applications. Any application that demands high performance queries will not be on the Blockchain. It’s not suitable for highly scalable real-time transactions of any sort, whether or not they involve PII data.

The way I see it, Paige, is there’s a range of fit for purpose data platforms in the data management space. There’s relational databases, all the NoSQL databases, there’s HDFS, there’s graph databases, key-value stores, real-time in-memory databases, and so on. Each of those is suited to particular architectures and use cases, but not to others. Blockchain is fundamentally a database, and it’s got its uses. It’s not going to dominate all data computing like a monoculture, no matter what John McAfee says. That’s not going to happen. It’s already limited technologically, and with regulatory limitations. It’s a niche data platform that’s finding its sweet spot in various places.

Debugging Data Why Data Quality Is Essential for AI and Machine Learning Success Expert Interview (Part 2): James Kobielus on Blockchain’s Sweet Spot in Practical Business Use Cases

You mentioned a couple of good use cases like supply chain management. I’ve heard of uses like tracking diamonds from the mine to the jewelry store to be certain of their origins, that they’re not blood diamonds. All of the examples I had heard of in the past were based on the concept of Blockchain as a transactional ledger or even a sensor log. For example, you keep sensors on your food from the farm to the market to make sure that it never went above a certain temperature for a certain amount of time, that sort of thing. One of the use cases you mentioned was actually news to me, that you could store other sorts of data like application code, so you could do code change management with it. What other use cases do you see coming?

Actually, there are a few pieces that I published recently for vertical application focused supply chain management. Blockchain startups are trying to grab a piece of the video streaming market. Essentially these services, a lot of which are still in alpha or beta pre-release phase, use Blockchain in several capacities. One for distributed video storage. Number two, for distributed video distribution from a peer-to-peer protocol.

Distributed video monetization using a Blockchain-based cryptocurrency that’s specific to each environment to help the video publishers monetize their offering. Blockchain for distributed video transactions, and for contracts. Blockchain for distributed video governance.

So are you talking about having something like Netflix bucks?

More and more Blockchain applications aren’t one hundred percent on the Blockchain. They handle things like PII off the chain, for instance, and put that in a relational database. Most architecture is using fit-for-purpose data platforms for specific functions in a broader application. That is really where Blockchain is coming into its own.

Another specialized Blockchain use case is artificial intelligence, one of my core areas. I’ve been reading for a while now about the AI community experimenting with using Blockchain as an AI compute brokering backbone; there’s a company called Cortex. You can read my article on that. They use Blockchain as a decentralized AI training data exchange. They have data that has the core ground truths a lot of AI applications need to be trained on.

Expert Interview Part 2 James Kobielus on Blockchain’s Sweet Spot in Practical Business Use Cases quote Expert Interview (Part 2): James Kobielus on Blockchain’s Sweet Spot in Practical Business Use Cases

So you’re saying they basically create really solid, excellent training datasets, doing all the data engineering to make sure these are good training datasets for AI ground truths, and then use Blockchain to exchange them to other AI developers?

It’s a Blockchain for people who built and sourced their training data to store it in a ledger so that others can tap into that data from an authoritative repository.

Right. Okay. That makes sense. Seems like a valuable commodity to the AI community.

Several small companies are doing this. They’re converging training data into an exchange or marketplace for downstream distribution to data scientists, or whoever will pay for the training data. Blockchain is used as an AI middleware bus, an AI audit log, an AI data lake.

What I’m getting at, Paige, is that there are lots of industry-specific implementations of Blockchain. Industries everywhere are using this, some in production, but many of them are still piloting and experimenting with Blockchain in a variety of contexts including e-commerce, AI, video distribution, in ways that are really fascinating.

These are the same kinds of dynamics that we saw in the early days of Hadoop and NoSQL and other technologies. Each technology market grows by vendors finding a sweet spot, an application that their approach is best suited to.

We see a lot of hybrid data management approaches in companies that use two or more strategies in a common architecture.

One thing that’s missing from all that stuff is real-time streaming, continuous computing applications. Blockchain is very much static data, it’s almost the epitome of static data. You won’t see too many real-time applications for Blockchain alone, but that’s okay. Blockchain is good for the things that it’s good for.

Blockchain will find its niche given time?

Yes.

Be sure not to miss Part 3 where we’ll talk about the future of Blockchain, how it intersects with artificial intelligence and machine learning, how Blockchain deals with privacy restrictions from regulations like GDPR, and how to get data back out of the Blockchain once you’ve put it in.

Jim is Wikibon’s Lead Analyst for Data Science, Deep Learning, and Application Development. Previously, Jim was IBM’s data science evangelist. He managed IBM’s thought leadership, social and influencer marketing programs targeted at developers of big data analytics, machine learning, and cognitive computing applications. Prior to his 5-year stint at IBM, Jim was an analyst at Forrester Research, Current Analysis, and the Burton Group. He is also a prolific blogger, a popular speaker, and a familiar face from his many appearances as an expert on theCUBE and at industry events.

Make sure to download our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.

Let’s block ads! (Why?)

Syncsort Blog

Read More

Expert Interview (Part 1): James Kobielus on Separating Blockchain Hype from Reality

October 2, 2018   Big Data
Expert Interview Part 1 Jim Kobielus on Separating Blockchain Hype from Reality2 Expert Interview (Part 1): James Kobielus on Separating Blockchain Hype from Reality
 Expert Interview (Part 1): James Kobielus on Separating Blockchain Hype from Reality

Paige Roberts

October 1, 2018

Since Syncsort recently joined the Hyperledger community, we have a clear interest in raising awareness of the Blockchain technology. There’s a lot of hype out there, but not a lot of clear, understandable facts about this revolutionary data management technology. Toward that end, Syncsort’s Integrate Product Marketing Manager, Paige Roberts, had a long conversation with Wikibon Lead Analyst Jim Kobielus.

In this first part of that conversation, we discussed the basic definition of what the Blockchain is, and cut through some of the hype surrounding it.

Roberts: Tell us a little about yourself.

Kobielus: I’m James Kobielus. I’m the lead analyst at Wikibon. I’m a veteran analyst covering data analytics, artificial intelligence and cloud data computing, and one of my research focus areas is Blockchain. In fact, I plan to write and publish a Wikibon research document on its maturation in the enterprise some time in the next few months.

Ah, good timing for the interview then. Let’s start with the basic definition. What exactly is the Blockchain?

Blockchain was defined initially by the legendary inventor of Bitcoin, Satoshi Nakamoto, which is not really his name, just a pseudonym. Blockchain is not a currency, rather it is a distributed, trusted hyper ledger. It’s essentially a database, but the architecture is distributed and can be stored on dozens, or hundreds, or thousands of separate computers that remain in synchronization with each other. The hyper ledger of data is stored in a secure fashion where everybody can read the Blockchain, and nobody can repudiate that they made an update to it because there’s a trust mechanism built-in. And the Blockchain cannot be changed. It’s immutable. Once you write something to a Blockchain, it cannot be deleted, it cannot be edited, so it’s a very specialized type of distributed database. In other words, where your traditional databases enable you to do what we often call CRUD operations; create, read, update, and delete the data. Blockchain only allows you to create and update the data by adding to it. You can also read it, but you can’t delete it. So, it’s specialized to a variety of applications that don’t require full CRUD semantics.

Okay, that makes good sense. CRUD operations are pretty familiar to anybody in the database space. How is the Blockchain different from other databases?

So, first of all, it’s not different in any radical sense from a number of approaches that have been around for a while now. There are plenty of distributed databases in the world from various vendors that use a variety of approaches to split the data into separate tables, or volumes, with varying degrees of synchronization across different servers. There are approaches in traditional relational databases such as sharding that enable the datasets to be distributed across many nodes.

What makes Blockchain different is that it is primarily for logging data for a secure, trusted record of transactions. Nobody can deny that they posted something because there is a complete audit trail in the updates that were made to the chain. There’s a distributed trust mechanism built into it that you don’t necessarily see in other data platforms or distributed data environments as an embedded capability.

Blockchain also is not limited in the types of data it can store, like say, a relational database is limited to storing structured data in structured tables. It can store pretty much any type of data within the blocks themselves. The term “block” actually has a real meaning in the Blockchain architecture. The data blocks can store textual data, video objects, application code or whatever you have. So, it’s quite versatile. It is a database that can store unstructured, multi-structured data, in addition to structured data.

Blockchain is open source. There are a lot of open source databases, of course. It was originally incorporated into Bitcoin and it’s still the foundation for Bitcoin, and for most cryptocurrencies, but Blockchain has evolved independently of the currencies. Using Blockchain doesn’t necessarily imply that it’s supporting a cryptocurrency application. It could be potentially supporting many kinds of applications.

There’s a core open source distribution, and there are various forks to that distribution, such as for the Hyperledger foundation. Hyperledger is an industry group that manages core Blockchain open source code. There is also the Ethereum Project managing other forks.

Debugging Data Why Data Quality Is Essential for AI and Machine Learning Success Expert Interview (Part 1): James Kobielus on Separating Blockchain Hype from Reality

Syncsort also sees Blockchain as important to our customers going forward. We recently joined Hyperledger so that we can help contribute to it, like we did in the early days of Hadoop and Spark. Blockchain has a lot of hype around it, though. One of the biggest things we’re trying to do is see what is hype and what is reality. Why do you think Blockchain has been riding so high on the hype train?

Hype serves an important purpose which is to raise people’s awareness and understanding of particular things. Usually in a marketing context, if you want to sell products, you have to make people aware of it and what it can do. Everyday technology has a hype cycle so, what you have to do if you’re a buyer is get down to exactly what the product does, what differentiates it from other approaches. How mature is this technology? Is it a stable code base? Are there standards? How widely is it adopted? How tested is it? Is there an ecosystem around it?

Blockchain has actually been around for about 10 years. Over that time, it’s grown in a lot of ways, one of which is in its tie to cryptocurrencies and the media around that. It has raised awareness of the Blockchain with a lot of business people, technical people and even consumers.

Supply chain management is one of the dominant use cases or patterns where I’ve seen Blockchain deployed. But the general understanding isn’t there yet. I wrote an article last May for SiliconANGLE called, “Blockchain isn’t ready for enterprise primetime. Here’s what will get it there.” The hype is well in advance of people’s understanding of what Blockchain is all about. That’s a fact.

It’ll take a couple of years for a general understanding across Blockchain and the technologies related to it to really get to a point where people are as familiar with Blockchain as they are now with something like mobile computing. So, the awareness will take a while. Also, it will take a while for the startup community to catch up. There are a LOT of startups, but none of them have really taken off yet. I could list some names, but they’re all unfamiliar to most people, even technical people.

Expert Interview Part 1 Jim Kobielus on Separating Blockchain Hype from Reality Quote2 Expert Interview (Part 1): James Kobielus on Separating Blockchain Hype from Reality

Ten years ago when Hadoop got started, there were a bunch of startups, and a few rose above the rest, and built substantial businesses based on Hadoop: Cloudera, Hortonworks, MapR and a few others. There is no equivalent, familiar brand, yet, that’s focused on Blockchain as a platform vendor. For this space to mature, for us at Wikibon to consider it mature, there needs to be a few of these startups that rise above the pack and survive. An enterprise IT professional needs to know that these companies will be around in a few years.

Also, many of the big, established IT vendors have already stepped in with their own Blockchain products and services. I mean, IBM certainly does. AWS launched their own platform over the past year or so. So has Microsoft, Oracle, and so has VMware.

What I’m getting at is all of these established IT vendors are starting to test the waters in terms of the Blockchain market with tech solutions and cloud services. None of them has had runaway success in terms of Blockchain platform, in terms of adoption. None have become the de facto standard either. We haven’t even gotten to the point where the M & A in this space has picked up.

The hype is very much in advance of the actual maturation in the shakeout of the Blockchain space

Yeah. There’s no Cloudera, or Hortonworks that’s come forward for yet.

Not yet, no.

Be sure to check out Part 2 of this conversation where we deep dive into the real practical value of the Blockchain and some of the business use cases where it shines.

Jim is Wikibon’s Lead Analyst for Data Science, Deep Learning, and Application Development. Previously, Jim was IBM’s data science evangelist. He managed IBM’s thought leadership, social and influencer marketing programs targeted at developers of big data analytics, machine learning, and cognitive computing applications. Prior to his 5-year stint at IBM, Jim was an analyst at Forrester Research, Current Analysis, and the Burton Group. He is also a prolific blogger, a popular speaker, and a familiar face from his many appearances as an expert on theCUBE and at industry events.

Also, make sure to download our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.

Let’s block ads! (Why?)

Syncsort Blog

Read More

Expert Interview (Part 3): Dr. Sourav Dey on Data Quality and Entity Resolution

August 3, 2018   Big Data
Expert Interview Part 3 Dr. Sourav Dey on Data Quality and Entity Resolution Expert Interview (Part 3): Dr. Sourav Dey on Data Quality and Entity Resolution
 Expert Interview (Part 3): Dr. Sourav Dey on Data Quality and Entity Resolution

Paige Roberts

August 2, 2018

At the recent DataWorks Summit in San Jose, Paige Roberts, Senior Product Marketing Manager at Syncsort, had a moment to speak with Dr. Sourav Dey, Managing Director at Manifold.

In the first part of our three-part interview Roberts spoke to Dey about his presentation which focused on applying machine learning and data science to real world problems. Dey gave two examples of matching business needs to what the available data could predict.

In part two, Dey discussed augmented intelligence, the power of machine learning and human experts working together to outperform either one alone.

In this final installment Roberts and Dey speak about the importance of data quality and entity resolution in machine learning applications.

Roberts: In your talk, you gave an example where you tried two different machine learning algorithms on a data set, and didn’t get good results either time. Rather than trying yet another, more complicated algorithm, you concluded that the data wasn’t of good quality to make that prediction. What quality aspects of the data affect your ability to use it for what you’re trying to accomplish?

Dey: That’s a deep question. There are a lot of things.

Let’s dive deeper then.

So, at the highest level, there’s the quantity of data. You can’t do very good machine learning with only a handful of examples. Ideally you need thousands of examples. Machine learning is not magic. It’s about finding patterns in historical data. The more data, the more patterns it can find.

People are sometimes disappointed by the fact that if we’re looking for something rare, they may not have very many examples of it. In those situations, machine learning often doesn’t work as well as desired.  This is often the case when trying to predict failures.  If you have good dependable equipment, failures are often very rare – occurring only in a small fraction of the examples.

There are techniques, like sample rebalancing that can address certain issues with rare events, but fundamentally more examples will lead to better performance of the ML algorithm.

What are other issues to be aware of?

Another aspect, of course, is the data labeled well? Tendu talked about this, too, in her talk on anti-money laundering. Lineage issues are a problem. Things like, oh, actually, the product was changed here, but I never noted it. That means that all of these features have changed. This comes up a lot, particularly with web and mobile-based products where the product is constantly changing. Often such changes mean that a model can’t be trained on data before the change because it is no longer a good proxy for the future. Labeling is one of the biggest issues. I gave you the example for the oil and gas where they thought they had good labeling, but they didn’t.

How about missing data?

Missing data is surprisingly not that big of an issue. In the oil and gas sensor data, it could drop off for a while because of poor internet connectivity. For small dropouts, we could interpolate using simple interpolation techniques. For larger dropouts we would just throw out the data. That’s much easier to deal with than labelling issues.

Can you talk a bit about entity resolution and joining data sources?

Yes, this is another problem we often face.  The issue is about joining data sources, particularly with bigger clients. They’ll have three silos, seven silos, ten silos, sometimes in really big companies even have 50 or 100 silos of data, where they’ve never been joined, but they’re of the same user base.

Expert Interview Part 3 Dr. Sourav Dey on Data Quality and Entity Resolution banner 1024x299 Expert Interview (Part 3): Dr. Sourav Dey on Data Quality and Entity Resolution

The data are all about the same people.

Right, and even within a single data source, it needs to be de-duplicated. It’s the same records. I’ll give a concrete example. We worked with this company that is an expert search firm. Their business is to help companies to find specific people with certain skills, e.g. a semi-conductor expert that understands 10 nanometer micron technology. Given a request, they want to find a relevant expert as fast as possible.

Clean, thick data drives business value for them by giving their search a large surface area to hit against. They can then service more requests, faster.  Their problem was that they had several different data silos and they never joined them. They only searched against one. They knew that they were missing out on a lot of potential matches and leaving money on the table. They hired Manifold to help them solve this problem.

How do we join these seven silos, and then figure out if the seven different versions of this person are actually the same person? Or two different people, or five different people.

This problem is called entity resolution. What’s interesting, is that you can use machine learning to do entity resolution. We’ve done it a couple of times now. There are some pretty interesting natural language processing techniques you can use, but all of them require a human in the loop to bootstrap the system. The human labels pairs, e.g. these records are the same, these records are not the same. These labels are fed back to the algorithm, and then it generates more examples. This general process is called active learning. It keeps feeding back the ones it’s not sure about to get labelled. With a few thousand labeled examples, it can start doing pretty well for both the de-duplication and the joining.

The compute becomes pretty challenging when you have large data sets. Tendu mentioned it in her talk on Anti-Money Laundering, you have to compare everything to everything, and do it with these fuzzy matching algorithms. That’s a challenge.

That’s a challenge, yeah. One of the tricks is to use a blocking algorithm which is crude classifier. Then, after the blocking, you have a much smaller set to do the machine learning base comparison on. That being said, even the blocking has to be run on N times M records where N and M are millions of records.

Where if you have seven silos and there’s a million records each and a hundred attributes per record, it’s a million times a million seven times …

It’s blows up quickly. That’s where you have to be smart about parallelizing and I think that’s where the Syncsort type of solution can be really powerful. It is an embarrassingly parallel problem. You just have to write the software appropriately so that can be done well.

Yeah, our Trillium data quality software is really good at parallel entity resolution at scale.

I like to work on clean data, and you guys are good at getting the data to the right state. That’s a very natural fit.

It is! You need clean data to work, and we make data clean. Well thank you for the interview, this has been fun!

Thank you!

Check out our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.

Let’s block ads! (Why?)

Syncsort Blog

Read More

Expert Interview (Part 2): Dr. Sourav Dey on Augmented Intelligence and Model Explainability

July 31, 2018   Big Data
Expert Interview Part 2 Dr. Sourav Dey on Augmented Intelligence and Model Explainability Expert Interview (Part 2): Dr. Sourav Dey on Augmented Intelligence and Model Explainability
 Expert Interview (Part 2): Dr. Sourav Dey on Augmented Intelligence and Model Explainability

Paige Roberts

July 31, 2018

At the recent DataWorks Summit in San Jose, Paige Roberts, Senior Product Marketing Manager at Syncsort, had a moment to speak with Dr. Sourav Dey, Managing Director at Manifold. In the first part of our three-part interview Roberts spoke to Dr. Dey about his presentation which focused on applying machine learning to real world requirements. Dr. Dey gave two examples of matching business needs to what the available data could predict.

Here in part two, Dr. Dey discusses augmented intelligence, the power of machine learning and human experts working together to outperform either one alone. In particular, AI as triage is a powerful application of this principle, and model explainability is the key to making it more useful.

Roberts: One of the big themes I’m seeing here, what the keynote talked about this morning, is that the best chess machine can beat the best human chess player, but both can be beaten by a mediocre chess player with a really good chess program working together. One of the things you talked about was that kind of cooperation between people and machines speeding up triage, and how that works.

Dey: Yeah, so this is what many people call augmented intelligence. I would say almost 50% or more of the projects that we do at Manifold fall into the business pattern that I call “AI as Triage”. The predictions that AI is doing helps to triage a lot of information that a single human can’t process. Then, the AI presents it in a way that a human can make a decision on it. That’s a theme that I’ve seen over and over again. Both of the examples I gave before fit that, for instance.

In the baby registry example, our client was collecting all of these signals that no single human can understand, all the web clicks, mobile clicks, marketing data, etc. The AI is triaging that and distilling it down so that a marketing person or the product person can make decisions on it.

In the oil and gas company example, it’s the same. The machines are generating fine-tick data from 54 sensors from thousands of locations across the country, no person (or even team of people) can look at that all the time.

Nobody can make sense of that.

Yeah, but the AI can crush it down, and present it to humans in an actionable way. That can really speed up that triage process. So that’s the goal there.

I was impressed by one example you mentioned. You have these decision trees making a decision that something would fail, and that was kind of useful. But the person still had to figure out from scratch why it would fail, and how to repair it. Whereas, if the AI explained … how was that done?

The TreeSHAP algorithm, yeah. It explains how a decision tree came to a particular decision. It’s relatively recent that people are doing some good research into this. Essentially, there is the model that’s making the prediction. Then, you can make another model of that model that explains the original model. It tells you why it made that prediction.

That WHY can be key.

There have been a few competing techniques out there. All of them had some issues, but this group at the University of Washington, inspired by game theory from economics, they made a consistent explanation. It’s called the Shapley Metric. What’s nice, is that they developed a fast version of it that can be used with tree-based models called TreeSHAP. It’s fantastic. We use it all the time now for explanations of why the model is making a particular individual prediction. For instance, today, you predicted .91 probability of failure. Why? You could also use it at the aggregate level, for something like: On the whole, thousands of machines over five years, what was importance of this feature in making the prediction?

Expert Interview Part 2 Dr. Sourav Dey on Augmented Intelligence and Model Explainability banner Expert Interview (Part 2): Dr. Sourav Dey on Augmented Intelligence and Model Explainability

And then the person going to repair that equipment knows WHY it was predicted to fail, and therefore has a pretty good idea of what they have to fix.

Well at least they have a much better idea. The maintenance engineers have a web app that they can then dig deeper into looking at the historical time series. In addition, they can VPN into the physical machine. All in all, the explainable model allows them to do triage much faster, and, in turn, do the repair more quickly.

Model explainability is incredibly useful for a lot of things. I know Syncsort has been doing a lot of work around GDPR, and I talked to a data scientist in Germany, Katharine Jarmul about this. For example, if a person wants a loan, and you’ve got a machine learning model that says no, you can’t have that loan, you have to be able to explain why.

Totally, yeah. There are laws about that for important civil rights reasons.

For what you’re doing, the reasons are less legal and more practical. If I’m going to use this prediction in order to take an action, such as a repair, it helps a lot if I know how the prediction was reached.

I can give another example of that. We did work for a digital therapeutics company. They make an app along with wearables that helps people get their diabetes under control. We were making predictions of whether or not, in 24 weeks, is the patient’s blood sugar going to go below a certain level. There’s a human in the loop, a human coach that you get as a part of this program. They didn’t know what to do with the raw prediction probability. When we put in an explainable algorithm, that let them know why that number was high or low, they could have much better phone calls with the patients.

Because they knew WHY the blood sugar was likely to dip.

They could say things like, hey, I see that you’re not doing this food planning very much, or you haven’t logged into the app in a while. You used to log in seven times a week. What’s going on? They have the knowledge ready to have a high bandwidth interaction with the patient.

So, I think there’s a lot there.

The more I learn about model explainability, the more I see where it’s hugely useful.

There are a lot of folks doing cool things with deep learning. It’s far harder to explain, but there’s work being done on that. Hopefully, in the next few years, there will be better techniques to explain those more complex models as well.

Tune in for the final part of this interview where Roberts and Dey speak about the effect of data quality as well as Entity Resolution in conjunction with machine learning.

Check out our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.

Let’s block ads! (Why?)

Syncsort Blog

Read More
« Older posts
  • Recent Posts

    • Rickey Smiley To Host 22nd Annual Super Bowl Gospel Celebration On BET
    • Kili Technology unveils data annotation platform to improve AI, raises $7 million
    • P3 Jobs: Time to Come Home?
    • NOW, THIS IS WHAT I CALL AVANTE-GARDE!
    • Why the open banking movement is gaining momentum (VB Live)
  • Categories

  • Archives

    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited