Tag Archives: Expert
Expert Interview (Part 1): Kenny Scott on the Challenges of Data Management

At the Collibra Data Citizens event in May of this year, Paige Roberts had a chance to speak with Kenny Scott, a Data Management Consultant. In part one of this two part series, Roberts and Scott speak about some of the challenges that come along with being a data management consultant.
Roberts: Let’s start off with an introduction of yourself for our readers.
Scott: My name is Kenny Scott. I started in banking 29 years ago, in the Trustee department providing a regulatory service to large Investment companies. After a few years I moved to London, returning to Edinburgh three years later to take up a role in a newly formed Business Systems Team, working with business objects, database creation, and forming the bridge between the business and Technology teams. There was a considerable amount of diverse data, from shareholder registers to complex market derivatives data at a time when Data Governance was not as prolific a subject as it is now.
Looking back with the experience I have now, there were a few practices that would definitely not be allowed in today’s data environment.
Roberts: Wild, wild west.
Scott: I spent some time in Luxembourg, bringing data into a complex monitoring tool. Working with a lot of good people, but was quite a maverick environment. Around this time I started working on data quality. An opportunity came for a business intelligence role. I didn’t get it, as another candidate has better experience, but they liked my approach and attitude and said, “How would you like to be a Metadata Manager? They said no one knows metadata, but you can find out what it is and deliver it.”
You can figure it out. [Laughter]
Within six months, I had a handle on Metadata Management and started to deploy it across the organization. We implemented hotkeys functionality, which brought up the definitions and linked them to business process. When we started, there were 25,000 business terms, this was too many so we removed them and went back to the basics. By the time I left, it was at 600 approved business terms there at the organization as opposed to 25,000 term of little value
After a year, I was given a data quality team to look after, because the data quality manager left. That’s when I started using Syncsort’s Trillium Data Quality software.
When was this?
Four and a half, five years ago. Ever since then I’ve been ensconced in Trillium software. The metadata and Trillium software were working together. They’re very complementary, as I was talking about in the presentation.
After a reorganization I found myself looking for new challenges and took the opportunity to take the contracting route for a few years
What problems did you have when you arrived? What was the driver for them to hire you?
It was a Data Foundation program that was in place. The company was finding it difficult to attract anyone with Data Quality experience, especially with exposure to Trillium software, they saw my CV, and thought, “well somebody knows Trillium” because that’s the tool that they had bought.
They were using Trillium Data Quality software and they got training so they could produce some business insights, but they were giving figures to the business senior managers, 80% and less, 60% and less, 40% and less. Businesses don’t care, they want to know the problem. They want it written back to them in English.
One of the challenges was that previous Quality Assessments could not be replicated due to a lack of documentation and process. They’ve done a huge customer analysis and it gave some great insight and narrative but nobody could tell us how they’ve cut the data, how they’ve sourced the data, how it’s been sliced, or what the rules were that they used to get to the focus.
The key is that everything we do is documented with a standard operating process. Those are used every time through the process. If people find a better way of doing something, we tweak it and then make it better, but we’ve written in such a way that anyone can come in off of the street and, as soon as they have access to the data, they can do the processes.
What has really been your major stumbling block in implementation?
Access to data. Getting the data out of systems when nobody wants to show anything because it could pull down the system or hunker performance, even in non-production environments. That’s changing now because they see the value of what we’re doing and there’s no tax on the server.
You had to explain to them that Trillium software would not mess with their performance before they would let you pull the data?
Yeah, those are the problems on the data governance side which wasn’t my task. What we were doing with data quality started to give validation to what we were doing in data governance.
When you say what you’re doing with the quality, you’re talking about the discovery aspect of it? You’re discovering the problems, and you could say, “Look, here they are.”
Yes, absolutely. What we do is focus on the customer data funnel where we get people from in the business. For example, it was alluded to address line two must be kept blank. And it’s actually in the manual to leave that blank. Who leaves a critical field like address line two blank?
Why do you have address line two it if you’re never going to put data in it?
Exactly. [Laughter] Because you’ve got an inconsistency in your data mastering platform, you’ve got to make your algorithms more complex to harvest the data to do it. The market sector we were working in was for Agriculture and farms in the UK. You’ve got a house, a street, and a street type. You’ve effectively got a house name, a farm name, an area name, and a town. There’s no numbers in four strings and when you’re using a name and address for projects you look for numbers in the patterns.
And you don’t have those patterns, you can’t find them.
They’re also doing things like the address fields having name information Mister and Missus. Mr. Scott could be the first line of the address. You’ve got names and addresses all over the place. That’s why we need consistency.
That’s a challenge.
That thing is the biggest challenge. Actually with all of the data, I reckon that we could get 96-97% exact match to a postal address if we just structure that correctly, which is a pretty good place to be.
Make sure to check part two where Roberts and Scott speak about the final stages of a data consultant project and where Scott’s next move will take him.
Check out our eBook on 4 ways to measure data quality.
Expert Interview (Part 2): Kenny Scott on the Final Stages of a Data Management Project

At the Collibra Data Citizens event in May of this year, Paige Roberts had a chance to speak with Kenny Scott, a data management consultant. In part one of this two part series, Roberts and Scott spoke about some of the challenges that come along with being a data management consultant. Part two focuses on the final stages of a data consultant project and what’s next for Kenny Scott.
Roberts: Going back to where you are now, what stage are you at in the implementation now? Where along the journey are you?
Scott: We started off as a group day on the 1st of January this year after about 18 months as a program. We’ve engaged several parts of the business. We’re recruiting those good people to come in. The next stage is to get them in, get them trained, and do the handover.
Roberts: So you’ve actually pretty much got them going. It’s just a matter of knowledge transfer and making sure that they’ve got people in-house who can continue.
Scott: What I’d like to do is transition into more of a data governance role, and help with ideas and then contribute to the data strategy. IT has the data architecture space and they’ve got the access to Collibra. They do the data models and the governance part of it that we don’t need to do. They handle the data quality and the data governance for our data pool as opposed to data governance as a whole.
Right. Collibra is focused on that whole top level, and Syncsort’s Trillium software is very focused on the quality aspect.
All right. Well, you’re talking about winding this down and you’re doing the hand-off. What do you think you’re going to do next?
At the moment I’d like to get back into a bigger organization. I’d like to get into a car manufacturer, or the utility industry, or aviation, or even maybe the medical field. I want to start looking at different datasets. Somewhere that’s big enough to be invested in the tools, somewhere with innovation, you know?
Yeah.
I’m all for business, and I’m looking for a business value, the business process, the business drivers, and a strong business unit
The whole purpose of the technology is to solve a business problem. If you don’t focus on the business problem first, you’re kind of missing the point.
I’ve seen this happening for years. You want the technology to work for you. This is your business problem, you’re telling us you wouldn’t do it. You’re telling us that we’ve got to use these tools, what you’re saying is, “This is our business requirement. This is the tool we’ve identified. We want you to bring it in and implement it so we can build the processes around it.”
That’s good. You really have to talk to each other. It’s too easy for the technology and the business half to get separated until they’re not even communicating.
It was interesting to hear some of those speeches out there now, and even some of the questions that come up about technology driving the governance. I don’t see that. I see technology having a process requirement to make things efficient by knowing where their catalog is or where their assets are. That’s the driver of governance. That is putting something in that is going to help them deliver the toolsets.
They’re the ones that hurt when it’s not done right. I heard somebody talk about a data quality campaign that had been done at their business, and it saved something like 600,000 pounds. They were wasting all of that money on the marketing campaigns that were never going anywhere.
And that’s fair. I wanted to get to where we are, at the moment, we started to monetize for all of that returned mail. I get figures back for what it costs for returned mail, what it costs for the email campaign so we see what we don’t get back from this.
So you can take that information up to the CEO, and say, “This is why it matters.”
And also since we have a permanent head of data coming in the next month so we can make the transfer and keep the show on the road.
That transfer of knowledge, it’s like there’s a certain degree of knowing what they’re supposed to be doing, and how it’s supposed to work, and you need to move that.
Exactly.
Well, thank you for taking the time to speak with me and I really enjoyed your presentation today.
Thank you very much for your support.
Check out our eBook on 4 ways to measure data quality.
2018 Best of Expert Interviews – Experts in the World of Data

We finish out the Best of 2018 series with a round up of our best Expert Interviews from this year. Take a look at the top 10 posts!
Splunk’s Chief Technology Advocate, Andi Mann and Syncsort’s Chief Product Officer, David Hodgson discuss the digital transformation taking place in IT and how machine learning and AI are helping IT leaders create a more business-centric view of their world.
Gregory Piatetsky-Shapiro of KDnuggets (@KDnuggets) discusses the how today’s advances in deep learning are cause for excitement and concern and notes some artificial intelligence concerns as it continues to advance.
Nicola Askham, The Data Governance Coach, speaks about the advantage of data governance as an overall strategy, rather than tactically addressing each new government regulation requirement and also what goes into planning and starting a data governance initiative
Dr. Sourav Dey, Managing Director at Manifold, speaks to Paige Roberts on applying machine learning and data science to real world problems. They also discussed augmented intelligence, the power of machine learning and human experts working together, and the importance of data quality and entity resolution in machine learning applications.
In this four-part expert interview, Katharine Jarmul, founder of KJamistan data science consultancy, and author of Data Wrangling with Python from O’Reilly, speaks about the importance of being able to explain your machine learning models. She also discussed the challenges of creating an inclusive company culture and how bias doesn’t only exist in machine learning data sets.
Wikibon’s Lead Analyst, Jim Kobielus, sat down for a talk on Blockchain in this three-part interview. He discussed the basics of what the Blockchain is, the real value of the technology and some of the practical use cases that are its sweet spots, and the future of Blockchain.
Tony Baer, Principal Analyst at Ovum, recently spoke to us about trends in Big Data, the future of Hadoop, and GDPR. He goes into trends in Hadoop and Cloud data as well as what’s next for Hadoop and cloud and whether we’re prepared for GDPR.
At the Strata Data Conference in New York City, Paige Roberts had a chance to sit down with Tobi Bosede, a Sr Machine Learning Engineer. In this three-part post they spoke about what goes into being a Machine Learning Engineer, predicting trade volumes and the correlation between volume and volatility, and Bosede’s perspective of being a “double minority” in the tech world.
At the Cloudera Sessions event in Munich, Germany, we had a chance to sit down with Mike Olson, Chief Strategy Officer of Cloudera. Olson dove into what’s new at Cloudera, how machine learning is evolving, and the adoption of the Cloud in organizations. Also discussed was Gartner’s latest hype cycle and Olson’s views on women in tech and the difference between Cloudera Altus and Director.
To finish the list off we have Joey Echeverria, an architect at Splunk, and author of the O’Reilly book, Hadoop Security. In this three-part series Echeverria discussed some common Hadoop security methods, different methods of fine-grained security when dealing with Hadoop, the latest developments with Splunk, and the differences between Apache Spark and Flink.
Make sure to download our eBook, “The New Rules for Your Data Landscape“, and take a look at the rules that are transforming the relationship between business and IT.
Expert Interview (Part 3): Jeff Bean and Apache Flink’s Take on Streaming and Batch Processing

At the recent Strata Data conference in NYC, Paige Roberts of Syncsort sat down for a conversation with Jeff Bean, the Technical Evangelist at data Artisans.
In the first of the three part blog, Roberts and Bean discussed data Artisans, Apache Flink, and how Flink handles state in streaming processing. Part two focuses on the adoption of Flink, why people tend to choose Flink, and available training and learning resources.
In the final installment, Roberts and Bean speak about Flink’s unique take on streaming and batch processing, and how Flink compares to other stream processing frameworks.
Roberts: So, aside from the ability to describe state, what else about Flink makes it especially cool?
Bean: It’s the real-time stream processing. A lot of vendors will say that they offer real time stream processing. When you look at what they actually offer, it’s some derivative side project, maybe a set of libraries that does stream processing, or a couple of extra functions that you can call. Flink is designed from the ground up for stream processing, and it treats batch processing as a special case rather than the other way around. I think that is more interesting for applications that want to handle both analytic data, and real-time data with streams. You don’t need to have two different sets of applications for that since Flink treats them both as the same. It sees batch as a special case of streaming rather than the other way around.
I can picture how if you have a batch process that you sort of chop it into tinier and tinier batches until you’re down to one event, and now you have streaming, but if I’m starting from a streaming point of view, how do I get to batch? What kind of a special case is batch?
Batch is basically streaming with bounds. You point your processor at a fixed data set, and it will process it one record at a time as if it were a stream. Off it goes and it’s done. With Flink, it’s really designed for streams as input. You can point it at a file or a table and say, “That’s a stream.”” I found that when you consider batch processing as a special case of streaming, rather than the other way around, it all comes together more easily.
And if you’ve pointed it at a table, after it pulls all the data off, then new transactions coming in actually become a stream?
Yep, exactly. We’re trained to think about data as if it were static, fixed objects like tables and files, but in fact all data is generated as a stream.
You didn’t get a million records all at once. You got them one at a time.
So, I remember one time asking one of the Spark experts about true streaming handling in Spark Streaming, and they said, “Well, yeah, it does true streaming.” And I said “I thought it did microbatch.” They said, “Well, everybody does microbatch. It’s just that our microbatches have gone down to one message at a time.” What’s your opinion on that?
When you’re working with Spark Streaming, in order to get optimal performance, you have to tune your microbatch size, or your microbatch interval. In Flink you choose the time characteristic instead, and in the event time characteristic, events are microbatched until the watermark advances. It’s a similar issue but it’s closer to the business problem. There is microbatching, but it happens at the framework level, and the OS level, at the level of the network buffer. Which is where it should be, really.
Okay. That makes sense.
It’s more intuitive, it’s more expressive. As a developer, I find it much easier to learn.
Are there any non-programming interfaces? I know there’s a lot of ways you can write Spark jobs that have nothing to do with Java or Scala or Python. You can build a KNIME workflow and execute in Spark. Syncsort DMX-h can build a data integration job and execute it on Spark. There are notebooks and such. Is there anything like that for Flink?
Not so much, at least on the commercial side. Zeppelin supports Flink, though. I would love to see more of that, and I kind of see that as part of my charter to help build.
So, before we wrap up, is there anything you’d like to let the readers know about before we end?
I mentioned it a little earlier but make sure to check out training.data-artisans.com for some great courses.
Alright. Thanks for taking the time do this. Good talking to you. Good luck with the new job!
Thank you.
Expert Interview (Part 2): Paco Nathan on the Current State of Agile and Deep Learning

At the recent Strata Data conference in NYC, Paige Roberts of Syncsort has a moment to sit and speak with Paco Nathan of Derwen, Inc. In part one of the interview, Roberts and Nathan discuss the origins, current state, and the future trends of artificial intelligence and neural networks.
In the second part, Roberts and Nathan go into the current state of Agile and deep learning.
Roberts: Changing the subject a little, one of the other things you talked about which kind of struck me pretty strong is basically the father the Agile says, don’t do Agile anymore. [Laughter]
Nathan: [Laughter] Right!
Roberts: Can you talk about that a little bit?
Nathan: Yeah, I was referencing a recent paper this year, actually just a few months ago, by Ron Jeffries who created Extreme Programming. Pair Programming came out of that. Scrum came out of that. A lot of the things we recognize as Agile came from that. He was one of the signatories of the Agile Manifesto 20 years ago. Recently he came out saying that the definition of Agile that he’s seen floating around in industry don’t have anything to do with the intention that they were trying to strike at. He wrote down, “20 years later, here’s my advice for what you really need to do with your team. Let’s get away from the names, and let’s just really focus on how to make teams better.”
Roberts: Wow. Okay. What’s the paper that he did?
Nathan: It’s called “Developers Should Abandon Agile.”
That’s pretty interesting. I think there are tons of software companies right now that for them, that’s the Bible. You have to do Agile to survive.
If you saw the talk by David Talby that was a really good one too. It was called, “Ways That Your Machine Learning Model Can Crash and What You Can Do About It.” He’s done a lot of work, especially in healthcare, with machine learning and he just had case study after case study of what goes wrong. The point there was, the real work is not developing the machine learning model. The real work is once you put it into production, what you have to do to make sure that it’s right, and that’s ongoing.
Yeah. That’s always true.
I heard David’s talk in London five minutes before my talk, and I made a slide to represent some of the things he talked about because it fit in with what I was saying. I showed it and then there were arguments out in the hallway afterwards, because the Agile people were like, “How dare you say that!” It’s really salient because if I’m developing a mobile web app, and I have a team that I’m engineering director of, I’m going to bring in my best architects and team leads early in the process. They’re going to go define the high level definitions and define the interfaces. As the project progresses more into fleshing out different parts of the API and getting into more maintenance mode, I don’t have to have my more senior people involved.
Right.
With machine learning, it is the exact opposite. If I’ve got a dataset, and I want to train a model, that’s a homework exercise for somebody who’s just beginning in data science. I can do that off the shelf. But once you get deployed and start seeing edge cases and the issues that have to do with ethics and security, that’s not a homework exercise. Unless you’re in context, and actually running in production, you’re not going to know in advance what those issues are.
Yeah, but a lot of the conversation now is about the fact that most of your datasets are in some way biased, and there’s a lot of ethics involved in launching a machine learning model. I just saw an article online where they’re making ethics in machine learning a first year course for people that they’re training for ML and AI (Carnegie Mellon, University of Edinburgh, Stanford). I guess it actually speaks a little bit to what you said about putting your experts at the end during production. To a certain extent, it seem to me like you also want to have the experts at the beginning, looking at the data before it even starts the process.
Definitely. Deloitte, McKinsey, Accenture, all of them, when we do executive briefings, they all want it set at the beginning. Before we even talk about introducing machine learning into your company, you need to get your ducks in a row as far as breaking down the data silos, and getting your workflow for cleaning your data in place, and a culture that’s based around using data engineering and data science appropriately. You need to do all of those things before you can even start on machine learning. There’s a lot of foundation that needs to be done correctly.
I said something about the high percentage of machine learning projects that never make it into production on Twitter, and got a response from John Warlander, a Data Engineer at Blocket in Sweden. He said, “I sometimes wonder how many of those ‘not in production’ big data projects happen in companies that don’t even have their ‘small data’ in order. That’s often where most of the low-hanging fruit is.” I’ll put that in my blog post about the Strata event themes and industry trends. We’re talking about a lot of those important themes, so I’ll probably put a lot of quotes from you in it.
David Talby had a great quote, “Really, if you want to talk about AI in a product, what you’re talking about is what you’re going to do once you’re deployed and the products being used by customers. How do you keep improving, because if you’re not doing that you’re not doing AI.”
Well if you’re not doing that, you’re certainly not having that feedback loop. You’ve lost that. When looking at the improvement in accuracy over random chance of any model, there’s always that curve that says this is more and more accurate and then it becomes less and less accurate over time if you don’t constantly retrain your models. One of the themes for Syncsort, as a data engineering kind of company, is making sure that the data that you’re feeding in there is itself constantly refreshed and improved. You said something in your talk that stuck with me. The value in ML and AI right now isn’t as much in iterating through models, or getting the best model, it’s feeding your models the best datasets.
I mean if you want a good data point on that, a lot of these companies, even ones who are leaders in AI, will share their code with you. They’re not going to share their data. That was kind of the punchline of the situation with Crowdflower or Figure Eight. Google bought into self-driving cars, and they realized they could replace a lot of one-off machine learning processes with deep learning, but to do that, they needed really good labelled datasets. Other manufacturers saw their success and wanted to do self-driving cars, too. They hired the talent and the first thing they find out is that if they want to do deep learning, they don’t have enough data, or enough good, labelled data. So, they go to Figure 8 and ask, Hey, can you label our datasets?
Lukas Biewald, the founder of Figure Eight, was talking in San Francisco a couple years ago, saying, “Yeah, for about $ 2 – 3 Million per sensor, we’d be happy to work with you on that.” And he had customers lined up, GM and all the others, because …
Because it’s worth it.
Yeah and if they don’t have it, they’re out of the self-driving car business. It may be a high price but it will likely include years of data.
People focus so much on the models. I have to have the most sophisticated algorithm, …
No. That’s not it.
The only reason that AI didn’t take off back in the 80’s or the 90’s when you and I were first studying it, was because we didn’t have enough data. We couldn’t crunch it. We couldn’t ingest that amount of data and do anything with it, affordably.
There needed to be millions of cat pictures on the internet before we could really do deep learning.
Before we could create something that could identify a cat picture. That’s just the nature of the game.
That was the paper that launched it all. And then the open source for using GPU’s to accelerate it.
That’s really taking off more now in spaces other than video games. Walking the Strata floor, there are a lot more vendors out there taking advantage of GPU’s.
There’s nothing really sacred about the architecture of a GPU with respect to machine learning. It just happens to be faster than a general-purpose CPU at doing linear algebra. But now we’re seeing more ASICs that can do more advanced linear algebra, at enough scale that you don’t have to go across the network. That’s the game. We’ll probably see a lot more custom hardware. Basically we’re in this weird sort of temporal chaos regime where hardware is moving faster than software and software is moving faster than process.
Hardware ALWAYS moves faster than software. Most software is just now finally, in the last few years, catching up to things like using vectors to take advantage of regular CPU chip cache.
And now we’re putting Tensorflow compositions in GPU’s.
Exactly. And we’re creating compute hardware that’s specific to task. Software always lags behind the hardware and then business processes have to develop after that.
Yeah, you have to log some time doing the job before you can really figure out the process. I think you’re company is in a really good space right now. You’ve gotta get the data right. And it’s not just a one-off. You’ve got to keep getting the data right across your company. Now, and forevermore.
Yeah, tracking and reproducing data changes in production is a big challenge for our customers. If you made 25 changes to the data to make it useful for model training, you then have to make those exact same 25 changes in production so that the model sees data in the format it’s expecting. I’m doing a series of short webinars on tackling the challenges of engineering production machine learning data pipelines, including one on tracking data lineage and reproducing data changes in production environments. So is there anything else going on at the moment that you’d like to let us know about?
I have a little company called Derwen.ai. If you check there, we’ve got a lot of articles. It’s my consulting firm and we do a lot of work with the conferences. We get to see a real bird’s eye view, and we hear from all kinds of people. We’re like Switzerland. We get to hear what a lot of people are working on, even if they’re not ready to go public with it. I hear the pain points people are dealing with, and help out the start-ups. It’s kind of like a distributed product management role.
Cool. All right, well, thanks for talking to me. I really enjoyed your presentation.
Thank you very kindly, so good to see you.
Check out our eBook on the Rise of Artificial Intelligence for IT Operations.
A Day in the Life of a Dynamics 365 Data Expert

At PowerObjects, we have found that one of the most essential roles on any Microsoft Dynamics 365 team is devoted specifically to the data needs of the project – the data experts. Any enterprise is only as good as the data that it has available to support itself.
Therefore, critical to the success of any D365 project are each of the following:
- Diligent stewardship (definition, migration, and integration) of the existing legacy data.
- Design and implementation of efficient and effective collection.
- Integration with existing information.
- Representation of all new data “gathered” as part of the day-to-day operations of an organization.
We believe that the people who excel in the role of “data expert” share these qualities:
- Solid organization and communication skills.
- A decided thoroughness of their work.
- Adaptability to new data and new data sources.
- Adaptability to seamlessly move between related toolsets, technologies, and technological environments or platforms.
- Ability to comprehend the varied functional needs and methods of organizations and the people who make up those organizations.
Of course, these people also demonstrate the PowerObjects’ Core Values every day.
The work performed by a data expert varies from day to day, depends on the specific phase of the project.
The data expert enters a normal solution build during the PowerObjects’ planning phase, beginning with the efforts to transition from the Sales to the Delivery team – designed to explain/refine the tasks to be performed and to introduce, define the roles of, and empower the members of the combined client team.
Clients are encouraged to identify key indicators and functionality during the planning phase. These indicators will quickly identify the entities, fields, and relationships important to the customer and define the solution developed. Specific subject matter experts (SMEs) on the client side will quickly be identified or make themselves known during early discussions.
This phase focuses on setting the expectations of the client; translating the functional requirements of the client into the technical requirements and design of the solution; and identifying the new/modified/retained business processes involved. Together, this planning will define and refine the scope, cost, and timeline of the D365 project at hand. The planning phase aids the client team in understanding their own requirements, as the structure of D365 encourages process improvement. However, D365 should not be considered or presented as “the process.”
Each data migration/integration effort requires great care, thoroughness, and detailed planning. The work of the data expert normally begins in earnest during this phase – with the identification of unique identifiers of all source data, identification and definition of all source data elements to be migrated, and the mapping of the source data to be migrated and/or integrated into specific entities and fields within the D365 solution.
Simply put, mapping the data is determining the destination of the data currently stored in the current legacy system(s) into the corresponding D365 data structures, while confirming the format of the source and destination data. Most of the time, we use a series of spreadsheets to build out and refine this mapping work. The more specific these documents, the easier it will be to execute the migration/integration work.
Note that not all legacy data will be migrated. Historic data stored in a customer’s legacy system is often found to contain duplicate, inconsistent, incomplete, or outdated information. The data expert may identify risks associated with certain “dirty” data while working with a client’s data set. They will offer client team members assistance in data quality and data cleansing methods and best practices; but data cleansing work is normally the domain of the client. History has proven that it is better to resolve data quality/cleanliness issues PRIOR to migrating data – ensuring only the most useful and cleanest data will be moved into the D365 instance.
Furthermore, data in the source systems may not be aligned exactly with the destination D365 entity receiving the data. This requires that the entire “column” of data be manipulated to match what D365 is expecting to receive. These individual steps are sometimes described as “transformation formulas.” For example, the existing name field may need to be cut up or “parsed” into the D365 firstname and lastname fields, or all telephone numbers must be presented in a certain format (e.g., “###.###.####”).
Powerful data integration tools, such as SQLServer Integration Services (SSIS), KingswaySoft, Scribe Insight, and the Microsoft Azure
products and tools, are available to the data expert to do the migration/integration work. While data migration/integration are two different processes, the methods and skill sets used are very similar.
Data Migration is the process of moving data from one system to another. Considerations include:
- Format of the data transferred.
- Planned, one-time (or infrequent) transfer of data.
Data Integration is the process of building and maintaining the synchronizing (transfer) of data between systems. Considerations include:
- Format of the data transferred.
- Recurring transfer of data.
- Frequency of the transfer.
- What will trigger the transfer.
Additional factors that will affect the level of effort and duration of a data migration/integration task are:
- Number and size of the source data legacy system(s).
- Volume of data (the number of tables, rows, columns to be processed).
- Types of data.
- Use of option sets or “picklists.”
- Number and complexity of the transformational formulas required to put the data into a format acceptable to the destination entity field.
Specific care must be paid to the steps required to match, dedupe, and integrate distinct source data available, as well as the coordinated migration of data into the developmental (or Sandbox) instance.
The integration mapping will often follow that of the migration mapping, which can be used as the starting point of integration mapping efforts. The mapped-to destinations will often coincide, but the source of data and the processes to get to that destination will vary. Therefore, integration and migration processes should be conceived and developed as separate functionality.
The most satisfying part of the data expert’s work is seeing the populated data entities’ data joined with the work of the application developers – representing the visualization of the client’s expectations on screen. Once data is merged with forms, reports, and dashboards of the application, and then viewed by the customer through the eyes of the D365 toolset, the connections of the functional and the technical requirements are seen. his is also the opportunity to fine tune the components of the solution. Fine tuning allows our team to deliver the solution originally envisioned, prepare the solution and the client for user acceptance testing, and ultimately deploy the data migration and integration components as integral components of the overall D365 solution to the client’s production environment.
We hope this gives you a view into what being a data expert is like! We’re always looking for great talent at PowerObjects, check out open roles and apply on our website here.
Happy Dynamics 365’ing!
Expert Interview (Part 3): James Kobielus on the Future of Blockchain, AI, Machine Learning, and GDPR

Since Syncsort recently joined the Hyperledger community, we have a clear interest in raising awareness of the Blockchain technology. There’s a lot of hype out there, but not a lot of clear, understandable facts about this revolutionary data management technology. Toward that end, Syncsort’s Integrate Product Marketing Manager, Paige Roberts, had a long conversation with Wikibon Lead Analyst Jim Kobielus.
In the first part of the conversation, we discussed the basic definition of what the Blockchain is, and cut through some of the hype surrounding it. In the second part, we dove into the real value of the technology and some of the practical use cases that are its sweet spots. In this final part, we’ll talk about the future of Blockchain, how it intersects with artificial intelligence and machine learning, how it deals with privacy restrictions from regulations like GDPR, and how to get data back out once you’ve put it in.
Roberts: Where does Blockchain go from here? What do you see as the future of Blockchain?
Kobielus: It will continue to mature. In terms of startups, they’ll come and go, and they’ll start to differentiate. Some will survive to be acquired by the big guys, who will continue to evolve their own portfolios, while integrating those into a wide range of vertical and horizontal applications.
Nobody’s going to make any money off of Blockchain itself. It’s open source. The money will be made off of cloud services, especially cloud services that incorporate Blockchain as one of the core data platforms.
Believe it or not, you can do GDPR on Blockchain but, here’s the thing: the GDPR community is working out exactly what you can do to delete the data records consistently on the Blockchain. Essentially, you can encrypt the data and then delete the key.
Right. If you can’t decrypt it, you can’t ever read it.
Yeah. Inaccessible forever more in theory. That’s a possibility of harmonizing Blockchain architecture with the GDPR and other mandates that require the right to be forgotten. The regulators also have to figure out what is Kosher there. I think there will be some reconciliation needed between the techies pushing Blockchain, and the regulators trying to enforce the various privacy mandates.
Just as important in terms of where it’s going, Blockchain platforms as a service, PAAS, will become ever more important components of the data providers overall solutions. Year by year, you’ll see the Microsofts, IBMs and Oracles of the world evolve Blockchain-based Cloud services into fairly formidable environments.
There are performance issues, in terms of speed of updates with Blockchain now, but I also know that there is widespread R & D to overcome those. VMWare just announced they’re working on a faster consensus protocol, so that different nodes on the Blockchain can come to consensus rapidly, allowing more rapid updates to the chain. Lots of parties are looking for better ways to do that. So, maybe it might become more usable for transactional applications in the future.
Blockchain deployment templates are going to become the way most enterprise customers power this technology. AWS and Microsoft already offer these templates for rapid creation and deployment of a Blockchain for financial or supply chain or whatever. We’re going to see more of those templates as the core way in which people buy, in a very business friendly abstraction. There will be a lot of Blockchain-based applications for specific needs. We’ll see a lot of innovation in terms of how to present this technology and how to deliver it so that you don’t have to understand what a consensus protocol is or really give a crap about what’s going on in the Blockchain itself. It should be abstracted from the average customer.
More in terms of going forward, you’ll see what I call “Blockchain domain accelerators.” There are Blockchain consultants everywhere now. There are national Blockchain startup accelerators. There are industry-specific Blockchain startup accelerators. There are Blockchain accelerators in terms of innovation of cryptocurrency and Internet of Things. We’re going to see more of these domain accelerator industry initiatives come to fruition using Blockchain as their foundation. They’ll analyze and make standards of how to deploy, secure and manage this technology specific to industry and use case requirements. That definitely is the future.
As I mentioned before, it will become a bigger piece of the AI future, because of Blockchain-based distributed marketplaces for training data. Training data for building and verifying machine learning models for things like sentiment analysis has real value. There’s not many startups in the world that would have massive training datasets already. To build the best AI, you’ll need to go find the best training datasets for what you’re working on.
I talked about that a little with Paco Nathan at Strata, how labelled, valid, useful training datasets were incredibly valuable now, and AI companies recognize that. They will share their code with you, but not their data, not for free.
I really think you’ll see a lot more AI training dataset marketplaces with Blockchain as the backing technology. It’s going to become a big piece of the AI picture.
Blockchain security is another big thing going forward. The Blockchain is the weak link is in protecting your private keys, which provide you with secure access to your cryptocurrencies that are running out of the chain. What we’re going to see is that there will be more emphasis on security capabilities that are edge-to-edge in terms of securing Blockchains from the weakest link, which is the end-user managing their keys. I think you’ll start to see a lot of Blockchain security vendors that help you manage your private keys, and also smart contracts. Smart contracts on the Blockchain have some security vulnerabilities in their own right. We’ll see a lot of new approaches to making these tamper-proof. There’s already a lot of problem with fraud.
I think I’ve covered most of the big things I see coming. That is the really major stuff.
One more thing, I’m curious about since Blockchain is still fairly new to me. There’s a lot of conversation about how you store data on the Blockchain, and a lot of research into things like securing it, and speeding up update speed, but storing data is only half the story with data management. Once you’ve put all this data in, you have to then get it out. If I’ve got a Blockchain, it has all this information I need, how do I go find and retrieve information from it? Do I use SQL?
There’s a query language in the core Blockchain code base.
So, it has its own specific query language, and people will have to learn a whole other way to retrieve data?
Basically, the core of Hyperledger has got a query language built in. It’s called Hyperledger Explorer. Hyperledger, in itself, is an ecosystem of projects just like Hadoop is and was, that will evolve. It’ll be adopted at various rates, some projects will be adopted widely, and some very little during production Blockchain deployments.
There’s some parallels with early Hadoop. Some of the early things that Hadoop had under their broad scope, they had an initial query language that didn’t take off, they updated that, and improved it with HiveQL. Same thing with Spark. They started out with a query language Shark, and switched to another one, Spark SQL.
We have to look at the entire ecosystem. Over time, some pieces may be replaced by proprietary vendor offerings, or different open source code that does these things better. It’s part of the maturation process. Five years from now, I’d like to see what the core Blockchain Hyperledger stack is. It may be significantly different. It may change as stuff gets proved out in practice.
Yeah, Hadoop changed a lot over the last decade.
Hadoop has become itself just part of a larger stack with things like Tensorflow, R, Kafka for streaming. Innovation continues to deepen the stack. The NoSQL movement, graph databases, the whole data management menagerie continues to grow. We’ll see how the core protocol of Blockchain evolves too. It’s a work in progress, like everything else.
I’ve written a bunch of articles on this. It’s changing all the time.
I’ll be sure to include some links in the blog post, so folks can learn more. I really thank you for taking the time to speak with me. It was really informative.
No problem. I enjoyed it.
Jim is Wikibon’s Lead Analyst for Data Science, Deep Learning, and Application Development. Previously, Jim was IBM’s data science evangelist. He managed IBM’s thought leadership, social and influencer marketing programs targeted at developers of big data analytics, machine learning, and cognitive computing applications. Prior to his 5-year stint at IBM, Jim was an analyst at Forrester Research, Current Analysis, and the Burton Group. He is also a prolific blogger, a popular speaker, and a familiar face from his many appearances as an expert on theCUBE and at industry events.
Also, make sure to download our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.
Expert Interview (Part 2): James Kobielus on Blockchain’s Sweet Spot in Practical Business Use Cases

Since Syncsort recently joined the Hyperledger community, we have a clear interest in raising awareness of the Blockchain technology. There’s a lot of hype out there, but not a lot of clear, understandable facts about this revolutionary data management technology. Toward that end, Syncsort’s Integrate Product Marketing Manager, Paige Roberts, had a long conversation with Wikibon Lead Analyst Jim Kobielus.
In the first part of the conversation, we discussed the basic definition of what the Blockchain is, and cut through some of the hype surrounding it. In this second part, we dove into the real value of Blockchain technology and some of the practical use cases that are its real sweet spots.
Roberts: The hype cycle tends to make all kinds of wild claims. It will do everything but wash your socks. Which claims for Blockchain do you feel have some validity?
Kobielus: First of all, since it doesn’t support CRUD, it’s not made for general purpose database transactions. It’s made for highly specialized environments where you need to have a persistent immutable record like logging. Logging of security related events, logging system events for later analysis correlation, etc. Or, where you have an immutable record of assets, video, music and so forth, in a marketplace where these are intellectual property that need protection against tampering. If you have a tamper-proof distributed record, which is what Blockchain is, it’s perfect for maintaining vast repositories of intellectual properties for downstream monetization. Or, for tracking supply chains.
A distributed transaction record that can’t be repudiated, that can’t be tampered with, that stands up in legal situations is absolutely valuable. So, Blockchain makes a lot of sense in those kinds of applications. In addition to lacking the ability to delete and edit the data, Blockchain is slow. It’s not an online transactional database. Updates to the chain can take minutes or hours depending on how the chain is set up, and how extensive the changes are, so you can’t have a high concurrency of transactions. It’s just not set up for fast query performance. It’s very slow.
Also, in a world moving towards harmonization around privacy protection, consistent with what the European Union has done with the General Data Protection Regulation (GDPR), and the recent California privacy regulation that is similar to GDPR. GDPR requires that any personally identifiable information (PII) must be capable of being forgotten, meaning people have the right to request deletion of their personal data, or to edit it if it’s wrong. In Blockchain, you can’t delete, and you can’t edit a record that’s written in Blockchain. There’s a vast range of enterprise applications that have personally identifiable information. The bulk of your business, sales, marketing, customer service, HR, etc. has tons of PII data.
So, Blockchain is not suitable for those core transaction processing applications. Any application that demands high performance queries will not be on the Blockchain. It’s not suitable for highly scalable real-time transactions of any sort, whether or not they involve PII data.
The way I see it, Paige, is there’s a range of fit for purpose data platforms in the data management space. There’s relational databases, all the NoSQL databases, there’s HDFS, there’s graph databases, key-value stores, real-time in-memory databases, and so on. Each of those is suited to particular architectures and use cases, but not to others. Blockchain is fundamentally a database, and it’s got its uses. It’s not going to dominate all data computing like a monoculture, no matter what John McAfee says. That’s not going to happen. It’s already limited technologically, and with regulatory limitations. It’s a niche data platform that’s finding its sweet spot in various places.
You mentioned a couple of good use cases like supply chain management. I’ve heard of uses like tracking diamonds from the mine to the jewelry store to be certain of their origins, that they’re not blood diamonds. All of the examples I had heard of in the past were based on the concept of Blockchain as a transactional ledger or even a sensor log. For example, you keep sensors on your food from the farm to the market to make sure that it never went above a certain temperature for a certain amount of time, that sort of thing. One of the use cases you mentioned was actually news to me, that you could store other sorts of data like application code, so you could do code change management with it. What other use cases do you see coming?
Actually, there are a few pieces that I published recently for vertical application focused supply chain management. Blockchain startups are trying to grab a piece of the video streaming market. Essentially these services, a lot of which are still in alpha or beta pre-release phase, use Blockchain in several capacities. One for distributed video storage. Number two, for distributed video distribution from a peer-to-peer protocol.
Distributed video monetization using a Blockchain-based cryptocurrency that’s specific to each environment to help the video publishers monetize their offering. Blockchain for distributed video transactions, and for contracts. Blockchain for distributed video governance.
So are you talking about having something like Netflix bucks?
More and more Blockchain applications aren’t one hundred percent on the Blockchain. They handle things like PII off the chain, for instance, and put that in a relational database. Most architecture is using fit-for-purpose data platforms for specific functions in a broader application. That is really where Blockchain is coming into its own.
Another specialized Blockchain use case is artificial intelligence, one of my core areas. I’ve been reading for a while now about the AI community experimenting with using Blockchain as an AI compute brokering backbone; there’s a company called Cortex. You can read my article on that. They use Blockchain as a decentralized AI training data exchange. They have data that has the core ground truths a lot of AI applications need to be trained on.
So you’re saying they basically create really solid, excellent training datasets, doing all the data engineering to make sure these are good training datasets for AI ground truths, and then use Blockchain to exchange them to other AI developers?
It’s a Blockchain for people who built and sourced their training data to store it in a ledger so that others can tap into that data from an authoritative repository.
Right. Okay. That makes sense. Seems like a valuable commodity to the AI community.
Several small companies are doing this. They’re converging training data into an exchange or marketplace for downstream distribution to data scientists, or whoever will pay for the training data. Blockchain is used as an AI middleware bus, an AI audit log, an AI data lake.
What I’m getting at, Paige, is that there are lots of industry-specific implementations of Blockchain. Industries everywhere are using this, some in production, but many of them are still piloting and experimenting with Blockchain in a variety of contexts including e-commerce, AI, video distribution, in ways that are really fascinating.
These are the same kinds of dynamics that we saw in the early days of Hadoop and NoSQL and other technologies. Each technology market grows by vendors finding a sweet spot, an application that their approach is best suited to.
We see a lot of hybrid data management approaches in companies that use two or more strategies in a common architecture.
One thing that’s missing from all that stuff is real-time streaming, continuous computing applications. Blockchain is very much static data, it’s almost the epitome of static data. You won’t see too many real-time applications for Blockchain alone, but that’s okay. Blockchain is good for the things that it’s good for.
Blockchain will find its niche given time?
Yes.
Be sure not to miss Part 3 where we’ll talk about the future of Blockchain, how it intersects with artificial intelligence and machine learning, how Blockchain deals with privacy restrictions from regulations like GDPR, and how to get data back out of the Blockchain once you’ve put it in.
Jim is Wikibon’s Lead Analyst for Data Science, Deep Learning, and Application Development. Previously, Jim was IBM’s data science evangelist. He managed IBM’s thought leadership, social and influencer marketing programs targeted at developers of big data analytics, machine learning, and cognitive computing applications. Prior to his 5-year stint at IBM, Jim was an analyst at Forrester Research, Current Analysis, and the Burton Group. He is also a prolific blogger, a popular speaker, and a familiar face from his many appearances as an expert on theCUBE and at industry events.
Make sure to download our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.
Expert Interview (Part 3): Dr. Sourav Dey on Data Quality and Entity Resolution

At the recent DataWorks Summit in San Jose, Paige Roberts, Senior Product Marketing Manager at Syncsort, had a moment to speak with Dr. Sourav Dey, Managing Director at Manifold.
In the first part of our three-part interview Roberts spoke to Dey about his presentation which focused on applying machine learning and data science to real world problems. Dey gave two examples of matching business needs to what the available data could predict.
In part two, Dey discussed augmented intelligence, the power of machine learning and human experts working together to outperform either one alone.
In this final installment Roberts and Dey speak about the importance of data quality and entity resolution in machine learning applications.
Roberts: In your talk, you gave an example where you tried two different machine learning algorithms on a data set, and didn’t get good results either time. Rather than trying yet another, more complicated algorithm, you concluded that the data wasn’t of good quality to make that prediction. What quality aspects of the data affect your ability to use it for what you’re trying to accomplish?
Dey: That’s a deep question. There are a lot of things.
Let’s dive deeper then.
So, at the highest level, there’s the quantity of data. You can’t do very good machine learning with only a handful of examples. Ideally you need thousands of examples. Machine learning is not magic. It’s about finding patterns in historical data. The more data, the more patterns it can find.
People are sometimes disappointed by the fact that if we’re looking for something rare, they may not have very many examples of it. In those situations, machine learning often doesn’t work as well as desired. This is often the case when trying to predict failures. If you have good dependable equipment, failures are often very rare – occurring only in a small fraction of the examples.
There are techniques, like sample rebalancing that can address certain issues with rare events, but fundamentally more examples will lead to better performance of the ML algorithm.
What are other issues to be aware of?
Another aspect, of course, is the data labeled well? Tendu talked about this, too, in her talk on anti-money laundering. Lineage issues are a problem. Things like, oh, actually, the product was changed here, but I never noted it. That means that all of these features have changed. This comes up a lot, particularly with web and mobile-based products where the product is constantly changing. Often such changes mean that a model can’t be trained on data before the change because it is no longer a good proxy for the future. Labeling is one of the biggest issues. I gave you the example for the oil and gas where they thought they had good labeling, but they didn’t.
How about missing data?
Missing data is surprisingly not that big of an issue. In the oil and gas sensor data, it could drop off for a while because of poor internet connectivity. For small dropouts, we could interpolate using simple interpolation techniques. For larger dropouts we would just throw out the data. That’s much easier to deal with than labelling issues.
Can you talk a bit about entity resolution and joining data sources?
Yes, this is another problem we often face. The issue is about joining data sources, particularly with bigger clients. They’ll have three silos, seven silos, ten silos, sometimes in really big companies even have 50 or 100 silos of data, where they’ve never been joined, but they’re of the same user base.
The data are all about the same people.
Right, and even within a single data source, it needs to be de-duplicated. It’s the same records. I’ll give a concrete example. We worked with this company that is an expert search firm. Their business is to help companies to find specific people with certain skills, e.g. a semi-conductor expert that understands 10 nanometer micron technology. Given a request, they want to find a relevant expert as fast as possible.
Clean, thick data drives business value for them by giving their search a large surface area to hit against. They can then service more requests, faster. Their problem was that they had several different data silos and they never joined them. They only searched against one. They knew that they were missing out on a lot of potential matches and leaving money on the table. They hired Manifold to help them solve this problem.
How do we join these seven silos, and then figure out if the seven different versions of this person are actually the same person? Or two different people, or five different people.
This problem is called entity resolution. What’s interesting, is that you can use machine learning to do entity resolution. We’ve done it a couple of times now. There are some pretty interesting natural language processing techniques you can use, but all of them require a human in the loop to bootstrap the system. The human labels pairs, e.g. these records are the same, these records are not the same. These labels are fed back to the algorithm, and then it generates more examples. This general process is called active learning. It keeps feeding back the ones it’s not sure about to get labelled. With a few thousand labeled examples, it can start doing pretty well for both the de-duplication and the joining.
The compute becomes pretty challenging when you have large data sets. Tendu mentioned it in her talk on Anti-Money Laundering, you have to compare everything to everything, and do it with these fuzzy matching algorithms. That’s a challenge.
That’s a challenge, yeah. One of the tricks is to use a blocking algorithm which is crude classifier. Then, after the blocking, you have a much smaller set to do the machine learning base comparison on. That being said, even the blocking has to be run on N times M records where N and M are millions of records.
Where if you have seven silos and there’s a million records each and a hundred attributes per record, it’s a million times a million seven times …
It’s blows up quickly. That’s where you have to be smart about parallelizing and I think that’s where the Syncsort type of solution can be really powerful. It is an embarrassingly parallel problem. You just have to write the software appropriately so that can be done well.
Yeah, our Trillium data quality software is really good at parallel entity resolution at scale.
I like to work on clean data, and you guys are good at getting the data to the right state. That’s a very natural fit.
It is! You need clean data to work, and we make data clean. Well thank you for the interview, this has been fun!
Thank you!
Check out our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.