• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Expert Interview (Part 3): Dr. Sourav Dey on Data Quality and Entity Resolution

August 3, 2018   Big Data
Expert Interview Part 3 Dr. Sourav Dey on Data Quality and Entity Resolution Expert Interview (Part 3): Dr. Sourav Dey on Data Quality and Entity Resolution
 Expert Interview (Part 3): Dr. Sourav Dey on Data Quality and Entity Resolution

Paige Roberts

August 2, 2018

At the recent DataWorks Summit in San Jose, Paige Roberts, Senior Product Marketing Manager at Syncsort, had a moment to speak with Dr. Sourav Dey, Managing Director at Manifold.

In the first part of our three-part interview Roberts spoke to Dey about his presentation which focused on applying machine learning and data science to real world problems. Dey gave two examples of matching business needs to what the available data could predict.

In part two, Dey discussed augmented intelligence, the power of machine learning and human experts working together to outperform either one alone.

In this final installment Roberts and Dey speak about the importance of data quality and entity resolution in machine learning applications.

Roberts: In your talk, you gave an example where you tried two different machine learning algorithms on a data set, and didn’t get good results either time. Rather than trying yet another, more complicated algorithm, you concluded that the data wasn’t of good quality to make that prediction. What quality aspects of the data affect your ability to use it for what you’re trying to accomplish?

Dey: That’s a deep question. There are a lot of things.

Let’s dive deeper then.

So, at the highest level, there’s the quantity of data. You can’t do very good machine learning with only a handful of examples. Ideally you need thousands of examples. Machine learning is not magic. It’s about finding patterns in historical data. The more data, the more patterns it can find.

People are sometimes disappointed by the fact that if we’re looking for something rare, they may not have very many examples of it. In those situations, machine learning often doesn’t work as well as desired.  This is often the case when trying to predict failures.  If you have good dependable equipment, failures are often very rare – occurring only in a small fraction of the examples.

There are techniques, like sample rebalancing that can address certain issues with rare events, but fundamentally more examples will lead to better performance of the ML algorithm.

What are other issues to be aware of?

Another aspect, of course, is the data labeled well? Tendu talked about this, too, in her talk on anti-money laundering. Lineage issues are a problem. Things like, oh, actually, the product was changed here, but I never noted it. That means that all of these features have changed. This comes up a lot, particularly with web and mobile-based products where the product is constantly changing. Often such changes mean that a model can’t be trained on data before the change because it is no longer a good proxy for the future. Labeling is one of the biggest issues. I gave you the example for the oil and gas where they thought they had good labeling, but they didn’t.

How about missing data?

Missing data is surprisingly not that big of an issue. In the oil and gas sensor data, it could drop off for a while because of poor internet connectivity. For small dropouts, we could interpolate using simple interpolation techniques. For larger dropouts we would just throw out the data. That’s much easier to deal with than labelling issues.

Can you talk a bit about entity resolution and joining data sources?

Yes, this is another problem we often face.  The issue is about joining data sources, particularly with bigger clients. They’ll have three silos, seven silos, ten silos, sometimes in really big companies even have 50 or 100 silos of data, where they’ve never been joined, but they’re of the same user base.

Expert Interview Part 3 Dr. Sourav Dey on Data Quality and Entity Resolution banner 1024x299 Expert Interview (Part 3): Dr. Sourav Dey on Data Quality and Entity Resolution

The data are all about the same people.

Right, and even within a single data source, it needs to be de-duplicated. It’s the same records. I’ll give a concrete example. We worked with this company that is an expert search firm. Their business is to help companies to find specific people with certain skills, e.g. a semi-conductor expert that understands 10 nanometer micron technology. Given a request, they want to find a relevant expert as fast as possible.

Clean, thick data drives business value for them by giving their search a large surface area to hit against. They can then service more requests, faster.  Their problem was that they had several different data silos and they never joined them. They only searched against one. They knew that they were missing out on a lot of potential matches and leaving money on the table. They hired Manifold to help them solve this problem.

How do we join these seven silos, and then figure out if the seven different versions of this person are actually the same person? Or two different people, or five different people.

This problem is called entity resolution. What’s interesting, is that you can use machine learning to do entity resolution. We’ve done it a couple of times now. There are some pretty interesting natural language processing techniques you can use, but all of them require a human in the loop to bootstrap the system. The human labels pairs, e.g. these records are the same, these records are not the same. These labels are fed back to the algorithm, and then it generates more examples. This general process is called active learning. It keeps feeding back the ones it’s not sure about to get labelled. With a few thousand labeled examples, it can start doing pretty well for both the de-duplication and the joining.

The compute becomes pretty challenging when you have large data sets. Tendu mentioned it in her talk on Anti-Money Laundering, you have to compare everything to everything, and do it with these fuzzy matching algorithms. That’s a challenge.

That’s a challenge, yeah. One of the tricks is to use a blocking algorithm which is crude classifier. Then, after the blocking, you have a much smaller set to do the machine learning base comparison on. That being said, even the blocking has to be run on N times M records where N and M are millions of records.

Where if you have seven silos and there’s a million records each and a hundred attributes per record, it’s a million times a million seven times …

It’s blows up quickly. That’s where you have to be smart about parallelizing and I think that’s where the Syncsort type of solution can be really powerful. It is an embarrassingly parallel problem. You just have to write the software appropriately so that can be done well.

Yeah, our Trillium data quality software is really good at parallel entity resolution at scale.

I like to work on clean data, and you guys are good at getting the data to the right state. That’s a very natural fit.

It is! You need clean data to work, and we make data clean. Well thank you for the interview, this has been fun!

Thank you!

Check out our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.

Let’s block ads! (Why?)

Syncsort Blog

data, Entity, Expert, Interview, Part, quality, Resolution, Sourav
  • Recent Posts

    • Bad Excuses
    • Understanding CRM Features-Better Customer Engagement
    • AI Weekly: Continual learning offers a path toward more humanlike AI
    • The Easier Way For Banks To Handle Data Security While Working Remotely
    • 3 Ways Data Virtualization is Evolving to Meet Market Demands
  • Categories

  • Archives

    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited