Tag Archives: Part

Unsupervised Learning Part 2: The AML Connection

Putting innovation into production is a big theme at FICO, as we commercialize analytic breakthroughs from FICO Labs into the products our customers rely on, worldwide. Recently, this has included applying advanced unsupervised learning to money laundering, one of the many domains in which FICO technology fights financial crime.

In my first blog about unsupervised learning, I took a deep dive into this machine learning technique, which draws inferences in the absence of outcomes. In a nutshell: “Good unsupervised learning requires more care, judgement and experience than supervised, because there is no clear, mathematically representable goal for the computer to blindly optimize to without understanding the underlying domain.”

Springboarding from that blog, today’s post covers three categories of unsupervised learning that FICO has investigated, refined and put into our anti-money laundering (AML) solutions.

The State of the Art in Unsupervised Analytics

Category 1: Finding distance-based outliers relative to training points

This category of unsupervised learning quantifies “outlierness” under the principle that if a query point is close to many training points, it is considered ordinary, but if it is further from them, then it should score higher (e.g., to denote greater outlierness). These are the most well-known and intuitive approaches, and what a data scientist would typically respond with if given a pop quiz to find outliers in multi-dimensional data. Here are two classic techniques:

  • Near-neighbor statistics, for example, the average distance from the query point to its M closest neighbors, with higher values corresponding to greater outlies. Arbitrariness in the metric is a difficult problem, and run-time scoring requires potentially expensive searches through saved data.
  • Clustering, a technique that, in training, estimates a finite number of cluster centers, thus characterizing the dataset in a fewer number of prototypical points. Score points are determined by the query points’ distance to their closest cluster center. Clustering suffers from an arbitrary free parameter that defines the number of clusters.

These techniques hinge on the choice of metric, and excess feature cross-correlation can be a big problem.  A Mahalanobis distance may help somewhat with correlation, but performs poorly with the frequently encountered non-Gaussian distributions and categorical features. Focusing on the difficulties involved in defining a proper metric becomes a matter of art and science trying to deal with cross-correlation and improper variable scaling, which emphasizes some outliers while being less sensitive to others.

Category 2: Machine learning of underlying data

Unsupervised learning methods that adapt to underlying data present a more sophisticated approach with explicit machine learning. They are less commonly used and understood than the Category 1 methods, but have some support in published scholarly and scientific literature. Based on ML concepts, these methods are more adaptive to the underlying data even when they have complex distributions and correlations.

  • An explicit dot-product/metric is necessary in the kernel function and has major effects on results
  • A significant number of training data must be stored for scoring
  • The raw score has no quantitative interpretation and often has a very ugly empirical distribution

Furthermore, the training procedure for the OCSVM, like most support vector models, does not scale sufficiently to the sizes of data sets now commonly encountered.

  • Neural-network autoencoders train multi-layer neural networks to predict their inputs, with a hidden layer “bottleneck” of smaller dimensionality than the original space. Outliers are those points that are less predictable than inliers. Autoencoders adapt to nonlinearity well, and explicitly reduce redundancy in inputs by mapping to a lower-dimensional manifold. In addition, being a neural network, training is easily scalable to large data set sizes, scoring is fast and score distributions are smooth.

    Neural network practitioners can readily use this technique to adapt to combined continuous and categorical features. There are still some arbitrary choices in the distance metric, but the arbitrary element has less direct impact than in previously described methods in which the metric intimately controls the geometry.

Category 3: Probabilistic and topological detection

Probablistic and topological methods showcase FICO’s latest machine learning innovations; our data science team is unaware of any other technique that has all the associated advantages. FICO has developed and implemented these in our labs from scratch, mindful of our experience with the previously discussed varieties of unsupervised learning.

  • Probabilistic neural-network outlier detection obtains a true calibrated probability estimate by starting from an analytically compact probability model. Bayesian principles are used to correct and enhance the first stage with a multi-layer neural network, to best represent the observed training data. With probabilistic neural-network outlier detection, training is scalable to large data sizes; scoring is rapid and provides smooth, calibrated probability estimates; and through the neural network, learns complex correlations. Furthermore, the flexibility of the starting probability model has let FICO’s data science team overcome the vexing boundaries and constraints of methods with explicit distances, such as the one-class SVM.
  • “Topological” outlier detection finds outliers that show patterns and relationships which are rarely or never seen in the baseline data. Some new data may be an outlier compared to the training database by being exaggerations or mutations of normal data. For instance, professional athletes are outliers in being often faster and stronger than 99.9% of us. But other sorts of outliers, which we call “topological” or “structural”, are even more exotic, showing patterns and relationships never or rarely seen in the baseline data. So, compared to professional athletes, imagine an otherwise normal person who happens to have an entirely different configuration of internal organs. That individual would be a topological outlier!

Unsupervised Learning Topological Outliers Unsupervised Learning Part 2: The AML Connection

Advanced Analytics for AML

FICO’s advanced analytics for anti-money laundering incorporates our most sophisticated machine learning-based outlier detection technologies. One key feature of our product is a quantification, from transaction and other data, of the degree of unusual and risky behavior that a few customers exhibit relative to the bulk of low-risk customers. Our fundamental innovations in unsupervised modeling and outlier scoring have improved the sophistication, palatability and success to tackle one of the world’s most elusive and disturbing channels: money laundering and associated crimes against humanity.

Follow me on Twitter @ScottZoldi.

Let’s block ads! (Why?)


Expert Interview (Part 2): Cloudera’s Mike Olson on the Gartner Hype Cycle

At the Cloudera Sessions event in Munich, Germany, Paige Roberts of Syncsort sat down with Mike Olson, Chief Strategy Officer of Cloudera. In the first part of the interview, Mike Olson went into what’s new at Cloudera, how machine learning is evolving, and the adoption of the Cloud in organizations.

In this part he talks about Gartner’s latest hype cycle, and where he sees things going.

Paige:   Did you see the latest Gartner’s hype cycle? They say that Hadoop will be obsolete before it reaches the plateau of productivity.

Mike:    Yes, and I’ll say that Gartner’s conclusions on Big Data just don’t match ours. We’ve got lots of serious customers doing really mission critical production workloads on our platform. I’m not sure who they’re talking to that’s leading to these conclusions. I will say that if you view the Big Data landscape as really just Hadoop, there’s all kinds of reasons to be skeptical, right?

Banner 6 Key Questions About Getting More from Your Mainframe with Big Data Technologies Expert Interview (Part 2): Cloudera’s Mike Olson on the Gartner Hype Cycle

Especially if you just look at it as MapReduce and HDFS.

That’s right and it’s perfectly fair to say those are awful alternatives to traditional relational databases. In fact, it’s legit to say there’s going to be a place for Oracle, SAP HANA, Teradata, Microsoft SQL Server, and DB2 Parallel Edition for the long term.  Scale out platforms are never going to be good at online transaction processing.

Distributed transactions have been hard forever and nothing about Hadoop makes it easier, but using tools like Impala to do high-performance analytic queries gives companies an alternative for certain parts of their traditional relational workloads on the scale-out platform. We’re not just bullish, we’ve been quite successful in delivering those capabilities to the enterprise.

The Gartner hype cycle, if you look at the terminology, there’s the peak of inflated expectations and then the trough of disillusionment, and then the plateau of productivity. And maybe Gartner’s current down outlook is because right now, we’re in the trough, and it’s the plateau of productivity broadly across the industry we have to get to. We’ve said publicly that we’ve got more than a thousand customers, more than 600 in the Global 8,000 running this platform in production for a bunch of very demanding workloads.

I have to wonder if they’re looking at the Hadoop of 10 years ago, as opposed to now. It used to be you had just MapReduce and HDFS which was really limited, but now it’s 25 different projects including Spark, and all these other capabilities, and that’s a completely different kind of distribution.

Frankly I think that if you look at Hadoop as just Hadoop, then there’s a bunch of stuff it doesn’t do. But the ecosystem has evolved way beyond that.

Yeah, it’s growing all the time. Actually I do, a “What is Hadoop” presentation and it starts with a slide that gives the basics of Hadoop 1.x. “Here’s this cool thing, and let me explain it to you.” Then it shows the ecosystem progressing in slide after slide, now it grew, and it grew, and it grew and grew some more.

HDFS and MapReduce are always going to be part of our platform. They’re really important. But, you can now spin a cluster infrastructure running on Microsoft Azure using ADLS object store and Spark running on top of that and there’s no HDFS or MapReduce anywhere near that thing.

It is no longer the be-all and end-all of Hadoop.

It’s a much more expansive and capable ecosystem than it used to be.

Tune in for the final installment of this interview, where Mike Olson shares his view on women in tech and explains the difference between Cloudera Altus and Director.

Make sure to check out our latest eBook, 6 Key Questions About Getting More from Your Mainframe with Big Data Technologies, and learn what you need to ask yourself to get around the challenges, and reap the promised benefits of your next Big Data project.

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 1): Mike Olson on Cloudera, Machine Learning, and the Adoption of the Cloud

At the Cloudera Sessions event in Munich, Germany, Paige Roberts of Syncsort sat down with Mike Olson, Chief Strategy Officer of Cloudera. In this first of a three-part interview, Mike Olson goes into what’s new at Cloudera, how machine learning is evolving, and the adoption of the Cloud in organizations.

Paige:   So, first off, go ahead and introduce yourself.

Mike:    I’m Mike Olson from Cloudera. I’m a co-founder and am chief strategy officer of the company and I’m excited to speak with you, Paige.

I’m excited to speak with you too. So, what is going on at Cloudera right now that’s really exciting?

Well, I spent some time in the sessions here in Munich today talking about what we’re seeing in the adoption of machine learning and some of these advanced analytic techniques. That’s really exciting, like the use cases that are getting built, using these new analytic techniques…it’s pretty awesome. I mean in healthcare, diagnosing disease better than ever before, delivering better treatments. Even intervening in real time when patients need special care and we can detect that because they’re Internet connected, they’re wearing connected devices. So, lots of cool use cases.

blog banner 2018 Big Data Trends eBook Expert Interview (Part 1): Mike Olson on Cloudera, Machine Learning, and the Adoption of the Cloud

That is pretty cool.

That’s driven by some investments that we’ve made over the last couple years. So, we bought a company in San Francisco called Sense.IO. That technology and that team basically turned into the Cloudera Data Science Workbench. I was really excited by that

I just saw a presentation earlier by somebody who said they were using it.

We think it makes developing those apps that much easier. About a month ago we concluded our acquisition of a really interesting Brooklyn-based machine learning research firm, Fast Forward Labs.

Hillary Mason! Yeah.

She’s been awesome for a long time, and now she’s running the research function at Cloudera, continuing to track the sort of cutting edge, what’s going to happen in ML (machine learning) and AI (artificial intelligence), applied to real enterprise workloads. So, we know more, we’ve got a much better informed opinion about that stuff now. I’m really excited about what that means for us.

That’s cool! So, is there anything that out on the landscape that you see coming that’s got you worried, or got you excited, or got you wondering?

If I were to highlight just a couple things, I wouldn’t say worried but respectfully attentive, large enterprises were definitely afraid of the Cloud before.

Yes, they were.

And now they’re clearly beginning to embrace the cloud. They’re trying to decide how to integrate their business practices into these new security regimes that the Cloud provides, and I absolutely believe the Cloud is at least as secure as your own data center, but you need to be sure that you’re using it properly, right?


But the shift from a traditional on-premises IT mindset to a cloudy one is confusing and disruptive to a lot of large enterprises, and we’re spending time with our clients, helping them think about …

Getting over the hump.  

Yeah, that’s right. And I don’t know that I would say I’m worried about it. I think it’s a big opportunity. People can do stuff in the Cloud because it’s easy to spin up a bunch of infrastructure and do some work and then spin it back down and, you know you can never do that on-premises, right?

Expert Interview Mike Olson Part 1 Quote Expert Interview (Part 1): Mike Olson on Cloudera, Machine Learning, and the Adoption of the Cloud

No, you can’t. You can’t commit to having a thousand-node cluster that you only need for two days. [Laughs]

No, that’s right. Who’s going to call their hardware vendor and say, I need three racks for a week, right? But you can do that on Amazon, on Azure, or on Google. So, helping them over that stumbling block is taking some time from us.

Yeah, I can understand that.

I told you all these reasons that I’m really excited about machine learning. If I were to highlight a modest concern I’ve got, it’s that ML is pretty hype-y, and maybe we’re contributing to that a little bit. We’re very bullish about it. I will say, we’ve got hundreds of customers actually doing it in production. This is real stuff. But you hear these terms like artificial intelligence and cognitive computing… and honestly, what we’re doing is training models on large amounts of historical data to recognize anomalous behavior and new data, it’s way more pragmatic and practical than words like cognitive computing make it sound. So, I worry that we’ll, as an industry, overpromise and then disappoint. These computers aren’t thinking, right?

Yeah, people are thinking SkyNet, and they’re getting Siri or Alexa. [Chuckle]

Exactly. By the way, Siri and Alexa are totally awesome in what they do. But if you really have it down, speaker independent voice recognition and then some good integrated search technology. That really isn’t the matrix.

[Laughs] No. We’re a little ways from bots taking over the world.

Indeed, indeed.

Make sure to check out part 2 of this interview, where Mike Olson goes into the Gartner hype cycle and what’s on the horizon.

Also, learn about 5 key Big Data trends in the coming year by checking out our report, 2018 Big Data Trends: Liberate, Integrate & Trust

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Insight Matter: Disrupt The Workforce With Digital Transformation (Part 1)

In the tech world in 2017, several trends emerged as signals amid the noise, signifying much larger changes to come.

As we noted in last year’s More Than Noise list, things are changing—and the changes are occurring in ways that don’t necessarily fit into the prevailing narrative.

While many of 2017’s signals have a dark tint to them, perhaps reflecting the times we live in, we have sought out some rays of light to illuminate the way forward. The following signals differ considerably, but understanding them can help guide businesses in the right direction for 2018 and beyond.

SAP Q417 DigitalDoubles Feature1 Image2 1024x572 Insight Matter: Disrupt The Workforce With Digital Transformation (Part 1)

When a team of psychologists, linguists, and software engineers created Woebot, an AI chatbot that helps people learn cognitive behavioral therapy techniques for managing mental health issues like anxiety and depression, they did something unusual, at least when it comes to chatbots: they submitted it for peer review.

Stanford University researchers recruited a sample group of 70 college-age participants on social media to take part in a randomized control study of Woebot. The researchers found that their creation was useful for improving anxiety and depression symptoms. A study of the user interaction with the bot was submitted for peer review and published in the Journal of Medical Internet Research Mental Health in June 2017.

While Woebot may not revolutionize the field of psychology, it could change the way we view AI development. Well-known figures such as Elon Musk and Bill Gates have expressed concerns that artificial intelligence is essentially ungovernable. Peer review, such as with the Stanford study, is one way to approach this challenge and figure out how to properly evaluate and find a place for these software programs.

The healthcare community could be onto something. We’ve already seen instances where AI chatbots have spun out of control, such as when internet trolls trained Microsoft’s Tay to become a hate-spewing misanthrope. Bots are only as good as their design; making sure they stay on message and don’t act in unexpected ways is crucial.

SAP Q417 DigitalDoubles Feature1 Image3 Insight Matter: Disrupt The Workforce With Digital Transformation (Part 1)This is especially true in healthcare. When chatbots are offering therapeutic services, they must be properly designed, vetted, and tested to maintain patient safety.

It may be prudent to apply the same level of caution to a business setting. By treating chatbots as if they’re akin to medicine or drugs, we have a model for thorough vetting that, while not perfect, is generally effective and time tested.

It may seem like overkill to think of chatbots that manage pizza orders or help resolve parking tickets as potential health threats. But it’s already clear that AI can have unintended side effects that could extend far beyond Tay’s loathsome behavior.

For example, in July, Facebook shut down an experiment where it challenged two AIs to negotiate with each other over a trade. When the experiment began, the two chatbots quickly went rogue, developing linguistic shortcuts to reduce negotiating time and leaving their creators unable to understand what they were saying.

The implications are chilling. Do we want AIs interacting in a secret language because designers didn’t fully understand what they were designing?

In this context, the healthcare community’s conservative approach doesn’t seem so farfetched. Woebot could ultimately become an example of the kind of oversight that’s needed for all AIs.

Meanwhile, it’s clear that chatbots have great potential in healthcare—not just for treating mental health issues but for helping patients understand symptoms, build treatment regimens, and more. They could also help unclog barriers to healthcare, which is plagued worldwide by high prices, long wait times, and other challenges. While they are not a substitute for actual humans, chatbots can be used by anyone with a computer or smartphone, 24 hours a day, seven days a week, regardless of financial status.

Finding the right governance for AI development won’t happen overnight. But peer review, extensive internal quality analysis, and other processes will go a long way to ensuring bots function as expected. Otherwise, companies and their customers could pay a big price.

SAP Q417 DigitalDoubles Feature1 Image4 1024x572 Insight Matter: Disrupt The Workforce With Digital Transformation (Part 1)

Elon Musk is an expert at dominating the news cycle with his sci-fi premonitions about space travel and high-speed hyperloops. However, he captured media attention in Australia in April 2017 for something much more down to earth: how to deal with blackouts and power outages.

In 2016, a massive blackout hit the state of South Australia following a storm. Although power was restored quickly in Adelaide, the capital, people in the wide stretches of arid desert that surround it spent days waiting for the power to return. That hit South Australia’s wine and livestock industries especially hard.

South Australia’s electrical grid currently gets more than half of its energy from wind and solar, with coal and gas plants acting as backups for when the sun hides or the wind doesn’t blow, according to ABC News Australia. But this network is vulnerable to sudden loss of generation—which is exactly what happened in the storm that caused the 2016 blackout, when tornadoes ripped through some key transmission lines. Getting the system back on stable footing has been an issue ever since.

Displaying his usual talent for showmanship, Musk stepped in and promised to build the world’s largest battery to store backup energy for the network—and he pledged to complete it within 100 days of signing the contract or the battery would be free. Pen met paper with South Australia and French utility Neoen in September. As of press time in November, construction was underway.

For South Australia, the Tesla deal offers an easy and secure way to store renewable energy. Tesla’s 129 MWh battery will be the most powerful battery system in the world by 60% once completed, according to Gizmodo. The battery, which is stationed at a wind farm, will cover temporary drops in wind power and kick in to help conventional gas and coal plants balance generation with demand across the network. South Australian citizens and politicians largely support the project, which Tesla claims will be able to power 30,000 homes.

Until Musk made his bold promise, batteries did not figure much in renewable energy networks, mostly because they just aren’t that good. They have limited charges, are difficult to build, and are difficult to manage. Utilities also worry about relying on the same lithium-ion battery technology as cellphone makers like Samsung, whose Galaxy Note 7 had to be recalled in 2016 after some defective batteries burst into flames, according to CNET.

SAP Q417 DigitalDoubles Feature1 Image5 Insight Matter: Disrupt The Workforce With Digital Transformation (Part 1)However, when made right, the batteries are safe. It’s just that they’ve traditionally been too expensive for large-scale uses such as renewable power storage. But battery innovations such as Tesla’s could radically change how we power the economy. According to a study that appeared this year in Nature, the continued drop in the cost of battery storage has made renewable energy price-competitive with traditional fossil fuels.

This is a massive shift. Or, as David Roberts of news site Vox puts it, “Batteries are soon going to disrupt power markets at all scales.” Furthermore, if the cost of batteries continues to drop, supply chains could experience radical energy cost savings. This could disrupt energy utilities, manufacturing, transportation, and construction, to name just a few, and create many opportunities while changing established business models. (For more on how renewable energy will affect business, read the feature “Tick Tock” in this issue.)

Battery research and development has become big business. Thanks to electric cars and powerful smartphones, there has been incredible pressure to make more powerful batteries that last longer between charges.

The proof of this is in the R&D funding pudding. A Brookings Institution report notes that both the Chinese and U.S. governments offer generous subsidies for lithium-ion battery advancement. Automakers such as Daimler and BMW have established divisions marketing residential and commercial energy storage products. Boeing, Airbus, Rolls-Royce, and General Electric are all experimenting with various electric propulsion systems for aircraft—which means that hybrid airplanes are also a possibility.

Meanwhile, governments around the world are accelerating battery research investment by banning internal combustion vehicles. Britain, France, India, and Norway are seeking to go all electric as early as 2025 and by 2040 at the latest.

In the meantime, expect huge investment and new battery innovation from interested parties across industries that all share a stake in the outcome. This past September, for example, Volkswagen announced a €50 billion research investment in batteries to help bring 300 electric vehicle models to market by 2030.

SAP Q417 DigitalDoubles Feature1 Image6 1024x572 Insight Matter: Disrupt The Workforce With Digital Transformation (Part 1)

At first, it sounds like a narrative device from a science fiction novel or a particularly bad urban legend.

Powerful cameras in several Chinese cities capture photographs of jaywalkers as they cross the street and, several minutes later, display their photograph, name, and home address on a large screen posted at the intersection. Several days later, a summons appears in the offender’s mailbox demanding payment of a fine or fulfillment of community service.

As Orwellian as it seems, this technology is very real for residents of Jinan and several other Chinese cities. According to a Xinhua interview with Li Yong of the Jinan traffic police, “Since the new technology has been adopted, the cases of jaywalking have been reduced from 200 to 20 each day at the major intersection of Jingshi and Shungeng roads.”

The sophisticated cameras and facial recognition systems already used in China—and their near–real-time public shaming—are an example of how machine learning, mobile phone surveillance, and internet activity tracking are being used to censor and control populations. Most worryingly, the prospect of real-time surveillance makes running surveillance states such as the former East Germany and current North Korea much more financially efficient.

According to a 2015 discussion paper by the Institute for the Study of Labor, a German research center, by the 1980s almost 0.5% of the East German population was directly employed by the Stasi, the country’s state security service and secret police—1 for every 166 citizens. An additional 1.1% of the population (1 for every 66 citizens) were working as unofficial informers, which represented a massive economic drain. Automated, real-time, algorithm-driven monitoring could potentially drive the cost of controlling the population down substantially in police states—and elsewhere.

We could see a radical new era of censorship that is much more manipulative than anything that has come before. Previously, dissidents were identified when investigators manually combed through photos, read writings, or listened in on phone calls. Real-time algorithmic monitoring means that acts of perceived defiance can be identified and deleted in the moment and their perpetrators marked for swift judgment before they can make an impression on others.

SAP Q417 DigitalDoubles Feature1 Image7 Insight Matter: Disrupt The Workforce With Digital Transformation (Part 1)Businesses need to be aware of the wider trend toward real-time, automated censorship and how it might be used in both commercial and governmental settings. These tools can easily be used in countries with unstable political dynamics and could become a real concern for businesses that operate across borders. Businesses must learn to educate and protect employees when technology can censor and punish in real time.

Indeed, the technologies used for this kind of repression could be easily adapted from those that have already been developed for businesses. For instance, both Facebook and Google use near–real-time facial identification algorithms that automatically identify people in images uploaded by users—which helps the companies build out their social graphs and target users with profitable advertisements. Automated algorithms also flag Facebook posts that potentially violate the company’s terms of service.

China is already using these technologies to control its own people in ways that are largely hidden to outsiders.

According to a report by the University of Toronto’s Citizen Lab, the popular Chinese social network WeChat operates under a policy its authors call “One App, Two Systems.” Users with Chinese phone numbers are subjected to dynamic keyword censorship that changes depending on current events and whether a user is in a private chat or in a group. Depending on the political winds, users are blocked from accessing a range of websites that report critically on China through WeChat’s internal browser. Non-Chinese users, however, are not subject to any of these restrictions.

The censorship is also designed to be invisible. Messages are blocked without any user notification, and China has intermittently blocked WhatsApp and other foreign social networks. As a result, Chinese users are steered toward national social networks, which are more compliant with government pressure.

China’s policies play into a larger global trend: the nationalization of the internet. China, Russia, the European Union, and the United States have all adopted different approaches to censorship, user privacy, and surveillance. Although there are social networks such as WeChat or Russia’s VKontakte that are popular in primarily one country, nationalizing the internet challenges users of multinational services such as Facebook and YouTube. These different approaches, which impact everything from data safe harbor laws to legal consequences for posting inflammatory material, have implications for businesses working in multiple countries, as well.

For instance, Twitter is legally obligated to hide Nazi and neo-fascist imagery and some tweets in Germany and France—but not elsewhere. YouTube was officially banned in Turkey for two years because of videos a Turkish court deemed “insulting to the memory of Mustafa Kemal Atatürk,” father of modern Turkey. In Russia, Google must keep Russian users’ personal data on servers located inside Russia to comply with government policy.

While China is a pioneer in the field of instant censorship, tech companies in the United States are matching China’s progress, which could potentially have a chilling effect on democracy. In 2016, Apple applied for a patent on technology that censors audio streams in real time—automating the previously manual process of censoring curse words in streaming audio.

SAP Q417 DigitalDoubles Feature1 Image8 1024x572 Insight Matter: Disrupt The Workforce With Digital Transformation (Part 1)

In March, after U.S. President Donald Trump told Fox News, “I think maybe I wouldn’t be [president] if it wasn’t for Twitter,” Twitter founder Evan “Ev” Williams did something highly unusual for the creator of a massive social network.

He apologized.

Speaking with David Streitfeld of The New York Times, Williams said, “It’s a very bad thing, Twitter’s role in that. If it’s true that he wouldn’t be president if it weren’t for Twitter, then yeah, I’m sorry.”

Entrepreneurs tend to be very proud of their innovations. Williams, however, offers a far more ambivalent response to his creation’s success. Much of the 2016 presidential election’s rancor was fueled by Twitter, and the instant gratification of Twitter attracts trolls, bullies, and bigots just as easily as it attracts politicians, celebrities, comedians, and sports fans.

Services such as Twitter, Facebook, YouTube, and Instagram are designed through a mix of look and feel, algorithmic wizardry, and psychological techniques to hang on to users for as long as possible—which helps the services sell more advertisements and make more money. Toxic political discourse and online harassment are unintended side effects of the economic-driven urge to keep users engaged no matter what.

Keeping users’ eyeballs on their screens requires endless hours of multivariate testing, user research, and algorithm refinement. For instance, Casey Newton of tech publication The Verge notes that Google Brain, Google’s AI division, plays a key part in generating YouTube’s video recommendations.

According to Jim McFadden, the technical lead for YouTube recommendations, “Before, if I watch this video from a comedian, our recommendations were pretty good at saying, here’s another one just like it,” he told Newton. “But the Google Brain model figures out other comedians who are similar but not exactly the same—even more adjacent relationships. It’s able to see patterns that are less obvious.”

SAP Q417 DigitalDoubles Feature1 Image9 Insight Matter: Disrupt The Workforce With Digital Transformation (Part 1)A never-ending flow of content that is interesting without being repetitive is harder to resist. With users glued to online services, addiction and other behavioral problems occur to an unhealthy degree. According to a 2016 poll by nonprofit research company Common Sense Media, 50% of American teenagers believe they are addicted to their smartphones.

This pattern is extending into the workplace. Seventy-five percent of companies told research company Harris Poll in 2016 that two or more hours a day are lost in productivity because employees are distracted. The number one reason? Cellphones and texting, according to 55% of those companies surveyed. Another 41% pointed to the internet.

Tristan Harris, a former design ethicist at Google, argues that many product designers for online services try to exploit psychological vulnerabilities in a bid to keep users engaged for longer periods. Harris refers to an iPhone as “a slot machine in my pocket” and argues that user interface (UI) and user experience (UX) designers need to adopt something akin to a Hippocratic Oath to stop exploiting users’ psychological vulnerabilities.

In fact, there is an entire school of study devoted to “dark UX”—small design tweaks to increase profits. These can be as innocuous as a “Buy Now” button in a visually pleasing color or as controversial as when Facebook tweaked its algorithm in 2012 to show a randomly selected group of almost 700,000 users (who had not given their permission) newsfeeds that skewed more positive to some users and more negative to others to gauge the impact on their respective emotional states, according to an article in Wired.

As computers, smartphones, and televisions come ever closer to convergence, these issues matter increasingly to businesses. Some of the universal side effects of addiction are lost productivity at work and poor health. Businesses should offer training and help for employees who can’t stop checking their smartphones.

Mindfulness-centered mobile apps such as Headspace, Calm, and Forest offer one way to break the habit. Users can also choose to break internet addiction by going for a walk, turning their computers off, or using tools like StayFocusd or Freedom to block addictive websites or apps.

Most importantly, companies in the business of creating tech products need to design software and hardware that discourages addictive behavior. This means avoiding bad designs that emphasize engagement metrics over human health. A world of advertising preroll showing up on smart refrigerator touchscreens at 2 a.m. benefits no one.

According to a 2014 study in Cyberpsychology, Behavior and Social Networking, approximately 6% of the world’s population suffers from internet addiction to one degree or another. As more users in emerging economies gain access to cheap data, smartphones, and laptops, that percentage will only increase. For businesses, getting a head start on stopping internet addiction will make employees happier and more productive. D!

About the Authors

Maurizio Cattaneo is Director, Delivery Execution, Energy, and Natural Resources, at SAP.

David Delaney is Global Vice President and Chief Medical Officer, SAP Health.

Volker Hildebrand is Global Vice President for SAP Hybris solutions.

Neal Ungerleider is a Los Angeles-based technology journalist and consultant.

Read more thought provoking articles in the latest issue of the Digitalist Magazine, Executive Quarterly.


Let’s block ads! (Why?)

Digitalist Magazine

SQL Server Machine Learning Services – Part 5: Generating multiple plots in Python

The series so far:

  1. SQL Server Machine Learning Services – Part 1: Python Basics
  2. SQL Server Machine Learning Services – Part 2: Python Data Frames
  3. SQL Server Machine Learning Services – Part 3: Plotting Data with Python
  4. SQL Server Machine Learning Services – Part 4: Finding Data Anomalies with Python
  5. SQL Server Machine Learning Services – Part 5: Generating multiple plots in Python

SQL Server Machine Learning Services (MLS) offers a wide range of options for working with the Python language within the context of a SQL Server database. This series has covered a number of those options, with the goal of familiarizing you with some of Python’s many capabilities. One of the most important of these capabilities, along with analyzing data, is being able to visualize data. You were introduced to many of these concepts in the third article, which focused on how to generate individual charts. This article takes that discussion a step further by demonstrating how to generate multiple charts that together provide a complete picture of the underlying data.

The examples in this article are based on data from the Titanic dataset, available as a .csv file from the site https://vincentarelbundock.github.io/Rdatasets/datasets.html. There you can find an assortment of sample datasets, available in both .csv and .doc formats. This is a great resource to keep in mind as you’re trying out various MLS features, whether using the R language or Python.

The Titanic dataset includes a passenger list from the Titanic’s infamous voyage. For each passenger, the data includes the individual’s name, age, gender, service class and whether the passenger survived. The following figure shows a sample of the data.

20screenshots5 stpython5 fig01 png SQL Server Machine Learning Services – Part 5: Generating multiple plots in Python

A value of 1 in the Survived column indicates that the passenger survived the voyage. A value of 0 indicates that the passenger did not. The SexCode column is merely a binary duplicate of the Sex column. A value of 0 indicates that the passenger was male, and a value of 1 indicates that the passenger was female.

I created the examples for this article in SQL Server Management Studio (SSMS). Each example retrieves data from a locally saved file named titanic.csv, which contains the Titanic dataset. In most cases, if you’re running a Python script within the context of the SQL Server databases, you’ll likely want to use data from that database, either instead of or in addition to a .csv data, but the principles covered in this article apply regardless of where the data originates, and using a .csv file helps to keep things simple, while demonstrating several important concepts, including how to retrieve data from that file.

Retrieving data from a .csv file

Python, with the help of the pandas module, makes it easy to retrieve data from a .csv file and save the data to a DataFrame object. The pandas module contains a variety of tools for creating and working with data frames, including the read_csv function, which does exactly what the name suggests, reads data from a .csv file.

The Python script in following example uses the function to first retrieve data from the titanic.csv file and then to update the resulting dataset in preparation for grouping the data for the visualizations:

The sp_execute_external_script stored procedure, which is used to run Python and R scripts, should be well familiar to you by now. If you have any questions about the procedure works, refer to the first article in this series. In this article, we focus exclusively on the Python script.

The first statement in the script imports the pandas module and assigns the pd alias:

You can then use the pd alias to call the read_csv function, passing in the full path to the titanic.csv file as the first parameter value:

On my system, the titanic.csv file is saved to the c:\datafiles\ folder. Be sure to update the path as necessary if you plan to try out this example or those that follow.

The second function parameter is index_col, which tells the function to use the specified column as the index for the outputted data frame. The value 0 refers to the first column, which is unnamed in the Titanic dataset. Without this parameter, the data frame’s index would be auto-generated.

When the read_csv function runs, it retrieves the data from the file and saves it to a DataFrame object, which, in this case, is assigned to the titanic variable. You can then use this variable to modify the data frame or access its data. For example, the original dataset includes a number of passengers whose ages were unknown, listed as NA in the .csv file. You might choose to remove those rows from the dataset, which you can do by using the notnull function:

To use the notnull function, you merely append it to the column name. Notice, however, that you must first call the dataset and then, within brackets, reference the dataset again, along with the column name in its own set of brackets.

Perhaps the bigger issue here is whether it’s even necessary remove these rows. It turns out that over 500 rows have unknown ages, a significant portion of the dataset. (The original dataset includes over 1300 rows.) The examples later in article will use the groupby function to group the data and find the average ages in each group, along with the number of passengers. When aggregating grouped data, the groupby function automatically excludes null values, so in this case, the aggregated results will be the same whether or not you include the preceding statement.

I’ve included the statement here primarily to demonstrate how to remove rows that contain null values. (The subsequent examples also include this statement so they’re consistent as you build on them from one example to the next.) One advantage to removing the rows with null values is that the resulting dataset is smaller, which can be beneficial when working with large datasets. Keep in mind, however, that modifying datasets can sometimes impact the analytics or visualizations based on those datasets. Always be wary of how you manipulate data.

The next statement in the example replaces the numeric values in the Survived column with the values deceased and survived:

To change values in this way, use the map function to specify the old and new values. You must enclose the assignments in curly braces, separating each set with a comma. For the assignments themselves, you separate the old value from the new value with a colon.

The next part of the Python script includes a for statement block that removes rows containing Age values that are not whole numbers:

The Titanic dataset uses fractions for infant ages, and you might decide not to include these rows in your data. As with the notnull function, I’ve included this logic primarily to demonstrate a useful concept, in this case, how you can loop through the rows in a dataset to take an action on each row.

The initial for statement specifies the i and row variables, which represent the current index value and dataset row in each of the loop’s iterations. The statement also calls the iterrows function on the titanic data frame for iterating through the rows.

For each iteration, the for statement block runs an if statement, which evaluates whether the Age value for the current row is an integer, using the is_integer function. If the function evaluates to False, the value is not an integer, in which case the drop function is used to remove that row identified by the current i value. When using the drop function in this way, be sure to include the inplace parameter, set to True, to ensure that the data frame is updated in place, rather than copied.

Once again, you should determine whether you do, in fact, want to remove these rows. In this case, the aggregated passenger totals will be impacted by their removal because there will be eight fewer rows, so be sure to tread carefully when making these sorts of changes.

The final statement in the Python script uses the print function to return the titanic data to the SSMS console:

The following figure shows a portion of the results. Note that, when returning a dataset that is considered wide, Python wraps the columns, as indicated by the backslash to the right of the column names. As a result, only the index and the first four columns are included in this particular view.

20screenshots5 stpython5 fig02 png SQL Server Machine Learning Services – Part 5: Generating multiple plots in Python

If you scroll down the results, you’ll find the remaining columns, similar to those shown in the following figure.

20screenshots5 stpython5 fig03 png SQL Server Machine Learning Services – Part 5: Generating multiple plots in Python

Notice that the index value for the first row is 1. This is the first value from the first column from the original dataset. If the index for the titanic data frame had been auto-generated, the first value would be 0.

Grouping and aggregating the data

Once you get the titanic data frame where you want it, you can use the groupby function to group and aggregate the data. The following example uses the function to group the data first by the PClass column, then the Sex column, and finally the Survived column:

You call the groupby function by tagging it onto the titanic data frame, specifying the three columns on which to base the grouping (in the order that grouping should occur):

After specifying the grouped columns and as_index parameter, you add the agg function, which allows you to find the mean and count of the Age column values. (The groupby function is covered extensively in the first three articles of this series, so refer to them for more details about the function’s use.)

The groupby function results are assigned to the df variable. From there, you can use the variable to call the Age column in order to round the column’s values to two decimal points:

As you can see, you need only call the round function and pass in 2 as the parameter argument. You can then assign names to the columns in the df data frame:

That’s all you need to do to prepare the data for generating visualizations. The last statement in the example again uses the print function to return the data from the df data frame to the SSMS console. The following figure shows the results.

20screenshots5 stpython5 fig04 png SQL Server Machine Learning Services – Part 5: Generating multiple plots in Python

You now have a dataset that is grouped first by class, then by gender, and finally by the survival status. If the rows with fractional Age values had not been removed, the passenger counts would be slightly different.

Generating a bar chart

From the df data frame you can create multiple charts that provide different insights into the data, but before we dive into those, let’s start with a single chart. The following example creates a bar chart specific to the first-class passengers, showing the average age of the males and females who survived and who did not:

The Python script starts with several new import statements related to the matplotlib module, along with the use function, which sets PDF as the backend:

The first three statements should be familiar to you. If they’re not, refer back to the third article. The fourth statement is new. It imports the patches package that is included with the matplotlib module. The patches package contains tools for refining the various charts. In this case, you’ll be using it to customize the chart’s legend.

The next step is to create a data subset based on the df data frame:

The df1 data frame filters out all rows except those specific to first-class passengers. It also filters out all but the gender, status, and age columns. Filtering out the data in advance helps to simplify the code used to create the actual plot.

The next step is to define the property values for the legend. For this, you use the Patch function that is part of the patches package:

The goal is to provide a legend that is consistent with the colors of the chart’s bars, which will be color-coded to reflect whether a group represents passengers that survived or passengers that did not. You’ll use the dc and sv variables when you define your chart’s properties.

Once you have all the pieces in place, you can create your bar chart, using the plot.bar function available to the df1 data frame:

You’ve seen most of these components before. The chart uses the gender column for the X-axis and the age column for the Y-axis. The alpha parameter sets the transparency to 80%. What’s different here is the color parameter. The value uses the map function, called from the status column, to set the bar’s color to navy if the value is deceased and to dark green if the value is survived.

From here, you can configure the other chart’s properties, using the ax1 variable to reference the chart object:

Most of this should be familiar to you as well. The only new statement is the third one, which uses the set_visible function to set the visibility of the X-axis label to False. This prevents the label from being displayed.

The next step is to configure the legend. For this, you call the legend function, passing in the dc and sv variables as values to the handles parameter:

Notice that the loc parameter is set to best. This setting lets Python determine the best location for the legend, based on how the bars are rendered.

The final step is to use the savefig function to save the chart to a .pdf file:

When you run the Python script, it should generate a chart similar to the one shown in the following figure.

20screenshots5 stpython5 fig05 png SQL Server Machine Learning Services – Part 5: Generating multiple plots in Python

The chart includes a bar for each first-class group, sized according to the average age of that group. In addition, each bar is color-coded to match the group’s status, with a legend that provides a key for how the colors are used.

Generating multiple bar charts

The bar chart above provides a straightforward, easy-to-understand visualization of the underlying data. Restricting the results to the first-class group makes it clear how the data is distributed. We could have incorporated all the data from the df dataset into the chart, but that might have made the results less clear, although in this case, one chart might have been okay, given that the df dataset includes only 12 rows.

However, another approach that can be extremely effective when visualizing data is to generate multiple related charts, as shown in the following example:

Much of the code in the Python script is the same as the preceding example, with a few notable exceptions. The first is that the script now defines three subsets of data, one for each service class:

The next addition to the script is a statement that creates the structure necessary for rendering multiple charts:

The statement uses the subplots function that is part of the matplotlib.pyplot package to define a structure that includes three charts, positioned in one row with three columns. The nrows parameter determines the number of rows, and the ncols parameter determines the number of columns, resulting in a total of three charts. This will cause the function to generate three AxesSubplot objects, which are saved to the ax1, ax2, and ax3 variables. The function also generates a Figure object, which is saved to the fig variable, although you don’t need to do anything specific with that variable going forward.

The third parameter in the subplots function is sharey, which is set to True. This will cause the Y-axis labels associated with the first chart to be shared across the row with all charts, rather than displaying the same labels for each chart. This is followed with the figsize parameter, which sets this figure’s width and height in inches.

The next step is to define each of three charts. The first definition is similar to the preceding example, with one important difference in the plot.bar function:

The function now includes the ax parameter, which instructs Python to assign this plot to the ax1AxesSubplot object. This ensures that the plot definition is incorporated into the single fig object.

The remaining chart definitions take the same approach, except that ax1 is swapped out for ax2 and ax3, respectively, and the df1 data frame is swapped out for df2 and df3. The two remaining chart definitions also do not define a Y-axis label. The following figure shows the resulting charts.

20screenshots5 stpython5 fig06 png SQL Server Machine Learning Services – Part 5: Generating multiple plots in Python

Notice that only the first chart includes the Y-axis labels and that each chart is specific to the class of service. In addition, the legend is positioned differently within the charts, according to how the bars are rendered.

Breaking data into smaller charts can make the groups easier to grasp and to compare. You can do the same thing with the passenger counts, as shown in the following example:

This example is very similar to the preceding one, except that the data subsets have been updated to return the count column, rather than the age column:

You must also update the plot.bar function calls that defines the subplots to reflect that the count column should be used for Y-axis:

Also, be sure to update the Y-axis labels:

Other than these changes, the rest of the script is the same. Your charts should now look like those shown in the following figure:

20screenshots5 stpython5 fig07 png SQL Server Machine Learning Services – Part 5: Generating multiple plots in Python

Although the charts are similar to those in the preceding example, the data is much different. However, you can still easily find meaning in that data.

Generating multiple types of bar charts

You can also mix things up, providing charts that reflect both the average ages and passenger counts within one figure. You need only do some rearranging and make some additions, as shown in the following example:

This time around, the script includes six subsets of data, three specific to age and three specific to passenger count:

The idea here is to include each class of service in its own row, giving us a figure with three rows and two columns. To achieve this, you must update the subplots function call to reflect the new structure:

In addition to the updated nrows and ncols parameter values, the statement also includes six subplot variables, rather than three, with the variables grouped together by row. In addition, the sharey parameter has been replaced with the sharex parameter and the figure size increased.

The next step is to define the six subplots, following the same structure as before, but providing the correct data subsets, subplot variables, column names and labels. The result is a single .pdf file with six charts, as shown in the following figure.

20screenshots5 stpython5 fig08 png SQL Server Machine Learning Services – Part 5: Generating multiple plots in Python

The challenge with this approach is that each chart scales differently. Mixing charts in this way can make it difficult to achieve matching scales. On the plus side, you’re able to visualize a great number of details within a collected set of charts, while still making it possible to quickly compare and understand the data.

The approach taken in this example is not the only way to create these charts. For instance, you might be able to set up a looping structure or use variables for repeated parameter values to simplify the subplot creation, but this approach allows you to see the logic behind each step in the chart-creation process. Like any other aspect of Python, there are often several ways to achieve the same results.

Python’s expansive world

As datasets become larger and more complex, the need is greater than ever for tools that can present the data to various stakeholders in ways that are both meaningful and easy to understand. SQL Server’s support for Python and the R language, along with the analytic and visualization capabilities they bring, could prove an invaluable addition to your arsenal for working with SQL Server data.

When it comes to visualizations in particular, Python has a great deal to offer, and what we’ve covered so far barely scratches the surface. Of course, we could say the same thing about most aspects of Python. It is a surprisingly extensive language that is both intuitive and flexible, and can be useful in a wide range of circumstances. Being able to work with Python within the context of a SQL Server database makes it all the better.

Let’s block ads! (Why?)

SQL – Simple Talk

Data Management and Integration Using Data Entities – Part 3

Feature Image 2 300x225 Data Management and Integration Using Data Entities – Part 3

In part 3 of our blog series on Dynamics 365 Finance and Operations data management and integration, we continue to explore various integration types using data entities. Dynamics 365 for Finance and Operations provides standard out of the box data entities across the modules that can be used as is or with applying extensions. However, you can also build custom entities to address your specific modification needs. The custom data entities need to be created within your development model and extensions.

Asynchronous integrations are used in business scenarios where very large amounts of data need to be imported / exported via files or in the case of recurring periodic jobs without compromising the performance of the system. Data entities support asynchronous integration through Data Management Framework (DMF). It enables asynchronous high-performing data insertion and extraction, and is used for:

  • Interactive file-based import/export (Using DMF)
  • Recurring integrations (file, queue, and so on) (Using DMF / OData)

To access the Dynamics 365 Data Management Framework:

  • Click on your Dynamics 365 URLàSystem administrationàData Management

020818 2109 DataManagem1 Data Management and Integration Using Data Entities – Part 3

The Data Management Framework components for performing the Asynchronous Integration are specified below:

Data Import Export Framework is responsible for uploading files from shared folders, transform the data to populate them into staging tables, validate and map the data to the destination tables. This type of integration is particularly useful for doing the Bulk Data upload and will be considered for doing the one-time full load.

For One-time Full Load – Using the Data Import / Export Framework:

ERP Data Management and Integration Using Data Entities – Part 3
 Data Management and Integration Using Data Entities – Part 3

The above diagram depicts the overall architecture of the framework. The Data Import / Export Framework creates a staging table for each entity in the Microsoft Dynamics Finance and Operations database where the target table resides. Data that is being migrated is first moved to the staging table. Business users / decision makers can verify the data and perform any cleanup or conversion that is required. Post validation and approval of the data can be moved to the target table or exported.

The data flow goes through three phases:

  • Source – These are inbound data files or messages in the queue. Data formats include CSV, XML, and tab-delimited.
  • Staging – Staging tables are generated to provide intermediary storage, this enables the framework to do high-volume file parsing, transformation, and some validations.
  • Target – This is the actual data entity where data will be imported into target table.


OData Integration uses a secure REST application programming interface (APIs) and an authorization mechanism to receive and send back data to the integration system, which is consumed by the data entity and DIXF. It supports single record and batch records. This type of integration is useful for doing the incremental / recurring uploads as it enables transfer of document files between Dynamics 365 and any other third party application or service further it can be reused / automated at the specified interval.

For Incremental / Recurring Integration – Using OData Framework:

020818 2109 DataManagem2 Data Management and Integration Using Data Entities – Part 3

You need to follow these setups to configure the asynchronous one time and recurring integration jobs in Dynamics 365:

1. Create data project:

  • On the main dashboard, click the Data management tile to open the data management workspace.
  • Click the Import or Export tile to create a new data project.
  • Enter a valid job name, data source, and entity name.
  • Upload a data file for one or more entities. Make sure that each entity is added, and that no errors occur.
  • Click Save.

020818 2109 DataManagem3 Data Management and Integration Using Data Entities – Part 3

2. Create recurring data job

  • On the Data project page, click Create recurring data job.
  • Enter a valid name and a description for the recurring data job.
  • On the Set up authorization policy tab, enter the application ID that was generated for your application, and mark it as enabled.

020818 2109 DataManagem4 Data Management and Integration Using Data Entities – Part 3

  • Expand Advanced options, and specify either File or Data package.
  • Specify File to indicate that your external integration will push individual files for processing via this recurring data job.
  • Specify Data package to indicate that you can push only data package files for processing. A data package is a new format for submitting multiple data files as a single unit that can be used in integration jobs.
  • Click Set processing recurrence, and set a valid recurrence for your data job.
    020818 2109 DataManagem5 Data Management and Integration Using Data Entities – Part 3
  • Click Set monitoring recurrence, and provide a monitoring recurrence.
  • Click OK, and then click Yes in the confirmation dialog box.

There you have it! Stay tuned as we will continue to cover other the Data Entity Management and Integration scenarios around Business Intelligence, Common Data services and Application Life Cycle Management.

Subscribe to our blog for more guides to Dynamics 365 for Finance and Operations!

Happy Dynamics 365’ing!

Let’s block ads! (Why?)

PowerObjects- Bringing Focus to Dynamics CRM

Expert Interview (Part 4): Katharine Jarmul on Anonymization and Introducing Randomness to Test Data Sets

At the recent Cloudera Sessions event in Munich, Germany, Paige Roberts, Syncsort’s Big Data Product Marketing Manager, had a chat with Katharine Jarmul, founder of KJamistan data science consultancy, and author of Data Wrangling with Python from O’Reilly. She had just given an excellent presentation on the implications of GDPR for the European data science community. For this final installment, we’ll discuss some of the work Katharine is doing in the areas of anonymization so that data can be repurposed without violating privacy, and creating artificial data sets that have the kind of random noise that makes real data sets so problematic.

In the first part of the interview, we talked about the importance of being able to explain your machine learning models – not just to comply with regulations like GDPR, but also to make the models more useful.

In part 2, Katharine Jarmul went beyond the basic requirements of GDPR again, to discuss some of the important ethical drivers behind studying the data fed to machine learning models. Biased data sets can make a huge impact in a world increasingly driven by machine learning.

In part 3, we talked about being a woman in a highly technical field, the challenges of creating an inclusive company culture, and how bias doesn’t only exist in machine learning data sets.

Roberts: We’ve hit on several subjects here. What else are you working on?

Katharine Jarmul: I’ve been doing a lot more research on things like fuzzing data, test data, and how that relates to anonymization. I’ll be doing a series on that, but there are also some other cool libraries and things I can point to about that. As data scientists, we spend so much time cleaning our data, but how do we mess up our data? Not only to test our own workflow, and determine if it’s working properly, but also to perhaps do things like release it to third parties, or to other people, and have it be anonymized.

Yeah, there’s the example of the Netflix prize, the guy de-anonymized that data. And Netflix was like, “Oh, oops.”

Oops [laughing].

That was supposed to be anonymous data. We thought it was anonymous data.

Yeah, I’m also on a big kick to find out how we can create synthetic data that really looks like our data that has…

You can test with it.


I worked doing healthcare data integration for a long time. We were doing EDI to COBOL which is a big jump in translation. All the pipelines we built were tested with fake data sets. I talked to the guys in charge of the team, told them that the minute we put real data through this system, it’s going to crash and burn. I don’t care how many of these EDI transactions you build with Marcus Welby, MD, and Barney and Betty Rubble, it’s not going to break the system like real data. Real data is always messier than we expect.

Yeah, and I think that if we find ways to be able to test with some of that noise, maybe we can even choose exactly what types of noise, or what types of randomness we want to pursue, then we can make sure that our validation is working properly. And if we don’t have validation, we should probably set that up [laughs].

We probably need some of that, yeah. Maybe. Just a thought.[laughing]

What could go wrong, right? [laughing]

Well, thank you for talking to me.

Thanks so much, Paige. That was fun.

Yeah. I always enjoy these interviews. I always learn something new.

Read our new eBook, Keep Your Data Lake Pristine with Big Data Quality Tools, for a look at how the proper software can help align people, process, and technology to ensure trusted, high-quality Big Data.

More on GDPR:

If you want to learn more about GDPR compliance and how Syncsort can help, be sure to view the webcast recording of Michael Urbonas, Director of Data Quality Product Marketing, on Data Quality-Driven GDPR: Compliance with Confidence.

On a related subject, be sure to read the post by Keith Kohl, Syncsort’s VP of Product Management: Data Quality and GDPR Are Top of Mind at Collibra Data Citizens ’17

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

The Top 20 CRM Blogs of 2017: Countdown, Part 2

Where does the discipline of CRM begin? We have a good idea where the software fits, but where does its impact end? With a sale? With a customer saying good things about your company to other customers? With a repeat purchase? And does CRM contribute to these events alone, or is there a web of other activities that help drive these relationships — and do we ever consider these things to be CRM?

The world is becoming a much more complicated place for practitioners of CRM, expressly because of considerations like these. CRM itself is well understood; getting the most value from it is not understood nearly as well.

The top 10 CRM bloggers of 2017 didn’t spend a lot of time talking about the nuts and bolts of CRM. They talked about the concepts, assumptions, errors, omissions and expectations around CRM. They attacked the “common knowledge.” They tried to get to the basics of what customers really want.

The ground rules again: blogs must not be from a vendor, and they must have seven or more posts in a year. Here are the Top 10 of 2017:

10. Bob Thompson, Customer Think

With a hard pivot toward customer experience and loyalty, Bob Thompson has shifted his area of specialization in
Customer Think over the years as the industry has matured and specialized in its view of CRM.

Bob spent much of 2017 hammering on the idea that humans were critical to delivering the experiences customers wanted, pushing back on a technology tide that had people excited about bots, AI, IVR and other innovations.

While integration has been proceeding more smoothly with these technologies than with technologies of previous generations, it still takes a human touch to deliver the best experiences. It’s not “either-or,” it’s both, according to Bob.

Toward that end, he spent the first half of 2017 writing about tools and business practices for building better customer engagement, but anchored that discussion in how they helped customers — and how engaged, empowered and empathetic employees were key to making any of them work most effectively.

His blogs stopped in August — here’s hoping that Bob comes back to the blog this year and keeps advocating for a customer experience future that uses technology to keep a human face on customer relationships.

Posts in 2017: 15

Favorite post:
Here’s Proof from Forrester that CX Drives Revenue. And 3 Cautions That It May Not</a>

9. CRM Switch

Continuing a strong run is
CRM Switch, from a CRM consultancy that recognizes that a blog exists to start conversations, not to close deals.

The content — usually from Steve Chipman, but with important contributions from Daryn Reif as well — addresses all aspects of sales relationship thinking, with some more technology-focused items sprinkled in to ensure that the “how” is covered as well as the “why.”

Sometimes, the reporting can get a little lazy, as in “Small Business CRM Vendor Roundup,” which rounds up exactly five vendors, but those posts are the exception, not the rule.

More typical is “CRM Selection for Your Business: Seven Proven Steps,” which offers a detailed, comprehensive set of advice that anyone planning to buy and deploy CRM should take to heart.

Born of years of practical experience, CRM Switch’s blog is a helpful guide for any company pondering a jump to an automated CRM solution.

Total posts in 2017: 21

Favorite post:
CRM Lead: How do I Disqualify Thee? Let Me Count the Ways

8: Effective CRM – Mike Boysen

Y’know the old saw about people not wanting to buy a quarter-inch drill, they want to buy a quarter-inch hole? Mike Boysen does. Nearly all of last year’s
Effective CRM
posts went right at that concept: People want outcomes and they’re not that interested in how they get them, so companies need to engage customers about what they really want.

It’s an elemental concept in making a company “customer-centric,” yet a lot of businesses still don’t get it. Mike digs into how you realize what jobs need to be done, how you understand the moments of truth in customer relationships better with jobs theory, and how you can keep a clear focus on jobs that need doing vs. the other elements of a customer relationship that can distract and divert you.

Mike talks about this in blunt terms — I especially liked his quote, “There are no soft-landings for founders who think they are just failing fast. There is only failure.”

Mike addresses some tough issues about CRM itself: “Vendors have given us a one-size-fits-all option where we can feel that we’re differentiating ourselves with the same tools as our competitors. Let’s face it, the vendors out there are doing no better at finding growth  —  profitable growth —  than the rest of us.”

If you think CRM needs some tough love — and to get focused on what it should have been focused on all along — Mike’s the guy for you.

Total posts in 2017: 8

Favorite post:
You Need to Know this New, Pioneering Approach to CRM

7. Forrester Blog – Kate Leggett, John Bruno

Forrester collects all of its analyst blogs into one enormous mega-blog, but if your focus is primarily on CRM and the CRM-like technologies that serve sales, do a search and isolate the blogs from Kate Leggett and John Bruno.

Kate covers the more traditional CRM space and customer service, while John examines sales and marketing technologies. Together, they create a set of posts that are concise and correlate strongly to their current research, with a few “bigger picture” posts that explore broader topics, especially the current pressing issues like AI and digital transformation.

Last year, the blogs’ coverage seemed to pull back a little. At analyst firms, there’s a constant pressure between feeding the blog and keeping some information back for the customers, and the 2017 posts felt a bit like the pendulum had swung away from the blog.

That said, there was still a lot of value in what Kate and John wrote in 2017, and Kate was especially effective in connecting the dots between the technology and the need for engaged employees to use that technology to achieve customer engagement. That’s advice that companies get constantly, but coming from an authoritative voice like Kate’s can make it stick.

Posts in 2017: 16

Favorite post:
Intelligence Makes Customer Service Operations Smarter, More Strategic

6. Destination CRM Blog

Destination CRM is a classic “reporter’s notebook”-style blog, and having been a reporter, I find it very entertaining. Today’s journalists are on the job constantly, and that usually means coming across more interesting ideas and stories than you can fit into your many regularly scheduled articles.

Thus, Oren Smilansky and San Del Rowe provide a home for items about research studies, standalone Q&As, and interesting (if not front-page) company news, ranging in tone from analysis of hard data to the whimsical (as in the post above about the perils of being a customer service agent).

The posts are short, the pace is regular, and the writers follow the practice of including links to their sources — something I wish more bloggers would do.

Don’t let the “Department of the Obvious” headlines (“Customer-Initiated Phone Calls are Valuable to Marketers, Study Says,” “Companies Need to Address Customers in their Native Tongue”) put you off. The writing is good even when the headlines are meh.

At the blog’s best, the writers report on some new findings, and then riff off those results based on their own reporting experience, showing that journalists have some CRM expertise to offer, too.

Posts in 2017: 59

Favorite post:
Customer Cursing Habits, Broken Down by Region and Industry

5. Think Customers: the 1-to-1 Media Blog

Late last year,
Think Customers: the 1-to-1 Media Blog announced that it was going to cease publishing regularly, as its ad-supported model was phased out.

Although the frequency of posts dropped, the guest posts from notable experts dried up, and the staff of writers dwindled to two — veteran Judith Aquino and newcomer Dylan Haviland — the quality remained.

The blog featured some good interviews with genuine thought leaders like Charlene Li, along with other posts that read much more like news stories than like opinion pieces.

A typical approach was to use something discussed at a conference or some recently-released research as a springboard, then add to it with the opinions of analysts, experts and practitioners.

The bloggers’ voices may not always be front and center, but the posts themselves have an air of authority and a completeness of ideas that set them apart.

The blog’s focus on customer experience permits lots of latitude in what’s discussed: concepts like employee engagement in retail, the role of AI in contact centers, and the importance of trust are front and center.

The blog’s takes on these topics are never the same twice, an accomplishment that owes a lot to the hard work the two writers put into the blog.

Posts in 2017: 11

Favorite post:
Emotion Powers Technology Adoption

4. ThinkJar! The Blog – Esteban Kolsky

Always an iconoclast, Esteban Kolsky spent a lot of time in 2017 shutting down the hype about artificial intelligence — and then explaining how it could be really useful. If that sounds like two ideas running headlong into each other, you have an idea of Esteban’s usual take on any subject.

ThinkJar! The Blog, he tears ideas down and then rebuilds them in an Esteban-esque image, infusing the discussion with new points of view and better ways of thinking about the concepts.

As for AI, Esteban pointed out that the notion that AI will be smarter than humans is nonsensical, because “computers would have to dumb down their behavior and operations to work like us.”

Even if they did manage to replicate us, we humans have the ability to adapt our behaviors, something that AI can’t do, enabling us to find meaning and practical utility regardless of what AI does — a bit of a lesson to people who think that all sales and marketing activities can be supplanted by sufficiently smart machines.

Esteban also maintains his role as analyst — witness his incisive, ruthless but ultimately hopeful examination of the Jive-Lithium merger, chock-full of his not-so-humble advice. Smart and snarky, Esteban is the inventor of the concept of self-deprecating arrogance, and his blog is as fun to read as it is important.

Posts in 2017: 16

Favorite post:
Knowledge Summary: the Next Decade in Digital Transformation

3. CRM Search

So, if you’re a medium-sized company looking for CRM advice, you could call in a high-priced consultant, engage with one of the large analyst firms, or find multiple other methods by which you could expend a lot of money in search of wisdom.

Before you start writing checks, however, you should check out the blog at
CRM Search, written by the widely-admired Chuck Schaeffer.

His posts are as detailed and thorough as many of the analyst’s reports you’d pay big money for, and they come from a genuine place of expertise.

Don’t expect a bunch of quick takes — it’s not uncommon for a post to go on for 1,100 words, and then jump to the next page for more. Replete with charts, graphics and plenty of linked citations, these are not pieces jotted off the top of Chuck’s head during airplane flights — they’re extremely thoughtful and well-planned posts.

Whether he’s reviewing the latest edition of Microsoft Dynamics 365, or defining and explaining the ramifications of cognitive computing, it’s Chuck’s deep dives into some heavy-duty subjects that make his blog essential.

The topics can seem a bit all over the place, and they are; it seems that Chuck writes about the things that most interest him in the moment. That ensures the posts are thorough, complete and energetic even when they examine deep, technical topics.

Posts in 2017: 8

Favorite post:
How to Design Your 360-Degree Customer View

2. Beagle Research Blog – Denis Pombriant

What were you concerned about in 2017? So concerned that you sat down and wrote about it? If you said the ASC 606 Accounting Rule, Richard Branson, AI and CRM, Oracle OpenWorld and Salesforce’s DreamForce, cryptocurrency, how Elon Musk is a Luddite and the best way to assemble a sales team based on the data, you must be Denis Pombriant.

Who else has such an eclectic view of the influences on customer relationships, sales and marketing, and digital transformation? No one who’s currently writing a blog!

Author of the
Beagle Research Blog and regular contributor to CRM Buyer, Denis has the ability to stitch these various stories together in a way that’s unmatched. While they may seem far afield from the topic of customer relationship management at times, they’re really not — Denis has for years avoided the trap of thinking that all there was to CRM was CRM software and vendors.

Everything in the economic system that affects the customer needs to be considered, whether it’s the coming impact of blockchain, the value of configure price quote (CPQ) tools to the buying experience, or how the availability of micropayment tools will change the equation for selling.

Denis does this in an exceptionally literate style and folds in plenty of metaphors and analogies to keep things from becoming stale or staid. On top of that, his analyses of major industry events goes beyond insightful. Several journalists I know say they check the blog to make sense of the events they’ve just attended.

Posts in 2017: 49

Favorite post:
Getting Loyalty Right

1. Social CRM: The Conversation – Paul Greenberg

I know the busy, busy Paul Greenberg would like to slow down. Don’t tell his brain that, though.

In 2017, he worked very hard to complete a new book (which will be out this summer), and you could see very clearly how Paul’s intent thinking about his latest long-form work impacted his shorter-form writing in
Social CRM: The Conversation. Ideas were sharper, metaphors were clearer, and Paul’s writing was even more energetic (if that’s possible).

It seems the more Paul works and the more he thinks, the more interesting things spill out into his writing. Last year, his investigations into the discipline of CRM focused much more on using the data than the process of collecting data, which has become an established practice and thus is less interesting.

“Doing CRM” is no longer about getting people to record the data; it’s focused on using the data to become the company you should be. Witness our favorite post of the year: Paul talks about how a company renowned for its abysmal treatment of customers was forced by a business downturn to engage with customers and seemingly was shocked by how well that tactic worked.

Paul’s point in this piece is not just that engaged customer relationships are good for business, but that businesses need to pursue them because they and the people they hire desire — no, need — to pursue them.

A corporate initiative to be more engaged because it will help sales is nice — but it can’t hold a candle to engagement that’s driven by culture and the genuine desire of employees to be engaged.

Paul also used guest posts to buy time for finishing his book, and he’s able to call in heavy hitters like Sameer Patel, David Raab and Brent Leary to fill in. But it’s Paul’s own unique voice that allowed his blog to reclaim the top of this list . His is one of the few blogs that can advise you of the things you should be doing differently and leave you genuinely excited about trying them.

Posts in 2017:14

Favorite post:
A Company Like Me: Beyond Customer-Centric to Customer-Engagedend enn The Top 20 CRM Blogs of 2017: Countdown, Part 2

Chris%20Bucholtz The Top 20 CRM Blogs of 2017: Countdown, Part 2Chris Bucholtz has been an ECT News Network columnist since 2009. His focus is on CRM, sales and marketing software, and the interface between people and technology. A noted speaker and author, Chris has covered the CRM space for 10 years.
Email Chris.

Let’s block ads! (Why?)

CRM Buyer

Expert Interview (Part 3): Katharine Jarmul on Women in Tech and the Impact of Biased Data in Both Human & Machine Learning Models

At the recent Cloudera Sessions event in Munich, Germany, Paige Roberts, Syncsort’s Big Data Product Marketing Manager, had a chat with Katharine Jarmul, founder of KJamistan data science consultancy, and author of Data Wrangling with Python from O’Reilly. She had just given an excellent presentation on the implications of GDPR for the European data science community. Part 3 dives into the position of being one of the women in tech, the challenges of creating an inclusive company culture, and how bias doesn’t only exist in machine learning data sets.

In the first part of the interview, we talked about the importance of being able to explain your machine learning models – not just to comply with regulations like GDPR, but also to make the models more useful.

In part 2, Katharine Jarmul went beyond the basic requirements of GDPR, to discuss some of the important ethical drivers behind studying the data fed to machine learning models. A biased data set can make a huge impact in a world increasingly driven by machine learning.

Paige Roberts:I know, I’m probably a little obsessive about it, but one of the things I do is look around at every event, and calculate the percentage of women to men. And I must say, the percentage at this event is a little low on women.

Katharine Jarmul: Yeah.

So, do you find yourself in that situation a lot? Do you get that, “I’m the only woman in the room” feeling?

I would say that one of the biggest problems that I see in terms of women in technology is not that there’s not a lot of amazing women interested in tech, and it’s difficult for a lot of really talented women in tech to get recognized and promoted.

blog banner 2018 Big Data Trends eBook Expert Interview (Part 3): Katharine Jarmul on Women in Tech and the Impact of Biased Data in Both Human & Machine Learning Models

It feels like women have to be twice as good, to be recognized as half as good.

Yeah. And I think we’re finding out now, there’s a lot of other minority groups as well, who find it difficult, such as women of color. Maybe you have to work four times as hard. We see this exponential thing, and when you’re at an event where it’s mainly executives, or people that have worked their way up for a while, then you just tend to see fewer women, and that’s really sad. I don’t see it as a pipeline problem. I know a lot of people talk about it as a pipeline problem, and yeah, okay, we could have a better pipeline.

Yeah, we need a few more women graduating, but that’s not the problem. The problem is they don’t get as far as they should once they graduate.

Exactly, and maybe eventually they leave because they are tired of not being promoted, having somebody else promoted over them, not getting the cool projects so they can shine.

And some of it is just cultural in tech companies. You get that exclusionary feeling. I had a conversation recently, somebody I was talking to… Oh, I was talking to Tobi Bosede. She’s a woman of color, and she’s a machine learning engineer who did a presentation at Strata. She said something along the lines of, the guys I work with say, “Let’s go play basketball after work.” And everybody on the team does. She’s thinking, “I don’t even like basketball. I don’t really want to go play basketball with the guys after work, but I still feel left out.”

Yeah, I get that. It’s difficult to make a good team culture that’s inclusive. I think you must really work for it. I know some great team leads who are doing things that help, but I think especially if say, you’re a white guy that didn’t grow up with a lot of diversity in your family or your neighborhood, it might be more difficult for you to learn how to create that culture. You must work for it. It’s not just going to happen.

It’s almost like a biased data set in your life. You don’t recognize bias in yourself, until you stop and think about it. It doesn’t just jump out and make itself known.

Of course.

Jarmul pt3 quote women in tech Expert Interview (Part 3): Katharine Jarmul on Women in Tech and the Impact of Biased Data in Both Human & Machine Learning Models

I did an interview with Neha Narkhede, she’s the CTO at Confluent, and she was talking about hiring bias. Even as a woman of color herself, when hiring, she catches herself doing it, and must stop and think, and deliberately avoid bias. It’s in your own head. You think, I should know better.

Yeah, yeah. And I think these unconscious biases are things that we have, as humans. We all have some affinity bias, right? So, if somebody is like me, I’m going to automatically think that they’re clearer. They think like me, so I can more easily see their point. That’s fine but, one of the things that helps teams grow is having arguments, …

Having different points of view, and accepting that, “Okay, this guy thinks completely different from me, but maybe he’s got a point.”

I find myself doing the thing where I think, “Why did they disagree with me? How could they?”

They’re wrong, obviously. [laughing]

[laughing] Especially when I notice that I’m doing it like that, I say, “Okay, I need to sit down and think through this. Is there perhaps a cardinal truth here? Or something that bothers me because it doesn’t necessarily fit into my world view? And should I, perhaps, poke at that a little bit, and figure it out?”

Stop and think, introspect.

Yeah [laughs].

That’s a good word. I like that.

We have our own mental models, and we need to question the bias in them, too.

Be sure to check out part 4 of this interview where we’ll discuss some of the work Ms. Jarmul is doing in the areas of anonymization so that data can be repurposed without violating privacy, and creating artificial data sets that have the kind of random noise that makes real data sets so problematic.

For a look at 5 key Big Data trends in the coming year, check out our report, 2018 Big Data Trends: Liberate, Integrate & Trust

Related Posts:

Neha Narkhede, CTO of Confluent, Shares Her Insights on Women in Big Data

Yolanda Davis, Sr Software Engineer at Hortonworks, on Women in Technology

Katharine Jarmul on If Ethics is Not None

Katharine Jarmul on PyData Amsterdam Keynote on Ethical Machine Learning

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 2): Katharine Jarmul on the Ethical Drivers for Data Set Introspection Beyond GDPR Compliance

At the recent Cloudera Sessions event in Munich, Germany, Paige Roberts of Syncsort had a chat with Katharine Jarmul, founder of KJamistan data science consultancy, and author of Data Wrangling with Python from O’Reilly. She had just given an excellent presentation on the implications of GDPR for the European data science community.

In the first part of the interview, we talked about the importance of being able to explain your machine learning models – not just to comply with regulations like GDPR, but also to make the models more useful.

In this part, Katharine Jarmul will go beyond the basic requirements of GDPR again, to discuss some of the important ethical drivers behind studying the data fed to machine learning models. A biased data set can make a huge impact in a world increasingly driven by machine learning.

blog banner webcast GDPR confidence Expert Interview (Part 2): Katharine Jarmul on the Ethical Drivers for Data Set Introspection Beyond GDPR Compliance

In part 3, we’ll talk about being a woman in a highly technical field, the challenges of creating an inclusive company culture, and how bias doesn’t only exist in machine learning data sets.

In the final installment, we’ll discuss some of the work Ms. Jarmul is doing in the areas of anonymization so that data can be repurposed without violating privacy, and creating artificial data sets that have the kind of random noise that makes real data sets challenging.

Paige Roberts:Okay, another interesting thing you were talking about in your presentation was the ethics involved in this area. If you’ve got that black box, you don’t know where your data came from, or maybe you didn’t really study it enough. You didn’t sit down and, how did you put it? Introspect. You didn’t really think about where that data came from, and how it can affect people’s lives.

Katharine Jarmul: Yeah, there’s been a lot of research coming out about this. Particularly when we have a sampling problem. For example, let’s say we have a bunch of customers, and only 5% are aliens. I will use that term, just as if they were Martians. We have these aliens that are using our product, and because we have the sampling bias, any statistical measurement we take of this 5% is not really going to make any sense, right? So, we need to recognize that our algorithm is probably not going to treat these folks fairly. Let’s think about how to combat that problem. There’s a lot of great mathematical ways to do so. There’s also ways that you can decide to choose a different sampling error, or choose to treat groups separately in your classification. There are a lot of ways to fight this, but first you must recognize that it’s a problem.

If you don’t recognize that there MIGHT be a problem, you don’t even look for it, so you never realize it’s there.

Exactly, and I think that’s key to a lot of these things that are coming out that are really embarrassing for some companies. It’s not that they’re bad companies, and it’s not that they’re using horrible algorithms, it’s that some of these things if you don’t think about all the potential ramifications in all the potential groups, it’s really easy to get to a point where you must say, “Oops, we didn’t know that,” and to have a big public apology you must issue.

Something like, I have made decisions on who gets a loan, or who gets a scholarship, or who gets promoted, or who gets hired, and it’s all based on biased data. I didn’t stop and think, “Oh, my dataset might be biased.” And now, my machine learning algorithm is propagating it. There were a lot of talks about that at Strata. Hillary Mason did a good one on that.

Oh excellent. Her work at Fast Forward Labs on interpretability is some of the best in terms of pushing the limit for how we apply interpretability, and therefore this accountability that comes with that, to “black box” models.

Because if you don’t know how your model works, you can’t tell when it’s biased.

Exactly. And, if you spend absolutely ALL of your time on “How can I get the precision higher?”, “How can I get the recall higher?” and you spend none of your time on, “Oh wait, what might happen if I give the model this data?” that the model might have not seen before, that is data from perhaps a person with a different color of skin, or a person with a different income level, or whatever it is- “How might the model react?” If you’re not thinking about those things at all, then they’ll really sneak up on you [laughs].

Be sure to check out part 3 of this interview, where we’ll discuss the challenges of women in tech, and the biases that exist, not just in our data sets, but also in our culture and our own minds.

If you want to learn more about GDPR compliance and how Syncsort can help, be sure to view the webcast recording of Michael Urbonas, Syncsort’s Director of Data Quality Product Marketing, on Data Quality-Driven GDPR: Compliance with Confidence.

Related posts:

Katharine Jarmul on If Ethics is Not None

Katharine Jarmul on PyData Amsterdam Keynote on Ethical Machine Learning

Keith Kohl, Syncsort’s VP of Product Management on Data Quality and GDPR Are Top of Mind at Collibra Data Citizens ’17

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog