• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Avoiding Obvious Insights with Analyze With Insight Miner

March 6, 2019   Sisense

Analyze with Insight Miner is a technology developed by Sisense that uses machine learning to identify statistically important insights in dashboards. We apply a vast range of analytical and statistical tests to detect interesting insights and compute common exploratory analysis, such as cohort, decision tree, bivariate analysis, and more.

As the lead of the team who built this technology, in this post, I’ll introduce a high-level flow of data in Analyze with Insight Miner – starting from reading the data, cleaning it, and lastly generating insights. It’s important to keep in mind during this section that we needed a data flow that is robust and generic enough to handle many types of datasets from different domains and with different distributions. Following that, I will discuss how we detect non-relevant or too obvious insights.

Analyze With Insight Miner

Analyze with Insight Miner generates insights based on tabular datasets. Imagine you have a big table with lots of columns and want to explore the impact of all the columns on one specific column. This one column will be the target variable, i.e. the column that we are interested to explore. All other columns will be referred to as the explaining variables, which are the variables by which we want to explain the target.

Using Analyze with Insight Miner, some typical insights a user might see are:

“When the age of the customer is larger than 60 in the US, the likelihood of the customer to churn is 5%, compared to a 30% likelihood in the entire population.”

“We detect a decreasing trend in the likelihood to churn over time in Italy.”

If we look at the examples above, the target variable is the churn column (a binary column describing if a customer churned or not) and the explaining variables are the age of the customers and their countries.

Learn more about how Analyze with Insight Miner can unhearth hidden gold from your data here.

But first things first…

Before Analyze with Insight Miner starts looking for insights, it needs to preprocess the dataset. It starts by understanding the type of the variables in the dataset (are they numeric, categorical, dates, and so on). It then applies some preprocessing steps. For instance, many statistical tests are not robust for outliers so it needs to detect and remove them.

On top of this, if there are missing values in the dataset, Analyze with Insight Miner needs to decide what to do with them. In some cases, it imputes them. A common approach for imputation if you’re working with a numeric column, for example, is to replace the missing values with the average of all other values of the same column. Imputation of values in a categorical column can be done by replacing the missing values with the most common value of the column.

After applying a sequence of cleansing and preprocessing steps it can get to the interesting part – generating insights.

In this step, Analyze with Insight Miner conducts a set of tests to discover patterns in the data or subpopulations that are different than the general population. For example, it creates combinations of subgroups (such as customers under 35 from the US) and checks if their target variable distribution is significantly different than the others (for example, the likelihood of this group to churn is significantly lower than all the others).

To form these groups Analyze with Insight Miner uses, among other techniques, decision trees. A decision tree is a popular machine learning algorithm used both for classification and regression. One of its strengths is the ability to interpret its logic, in contrast to many other machine learning algorithms that are black boxes. Analyze with Insight Miner trains a decision tree on the target variable using the explaining variables. Below you can see an example for a decision tree.

image11 770x518 Avoiding Obvious Insights with Analyze With Insight Miner

We can extract several insights from this decision tree. For example, females above 35 are likely to churn 23%.

Now that we understand the basics behind how Analyze with Insight Miner works I would like to present one of the techniques it uses to avoid generating insights that are obvious and uninteresting for the end user.

What Are Obvious Insights?

One of our assumptions in developing Analyze with Insight Miner was that end users don’t necessarily need to have the knowledge of constructing the right dataset that is most suitable for the exploratory analysis.

Continuing from the previous example, let’s assume that we are interested in uncovering insights related to the churn patterns of our customers. In this case, our target variable can be a binary column indicating if a customer churned or renewed his account. The explaining variables can be columns related to the customers’ demographics, past usage, and so on.

Below we can see a mini example for such a dataset. The last column in this dataset (“IsRenewal”) indicates if the customer renewed his account and is the complete opposite of “IsChurn”. In this case, we’ll return a very strong insight – “when customers renew their account they don’t churn.” Unfortunately, for obvious reasons, this is not interesting at all.

IMAGE 2 Avoiding Obvious Insights with Analyze With Insight Miner

This example is easy to detect since there is a one-to-one mapping between the target variable and the “IsRenewal” explaining variable. But what if the relationship between the variables is not so straightforward? How can we detect those relationships automatically and avoid obvious insights?

How to Avoid Generating Obvious Insights

One approach to avoid generating obvious insights is to apply what we refer to as “opposite feature selection.”

Before we dive into the details, let’s first define feature selection. Feature selection is a well-known process in machine learning that selects a subset of features (columns) to use in the model construction.

Eventually, after applying feature selection, you would stay with the subset of features that are the most informative/predictive and drop the redundant or irrelevant. There are a few reasons to use feature selection techniques prior to model construction. Among them are the simplification of models, avoiding the curse of dimensionality, and reducing overfitting.

While there are many feature selection techniques, we’ll focus on mutual information based technique. Mutual information is a measure taken from information theory that measures the mutual dependence of two random variables. Intuitively, mutual information measures how much knowing one of these variables reduces uncertainty about the other.

In our example, the mutual information between the target variable “IsChurn” and the “IsRenewal” column will be very high since knowing the values of one of these columns completely reduces the uncertainty of the other. To apply feature selection, we can rank the explaining variables by their mutual information with the target variable and select the top K variables with the highest mutual information.

Our problem is a bit different than feature selection in which we want to choose the variables with the highest dependency on the target. Here, we want to avoid obvious insights that are caused by variables that are too dependent on the target. So, it stands to reason that we can do the opposite from feature selection and remove the variables with very high mutual information, right?

Not exactly. To do so, we will need to define a threshold such that all variables with mutual information higher than 0.95 are discarded. The problem is that mutual information is not a normalized measure – it can get any non-negative, real value. To do so we can use the normalized version of mutual information, which yields values between 0 to 1. When the mutual information values are normalized we can set a threshold and remove columns with normalized mutual information above it.

User Feedback

Another technique for selecting the most relevant and interesting insights is to learn from user feedback. This can be accomplished in several ways. We can explicitly ask the users for feedback by presenting “Like” or “Dislike” buttons next to the insights they receive.

We can also implicitly get feedback from the users’ usage. We can consider insights that were shared with other users as interesting insights or we can detect if a new widget was constructed based on the insight presented. By doing this we can learn using machine learning models the patterns of interesting insights and predict in advance how interesting a user will find specific insights.

More to Come

Outsourcing data science often doesn’t live up to customer expectations because there needs to be a strong understanding of the domain data for insights to be delivered. However, with a basic overview of how Analyze with Insight Miner works, it’s easy to understand how it can help data engineers, analysts, and developers scale delivery of hidden insights beyond predefined dashboards to their end users.

I’m so excited to share this feature with all of you. We are planning on adding some more new cool features in the future and integrating Analyze with Insight Miner with other features in Sisense – so stay tuned!

Learn more about Analyze with Insight Miner here.

Tags: data science | Machine Learning

Let’s block ads! (Why?)

Blog – Sisense

analyze, Avoiding, Insight, Insights, Miner, obvious
  • Recent Posts

    • Now get Mind Map View of your Dynamics 365 CRM Connections in a single view with latest Map My Relationships features!
    • Potatoes for Brains
    • How to Prepare for Microsoft Certification Exams
    • The Missing Link: Blockchain for Digital Supply Chains
    • Incoming White House science and technology leader on AI, diversity, and society
  • Categories

  • Archives

    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited