Tag Archives: Analysis

Exploratory and Confirmatory Analysis: What’s the Difference?

1200x628 Explorer2 Exploratory and Confirmatory Analysis: What’s the Difference?

How does a detective solve a case? She pulls together all the evidence she has, all the data that’s available to her, and she looks for clues and patterns.

At the same time, she takes a good hard look at individual pieces of evidence. What supports her hypothesis? What bucks the trend? Which factors work against her narrative? What questions does she still need to answer… and what does she need to do next in order to answer them?

Then, adding to the mix her wealth of experience and ingrained intuition, she builds a picture of what really took place – and perhaps even predicts what might happen next.

But that’s not the end of the story. We don’t simply take the detective’s word for it that she’s solved the crime. We take her findings to a court and make her prove it.

In a nutshell, that’s the difference between Exploratory and Confirmatory Analysis.

Data analysis is a broad church, and managing this process successfully involves several rounds of testing, experimenting, hypothesizing, checking, and interrogating both your data and approach.

Putting your case together, and then ripping apart what you think you’re certain about to challenge your own assumptions, are both crucial to Business Intelligence.

Before you can do either of these things, however, you have to be sure that you can tell them apart.

What is Exploratory Data Analysis?

Exploratory data analysis (EDA) is the first part of your data analysis process. There are several important things to do at this stage, but it boils down to this: figuring out what to make of the data, establishing the questions you want to ask and how you’re going to frame them, and coming up with the best way to present and manipulate the data you have to draw out those important insights.

That’s what it is, but how does it work?

As the name suggests, you’re exploring – looking for clues. You’re teasing out trends and patterns, as well as deviations from the model, outliers, and unexpected results, using quantitative and visual methods. What you find out now will help you decide the questions to ask, the research areas to explore and, generally, the next steps to take.

Exploratory Data Analysis involves things like: establishing the data’s underlying structure, identifying mistakes and missing data, establishing the key variables, spotting anomalies, checking assumptions and testing hypotheses in relation to a specific model, estimating parameters, establishing confidence intervals and margins of error, and figuring out a “parsimonious model” – i.e. one that you can use to explain the data with the fewest possible predictor variables.

In this way, your Exploratory Data Analysis is your detective work. To make it stick, though, you need Confirmatory Data Analysis.

What is Confirmatory Data Analysis?

Confirmatory Data Analysis is the part where you evaluate your evidence using traditional statistical tools such as significance, inference, and confidence.

At this point, you’re really challenging your assumptions. A big part of confirmatory data analysis is quantifying things like the extent any deviation from the model you’ve built could have happened by chance, and at what point you need to start questioning your model.

Confirmatory Data Analysis involves things like: testing hypotheses, producing estimates with a specified level of precision, regression analysis, and variance analysis.
In this way, your confirmatory data analysis is where you put your findings and arguments to trial.

Uses of Confirmatory and Exploratory Data Analysis

In reality, exploratory and confirmatory data analysis aren’t performed one after another, but continually intertwine to help you create the best possible model for analysis.

Let’s take an example of how this might look in practice.

Imagine that in recent months, you’d seen a surge in the number of users canceling their product subscription. You want to find out why this is, so that you can tackle the underlying cause and reverse the trend.

This would begin as exploratory data analysis. You’d take all of the data you have on the defectors, as well as on happy customers of your product, and start to sift through looking for clues. After plenty of time spent manipulating the data and looking at it from different angles, you notice that the vast majority of people that defected had signed up during the same month.

On closer investigation, you find out that during the month in question, your marketing team was shifting to a new customer management system and as a result, introductory documentation that you usually send to new customers wasn’t always going through. This would have helped to troubleshoot many teething problems that new users face.

Now you have a hypothesis: people are defecting because they didn’t get the welcome pack (and the easy solution is to make sure they always get a welcome pack!).

But first, you need to be sure that you were right about this cause. Based on your Exploratory Data Analysis, you now build a new predictive model that allows you to compare defection rates between those that received the welcome pack and those that did not. This is rooted in Confirmatory Data Analysis.

The results show a broad correlation between the two. Bingo! You have your answer.

Exploratory Data Analysis and Big Data

Getting a feel for the data is one thing, but what about when you’re dealing with enormous data pools?

After all, there are already so many different ways you can approach Exploratory Data Analysis, by transforming it through nonlinear operators, projecting it into a difference subspace and examining your resulting distribution, or slicing and dicing it along different combinations of dimensions… add sprawling amounts of data into the mix and suddenly the whole “playing detective” element feels a lot more daunting.

The important thing is to ensure that you have the right tech stack in place to cope with this, and to make sure you have access to the data you need in real time.

Two of the best statistical programming packages available for conducting Exploratory Data Analysis are R and S-Plus; R is particularly powerful and easily integrated with many BI platforms. That’s the first thing to consider.

The next step is ensuring that your BI platform has a comprehensive set of data connectors, that – crucially – allow data to flow in both directions. This means that you can keep importing Exploratory Data Analysis and models from, for example, R to visualize and interrogate results – and also send data back from your BI solution to automatically update your model and results as new information flows into R.

In this way, you not only strengthen your Exploratory Data Analysis, you incorporate Confirmatory Data Analysis, too – covering all your bases of collecting, presenting and testing your evidence to help reach a genuinely insightful conclusion.

Your honor, we rest our case.

Ready to learn how to incorporate R for deeper statistical learning? You can watch our webinar with renowned R expert Jared Lander to learn how R can be used to solve real-life business problems.

Let’s block ads! (Why?)

Blog – Sisense

Qubole raises $25 million for data analysis service

 Qubole raises $25 million for data analysis service

Qubole, which provides software to automate and simplify data analytics, announced today that it has raised $ 25 million in a round co-led by Singtel Innov8 and Harmony Partners. Existing investors Charles River Ventures (CRV), Lightspeed Venture Partners, Norwest Venture Partners, and Institutional Venture Partners (IVP) also joined.

Founded in 2011, the Santa Clara, California-based startup provides the infrastructure to process and analyze data more easily.

It’s possible for companies to store large amounts of information in public clouds without building their own datacenters. But they still need to process and analyze the data, which is where Qubole comes in.

“Many companies struggle with creating data lakes,” CEO Ashish Thusoo noted. His solution is providing a cloud-based infrastructure to break the raw data down without having to break it into silos.

The chief executive is well-versed in the matter, as he led a team of engineers at Facebook that focused on data infrastructure. “It gave us a front row seat of how modern enterprises should be using data,” he said.

Qubole claims to be processing nearly an exabyte of data in the cloud per month for more than 200 enterprises, which include Autodesk, Lyft, Samsung, and Under Armour. In the case of Lyft, the ride-sharing company uses Qubole to process and analyze its data for route optimization, matching drivers with customers faster.

Qubole offers a platform as a service (PaaS) that currently runs on Amazon Web Services (AWS), Microsoft Azure, and Oracle Cloud. “Google is something we’re looking at,” said Thusoo.

He said the biggest competitors in the sector include AWS, Cloudera, and Databricks, which recently closed a $ 140 million round of funding.

To date, Qubole has raised a total of $ 75 million. It plans on using the new money to further develop its product, increase sales and marketing efforts, and expand in the Asia Pacific (APAC) region.

“There is a significant opportunity for big data in the Asia Pacific region,” said Punit Chiniwalla, senior director at Singtel Innov8, in a statement.

Qubole currently employs 240 people across its offices in California, India, and Singapore.

Sign up for Funding Daily: Get the latest news in your inbox every weekday.

Let’s block ads! (Why?)

Big Data – VentureBeat

Improve Sales, Marketing, SEO and more with Sales Call Analysis

blog title rethink podcast amit bendov 351x200 Improve Sales, Marketing, SEO and more with Sales Call Analysis

This transcript has been edited for length. To get the full measure, listen to the podcast.

Michelle Huff: What are people trying to learn from analyzing all these sales conversations?

Amit Bendov: Sales is a pretty complex craft. There’s a lot of things that you need to do right. Anything from being a good listener, to asking the right questions at the right time, to making sure that you uncover customer problems, that you create a differentiated value proposition in their minds, to make sure you have concrete action items. So, there’s quite a lot to learn. And then there’s also product and industry specific knowledge. For example, if you hire a news salesperson at Act-On, and first they need to be good salespeople; second, they need to understand marketing automation, they need to understand the industry, they need to understand competitive differentiation.

So, all those things are little skills that Gong identifies how are you doing and how you compare against some of the best reps in your company. And then it starts coaching either by providing you feedback or a manager feedback, and improving your reps so you can become better at all these little things, becoming a better listener, asking better questions, and becoming a better closer.

Although we sell primarily to the sales team, the product marketing and marketing teams are also big fans, because you can see if there are new messages that are being rolled out, and are customers responding to the new messages, are we telling the story right. What are customers asking about, both from a marketing and sales perspective, but also from a product, which features they like, which features they don’t like, which features they like about our competitors. So, it’s a great insight tool for marketeers and product guys.

Michelle: What do you think we could learn from unsuccessful sales calls?

Amit: I’m a believer we could learn more from successful calls. There are fewer ways to succeed than ways to fail. So, learning from what works is usually more powerful. But we all have our failures. And nobody’s perfect. Even some of the better calls have lots of areas to improve. One of the first things that people notice is the listen to talk ratio. And one of the things Gong measures is how much time you’re speaking on a call versus how much time the prospect is speaking. The optimal ratio, if you’re curious, is 46 percent. So, the salesperson is speaking for 46 percent of the time of the call and the prospect’s filling in the rest.

A lot of the sales reps – especially, the new hires – tend to speak as much as 80 percent of the time. Maybe it’s because they’re insecure, maybe they feel they need to push more. And that’s almost never a good idea.

Michelle: That’s such a good feedback loop. Forty six percent, that’s a good ratio, right? It’s not not saying anything at all, but letting them do the majority of talking. That’s interesting.

Amit: It is like a Fitbit for sales calls. Once people see that feedback, it’s pretty easy to cure: ‘Oh my God, I spoke for 85 percent.’ And then they start setting personal goals to bring it down. And because they get feedback on every call, every time, it’s pretty easy to fix this problem.

Michelle: What are certain keywords when it comes to customer timelines that you train people to look for?

Amit: One of the things we’ve analyzed, and we ran these on a very large number of calls, is what’s a reliable response to the question regarding the time of the project. If a salesperson would ask a customer, when would you like to be live? And there’s a range of options. We found the word “probably,” as in probably mid-February, is a pretty good indicator that they’re serious about it. We don’t know exactly why. But I mean we can only guess that maybe they’ve taken their response more seriously. Versus “like we need this yesterday,” or “we have to have it tomorrow,” which are not really thoughtful answers. Or obviously, “well, maybe sometime next year,” which is very loose. So, the word probably is actually a pretty good indicator that the deal, if it happens, it doesn’t mean that it will close, but if it will, it will probably happen on that timeframe.

Michelle: That is super insightful. Because it’s almost counterintuitive. You’d think that if you hear the word probably, it wouldn’t, they’re not very firm.

Amit: Here’s another interesting one. A lot of the sales managers and coaches are obsessed with filler words, things like you know, like, basically. And sometimes they would drive the salespeople nuts with trying to bring down their filler word portions. And what we’ve found, we analyzed a large number of calls, and tried to see if there’s an impact on close rates, in calls where there are a lot of filler words and calls where there are not a lot of filler words. And we found absolutely zero correlation between the words and success. So, my advice to our listener, just don’t worry about it. Just say what you like. It doesn’t make a big difference. Or at least there is no proof that it makes a difference.

And my theory that it’s more annoying when you listen to it in a recording versus in a live conversation. Because in a live conversation, both you and the customer are focused on the conversation and trying to understand what’s going on, and you don’t pay attention to those filler words. But when you listen to a recording, they’re much more prominent.

Michelle: I think at the end of the day, if there’s a connection, people are buying from people. I feel like those are all things where it’s good feedback, where you can just improve on how you up your game across the board on all your conversations.

Amit: Absolutely. You shine a light on this huge void that is sales conversation, what’s happening in conversation, just how people see clearly the data, versus just rely on opinions, or self-perception, or subjective opinions on what is actually happening.

Michelle: We talked a lot about the sales use case. What are some of the other areas we can use Gong? How about customer success?

Amit: Almost all of our customers now use it for customer success as well. Again, here’s where you want to know what the customers are thinking. Are we taking good care of them? Are we saying the right things? What is it like, which customers are unhappy with our service, what do we need to improve? So that again shines a light on how customers feel. Because without that, all you have is really some usage metrics and KPIs and surveys that are important, but don’t tell the complete story. What people actually tell you, I mean you could have customers that use the product a lot and are not very thrilled with your product. Or customers that don’t use it as much as you think that they should be, but they’re very excited. So definitely Gong is used for customer service in almost all of our customers.

Here is another interesting application. I know a lot of our audience are marketers. A lot of what we do in marketing has to do with messages. First is like how we describe the product. So, you can learn a lot about it from what your customers say, not what your salespeople are saying. But if you listen to real customer calls, I mean existing user, and you’ll hear how they describe the product, this is probably a very good language for you to use. If you listen to enough calls, you can get a theme that will help you explain your product better.

You might have heard that I’ve used ‘shine a light on your sales conversation.’ I didn’t make this up. We’ve interviewed 20 VPs of sales that use the product. And we looked at what are the common themes they use when they describe the product. And this came from them. So, it’s a great messaging research tool.

The other application you could use Gong for is SEO and SEM. Usually one of the first things people will say when they join an introductory call is, we’re looking for “X.” That “X” might not be what’s on your website, or how you describe your product. If you listen to a lot of calls and use their own words, those are the words you want to bid on or optimize for. Because that will drive a lot of traffic. And this is what people really search for and not the words you would normally use in your website. We do that, too.

Michelle: In our conversation, you’re bringing up your research and the insights you’re able to gain. Having these research reports seems to be a key part of your marketing and brand awareness. How do you leverage these research reports? What are you doing? And how is it helping?

Amit: We identify that’s the key strategic marketing capability that we’re going to be counting on. Something that people have not seen before, so it’s like the first pictures from the Hubble telescope. OK, so here’s what the universe looks like. Or the first pictures of the Titanic. This is what it really looks like and that’s where it lies. So, we’re doing the same thing for sales conversations. It’s something that people are very passionate about. There’s some 10,000 books on Amazon on how to sell. But nobody really has the facts. So, we identified this as an opportunity.

And we’re trying to do things that are interesting and useful. We try to keep it short. We take a small chunk, we investigate it, we publish the results, and try to make sure that whatever we do it’s something that people have at least some takeaway. So, it’s both interesting and immediately be applicable to what they do.

That does generate activity on social media. We get hundreds of likes per post. We get press from it. Some of it got published on Business Insider, and Forbes, and it drives a lot of traffic. Plus, it’s in line with our message, shining the light on your sales conversation. We do get a lot of people coming into sales calls, ‘Hey, I read this blog on LinkedIn, this is very fascinating about how much I should be talking, what kind of questions I want to apply to my own team, which is what we sell.’

Michelle: I enjoyed the conversation. Thank you so much for taking the time to speak with us.

Amit: My pleasure, Michelle. I had a lot of fun. Thank you.

Let’s block ads! (Why?)

Act-On Blog

Spotfire Tips & Tricks: Hierarchical Cluster Analysis

Hierarchical cluster analysis or HCA is a widely used method of data analysis, which seeks to identify clusters often without prior information about data structure or number of clusters. Strategies for hierarchical clustering generally fall into two types: Agglomerative and divisive. Agglomerative is a bottom up approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive is a top-down approach where all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Hierarchical cluster analysis in Spotfire

The algorithm used for hierarchical clustering in TIBCO Spotfire is a hierarchical agglomerative method. For row clustering, the cluster analysis begins with each row placed in a separate cluster. Then the distance between all possible combinations of two rows is calculated using a selected distance measure. The two most similar clusters are then grouped together and form a new cluster. In subsequent steps, the distance between the new cluster and all remaining clusters is recalculated using a selected clustering method. The number of clusters is thereby reduced by one in each iteration step. Eventually, all rows are grouped into one large cluster. The order of the rows in a dendrogram are defined by the selected ordering weight. The cluster analysis works the same way for column clustering.

Distance measures: The following measures can be used to calculate the distance or similarity between rows or columns

  • Correlation
  • Cosine correlation
  • Tanimoto coefficient
  • Euclidean distance
  • City block distance
  • Square euclidean distance
  • Half Square euclidean distance

Clustering methods: The following clustering methods are available in Spotfire

  • UPGMA
  • WPGMA
  • Single linkage
  • Complete linkage
  • Ward’s method

Spotfire also provides options to normalize data and perform empty value replacement before performing clustering.

HCtool Spotfire Tips & Tricks: Hierarchical Cluster Analysis

To perform a clustering with the hierarchical clustering tool, Iris data set was used.

Select Tools > Hierarchical Clustering

Select Data Table, and next click Select Columns

Sepal length, Sepal Width, Petal Length and Petal width columns were selected

HC select columns Spotfire Tips & Tricks: Hierarchical Cluster Analysis

Next, in order to have row dendrograms, Cluster Rows check box was selected

Click the Settings button to open the Edit Clustering Settings dialog and select a Clustering method and Distance measure. In this case default options were selected.

The hierarchical clustering calculation is performed, and heat map visualization with the specified dendrograms is created in just a few clicks. A cluster column is also added to the data table and made available in the filters panel. The bar chart uses cluster ID column to display species. The pruning line was set to 3 clusters and it is observed that  Setosa was predicted correctly as single cluster, but there were some rows in Virginca and versicolor which were not in right cluster and these are known issues.

HCA Iris Spotfire Tips & Tricks: Hierarchical Cluster Analysis

Try this for yourself with TIBCO Spotfire! Test out a Spotfire free trial today. Check out other Tips and Tricks posts to help you #DoMoreWithSpotfire!

Let’s block ads! (Why?)

The TIBCO Blog

Getting Started with Azure Analysis Services Webinar Sept 21

The Azure Analysis Services team would like to help you get started with this exciting new offering that makes scaling Business Intelligence solutions easier than ever.

In this Webinar you will meet the Analysis Services team (such as Josh Caplan and Kay Unkroth) where they will show you the easiest way to get started with Analysis Services. If you aren’t familiar with Azure Analysis Services, Azure Analysis Services provides enterprise-grade data modeling in the cloud. This enables you to mashup and combine data from multiple sources, define metrics, and secure your data in a single, trusted semantic data model. The data model provides an easier and faster way for your users to browse massive amounts of data with client applications like Power BI, Excel, Reporting Services, third-party, and custom apps.

When:

September 21st 10AM PST

Subscribe to watch:

https://www.youtube.com/watch?v=MmCaggAnvhM

image202 Getting Started with Azure Analysis Services Webinar Sept 21

Let’s block ads! (Why?)

Microsoft Power BI Blog | Microsoft Power BI

Using Azure Analysis Services on Top of Azure Data Lake Storage

The latest release of SSDT Tabular adds support for Azure Data Lake Store (ADLS) to the modern Get Data experience (see the following screenshot). Now you can augment your big data analytics workloads in Azure Data Lake with Azure Analysis Services and provide rich interactive analysis for selected data subsets at the speed of thought!

ADLS Connector Available Using Azure Analysis Services on Top of Azure Data Lake Storage

If you are unfamiliar with Azure Data Lake, check out the various articles at the Azure Data Lake product information site. Also read the article “Get started with Azure Data Lake Analytics using Azure portal.”

Following these instructions, I provisioned a Data Lake Analytics account called tpcds for this article and a new Data Lake Store called tpcdsadls. I also added one of my existing Azure Blob Storage accounts, which contains a 1 TB TPC-DS data set, which I already created and used in the series “Building an Azure Analysis Services Model on Top of Azure Blob Storage.” The idea is to move this data set into Azure Data Lake as a highly scalable and sophisticated analytics backend, from which to serve a variety of Azure Analysis Services models.

For starters, Azure Data Lake can process raw data and put it into targeted output files so that Azure Analysis Services can import the data with less overhead. For example, you can remove any unnecessary columns at the source, which eliminates about 60 GB of unnecessary data from my 1 TB TPC-DS data set and therefore benefits processing performance, as discussed in “Building an Azure Analysis Services Model on Top of Azure Blob Storage–Part 3″.

Moreover, with relatively little effort and a few small changes to a U-SQL script, you can provide multiple targeted data sets to your users, such as a small data set for modelling purposes plus one or more production data sets with the most relevant data. In this way, a data modeler can work efficiently in SSDT Tabular against the small data set prior to deployment, and after production deployment, business users can get the relevant information they need from your Azure Analysis Services models in Microsoft Power BI, Microsoft Office Excel, and Microsoft SQL Server Reporting Services. And if a data scientist still needs more than what’s readily available in your models, you can use Azure Data Lake Analytics (ADLA) to run further U-SQL batch jobs directly against all the terabytes or petabytes of source data you may have. Of course, you can also take advantage of Azure HDInsight as a highly reliable, distributed and parallel programming framework for analyzing big data. The following diagram illustrates a possible combination of technologies on top of Azure Data Lake Store.

Analyze Big Data 1024x547 Using Azure Analysis Services on Top of Azure Data Lake Storage

Azure Data Lake Analytics (ADLA) can process massive volumes of data extremely quickly. Take a look at the following screenshot, which shows a Data Lake job processing approximately 2.8 billion rows of TPC-DS store sales data (~500 GB) in under 7 minutes!

Processing Store Sales 1024x741 Using Azure Analysis Services on Top of Azure Data Lake Storage

The screen in the background uses source files in Azure Data Lake Storage and the screen in the foreground uses source files in Azure Blob Storage connected to Azure Data Lake. The performance is comparable, so I decided to leave my 1 TB TPC-DS data set in Azure Blob Storage, but if you want to ensure absolute best performance or would like to consolidate your data in one storage location, consider moving all your raw data files into ADLS. It’s straightforward to copy data from Azure Blob Storage to ADLS by using the AdlCopy tool, for example.

With the raw source data in a Data Lake-accessible location, the next step is to define the U-SQL scripts to extract the relevant information and write it along with column names to a series of output files. The following listing shows a general U-SQL pattern that can be used for processing the raw TPC-DS data and putting it into comma-separated values (csv) files with a header row.

@raw_parsed = EXTRACT child_id int,
                ,
                empty string
FROM "/{*}_{child_id}_100.dat"
USING Extractors.Text(delimiter: '|');

@filtered_results = SELECT 
FROM @raw_parsed
;

OUTPUT @filtered_results
TO "//.csv"
USING Outputters.Csv(outputHeader:true);

The next listing shows a concrete example based on the small income_band table. Note how the query extracts a portion of the file name into a virtual child_id column in addition to the actual columns from the source files. This child_id column comes in handy later when generating multiple output csv files for the large TPC-DS tables. Also, the WHERE clause is not strictly needed in this example because the income_band table only has 20 rows, but it’s included to illustrate how to restrict the amount of data per table to a maximum of 100 rows to create a small modelling data set.

@raw_parsed = EXTRACT child_id int,
                      b_income_band_sk string,
                      b_lower_bound string,
                      b_upper_bound string,
                      empty string
FROM "wasb://income-band@aasuseast2/{*}_{child_id}_100.dat"
USING Extractors.Text(delimiter: '|');

@filtered_results = SELECT b_income_band_sk,
                           b_lower_bound,
                           b_upper_bound
FROM @raw_parsed
ORDER BY child_id ASC
FETCH 100 ROWS;

You can find complete sets of U-SQL scripts to generate output files for different scenarios (modelling, single csv file per table, multiple csv files for large tables, and large tables filtered by last available year) at the GitHub repository for Analysis Services.

For instance, for generating the modelling data set, there are 25 U-SQL scripts to generate a separate csv file for each TPC-DS table. You can run each U-SQL script manually in the Microsoft Azure portal, yet it is more convenient to use a small Microsoft PowerShell script for this purpose. Of course, you can also use Azure Data Factory, which among other things enables you to run U-SQL scripts on a scheduled basis. For this article, however, the following Microsoft PowerShell script suffices.

$  script_folder = "<Path to U-SQL Scripts>"
$  adla_account = "<ADLA Account Name>"
Login-AzureRmAccount -SubscriptionName "<Windows Azure Subscription Name>"

Get-ChildItem $  script_folder -Filter *.usql |
Foreach-Object {
    $  job = Submit-AdlJob -Name $  _.Name -AccountName $  adla_account –ScriptPath $  _.FullName -DegreeOfParallelism 100
    Wait-AdlJob -Account $  adla_account -JobId $  job.JobId
}

Write-Host "Finished processing U-SQL jobs!";

It does not take long for Azure Data Lake to process the requests. You can use the Data Explorer feature in the Azure Portal to double-check that the desired csv files have been generated successfully, as the following screenshot illustrates.

Output CSV Files for Modelling Using Azure Analysis Services on Top of Azure Data Lake Storage

With the modelling data set in place, you can finally switch over to SSDT and create a new Analysis Services Tabular model at the 1400 compatibility level. Make sure you have the latest version of the Microsoft Analysis Services Projects package installed so that you can pick Azure Data Lake Store from the list of available connectors. You will be prompted for the Azure Data Lake Store URL and you must sign in using an organizational account. Currently, the Azure Data Lake Store connector only supports interactive logons, which is an issue for processing the model in an automated way in Azure Analysis Services, as discussed later in this article. For now, let’s focus on the modelling aspects.

The Azure Data Lake Store connector does not automatically establish an association between the folders or files in the store and the tables in the Tabular model. In other words, you must create each table individually and select the corresponding csv file in Query Editor. This is a minor inconvenience. It also implies that each table expression specifies the folder path to the desired csv file individually. If you are using a small data set from a modelling folder to create the Tabular model, you would need to modify every table expression during production deployment to point to the desired production data set in another folder. Fortunately, there is a way to centralize the folder navigation by using a shared expression so that only a single expression requires an update on production deployment. The following diagram depicts this design.

Folder Navigation by using a shared Expression Using Azure Analysis Services on Top of Azure Data Lake Storage

To implement this design in a Tabular model, use the following steps:

  1. Start Visual Studio and check under Tools -> Extensions and Updates that you have the latest version of Microsoft Analysis Services Projects installed.
  2. Create a new Tabular project at the 1400 compatibility level.
  3. Open the Model menu and click on Import From Data Source.
  4. Pick the Azure Data Lake Store connector, provide the storage account URL, and sign in by using an Organizational Account. Click Connect and then OK to create the data source object in the Tabular model.
  5. Because you chose Import From Data Source, SSDT displays Query Editor automatically. In the Content column, click on the Table link next to the desired folder name (such as modelling) to navigate to the desired root folder where the csv files reside.
  6. Right-click the Table object in the right Queries pane, and click Create Function. In the No Parameters Found dialog box, click Create.
  7. In the Create Function dialog box, type GetCsvFileList, and then click OK.
  8. Make sure the GetCsvFileList function is selected, and then on the View menu, click Advanced Editor.
  9. In the Edit Function dialog box informing you that updates from the Table object will no longer propagate to the GetCsvFileList function if you continue, click OK.
  10. In Advanced Editor, note how the GetCsvFileList function navigates to the modelling folder, enter a whitespace character at the end of the last line to modify the expression, and then click Done.
  11. In the right Queries pane, select the Table object, and then in the left Applied Steps pane, delete the Navigation step, so that Source is the only remaining step.
  12. Make sure the Formula Bar is displayed (View menu -> Formula Bar), and then redefine the Source step as = GetCsvFileList() and press Enter. Verify that the list of csv files is displayed in Query Editor, as in the following screenshot.
    Invoke GetCsvFileList 1024x665 Using Azure Analysis Services on Top of Azure Data Lake Storage
  13. For each table you want to import:
    1. Right-click the existing Table object and click Duplicate.
    2. In the Content column, click on the Binary link next to the desired file name (such as call_center) and verify that Query Editor parses the columns and detects the data types correctly.
    3. Rename the table according to the csv file you selected (such as call_center).
    4. Right-click the renamed table object (such as call_center) in the Queries pane and click Create New Table.
    5. Verify that the renamed table object (such as call_center) is no longer displayed in italic, which indicates that the query will now be imported as a table into the Tabular model.
  14. After you created all desired tables by using the sequence above, delete the original Table object by right-clicking on it and selecting Delete.
  15. In Query Editor, click Import to add the GetCsvFileList expression and the tables to your Tabular model.

During the import, SSDT Tabular pulls in the small modelling data set. And prior to production deployment, it is now a simple matter of updating the shared expression by right-clicking on the Expressions node in Tabular Model Explorer and selecting Edit Expressions, and then changing the folder name in Advanced Editor. The below screenshot highlights the folder name in the GetCsvFileList expression. And if each table can find its corresponding csv file in the new folder location, deployment and processing can succeed.

Changing the CSV Folder 1024x644 Using Azure Analysis Services on Top of Azure Data Lake Storage

Another option is to deploy the model with the Do Not Process deployment option and use a small TOM application in Azure Functions to process the model on a scheduled basis. Of course, you can also use SSMS to connect to your Azure Analysis Services server and send a processing command, but it might be inconvenient to keep SSDT or SSMS connected for the duration of the processing cycle. Processing against the full 1 TB data set with a single csv file per table took about 15 hours to complete. Processing with four csv files/partitions for the seven large tables and maxActiveConnections on the data source set to 46 concurrent connections took roughly 6 hours. This is remarkably faster in comparison to using general BLOB storage, as in the Building an Azure Analysis Services Model on Top of Azure Blob Storage article, and suggests that there is potential for performance improvements in the Azure BLOB storage connector.

Processing 1024x529 Using Azure Analysis Services on Top of Azure Data Lake Storage

Even the processing performance against Azure Data Lake could possibly be further increased, as the processor utilization on an S9 Azure Analysis Server suggests (see the following screenshot). For the first 30 minutes, processor utilization is close to the maximum and then it decreases as the AS engine finishes more and more partitions and tables. Perhaps with an even higher degree of parallelism, such as with eight or twelve partitions for each large table, Azure AS could keep processor utilization near the maximum for longer and finish the processing work sooner. But processing optimizations through elaborate table partitioning schemes is beyond the scope of this article. The processing performance achieved with four partitions on each large table suffices to conclude that Azure Data Lake is a very suitable big-data backend for Azure Analysis Services.

QPUs 1024x352 Using Azure Analysis Services on Top of Azure Data Lake Storage

There is currently only one important caveat: The Azure Data Lake Store connector only supports interactive logons. When you define the Azure Data Lake Store data source, SSDT prompts you to log on to Azure Data Lake. The connector performs the logon and then stores the obtained authentication token in the model. However, this token only has a limited lifetime. Chances are fair that processing succeeds after the initial deployment, but when you come back the next day and want to process again, you get an error that “The credentials provided for the DataLake source are invalid.“ See the screenshot below. Either you deploy the model again in SSDT or you right-click the data source in SSMS and select Refresh Credentials to log on to Data Lake again and submit fresh tokens to the model.

refresh creds 1024x710 Using Azure Analysis Services on Top of Azure Data Lake Storage

A subsequent article is going to cover how to handle authentication tokens programmatically, so stay tuned for more on connecting to Azure Data Lake and other big data sources on the Analysis Services team blog. And as always, please deploy the latest monthly release of SSDT Tabular and send us your feedback and suggestions by using SSASPrev at Microsoft.com or any other available communication channels such as UserVoice or MSDN forums.

Let’s block ads! (Why?)

Analysis Services Team Blog

Deploying Analysis Services and Reporting Services Project Types in Visual Studio 2017

(Co-authored by Mike Mallit)

SQL Server Data Tools (SSDT) adds four different project types to Visual Studio 2017 to create SQL Server Database, Analysis Services, Reporting Service, and Integration Services solutions. The Database Project type is directly included with Visual Studio. The Analysis Services and Reporting Service project types are available as separate Visual Studio Extension (VSIX) packages. The Integration Services project type, on the other hand, is only available through the full SSDT installer due to dependencies on COM components, VSTA, and SSIS runtime, which cannot be packed into a VSIX file. The full SSDT for Visual Studio 2017 installer is available as a first preview at https://docs.microsoft.com/en-us/sql/ssdt/download-sql-server-data-tools-ssdt.

This blog article covers the VSIX packages for the Analysis Services and Reporting Service project types, specifically the deployment and update of these extension packages as well as troubleshooting best practices.

In Visual Studio 2017, the Analysis Services and Reporting Services project types are always deployed through the VSIX packages, even if you deploy these project types by using the full SSDT installer. The SSDT installer simply downloads the VSIX packages, which ensures that you are deploying the latest released versions. But you can also deploy the VSIX package individually. You can find them in Visual Studio Marketplace:

The SSDT Installer is the right choice if you don’t want to add the Analysis Services and Reporting Services project types to an existing instance of Visual Studio 2017 on your workstation. The SSDT Installer installs a separate instance of SSDT for Visual Studio 2017 to host the Analysis Services and Reporting Services project types.

On the other hand, if you want to deploy the VSIX packages in an existing Visual Studio instance, it is perhaps easiest to display the Extensions and Updates dialog box in Visual Studio by clicking on Extensions and Updates on the Tools menu, then expanding Online in the left pane and selecting the Visual Studio Marketplace node. Then search for “Analysis Services” or “Reporting Services” and then click the Download button next to the desired project type, as the following screenshot illustrates. After downloading the desired VSIX package, Visual Studio schedules the installation to begin when all Visual Studio instances are closed.

VSIX Download 1024x646 Deploying Analysis Services and Reporting Services Project Types in Visual Studio 2017

The actual VSIX installation is very straightforward. The only input requirement is to accept the licensing terms by clicking on the Modify button in the VSIX Installer dialog box.

Of course, you can also use the full SSDT Installer to add the Analysis Services and Reporting Services project types to an existing Visual Studio 2017 instance. These project types support all available editions of Visual Studio 2017. The SSDT installer requires Visual Studio 2017 version 15.3 or later. Earlier versions are not supported, so make sure you apply the latest updates to your Visual Studio 2017 instances.

One of the key advantages of VSIX packages is that Visual Studio automatically informs you when updates are available. So, it’s less burdensome to stay on the latest updates. This is especially important considering that updates for the Analysis Services and Reporting Services project types are released monthly. Whether you chose the SSDT Installer or the VSIX deployment method, you get the same update notifications because both methods deploy the same VSIX packages.

You can also check for updates at any time by using the Extensions and Updates dialog box in Visual Studio. In the left pane, expand Updates, and then select Visual Studio Marketplace to list any available updates that have not have been deployed yet.

Although VSIX deployments are very straightforward, there are situations that may require troubleshooting, such as when a deployment completes unsuccessfully or when a project type fails to load. When troubleshooting the deployment of the Analysis Services and Reporting Services project types, keep the following software dependencies in mind:

  • The Analysis Services and Reporting Services project types require Visual Studio 2017 version 15.3 or later. Among other things, this is because of Microsoft OLE DB Provider for Analysis Services (MSOLAP). To load MSOLAP, the project types require support for Registration-Free COM, which necessitates Visual Studio 2017 version 15.3 at a minimum.
  • The VSIX packages for Analysis Services and Reporting Services depend on a shared VSIX, called Microsoft.DataTools.Shared.vsix, which is a hidden package that doesn’t get installed separately. It is installed when you select the Microsoft.DataTools.AnalysisServices.vsix or the Microsoft.DataTools.ReportingServices.vsix. Most importantly the shared VSIX contains data providers for Analysis Services (MSOLAP, ADOMD.NET, and AMO), which both project types rely on.

If you are encountering deployment or update issues, use the following procedure to try to resolve the issue:

  1. Check that you have installed any previous versions of the VSIX packages. If no previous versions exist, skip steps 2 and 3. If previous versions are present, continue with step 2.
  2. Uninstall any previews instances of the project types and verify that the shared VSIX is also uninstalled:
    1. Start the Visual Studio Installer application and click Modify on the Visual Studio instance you are using.
    2. Click on Individual Components at the top, and then scroll to the bottom of the list of installed components.
    3. Under Uncategorized, clear the checkboxes for Microsoft Analysis Services Projects, Microsoft Reporting Services Projects, and Microsoft BI Shared Components for Visual Studio. Make sure you remove all three VSIX packages, then the Modify button.

VSIX Uninstall 1024x572 Deploying Analysis Services and Reporting Services Project Types in Visual Studio 2017

Note: Occasionally, an orphaned Microsoft BI Shared Components for Visual Studio package causes deployment issues. If an entry exists, uninstall it.

  1. Check the following folder paths to make sure these folders do not exist. If any of them exist, delete them.
    1. C:\Program Files (x86)\Microsoft Visual Studio17\Enterprise\Common7\IDE\CommonExtensions\Microsoft\BIShared
    2. C:\Program Files (x86)\Microsoft Visual Studio\SSDT\Enterprise\Common7\IDE\CommonExtensions\Microsoft\SSAS
    3. C:\Program Files (x86)\Microsoft Visual Studio\SSDT\Enterprise\Common7\IDE\CommonExtensions\Microsoft\SSRS
    4. C:\Program Files (x86)\Microsoft Visual Studio17\Enterprise\Common7\IDE\PublicAssemblies\Microsoft BI
  1. Install the VSIX pages for the Analysis Services and Reporting Services project types from Visual Studio Marketplace and verify that the issue was resolved. If you are still not successful, continue with step 5.
  2. Repair the Visual Studio instance to fix any shell-related issues that may prevent the deployment or update of the VSIX packages as follows:
    1. Start the Visual Studio Installer application, click More Options on the instance, and then choose Repair.
    2. Alternatively, use the command line with Visual Studio closed. Run the following command: “%programfiles(x86)%\Microsoft Visual Studio\Installer\resources\app\layout\InstallCleanup.exe” –full

Note: InstallCleanup.exe is a utility to delete cache and instance data for Visual Studio 2017. It works across instances and deletes existing corrupt, partial and full installations. For more information, see Troubleshooting Visual Studio 2017 installation and upgrade failures.

  1. Repeat the entire procedure again to uninstall packages, delete folder paths, and then re-install the project types.

In short, version mismatches between the shared VSIX and the project type VSIX packages can cause deployment or update issues as well as a damaged Visual Studio instance. Uninstalling the VSIX packages and deleting any extension folders that may have been left behind takes care of the former and repairing or cleaning the Visual Studio instance takes care of the latter root cause.

Another known cause of issues relates to the presence of older SSAS and SSRS VSIX packages when installing the preview release of the SSDT Installer. The newer Microsoft BI Shared Components for Visual Studio VSIX package included in the SSDT Installer is incompatible with the SSAS and SSRS VSIX packages, and so you must uninstall the existing SSAS and SSRS VSIX packages prior to running the SSDT Installer. As soon as the SSAS and SSRS VSIX packages version 17.3 are released to Visual Studio Marketplace, then upgrading the packages prior to running SSDT Installer also helps to avoid the version mismatch issues.

And that’s it for a quick overview of the VSIX deployment, update, and troubleshooting for the Analysis Services and Reporting Services project types in Visual Studio 2017. And as always, please send us your feedback and suggestions by using ProBIToolsFeedback at Microsoft.com. Or use any other available communication channels such as UserVoice or MSDN forums.

Let’s block ads! (Why?)

Analysis Services Team Blog

Online Analysis Services Course: Developing a Multidimensional Model

Check out the excellent, new online course by Peter Myers and Chris Randall for Microsoft Learning Experiences (LeX). Lean how to develop multidimensional data models with SQL Server 2016 Analysis Services. The complete course is available on edX at no cost to audit, or you can highlight your new knowledge and skills with a Verified Certificate for a small charge. Enrollment is available at edX.

Let’s block ads! (Why?)

Analysis Services Team Blog

Model Comparison and Merging for Analysis Services

Relational-database schema comparison and merging is a well-established market. Leading products include SSDT Schema Compare and Redgate SQL Compare, which is partially integrated into Visual Studio. These tools are used by organizations seeking to adopt a DevOps culture to automate build-and-deployment processes and increase the reliability and repeatability of mission critical systems.

Comparison and merging of BI models also introduces opportunities to bridge the gap between self-service and IT-owned “corporate BI”. This helps organizations seeking to adopt a “bi-modal BI” strategy to mitigate the risk of competing IT-owned and business-owned models offering redundant solutions with conflicting definitions.

Such functionality is available for Analysis Services tabular models. Please see the Model Comparison and Merging for Analysis Services whitepaper for detailed usage scenarios, instructions and workflows.

This is made possible with BISM Normalizer, which we are pleased to announce now resides on the Analysis Services Git repo. BISM Normalizer is a popular open-source tool that works with Azure Analysis Services and SQL Server Analysis Services. All tabular model objects and compatibility levels, including the new 1400 compatibility level, are supported. As a Visual Studio extension, it is tightly integrated with source control systems, build and deployment processes, and model management workflows.

Schema Diff2 Model Comparison and Merging for Analysis Services

Thanks to Javier Guillen (Blue Granite), Chris Webb (Crossjoin Consulting), Marco Russo (SQLBI), Chris Woolderink (Tabular) and Bill Anton (Opifex Solutions) for their contributions to the whitepaper.

Let’s block ads! (Why?)

Analysis Services Team Blog

What’s new in SQL Server 2017 RC1 for Analysis Services

The RC1 public preview of SQL Server 2017 is available here! It includes Dynamic Management View improvements for tabular models with compatibility level 1200 and 1400.

DMVs are useful in numerous scenarios including the following.

  • Exposing information about server operations and health.
  • Documentation of tabular models.
  • Numerous client tools use DMVs for a variety of reasons. For example, BISM Normalizer uses them to perform impact analysis for incremental metadata deployment and merging.

RC1 rounds off the DMV improvements introduced in CTP 2.0 and CTP 2.1.

DISCOVER_CALC_DEPENDENCY now works with 1200 and 1400 models. 1400 models show dependencies between M partitions, M expressions and structured data sources.

Further enhancements in RC1 include the following for 1200 (where applicable) and 1400 models.

  • Named dependencies result from DAX or M expressions that explicitly reference other objects. RC1 introduces named dependencies for DAX in addition to DAX data dependencies. Previous versions of this DMV returned only data dependencies. In many cases a dependency is both named and data. RC1 returns the superset.
  • In addition to dependencies between M partitions, M expressions and structured data sources, dependencies between provider data sources and non-M partitions (these are the traditional partition and data source types for tabular models) are returned in RC1.
  • The following new schema restrictions have been introduced to allow focused querying of the DMV. The table below shows the intersection of the schema restrictions with the type of objects covered.
    • KIND with values of ‘DATA_DEPENDENCY’ or ‘NAMED_DEPENDENCY’.
    • OBJECT_CATEGORY with values of ‘DATA_ACCESS’ or ‘ANALYSIS’.
  KIND OBJECT_CATEGORY
  DATA_DEPENDENCY NAMED_DEPENDENCY DATA_ACCESS ANALYSIS
Mashup
Provider data source & non-M partitions
DAX named dependencies
Other data dependencies
  • Mashup dependencies are dependencies between M partitions, M expressions and structured data sources. They are named, M-expression based, and only apply to 1400 models.
  • Provider data source & non-M partitions are dependencies between traditional partitions and provider data sources. They are based on properties in tabular metadata rather than expression based, so are not considered “named”. They are available for 1200 and 1400 models.
  • DAX named dependencies are explicit named references in DAX expressions. They are available for 1200 and 1400 models.
  • Other data dependencies are data dependencies for DAX expressions and other types of data dependencies such as hierarchies and relationships. To avoid potential performance issues, data dependencies from DAX measures are only returned when using a QUERY schema restriction. They are available for 1100, 1103, 1200 and 1400 models.

1100 and 1103 models only return other data dependencies, and they ignore the new schema restrictions.

DAX data dependencies

DAX data dependencies and DAX named dependencies are not necessarily the same thing. For example, a calculated table called ShipDate with a DAX formula of “=DimDate” clearly has a named dependency (and data dependency) on the DimDate table. It also has data dependencies on the columns within DimDate, but these are not considered named dependencies.

Example: [KIND]=’NAMED_DEPENDENCY’

The following query returns the output shown below. All DAX and M expression named references in the model are included. These can originate from calculated tables/columns, measures, M partitions, row-level security filters, detail rows expressions, etc.

SELECT * FROM SYSTEMRESTRICTSCHEMA
    ($  SYSTEM.DISCOVER_CALC_DEPENDENCY, [KIND] = 'NAMED_DEPENDENCY')

calc dep 1 What’s new in SQL Server 2017 RC1 for Analysis Services

Example: [KIND]=’DATA_DEPENDENCY’

The following query returns the output shown below. Some data dependencies happen to also be named dependencies, in which case they are returned by this query and the one above with a NAMED_DEPENDENCY schema restriction.

SELECT * FROM SYSTEMRESTRICTSCHEMA
    ($  SYSTEM.DISCOVER_CALC_DEPENDENCY, [KIND] = 'DATA_DEPENDENCY')

calc dep 2 data dep What’s new in SQL Server 2017 RC1 for Analysis Services

Example: [OBJECT_CATEGORY]=’DATA_ACCESS’

The following query returns the output shown below. Partitions, M expressions and data source dependencies are included.

SELECT * FROM SYSTEMRESTRICTSCHEMA
    ($  SYSTEM.DISCOVER_CALC_DEPENDENCY, [OBJECT_CATEGORY] = 'DATA_ACCESS')

calc dep 3 data access What’s new in SQL Server 2017 RC1 for Analysis Services

Example: [OBJECT_CATEGORY]=’ANALYSIS’

The following query returns the output shown below. The results of this query are mutually exclusive with the results above with a DATA_ACCESS schema restriction.

SELECT * FROM SYSTEMRESTRICTSCHEMA
    ($  SYSTEM.DISCOVER_CALC_DEPENDENCY, [OBJECT_CATEGORY] = 'ANALYSIS')

calc dep 4 analysis1 What’s new in SQL Server 2017 RC1 for Analysis Services

RC1 provides improvements for this DMV, which is used by various client tools to show measure dimensionality. For example, the Explore feature in Excel Pivot Tables allows the user to cross-drill to dimensions related to the selected measures.

RC1 corrects the cardinality columns, which were previously showing incorrect values.

SELECT * FROM $  System.MDSCHEMA_MEASUREGROUP_DIMENSIONS;

Mdschema MeasureGroup Dimensions What’s new in SQL Server 2017 RC1 for Analysis Services

Download now!

To get started, download SQL Server 2017 RC1. The latest release of the Analysis Services VSIX for SSDT is available here. VSIX deployment for Visual Studio 2017 is discussed in this blog post.

Be sure to keep an eye on this blog to stay up to date on Analysis Services!

Let’s block ads! (Why?)

Analysis Services Team Blog