Tag Archives: lake

Querying Data in Azure Data Lake Store with Power BI

Let’s say you have data in Azure Data Lake Store (ADLS) that you want to report directly from in Power BI. You might be doing this for early exploratory data efforts, or you might have some curated data which has been prepared in the data lake for analysis tools such as Power BI. 

The screen shots & functionality discussed for ADLS and Power BI are from early May 2018.

In the example I’m using, my Data Lake Store service is called “bankingadls.” The file structure I’m working with looks like this:

ADLS FileStructure Querying Data in Azure Data Lake Store with Power BI

We’re going to be pulling the data files from the RawData subfolders into Power BI.

What do you need to be able to retrieve ADLS data into Power BI?

  • Read and execute permissions to the data stored in ADLS. I talked about permissions in this blog post
  • The source file(s) need a format which has reliable structure that Power BI can understand.
  • If you’re combining multiple files (such as all 3 shown in the screenshot above), they need to have the same structure/file layout for each file. Power BI is pretty smart about this, so if you’re missing a column in one file that’s present in the other files, it can cope with something simple like that from what I’ve seen. The rule of thumb when designing a data lake is to use the same data structure within each folder, so hopefully varying data formats aren’t a problem–otherwise you’ll have to cleanse & standardize the data before it can be analyzed. 

What can you query from ADLS?

You can connect to the data stored in Azure Data Lake Store. What you *cannot* connect to currently is the data stored in the Catalog tables/views/stored procedures within Azure Data Lake Analytics (hopefully connectivity to the ADLA Catalog objects from tools other than U-SQL is available soon–you can vote for Power BI connectivity to the Catalog on this UserVoice suggestion).

You’re not sending a U-SQL query here. Rather, we’re sending a web API request to an endpoint.

With an ADLS data source, you have to import the data into Power BI Desktop. There is no option for DirectQuery.

Should you use https:// or adl:// to connect to ADLS

In the portal you may notice that you have two ways to connect:

ADLS URLandURI Querying Data in Azure Data Lake Store with Power BI

https:// is a WebHDFS-compatible endpoint which allows integration of the ADL Store with open source projects and other languages.

adl:// is the “AzureDataLakeFilesystem” which is also secure, and provides additional performance enhancements on top of WebHDFS. 

I typically use the adl:// endpoint in case it does give me a performance boost. From what I can tell in Fiddler, it looks like both methods send the same GET request from Power BI which looks like this:

ADLS fromPowerBI GetRequest Querying Data in Azure Data Lake Store with Power BI

Connecting to one file in ADLS from Power BI

Connecting to one individual file in ADLS is pretty straightforward. In the Get Data area within Power BI Desktop, we’ll choose the Azure Data Lake Store connector:

PowerBI ADLS Querying Data in Azure Data Lake Store with Power BI

When prompted for the URL, you can put in either the URI or the URL – either will work. For one file, you’ll use the full path to that file:

URL ADLS OneFile Querying Data in Azure Data Lake Store with Power BI

Go ahead and respond to the security dialog box when prompted (if you’re accessing this ADL Store for the first time). Next let’s choose Edit to go into the Query Editor:

PowerBI ADLS2 Querying Data in Azure Data Lake Store with Power BI

Here’s where it might look funny to you at first. What it’s actually showing you here is the metadata. To get to the data, click on the hyperlink called Binary:

PowerBI ADLSData Querying Data in Azure Data Lake Store with Power BI

And now we’ve got a preview of the actual file data. Power BI auto-created a few steps to tidy up the data:

PowerBI ADLSData2 Querying Data in Azure Data Lake Store with Power BI

At this point, you can Close & Apply. Carry on with fine-tuning the data model properties and creating visuals. 

Connecting to a folder containing multiple files from Power BI

Connecting to a folder is helpful when you have numerous files you want to consolidate. In a data lake, it’s common to have data partitioned (into subfolders) by date or other logical ways to store the data incrementally. 

There are a couple of differences when retrieving data from the folder level instead. The first difference is we specify the folder path in the URL:

URL ADLS Folder Querying Data in Azure Data Lake Store with Power BI

The next difference is how the metadata is initially presented. The metadata for this “table” is actually the subfolder – for us, that’s the 2017 subfolder under RawData. If we click on the Table hyperlink, we are taken to the metadata for the monthly folders:

PowerBI ADLS3 Querying Data in Azure Data Lake Store with Power BI

At this point, we don’t want to navigate down any more because we want files that are present in each of those monthly folders. What we want to do is click on the “Expand” button:

PowerBI ADLS CombineFiles Querying Data in Azure Data Lake Store with Power BI

Your steps above might differ a bit depending on your folder structure. 

Now we see that it has detected the presence of the individual files across the multiple folders. Now it’s time to click on the “Combine Files” button:

PowerBI ADLS CombineFiles4 Querying Data in Azure Data Lake Store with Power BI

The Query Editor uses the first file to determine what the column structure of the data will be:

PowerBI ADLS CombineFiles3 Querying Data in Azure Data Lake Store with Power BI

When combining files, be sure to choose “Skip files with errors.” I had some null rows along with the binary file list which Power BI saw as file errors so it failed when trying to complete a full data refresh. The error was “Failed to save modifications to the server. The parameter is expected to be of type Text.Type or Binary.Type.” I believe what this meant was the sample query & parameter it used in the Query Editor to figure out the metadata was choking on those null values. I’m not certain where the null rows came from, but choosing “Skip files with errors” solved the issue (or you could filter out null values in the Content.Content column shown above).

At this point, you can Close & Apply. Carry on with fine-tuning the data model properties and creating visuals. 

Refreshing the data in the Power BI Service

In order to schedule a data refresh, we’ll want to publish the PBIX to the Power BI Service. (Reminder: if you plan to share this with others, lean towards using an app workspace. Sharing out of your personal workspace should be used only in a limited way.)

A gateway is not needed in the Power BI Service since it’s pulling from an Azure service.

The data source credentials for the refresh use the OAuth2 protocol which is common for web requests:

PowerBI ADLS DataRefresh Querying Data in Azure Data Lake Store with Power BI

Now for the bad news. The OAuth2 protocol relies on access tokens, and those tokens expire at regular intervals. (I’m currently testing this with a new refresh schedule created yesterday; I’ll update this post as soon as I have verified what the current expiration interval is–thankfully it is longer than it used to be.) Once the refresh fails due to token expiration, you’ll have to edit credentials in the dataset properties. Usually our way around this issue, like when Azure Data Factory needs to access ADLS, is to use an Azure application (service principal) for authentication, but that’s not currently supported either. 

Depending on your quantity of files and/or size of files in the data lake, the data refresh may take a bit of time. One https GET request is sent by Power BI per file to retrieve data. 

Like This Content?

If you are integrating data between Azure services, you might be interested in an all-day session Meagan Longoria and I are presenting at PASS Summit in November. It’s called “Designing Modern Data and Analytics Solutions in Azure.” Check out info here: http://www.pass.org/summit/2018/Sessions/Details.aspx?sid=78885 

You Might Also Like…

Why the Default Summarization Property in Power BI is So Important

Checklist for Finalizing a Data Model in Power BI Desktop

Data Refresh Issues in the Power BI Service Due to Invalid Dates

Let’s block ads! (Why?)

Blog – SQL Chick

Three Steps to Make Your Data Lake Pay Off

Data Lakes Three Steps to Make Your Data Lake Pay Off

“33% of enterprises will take their data lakes off life support (in 2018). Without a clear connection to change-the-business outcomes, many early adopters will pull the funding plug on their data lakes to see if they pay for themselves or die.”  – Forbes.com

There’s untapped potential in your data lake — if you can tap into it, it doesn’t need to die! FICO has the experience and the AI tools to wade through your unstructured and structured data, emerge with actionable insights, and then implement them to improve your business.

How exactly do we do that?  We’ve created a universal how-to guide on this topic, which will be presented at FICO World 2018, April 16-19 in Miami Beach.

Beyond Big Data to Big Insights and Big Business Value

On Tuesday, April 17, my colleague Peter Ould and I will divulge the details of FICO’s three-step process for getting business value from big data quickly. In short, the steps are:

  1. Exploit the Data (Hint: This step is critical in guiding the subsequent steps and maximizing value)
  2. Listen to the Data to Find the Complex Patterns (Hint: Most people skip this step but they shouldn’t)
  3. Execute Learnings Effectively and Efficiently (Hint: Machine Learning really helps here)

Our presentation highlights each of the three steps, discussing the key techniques and methodologies involved as well as the resulting business value achieved by some of FICO’s most innovative clients.

As the pace of data generation and data diversity continue to increase, making better, faster data driven decisions will differentiate you from your competitors. Join us in Miami Beach to learn about FICO’s proven process for tapping into the potential of your Big Data.

Let’s block ads! (Why?)

FICO

Granting Permissions in Azure Data Lake

ADLS Permissions Granting Permissions in Azure Data Lake

(1) RBAC permissions to the ADLS account itself, for the purpose of managing the resource.
RBAC = Role-based access control. RBAC are the familiar Azure roles such as reader, contributor, or owner. Granting a role on the service allows someone to view or manage the configuration and settings for that particular Azure service (ADLS in this case). See Part 2 for info about setting up RBAC.

(2) ACL permissions to the data stored in ADLS, for the purpose of managing the data.
ACL = access control list. The ACLs grant read/write/execute permissions on the data itself. Granting permissions here allows someone to create, read, and/or modify files and folders (i.e., the actual data) stored in ADLS. If you come from the Unix or Linux world, the POSIX-style ACLs will be a familiar concept. See Part 3 about setting up ACLs.

Finding More Information

There is quite a bit more to know about ADLS security than what is covered in this series, so be sure to also dive into the official documentation links:

Security in Azure Data Lake Store

Access Control in Azure Data Lake Store

Securing Data in Azure Data Lake Store

Best Practices for Using Azure Data Lake Store

Want to Know More?

My next all-day workshop on Architecting a Data Lake is in Raleigh, NC on April 13, 2018

Let’s block ads! (Why?)

Blog – SQL Chick

Assigning Resource Management Permissions for Azure Data Lake Store (Part 2)

This is part 2 in a short series on Azure Data Lake permissions. 

Part 1 – Granting Permissions in Azure Data Lake
Part 2 – Assigning Resource Management Permissions for Azure Data Lake Store {you are here}
Part 3 – Assigning Data Permissions for Azure Data Lake Store
Part 4 – Using a Service Principal for Azure Data Lake Store
Part 5 – Assigning Permissions for Azure Data Lake Analytics

In this section, we’re covering the “service permissions” for the purpose of managing Azure Data Lake Store (ADLS). Granting a role on the resource allows someone to view or manage the configuration and settings for that particular Azure service (i.e., although we’re talking about ADLS, this post is applicable to Azure services in general). RBAC, or role-based access control, includes the familiar built-in Azure roles such as reader, contributor, or owner (you can create custom roles as well).

Tips for Assigning Roles for the ADLS Service

Setting permissions for the service + the data stored in ADLS is always two separate processes, with one exception: when you define an owner for the ADLS service in Azure, that owner is automatically granted ‘superuser’ (full) access to manage the ADLS resource in Azure *AND* full access to the data. Any other RBAC role other than owner needs the data access specifically assigned via ACLs. This is a good thing because not all system administrators need to see the data, and not all data access users/groups/service principals need access to the service itself. This type of separation is true for certain other services too, such as Azure SQL Database.

Try to use groups whenever you can to grant access, rather than individual accounts. This is a consistent best practice for managing security across many types of systems.

If you are using resource groups in Azure the way they are intended to be used, you may be able to define service permissions at the resource group level rather than at the individual resource level (although the example shown is here is setting RBAC for ADLS specifically). Managing permissions at the resource group level reduces maintenance, assuming your resource group isn’t too broadly defined.

Typically, automated processes which do need access to the data (discussed in Part 3), don’t need any access to the ADLS resource itself. However, if any access to the Azure portal or to manage the ADLS service (such as through ARM or PowerShell) is needed, then the appropriate RBAC entry is necessary. 

In Part 4 we’ll talk a bit about using service principals. I’ve found that frequently a service principal needs data access (ACLs), but not any RBAC access to the service.

The RBAC functionality is consistent across Azure services. When roles are updated for an Azure resource, it is recorded in the Activity Log:

ADLS RBAC ActivityLog Assigning Resource Management Permissions for Azure Data Lake Store (Part 2)

Defining RBAC Permissions in the Azure Portal

Setting up permissions can be done in the portal in the Access control (IAM) pane. (By the way, the IAM acronym stands for Identity and Access Management.)

ADLS RBAC Portal Assigning Resource Management Permissions for Azure Data Lake Store (Part 2)

Defining RBAC Permissions via PowerShell Script

The technique shown above in the portal is convenient for quick changes, for learning, or for “one-off” scenarios. However, in an enterprise solution, and for production environments, it’s a better practice to handle permissions via a script so you can do things such as:

  • Promote changes through different environments
  • Pass off scripts to an administrator to run in production
  • Include permission settings in source control

In the following PowerShell script, we are assigning contributor permissions to an AAD group:

ADLS RBAC PowerShell Assigning Resource Management Permissions for Azure Data Lake Store (Part 2)

Here’s a copy/paste friendly script from the above screenshot:

#-----------------------------------------

#Input Area
$  subscriptionName = 'YourSubscriptionName'
$  resourceGroupName = 'YourResourceGroupName'
$  resourceName = 'YourResourceName'
$  groupName = 'YourAADGroupName'
$  userRole = 'Contributor'

#-----------------------------------------

#Manual login into Azure
Login-AzureRmAccount -SubscriptionName $  subscriptionName

#-----------------------------------------

$  resourceId = Get-AzureRmResource `
    -ResourceGroupName $  resourceGroupName `
    -ResourceName $  resourceName 
$  groupId = Get-AzureRMADGroup `
    -SearchString $  groupName

New-AzureRmRoleAssignment `
    -ObjectId $  groupId.Id `
    -RoleDefinitionName $  userRole `
    -Scope $  resourceId.ResourceId 

The online examples for the New-AzureRmRoleAssignment cmdlet enumerates the IDs or GUIDs, which makes things clear for learning but isn’t ideal for operationalized scripts. Therefore, the purpose for $ resourceId and $ groupId above is to do the work of looking up the GUIDs so you don’t have to do that manually.

Personally, I like using PowerShell instead of ARM (Azure Resource Manager) templates for certain things, such as permissions, but you do have additional options beyond what I’ve discussed here based on what you’re most comfortable with.

Finding More Information

Get Started with Role-Based Access Control in Azure

Want to Know More?

My next all-day workshop on Architecting a Data Lake is in Raleigh, NC on April 13, 2018

Let’s block ads! (Why?)

Blog – SQL Chick

Assigning Data Permissions for Azure Data Lake Store (Part 3)

This is part 2 in a short series on Azure Data Lake permissions. 

Part 1 – Granting Permissions in Azure Data Lake
Part 2 – Assigning Resource Management Permissions for Azure Data Lake Store
Part 3 – Assigning Data Permissions for Azure Data Lake Store {you are here}
Part 4 – Using a Service Principal for Azure Data Lake Store
Part 5 – Assigning Permissions for Azure Data Lake Analytics

In this section, we’re covering the “data permissions” for Azure Data Lake Store (ADLS). The ACL (access control list) grants permissions to to create, read, and/or modify files and folders stored in the ADLS service. Uploading and downloading data falls in this category of ACLs. If you come from the Unix or Linux world, the POSIX-style ACLs will be a familiar concept. 

There are two types of ACLs: Access ACLs and Default ACLs.

An Access ACL is the read/write/execute permissions specified for a folder or file. Every single folder or file has its security explicitly defined — so that means the ADLS security model is not an ‘inheritance’ model. That is an important concept to remember.

A Default ACL is like a ‘template’ setting at a folder level (the concept of a default doesn’t apply at the file level). Any new child item placed in that folder will automatically obtain that default security setting. The default ACLs are absolutely critical, given that data permissions aren’t an inheritance model. You want to avoid a situation where a user has permission to read a folder, but is unable to see any of the files within the folder — that situation will happen if a new file gets added to a folder which has an access ACL set at the folder level, but not a default ACL to apply to new child objects.

Tips for Assigning Data Permissions for ADLS

Organize your data lake folders and files so that folder-level security is one of the main considerations, and so that security is easier to manage.

Access to raw data is typically highly limited. This is partially due to lack of usability, and partially to ingest data as quickly as possible. Because every single file in ADLS has security properties specified, that is one of several reasons why a large number of very tiny files in ADLS is generally discouraged.

Typically in a data lake, the majority of users need only read+execute rights to consume the data. However, you may also have an area like a ‘manual drop zone’ or an ‘analytics sandbox’ where select users have write permissions to create, modify & delete folders and files. Generally speaking, write permissions in a data lake are minimal.

Be selective with granting permissions at the root level (“/”). It does minimize maintenance if you define an access ACL + default ACL at the root level, but only if you feel confident that is adequate security.

Try to use Azure Active Directory (AAD) groups whenever you can to grant access, rather than individual user accounts. This is a consistent best practice for managing security across many types of systems. This will reduce maintenance, and reduce the risk of inaccurate or out of date user permissions.

Currently the maximum # of ACLs that can be assigned to a file or folder is 32. This is another big reason to use AAD groups for managing access, rather than individual users.

Try to assign access at the folder level whenever you can. Although ADLS doesn’t have a true inheritance model, you can set a ‘default’ entry which will handle new child items. 

Be aware that changing a default ACL on a folder doesn’t automatically propagate to change the default ACL on any existing child folders. So, managing changes to existing data needs to be done carefully. Although it can seem like the default ACLs act like an inheritance model in some respects, it definitely is not.

Grant access to an AAD application (aka service principal identity or SPI) for automated operations, such as data loads. For service principals, you often can assign just the data permissions (the ACL) and not any permissions to the ADLS service (the RBAC). Check Part 4 for more discussion about service principals.

You almost always want to assign read + execute permissions together. The execute permissions allow a user to traverse the folder structure to where a file resides, which is needed in conjunction with the read (or write) permissions for the actual file.

The portal interface makes it easy to apply permissions to existing child-level folders and files. It’s is a little harder to specify via scripting methods since your script will need to explicitly be set up to handle recursive operations. Therefore, try to assign relevant permissions as early as possible in your design/development/operationalization phase. 

When you set permissions on existing data, it can take a little while if you are asking it to recursively traverse the folders and files to set permissions for every object. This is another reason to try to set permissions at the AAD group level, rather than via individual users. 

The PowerShell cmdlets to manage ADLS changed in January 2018. See this post: Breaking changes to Azure Data Lake Store cmdlets

Defining ADLS Data Permissions in the Azure Portal

In my example, I want to assign read + execute permissions for the StandardizedData folder, but not for the RawData folder. In the portal, I open Data Explorer, navigate to the applicable folder which sets the “scope” for the permissions, then the Access button:

ADLS ACL Portal 1 Assigning Data Permissions for Azure Data Lake Store (Part 3)

Click the Add button to select a user or a group. Notice the permissions are read/write/execute. You can have the new permission entry add to all existing child folders & files (which you typically want to do). The last radio button is really important – this lets you set it as both an access entry *and* the default entry.

ADLS ACL Portal 2 Assigning Data Permissions for Azure Data Lake Store (Part 3)

Defining ADLS Data Permissions via PowerShell Script

The technique shown above in the portal is convenient for quick changes, for learning, or for “one-off” scenarios. However, in an enterprise solution, or a production environment, it’s a better practice to handle permissions via a script so you can do things such as:

  • Promote changes through different environments
  • Pass off scripts to an administrator to run in production
  • Include permission settings in source control

In the following script, we are assigning read+execute permissions to the DataReaderGroup:

  • Step 1 defines the access ACL.
  • Step 2 defines the default ACL. (Thanks to Saveen Reddy from the ADL team who very kindly clued me into needing to set the default in a separate step. I was a stumped on that one for a bit.)
  • Step 3 applies the folder-level access to the child objects. Note that it is only going one level deep in the folder structure. Therefore, you’ll want to construct the script to be ‘smarter’ about recursion if you have a large number of folders which already exist, or just pass in an array list of the folders it should apply to – see Shannon Lowder’s blog for an example of this technique. If you have any files, the default ACL in the foreach loop will fail (but it works fine for folders).
ADLS ACL PowerShell Assigning Data Permissions for Azure Data Lake Store (Part 3)

Here’s the copy/paste friendly script for the above screenshot:

#-----------------------------------------

#Input Area
$  subscriptionName = 'YourSubscriptionName'
$  resourceGroupName = 'YourResourceGroupName'
$  resourceName = 'YourResourceName'
$  adlsPath = '/Folder/Subfolder'
$  groupName = 'YourAADGroupName'
$  permissionType = 'ReadExecute'

#-----------------------------------------

#Manual login into Azure
#Login-AzureRmAccount -SubscriptionName $  subscriptionName

#-----------------------------------------

#Step 1: Set the access permissions at the folder level
$  groupId = Get-AzureRmADGroup -SearchString $  groupName  
Set-AdlStoreItemAclEntry  `
    -AccountName $  resourceName `
    -Path $  adlsPath `
    -AceType Group `
    -Permissions $  permissionType `
    -Id $  groupId.Id 

#Step 2: Set the default at the folder level
Set-AdlStoreItemAclEntry  `
    -AccountName $  resourceName `
    -Path $  adlsPath `
    -AceType Group `
    -Permissions $  permissionType `
    -Id $  groupId.Id `
    -Default 

#Step 3: Set existing child objects to be the same as the folder level
$  childObjects = Get-AzureRmDataLakeStoreChildItem `
    -AccountName $  resourceName `
    -Path $  adlsPath
$  arrayOfObjectNames = @($  childObjects.Name)
foreach ($  objectName in $  arrayOfObjectNames) 
    {
     Write-Host "Setting ACL for $  adlsPath/$  objectName"
     #Set the access
     Set-AdlStoreItemAclEntry  `
        -AccountName $  resourceName `
        -Path "$  adlsPath/$  objectName" `
        -AceType Group `
        -Permissions $  permissionType `
        -Id $  groupId.Id 
     #Set the default
     Set-AdlStoreItemAclEntry `
        -AccountName $  resourceName `
        -Path "$  adlsPath/$  objectName" `
        -AceType Group `
        -Permissions $  permissionType `
        -Id $  groupId.Id `
        -Default 
    }

Finding More Information

PowerShell Cmdlets for Azure Data Lake Store

Breaking Changes to Data Lake Store Cmdlets

Access Control in Azure Data Lake Store <–Definitely take time to read this

Secure Data in Azure Data Lake Store

Want to Know More?

My next all-day workshop on Architecting a Data Lake is in Raleigh, NC on April 13, 2018

Let’s block ads! (Why?)

Blog – SQL Chick

New eBook! Keep Your Data Lake Pristine with Big Data Quality Tools

The term Big Data doesn’t seem quite “big enough” anymore to properly describe the vast over-abundance of data available to organizations today. Business leaders have repeatedly expressed little confidence in the reliability of their Data Lake, especially as the volume and variety of Big Data sources continue to grow.

The very purpose of that data is to enable new levels of business insight and clarity. No one sets out to create a data swamp that provides nothing but confusion and distrust.

Our new eBook, Keep Your Data Lake Pristine with Big Data Quality Tools, takes a look at how the proper software can help align people, process, and technology to ensure trusted, high-quality Big Data.

Download the eBook to see how data quality can yield new insights for business success.

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Zones in a Data Lake

As we are approaching the end of 2017, many people have resolutions or goals for the new year. How about a goal to get organized…in your data lake?

The most important aspect of organizing a data lake is optimal data retrieval.

It all starts with the zones of your data lake, as shown in the following diagram:

Hopefully the above diagram is a helpful starting place when planning a data lake structure. I have used all of the above zones in projects (with the exception of a transient zone which I haven’t had a requirement for).

You Might Also Like…

Data Lake Use Cases and Planning Considerations  <–More tips on organizing the data lake in this post

Let’s block ads! (Why?)

Blog – SQL Chick

Two Ways to Approach Federated Queries with U-SQL and Azure Data Lake Analytics

Did you know there are two ways to do federated queries with Azure Data Lake Analytics (ADLA)? By federated queries, I mean a query that combines (federates) data from multiple sources — in this case, from within Azure Data Lake and another data store. Federated queries are one aspect of data virtualization which helps us to access data without requiring the physical movement of data or data integration:

FederatedQueries Two Ways to Approach Federated Queries with U SQL and Azure Data Lake Analytics

The two methods for federated queries with U-SQL and ADLA are:

  1. Schema-less (aka “lazy metadata”)
  2. Via a pre-defined schema via an external table

You might be familiar with external tables in SQL Server, Azure SQL Data Warehouse, or APS. In those platforms, external tables work with PolyBase for purposes of querying data where it lives elsewhere, often for the purpose of loading it into the relational database. That same premise exists in Azure Data Lake Analytics as well. However, in the data lake there’s two approaches – an external table is still a good idea most of the time but it isn’t absolutely required.

Option 1: Schema-Less

Following are the components of making schema-less federated queries work in ADLA:

ADLA Schemaless Two Ways to Approach Federated Queries with U SQL and Azure Data Lake Analytics

Pros of the schema-less option:

  • Access the data quickly for exploration without requiring an external table to be defined in the ADLA Catalog
  • More closely aligned to a schema-on-read paradigm because of its flexibility 
  • Query flexibility: can retrieve a subset of columns without having to define all the columns

Cons of the schema-less option:

  • Additional “burden” on the data analyst doing the ad hoc querying with U-SQL to always perform the schema-on-read within the query
  • Repeating the same schema-on-read syntax in numerous U-SQL queries, rather than reusing the definition via an external table — so if the source system table or view changes, it could involve altering numerous U-SQL scripts.
  • Requires a rowset in the U-SQL schema-on-read queries – i.e., cannot do a direct join so this approach involves slightly longer, more complex syntax

Option 2: With a Pre-Defined Schema in an External Table

The following introduces an external table to the picture in order to enforce a schema:

ADLA SchemaExternalTable Two Ways to Approach Federated Queries with U SQL and Azure Data Lake Analytics

Pros of using an external table:

  • Most efficient on the data analyst doing the ad hoc querying with U-SQL
  • Easier, shorter syntax on the query side because columns and data types have already been predefined in the ADLA Catalog, so a direct join to an external table can be used in the query without having to define a rowset
  • Only one external table to change if a modification does occur to the underlying SQL table

Cons of using an external table:

  • Schema must remain consistent – a downstream U-SQL query will error if a new column is added to the remote source and the external table has not been kept in sync
  • All remote columns must be defined in the external table (not necessarily a big con – but definitely important to know)

In summary, the schema-less approach is most appropriate for initial data exploration because of the freedom and flexibility. An external table is better suited for ongoing, routine queries in which the SQL side is stable and unchanging. Solutions which have been operationalized and promoted to production will typically warrant an external table. 

Want to Know More?

During my all-day workshop, we set up each piece step by step including the the service principal, credential, data source, external table, and so forth so you can see the whole thing in action. The next workshop is in Washington DC on December 8th. For more details and how to register check here: Presenting a New Training Class on Architecting a Data Lake  

You Might Also Like…

Querying Multi-Structured JSON Files with U-SQL

Running U-SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Handling Row Headers in U-SQL

Data Lake Use Cases and Planning Considerations

Let’s block ads! (Why?)

Blog – SQL Chick

Running U-SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

This post is a continuation of the blog where I discussed using U-SQL to standardize JSON input files which vary in format from file to file, into a consistent standardized CSV format that’s easier to work with downstream. Now let’s talk about how to make this happen on a schedule with Azure Data Factory (ADF).

This was all done with Version 1 of ADF. I have not tested this yet with the ADF V2 Preview which was just released.

Prerequisites

  1. Steps 1-4 from my previous post, which includes registering the custom JSON assemblies, creating a database in the Azure Data Lake Catalog, and uploading our raw file so it’s ready to use.
  2. An Azure Data Factory service provisioned and ready to use (this post reflects ADF V1), along with some basic knowledge about ADF since I’m not going into ADF details in this post.

Summary of Steps

  1. Create a procedure in the ADL catalog
  2. Test the procedure
  3. Create a service principal (aka AAD App)  [one time setup]
  4. Assign permissions to service principal  [one time setup]
  5. Obtain IDs [one time setup]
  6. Create ADF components
  7. Verify success of ADF job

Step 1: Create a Procedure in the ADLA Catalog Which Will Be Executed By ADF

This ADLA procedure will be executed by Azure Data Factory. Alternatively, you could also reference a U-SQL script in Azure Storage if you prefer storing a script file there (at the time of this writing, we cannot yet store a script file in ADLS). Either way, U-SQL scripts are typically just too long to practically embed in the Azure Data Factory pipeline activity. In addition to what was discussed in the first part of this solution, we want this stored procedure to:

  • Reference ‘external’ variables which will be populated by the ADF time slices (in our case, the time slice is daily)
  • Apply the ADF time slices to the ‘where’ predicate
  • Use variables to create a “smart” output file path & name which allows the standardized output partitioning to match the raw data partitioning by year/month/day

Run the following U-SQL (Azure Data Lake Analytics) job:

CREATE PROCEDURE BankingADLDB.dbo.uspCreateStandardizedDataset(@DateSliceStart DateTime, @DateSliceEnd DateTime)
AS
BEGIN

REFERENCE ASSEMBLY BankingADLDB.[Newtonsoft.Json];
REFERENCE ASSEMBLY BankingADLDB.[Microsoft.Analytics.Samples.Formats]; 

USING Microsoft.Analytics.Samples.Formats.Json;

//These external parameters will be populated by ADF based on the time slice being executed.
DECLARE EXTERNAL @DateSliceStart DateTime =System.DateTime.Parse("2017/03/14");
DECLARE EXTERNAL @DateSliceEnd DateTime =System.DateTime.Parse("2017/03/14");

//These are intermediary variables which inherit the time element from the ADF time slice.
DECLARE @YearNbr int = @DateSliceStart.Year;
DECLARE @MonthNbr int = @DateSliceStart.Month;
DECLARE @DayNbr int = @DateSliceStart.Day;

//These are used to align the Year/Month/Day partitioning of the input & output.
//This technique also allows U-SQL to dynamically generate different output file path & name.
DECLARE @YearString string = @YearNbr.ToString();
DECLARE @MonthString string = @MonthNbr.ToString().PadLeft(2, '0');
DECLARE @DayString string = @DayNbr.ToString().PadLeft(2, '0');

DECLARE @InputPath = "/ATMMachineData/RawData/" + @YearString + "/" + @MonthString + "/{filename}.json";

DECLARE @OutputFile string = "/ATMMachineData/StandardizedData/" + @YearString + "/" + @MonthString + "/" + @YearString + @MonthString + @DayString + ".csv";

@RawData = 
EXTRACT 
 [AID] string
,[Timestamp] DateTime
,[Data] string
,date DateTime//virtual column
,filename string//virtual column 
FROM @InputPath
USING new JsonExtractor();

@CreateJSONTuple = 
SELECT 
 [AID] AS AssignedID
,[Timestamp] AS TimestampUtc
,JsonFunctions.JsonTuple([Data]) AS EventData 
FROM @RawData
WHERE [Timestamp] >= @DateSliceStart
AND [Timestamp] <@DateSliceEnd;

@Dataset =
SELECT
AssignedID
,TimestampUtc
,EventData["Val"] ?? "0" AS DataValue
,EventData["PrevVal"] ?? "0" AS PreviousDataValue
,EventData["Descr"] ?? "N/A" AS Description
FROM @CreateJSONTuple;

OUTPUT @Dataset
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:false);

END;

Step 2: Test the ADLA Procedure Works

Before we invoke it with ADF, let’s double check our new procedure is working ok. Run the following U-SQL job in ADLA to call the procedure & use 3/14/2017 as the variable values (which matches the timestamp of our original source file):

BankingADLDB.dbo.uspCreateStandardizedDataset(System.DateTime.Parse("2017/03/14"), System.DateTime.Parse("2017/03/15"));

Verify the output is created via Data Explorer. Note the procedure will create the folder structure as well as the file based on the @OutputFile parameter value. After you have confirmed that it worked, go ahead & delete the output file ATMMachineData\StandardizedData17170314.csv so we can be certain later when it’s been generated with ADF.

ADLProcOutput Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Step 3: Create a Service Principal For Use by ADF When it Executes U-SQL

You can authenticate using your own credentials in ADF, but they’ll expire pretty quick — so although that technique is fast and easy for testing, personal credentials won’t work for ongoing scheduling. Therefore, we’ll set this up using a service principal so you get started on the right foot. This is easiest in PowerShell (though you can also do this in the Azure portal if you prefer, in the Azure Active Directory menu > App Registrations page).

#Input Area
$  subscriptionName = ''
$  aadSvcPrinAppDisplayName = 'ADF ReadWrite Access To ADL - Svc Prin - Dev'
$  aadSvcPrinAppHomePage = 'http://ADFReadWriteAccessToADLDev'
$  aadSvcPrinAppIdentifierUri = 'https://url.com/ADFReadWriteAccessToADLDev'
$  aadSvcPrinAppPassword = '' 

#-----------------------------------------

#Manual login into Azure
Login-AzureRmAccount -SubscriptionName $  subscriptionName

#-----------------------------------------

#Create Service Principal (App Registration):
$  aadSvcPrinApplicationDev = New-AzureRmADApplication `
 -DisplayName $  aadSvcPrinAppDisplayName `
 -HomePage $  aadSvcPrinAppHomePage `
 -IdentifierUris $  aadSvcPrinAppIdentifierUri `
 -Password $  aadSvcPrinAppPassword

New-AzureRmADServicePrincipal -ApplicationId $  aadSvcPrinApplicationDev.ApplicationId

In AAD, it should look like this:

AADAppRegistration Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

I put “Dev” in the suffix of mine because I typically create separate service principals for each environment (Dev, Test, Prod). It’s also frequently a good idea to create separate registrations for Read/Write vs. just Read permissions.

Step 4: Assign Permissions to the Service Principal So It Can Read and Write Via the ADF Job

For this step we’ll use the portal instead of PowerShell. You can do this piece in PowerShell as well if you prefer using the Set-AzureRmDataLakeStoreItemAclEntry cmdlet – you’ll also need to make sure the Azure Data Lake provider is registered. To keep this fast & easy, let’s just use the portal.

Security for Azure Data Lake Store

The first piece is referred to as ACLs – access control lists. Go to Data Explorer in ADLS. Make sure you’re on the root folder and select Access (or if you want to define permissions at a sub-foldere level only, it’s ok to start from that level). Choose Add.

ADLSecurity1 Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Select User or Group: choose the ADF service principal we just created. 

Select Permissions: This account needs to read, write, and execute. Note the radio button selections as well so that existing and new child objects will be assigned this permission.

ADLSecurity2 Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

As soon as you hit ok, notice the message at the top of the page. Make sure not to close the blade while it’s assigning the permissions to the child objects:

ADLSecurity3 Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

When you see that it’s finished (with the green check mark), then it’s ok to close the blade:

ADLSecurity4 Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Security for Azure Data Lake Analytics

The second piece of security needed for our service principal is done over in Azure Data Lake Analytics, so that it’s allowed to run U-SQL:

ADLSecurity5 Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Note that the equivalent IAM (Identity & Access Mgmt) permissions for our service principal don’t need to be assigned over in ADLS – just ADLA. Normally that step would be needed for a regular user though. There’s actually a *lot* more to know about security with Azure Data Lake that I’m not going into here. 

Step 5: Obtain IDs Needed for Azure Data Factory

AAD Application ID

Go find the Application ID for your new service principal and copy it so you have it:

AADApplicationID Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Tenant ID (aka Directory ID)

Also find your Tenant ID that’s associated with Azure Active Directory:

TenantID Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Subscription ID

And, lastly, find the Subscription ID where you’ve provisioned ADL and ADF:

SubscriptionID Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Step 6: Create Azure Data Factory Components

The following ADF scripts include two linked services, two datasets, and one pipeline. 

In both linked services you will need to replace several things (as well as the account name and resource group name). Also, be sure NOT to hit the authorize button if you’re creating the linked services directly in the portal interface (it’s actually a much better idea to use Visual Studio because all of these files can be source-controlled and you can use configuration files to direct deployments to Dev, Test, and Prod which have different values for IDs, keys, etc). You may also want to change the linked services names – mine is called lsBankingADLA (or S) to coincide with what my actual services are called — but without the Dev, Test, Prod suffix that they have for real (because we need to propagate the linked services without changing the names).

Linked Service for Azure Data Lake Analytics

{
"name": "lsBankingADLA",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "",
"dataLakeAnalyticsUri": "azuredatalakeanalytics.net",
"servicePrincipalId": "",
"servicePrincipalKey": "",
"tenant": "",
"subscriptionId": "",
"resourceGroupName": ""
}
}
}
 

Linked Service for Azure Data Lake Store

{
"name": "lsBankingADLS",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "",
"servicePrincipalKey": "",
"tenant": "",
"subscriptionId": "",
"resourceGroupName": ""
}
}
}

Dataset for the Raw JSON Data 

{
"name": "dsBankingADLSRawData",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "lsBankingADLS",
"typeProperties": {
"fileName": "{year}/{month}/{day}.json",
"folderPath": "ATMMachineData/RawData/",
"format": {
"type": "JsonFormat"
},
"partitionedBy": [
{
"name": "year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}

Dataset for the Standardized CSV Data

{
"name": "dsBankingADLSStandardizedData",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "lsBankingADLS",
"typeProperties": {
"fileName": "SpecifiedInTheUSQLProcedure.csv",
"folderPath": "ATMMachineData/StandardizedData/{year}/{month}",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"nullValue": "N/A",
"firstRowAsHeader": true
},
"partitionedBy": [
{
"name": "year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1,
"anchorDateTime": "2017-03-14T00:00:00Z"
}
}
}

Pipeline with U-SQL Activity to Run the Procedure in ADLA 

{
"name": "plStandardizeBankingData",
"properties": {
"description": "Standardize JSON data into CSV, with friendly column names & consistent output for all event types. Creates one output (standardized) file per day.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "BankingADLDB.dbo.uspCreateStandardizedDataset(System.DateTime.Parse(@DateSliceStart), System.DateTime.Parse(@DateSliceEnd));",
"degreeOfParallelism": 30,
"priority": 100,
"parameters": {
"DateSliceStart": "$  $  Text.Format('{0:yyyy-MM-ddTHH:mm:ssZ}', SliceStart)",
"DateSliceEnd": "$  $  Text.Format('{0:yyyy-MM-ddTHH:mm:ssZ}', SliceEnd)"
}
},
"inputs": [
{
"name": "dsBankingADLSRawData"
}
],
"outputs": [
{
"name": "dsBankingADLSStandardizedData"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 10,
"executionPriorityOrder": "NewestFirst"
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"anchorDateTime": "2017-03-14T00:00:00Z"
},
"name": "acStandardizeBankingData",
"linkedServiceName": "lsBankingADLA"
}
],
"start": "2017-03-14T00:00:00Z",
"end": "2017-03-15T00:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}

A few comments about the pipeline:

ADFPipeline Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Once all 5 components are deployed, they should look like this:

ADFObjects Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Step 7: Verify Success of ADF Job

We can verify the ADF job succeeded by looking at the Monitor & Manage App (note you’ll have to set the start time back to March 2017 for it to actually show any activity windows):

ADFM%26MApp Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

We can also see in the ADLA Job Management area that an ADF job was executed (it very kindly prefixes jobs run by Data Factory with ADF):

ADLAJobManagement Running U SQL on a Schedule with Azure Data Factory to Populate Azure Data Lake

Let’s block ads! (Why?)

Blog – SQL Chick

Using Azure Analysis Services on Top of Azure Data Lake Storage

The latest release of SSDT Tabular adds support for Azure Data Lake Store (ADLS) to the modern Get Data experience (see the following screenshot). Now you can augment your big data analytics workloads in Azure Data Lake with Azure Analysis Services and provide rich interactive analysis for selected data subsets at the speed of thought!

ADLS Connector Available Using Azure Analysis Services on Top of Azure Data Lake Storage

If you are unfamiliar with Azure Data Lake, check out the various articles at the Azure Data Lake product information site. Also read the article “Get started with Azure Data Lake Analytics using Azure portal.”

Following these instructions, I provisioned a Data Lake Analytics account called tpcds for this article and a new Data Lake Store called tpcdsadls. I also added one of my existing Azure Blob Storage accounts, which contains a 1 TB TPC-DS data set, which I already created and used in the series “Building an Azure Analysis Services Model on Top of Azure Blob Storage.” The idea is to move this data set into Azure Data Lake as a highly scalable and sophisticated analytics backend, from which to serve a variety of Azure Analysis Services models.

For starters, Azure Data Lake can process raw data and put it into targeted output files so that Azure Analysis Services can import the data with less overhead. For example, you can remove any unnecessary columns at the source, which eliminates about 60 GB of unnecessary data from my 1 TB TPC-DS data set and therefore benefits processing performance, as discussed in “Building an Azure Analysis Services Model on Top of Azure Blob Storage–Part 3″.

Moreover, with relatively little effort and a few small changes to a U-SQL script, you can provide multiple targeted data sets to your users, such as a small data set for modelling purposes plus one or more production data sets with the most relevant data. In this way, a data modeler can work efficiently in SSDT Tabular against the small data set prior to deployment, and after production deployment, business users can get the relevant information they need from your Azure Analysis Services models in Microsoft Power BI, Microsoft Office Excel, and Microsoft SQL Server Reporting Services. And if a data scientist still needs more than what’s readily available in your models, you can use Azure Data Lake Analytics (ADLA) to run further U-SQL batch jobs directly against all the terabytes or petabytes of source data you may have. Of course, you can also take advantage of Azure HDInsight as a highly reliable, distributed and parallel programming framework for analyzing big data. The following diagram illustrates a possible combination of technologies on top of Azure Data Lake Store.

Analyze Big Data 1024x547 Using Azure Analysis Services on Top of Azure Data Lake Storage

Azure Data Lake Analytics (ADLA) can process massive volumes of data extremely quickly. Take a look at the following screenshot, which shows a Data Lake job processing approximately 2.8 billion rows of TPC-DS store sales data (~500 GB) in under 7 minutes!

Processing Store Sales 1024x741 Using Azure Analysis Services on Top of Azure Data Lake Storage

The screen in the background uses source files in Azure Data Lake Storage and the screen in the foreground uses source files in Azure Blob Storage connected to Azure Data Lake. The performance is comparable, so I decided to leave my 1 TB TPC-DS data set in Azure Blob Storage, but if you want to ensure absolute best performance or would like to consolidate your data in one storage location, consider moving all your raw data files into ADLS. It’s straightforward to copy data from Azure Blob Storage to ADLS by using the AdlCopy tool, for example.

With the raw source data in a Data Lake-accessible location, the next step is to define the U-SQL scripts to extract the relevant information and write it along with column names to a series of output files. The following listing shows a general U-SQL pattern that can be used for processing the raw TPC-DS data and putting it into comma-separated values (csv) files with a header row.

@raw_parsed = EXTRACT child_id int,
                ,
                empty string
FROM "/{*}_{child_id}_100.dat"
USING Extractors.Text(delimiter: '|');

@filtered_results = SELECT 
FROM @raw_parsed
;

OUTPUT @filtered_results
TO "//.csv"
USING Outputters.Csv(outputHeader:true);

The next listing shows a concrete example based on the small income_band table. Note how the query extracts a portion of the file name into a virtual child_id column in addition to the actual columns from the source files. This child_id column comes in handy later when generating multiple output csv files for the large TPC-DS tables. Also, the WHERE clause is not strictly needed in this example because the income_band table only has 20 rows, but it’s included to illustrate how to restrict the amount of data per table to a maximum of 100 rows to create a small modelling data set.

@raw_parsed = EXTRACT child_id int,
                      b_income_band_sk string,
                      b_lower_bound string,
                      b_upper_bound string,
                      empty string
FROM "wasb://income-band@aasuseast2/{*}_{child_id}_100.dat"
USING Extractors.Text(delimiter: '|');

@filtered_results = SELECT b_income_band_sk,
                           b_lower_bound,
                           b_upper_bound
FROM @raw_parsed
ORDER BY child_id ASC
FETCH 100 ROWS;

You can find complete sets of U-SQL scripts to generate output files for different scenarios (modelling, single csv file per table, multiple csv files for large tables, and large tables filtered by last available year) at the GitHub repository for Analysis Services.

For instance, for generating the modelling data set, there are 25 U-SQL scripts to generate a separate csv file for each TPC-DS table. You can run each U-SQL script manually in the Microsoft Azure portal, yet it is more convenient to use a small Microsoft PowerShell script for this purpose. Of course, you can also use Azure Data Factory, which among other things enables you to run U-SQL scripts on a scheduled basis. For this article, however, the following Microsoft PowerShell script suffices.

$  script_folder = "<Path to U-SQL Scripts>"
$  adla_account = "<ADLA Account Name>"
Login-AzureRmAccount -SubscriptionName "<Windows Azure Subscription Name>"

Get-ChildItem $  script_folder -Filter *.usql |
Foreach-Object {
    $  job = Submit-AdlJob -Name $  _.Name -AccountName $  adla_account –ScriptPath $  _.FullName -DegreeOfParallelism 100
    Wait-AdlJob -Account $  adla_account -JobId $  job.JobId
}

Write-Host "Finished processing U-SQL jobs!";

It does not take long for Azure Data Lake to process the requests. You can use the Data Explorer feature in the Azure Portal to double-check that the desired csv files have been generated successfully, as the following screenshot illustrates.

Output CSV Files for Modelling Using Azure Analysis Services on Top of Azure Data Lake Storage

With the modelling data set in place, you can finally switch over to SSDT and create a new Analysis Services Tabular model at the 1400 compatibility level. Make sure you have the latest version of the Microsoft Analysis Services Projects package installed so that you can pick Azure Data Lake Store from the list of available connectors. You will be prompted for the Azure Data Lake Store URL and you must sign in using an organizational account. Currently, the Azure Data Lake Store connector only supports interactive logons, which is an issue for processing the model in an automated way in Azure Analysis Services, as discussed later in this article. For now, let’s focus on the modelling aspects.

The Azure Data Lake Store connector does not automatically establish an association between the folders or files in the store and the tables in the Tabular model. In other words, you must create each table individually and select the corresponding csv file in Query Editor. This is a minor inconvenience. It also implies that each table expression specifies the folder path to the desired csv file individually. If you are using a small data set from a modelling folder to create the Tabular model, you would need to modify every table expression during production deployment to point to the desired production data set in another folder. Fortunately, there is a way to centralize the folder navigation by using a shared expression so that only a single expression requires an update on production deployment. The following diagram depicts this design.

Folder Navigation by using a shared Expression Using Azure Analysis Services on Top of Azure Data Lake Storage

To implement this design in a Tabular model, use the following steps:

  1. Start Visual Studio and check under Tools -> Extensions and Updates that you have the latest version of Microsoft Analysis Services Projects installed.
  2. Create a new Tabular project at the 1400 compatibility level.
  3. Open the Model menu and click on Import From Data Source.
  4. Pick the Azure Data Lake Store connector, provide the storage account URL, and sign in by using an Organizational Account. Click Connect and then OK to create the data source object in the Tabular model.
  5. Because you chose Import From Data Source, SSDT displays Query Editor automatically. In the Content column, click on the Table link next to the desired folder name (such as modelling) to navigate to the desired root folder where the csv files reside.
  6. Right-click the Table object in the right Queries pane, and click Create Function. In the No Parameters Found dialog box, click Create.
  7. In the Create Function dialog box, type GetCsvFileList, and then click OK.
  8. Make sure the GetCsvFileList function is selected, and then on the View menu, click Advanced Editor.
  9. In the Edit Function dialog box informing you that updates from the Table object will no longer propagate to the GetCsvFileList function if you continue, click OK.
  10. In Advanced Editor, note how the GetCsvFileList function navigates to the modelling folder, enter a whitespace character at the end of the last line to modify the expression, and then click Done.
  11. In the right Queries pane, select the Table object, and then in the left Applied Steps pane, delete the Navigation step, so that Source is the only remaining step.
  12. Make sure the Formula Bar is displayed (View menu -> Formula Bar), and then redefine the Source step as = GetCsvFileList() and press Enter. Verify that the list of csv files is displayed in Query Editor, as in the following screenshot.
    Invoke GetCsvFileList 1024x665 Using Azure Analysis Services on Top of Azure Data Lake Storage
  13. For each table you want to import:
    1. Right-click the existing Table object and click Duplicate.
    2. In the Content column, click on the Binary link next to the desired file name (such as call_center) and verify that Query Editor parses the columns and detects the data types correctly.
    3. Rename the table according to the csv file you selected (such as call_center).
    4. Right-click the renamed table object (such as call_center) in the Queries pane and click Create New Table.
    5. Verify that the renamed table object (such as call_center) is no longer displayed in italic, which indicates that the query will now be imported as a table into the Tabular model.
  14. After you created all desired tables by using the sequence above, delete the original Table object by right-clicking on it and selecting Delete.
  15. In Query Editor, click Import to add the GetCsvFileList expression and the tables to your Tabular model.

During the import, SSDT Tabular pulls in the small modelling data set. And prior to production deployment, it is now a simple matter of updating the shared expression by right-clicking on the Expressions node in Tabular Model Explorer and selecting Edit Expressions, and then changing the folder name in Advanced Editor. The below screenshot highlights the folder name in the GetCsvFileList expression. And if each table can find its corresponding csv file in the new folder location, deployment and processing can succeed.

Changing the CSV Folder 1024x644 Using Azure Analysis Services on Top of Azure Data Lake Storage

Another option is to deploy the model with the Do Not Process deployment option and use a small TOM application in Azure Functions to process the model on a scheduled basis. Of course, you can also use SSMS to connect to your Azure Analysis Services server and send a processing command, but it might be inconvenient to keep SSDT or SSMS connected for the duration of the processing cycle. Processing against the full 1 TB data set with a single csv file per table took about 15 hours to complete. Processing with four csv files/partitions for the seven large tables and maxActiveConnections on the data source set to 46 concurrent connections took roughly 6 hours. This is remarkably faster in comparison to using general BLOB storage, as in the Building an Azure Analysis Services Model on Top of Azure Blob Storage article, and suggests that there is potential for performance improvements in the Azure BLOB storage connector.

Processing 1024x529 Using Azure Analysis Services on Top of Azure Data Lake Storage

Even the processing performance against Azure Data Lake could possibly be further increased, as the processor utilization on an S9 Azure Analysis Server suggests (see the following screenshot). For the first 30 minutes, processor utilization is close to the maximum and then it decreases as the AS engine finishes more and more partitions and tables. Perhaps with an even higher degree of parallelism, such as with eight or twelve partitions for each large table, Azure AS could keep processor utilization near the maximum for longer and finish the processing work sooner. But processing optimizations through elaborate table partitioning schemes is beyond the scope of this article. The processing performance achieved with four partitions on each large table suffices to conclude that Azure Data Lake is a very suitable big-data backend for Azure Analysis Services.

QPUs 1024x352 Using Azure Analysis Services on Top of Azure Data Lake Storage

There is currently only one important caveat: The Azure Data Lake Store connector only supports interactive logons. When you define the Azure Data Lake Store data source, SSDT prompts you to log on to Azure Data Lake. The connector performs the logon and then stores the obtained authentication token in the model. However, this token only has a limited lifetime. Chances are fair that processing succeeds after the initial deployment, but when you come back the next day and want to process again, you get an error that “The credentials provided for the DataLake source are invalid.“ See the screenshot below. Either you deploy the model again in SSDT or you right-click the data source in SSMS and select Refresh Credentials to log on to Data Lake again and submit fresh tokens to the model.

refresh creds 1024x710 Using Azure Analysis Services on Top of Azure Data Lake Storage

A subsequent article is going to cover how to handle authentication tokens programmatically, so stay tuned for more on connecting to Azure Data Lake and other big data sources on the Analysis Services team blog. And as always, please deploy the latest monthly release of SSDT Tabular and send us your feedback and suggestions by using SSASPrev at Microsoft.com or any other available communication channels such as UserVoice or MSDN forums.

Let’s block ads! (Why?)

Analysis Services Team Blog