Tag Archives: data

Deciding Whether to Use Azure SQL Data Warehouse

From time to time I publish on the BlueGranite team blog. My newest post is a decision tree about whether or not Azure SQL Data Warehouse is a good fit.

In Azure you have several technology choices for where to implement a data warehouse. Since Azure SQL DW is an MPP (massively parallel processing) platform, it’s only appropriate in certain circumstances. Hopefully the decision tree can help educate people on the best use cases and situations for Azure SQL DW, and prevent making the wrong technology choice which leads to performance issues down the road.

Please head on over to the BlueGranite site to check out the post

Let’s block ads! (Why?)

Blog – SQL Chick

Data Quality Best Practices

You know why data quality is important. But do you understand what data quality looks like in practice? If not, this post is for you. Keep reading for a primer on data quality best practices.

data quality Data Quality Best Practices

Following data quality best practices will help you keep consistent, error-free data that meets its intended goals.

Ensuring data quality means making sure that your data sets are fit to serve the goals you intend to meet with them.

Data that is inconsistent, contains errors, is incomplete or difficult to translate into the format you need is low-quality data.

If you lack data quality you may as well have no data at all. Without data quality, your data can’t reliably deliver insight into your business.

blog banner Data Quality Magic Quadrant Data Quality Best Practices

Top 5 Data Quality Best Practices

This is why you should adhere to the following five best practices for maximizing data quality:

1. Establish Metrics

In order to track data quality and assess your organization’s ability to improve the quality of its data over time, you need clear metrics for measuring data quality.

Data quality metrics can include information like the number of incomplete or redundant entries in a database, or the amount of data you have that cannot be analyzed due to formatting incompatibilities.

The exact data metrics you use may vary. What’s essential is to have firm metrics of some kind in place for assessing data quality.

2. Perform Data Quality Post-Mortems

From time to time, something will go wrong due to poor data quality. You may be unable to import data into Hadoop because of formatting problems, for example. Or you may deliver marketing materials to the totally wrong people due to data quality errors (yes, that happens – see this data quality failure example).

You should make it a recurring practice to perform a post-mortem after such a problem occurs. Don’t just deal with the consequences and hope they don’t recur, because they will if you fail to take steps to understand and address the underlying cause of the issue.

3. Educate Your Organization

Today, it’s hard to find an employee who doesn’t have a role to play in data management – whether he or she realizes it or not.

True, not everyone is a data scientist. But almost everyone works with data in one way or another…

  • Administrative assistants enter manual data in an appointment book.
  • IT personnel make decisions about which machine data logs to keep, and where to store them.
  • Marketers design websites that automatically collect data about customers.

All of these people – and, indeed, everyone in your organization – should be educated in the basics of data quality. They should understand the importance of avoiding data errors, inconsistencies, and incompleteness.

4. Establish Consistent Procedures

Speaking of consistency, making your data input, storage, extraction and analytics processes as consistent as possible is key to ensuring that your data itself also remains consistent.

Consistent procedures are based on clearly documented steps that everyone follows. Creating and enforcing procedural rules for handling data will do much to help avoid common data quality problems.

5. Perform Data Quality Assurance Audits

You don’t want to wait until you need your data to find out it has quality problems. Instead, you should perform routine and recurring audits.

An audit doesn’t have to involve manual work. You can make data audits a routine, ongoing process by taking advantage of automated data quality solutions, like those from Syncsort.

Download the 2017 Gartner Magic Quadrant for Data Quality Tools report to learn how a leading data quality solution can help you achieve your long-term strategic objectives.

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

What You Need To Know About GDPR’s New Data Protection Officer Role

20171207 bnr rethink podcast david fowler 351x200 What You Need To Know About GDPR’s New Data Protection Officer Role

This transcript has been edited for length. To get the full measure, listen to the podcast.

Michelle Huff: Can you remind us what is GDPR? And why do we need to know and prepare for it?

David Fowler: The GDPR, or the general data protection regulation, is the upcoming rewrite of the digital laws within the European Union. It’s the largest piece of legislation in the last 20 years. If you think about our market where we were 20 years ago and where we are today, the digital channel has certainly transformed itself in relationship to user data, data rights, individual access, etc. It’s a very large undertaking in terms of being rolled out within the European Union. And it’s a very heavy piece of legislation.

Michelle: We’re all preparing for GDPR, it’s top of mind for me here in marketing. But it’s a special use case since other marketers leverage marketing automation oftentimes to send out emails, and they need to comply using their practices and tools. So, can you update us on how Act-On is faring in its preparations?

David: The law doesn’t go into effect until May of 2018. There’s a lot of things that we, as a company, have to be looking at, and you as a customer have to be looking at, in terms of your responsibilities under GDPR.

And a lot depends on which side of that fence you fall on, whether you’re the data controller or the data processor. And in our case, in Act-On, we are both. We serve as a processor of our customers’ data, and also as a controller of your data should we be marketing to you. We have a double whammy, so to speak, as it relates to our obligations.

About a year ago we undertook a third-party assessment of our preparedness for GDPR through a company called Truste, which they’re the organization that certifies our website and our privacy principles. From that document we then put together a sort of working plan in terms of what we need to start looking at to get our ship in order for the GDPR.

We are probably between second and third base right now as it relates to where we are as an organization. But it’s really very complex. Because how things are interpreted by our customers could be completely different than how we look at things ourselves. I’m very confident that come May of next year when GDPR kicks on May 26th, we’ll be good to go.

Michelle: It seems so far away, yet so close. One of GDPR’s requirements is that every company have this role of the data protection officer. Can you tell us more about this role, and how should people be thinking about its responsibilities?

David: The data protection officer is a new concept within the European Union, meaning you must have on staff an employee that does nothing but be responsible for your ongoing obligations under the GDPR.

Even though they’re on your payroll, they report back to the data protection authority within the country that you may reside. You could see right out of the gate there could be some interesting tenets of reporting hierarchy. But the point being is they are an extension of the data protection authority within your organization.

They have a responsibility to ensure that you as an organization are meeting your GDPR preparedness, meeting your obligations under the law, and on your ongoing preparations as it relates to just anything that may pop up. Essentially, they are an extension of the data protection authority within the country that you’re in, but reporting to you at the C level.

Michelle: Is that only for those companies that have EU headquarters? Or is it for those who are doing business in the EU? What companies are really impacted and really starting to have to hire for this role?

David: That’s a good question. The hottest job right now in the European Union is the data protection officer. If you do a Google search, you’ll get numerous hits. But the point being is that the GDPR law is applied to any company that has European citizens within their database. So regardless of where you are on the planet, if you are marketing to European citizens, and you are a certain size, or a certain vertical, then you’re required to have a DPO on staff. In the case of Act-On, because we have a very large chunk of our customers in Europe, and we exceed 250 employees, then we will be required to nominate a data protection officer before May of next year.

Michelle: That’s interesting. Can that person reside anywhere in the world? Or do they have to reside in the EU?

David: It depends. It’s funny because I was just at the GDPR meeting last week in Brussels. And that example was at one of the session I sat in, where you had a multinational company within multiple countries within the European Union. And it was a very complex answer to a very complex question. The net/net is you should have one. Where that person actually sits, physically sits, I think is one of those things that will be determined based on how it’s managed moving forward once it’s rolled out. As most of these things do, they end up going back to legal teams for an opinion.

Michelle: Why is it needed?

David: Essentially, it’s sort of the ombudsperson, in terms of best practice adoption at the end of the day. Because out of the 99 articles in the GDPR and the 173 recitals, that’s a lot of heavy lifting as it relates to how you operate your digital business or your marketing business regardless of the law itself. I think it’s more of sort of a stopgap solution to ensure that you or anybody who’s required to follow GDPR is following it. And it’s one of those roles I think that will be matured over time. It sounds great in concept, but the reality is when it kicks off, we’ll see how that works.

So being able to manage through GDPR and certainly manage the requirements of GDPR is a massive undertaking. And it’s not necessarily just a business role. This role is really intended to be a very highly technically-oriented type person. So, you think about chief privacy officer. This person is gonna be more technically inclined, as well as business process inclined. They’re going to be able to talk the technical chops, as well as the business chops at the same time.

Michelle: How can someone think about this on the brighter side, making lemonade out of lemons? How do you think companies and marketers could really think about this DPO positively instead of thinking of it as a burden, or just a requirement that we have to adhere to?

David: I think it’s one of those things where you leverage the experience of the individual. The GDPR is a massive piece of legislation. And there’s nobody out there that can say, I fully understand. And this person’s going to be able to help facilitate the understanding of the GDPR within your organization. I see that as a benefit, not as a negative. Because you might find ways to do things better and more efficiently. And you might be able to find more sources of revenue, being able to leverage the experience of that individual in your future product marketing developments or product developments.

If we’re having the same conversation five years from now, while we have three or four years under our belt of this particular individual functioning within the company, I think you’ll find it’s gonna be very more fruitful than painful.

But there will be some initial bumps in the road like there always are when things go live. But I think ultimately leveraging that knowledge base is certainly something that is an advantage. And I see that person definitely joined to the hip with the marketing organization because everything begins when a customer is engaged as it relates to revenue and engagement. If I was a marketer, CMO of an organization, I’d be joined at the hip with that person because I think that’s got nothing but benefit to the overall function of that group of the organization.

Michelle: Where can people learn more? Or are there things outside of Act-On that people should really go to and learn more?

David: It depends where you are physically. Some countries are far ahead of others in terms of communicating GDPR notification. We have our GDPR hub, which is up and running now on our website. We’ll be posting a lot more information there as we get closer to the day. But every country is responsible for rolling this out. You’re going to see after the holidays a massive promotional push from the European Union in terms of getting ready for GDPR.

And feel free to reach out to us at gdprinquiry@act-on.net. That email will come directly to me. And I’m more than happy to steer you in the right direction.

Michelle: Thanks, David. Always insightful. It’s great being able to just talk through it.

David: Thank you. Good luck, everyone.

Let’s block ads! (Why?)

Act-On Blog

Announcing Data & BI Summit in Dublin, Ireland 24-26 April 2018

Working with the Power BI User Group and Dynamic Communities we are very proud to announce Data & BI Summit will be held in Dublin, Ireland 24-26 April 2018.

Data BI 18 Blog Header3 Announcing Data & BI Summit in Dublin, Ireland 24 26 April 2018

Data & BI Summit

Join Business Analysts, Data Professionals & Power BI Users at the inaugural Data & BI Summit, located in Dublin, Ireland 24-26 April 2018 at the Convention Centre Dublin.

What to expect with the Data & BI Summit:

  • Have access to exceptional content: Learn how to bring your company through the digital transformation by gaining new understandings of your data and deepening your knowledge of the Microsoft Business Intelligence tools. Products will include: Power BI, PowerApps, Flow, SQL Server, Excel, Azure, D365 and more! Sessions will be available for all users, whether you’re just exploring these technologies are looking for advanced information that can take your skills to the next level. 
  • Get your questions answered: Network with the Microsoft Power BI team, dig-in onsite to find immediate answers with industry experts, Data MVPs, and User Group Leaders while taking advantage of the opportunity to engage in interactive sessions, workshops and labs.
  • Network with your peers: Enjoy countless opportunities to create lasting relationships by connecting and networking with user group peers, partners and Microsoft team members. 
  • Stretch your skillset: Advance your career by learning the latest updates and how they can help you and your business. Interested in sharing your experiences and sharpening your presenting skills? Click here to submit a session and join the list of volunteers bringing the community together in Dublin.

Why to attend:

“I’m excited to be part of the upcoming Data & BI Summit event. By attending you will have opportunity to hear from some of the best speakers in this field, network with peers, learn about new features, and share best practices.  It’s a can’t miss for anyone using or interested in the Microsoft Business Intelligence tools!”

- Reza Rad, PUG Board Member, Microsoft MVP, PUG Leader

Reza 300x280 Announcing Data & BI Summit in Dublin, Ireland 24 26 April 2018

Early Bird Pricing

Join your peers & other experts in Dublin, Ireland 24-26 April 2018  by registering Early Bird pricing, a savings of €400

Register Now

DCI Logo Announcing Data & BI Summit in Dublin, Ireland 24 26 April 2018

Let’s block ads! (Why?)

Microsoft Power BI Blog | Microsoft Power BI

Now Available: 2018 Big Data Trends Survey Report

The results from Syncsort’s 2018 Big Data Trends Survey are in, and it’s clear is that Big Data will be stronger than ever! Although the technologies might change (e.g., Hadoop is giving way to Spark), the initiatives involving the processing of massive data volumes for greater insights are not going away.

In this report, 2018 Big Data Trends: Liberate, Integrate & Trust, we review what every business needs to know in the upcoming year about Big Data including the five top trends for 2018 and Big Data technology’s business benefits. It also lays out the most common challenges of data lake implementation, with a spotlight on data quality’s impact on the data lake.

blog banner 2018 Big Data Trends eBook Now Available: 2018 Big Data Trends Survey Report

Download the 2018 report now!

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Uniting Data Quality and Data Integration

Recently Syncsort has added several companies to its portfolio. An astute observer may notice that the product offerings from these companies are complementary. We are not planning to make any changes to these products. Our customers will continue to use these products in their production systems and we will continue to focus on providing excellent support.

On the other hand, a tighter integration of some of these products will be beneficial to many customers. Case in point: joining key data integration and data quality functionality. That’s what we did in creating Trillium Quality for Big Data.

About our Data Quality Technology

Syncsort’s Trillium data quality software has been the leader in Data Quality for so many years. In fact, Gartner named Syncsort a leader in its Magic Quadrant for Data Quality Tools 2017 for the twelfth consecutive year – every year since Gartner started publishing it.

Why? Looking at what Gartner, and customers consider important is solid, stable core data quality functionality such as parsing, standardization and cleansing, and additional key capabilities like profiling, interactive visualization, matching, multi-domain support and business-driven workflow.

blog banner Data Quality Magic Quadrant Uniting Data Quality and Data Integration

Trillium Quality allows users to cleanse and de-dupe customer records. Why is this so important? Data Scientists spend a significant amount of their time cleansing and de-duping before doing any analytics or applying machine learning algorithms on the data. Also, if you want to get a Customer 360 view, you need to de-dupe the customer records.

The cleansing and matching processes are set up as a Data Quality project in Trillium Control Center UI. Users can then run the project on a stand-alone server. Nowadays, processing millions of customer transactions on daily basis is a very common occurrence. In some cases, the project may run several hours and may not meet the end user SLAs.

About our Data Integration Technology

Syncsort’s DMX-h is a high performance ETL tool that allows users to develop their ETL data flow once and run it in a standalone server or in a Big Data platform like Hadoop. They can even use different distributed computing paradigms like MapReduce or Spark without making any changes to the data flow. The DMX-h Intelligent Execution (IX) engine dynamically comes up with an optimum execution plan based on the computing paradigm of choice on premise or in the cloud.

Typically, users develop an ETL data flow with the DMX-h UI as a DMX-h job. They verify that the job is set up correctly by running it on a small subset of data on a standalone server. Once it is debugged for correctness, they deploy it to run on distributed platforms on premise or in the cloud. Running the job on a cluster of tens or hundreds of compute nodes brings tremendous horizontal scaling.

Putting it all Together with Trillium Quality for Big Data

During last several months, Trillium and DMX-h engineers were collaborating to build a solution that would put Trillium Quality on steroids by allowing it to run under DMX-h IX. The fruit of their hard work is Trillium Quality for Big Data.

With Trillium Quality for Big Data, users design the Quality project in the Trillium Control Center. There is no change to that. They then choose to deploy the project to the Hadoop edge node. On the edge node, they run a shell script that automagically converts the deployed Trillium project to a DMX-h job and runs it using MapReduce or Spark framework at user’s will. They still have the option to run the project standalone on the edge node. The “Design Once, Run Anywhere” mantra of DMX-h is now available for Trillium Quality projects. By deferring the choice of the platform until run time, users can concentrate on the business logic in the project at design time.

The horizontal scaling of Trillium Quality projects is phenomenal. During our in-house testing, we noticed that projects that took hours to run stand-alone were finishing in minutes when run in the cluster. Already, a large European bank is testing this integrated solution for their transactional data.

Stay tuned for more exciting integrated product offerings from Syncsort that leverage our data integration and data quality technology in near future!

Learn about why Syncsort is a leader in Gartner’s Magic Quadrant for Data Quality Tools report

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

What IBM looks for in a data scientist

 What IBM looks for in a data scientist

Job seekers sometimes ask how IBM defines “data scientist.” It’s an important question since more and more would-be data scientists are fighting for attention in an increasingly lucrative labor market.

The first step is to distinguish between what we see as true data scientists and other professionals working in adjacent roles (for instance, data engineers, business analysts, and AI application developers). To make that distinction, let’s first define what we mean by data science.

At its core, data science is applying the scientific method to solve business problems.

You can further expand on the definition by understanding that we solve those business problems using artificial intelligence to create predictions and prescriptions and to optimize processes.

The definition demonstrates that to achieve the true potential of data science, we need data scientists with very particular experiences and skills — specifically, we need people with the experiences and skills required to run and complete data science projects:

1. Training as a scientist, with an MS or PhD
2. Expertise in machine learning and statistics, with an emphasis on decision optimization
3. Expertise in R, Python, or Scala
4. Ability to transform and manage large data sets
5. Proven ability to apply the skills above to real-world business problems
6. Ability to evaluate model performance and tune it accordingly

Let’s look at those qualifications in the context of our definition of data science.

1. Training as a scientist, with a Masters of Science or Doctorate

This is less about the degree itself and more about what you learn when you get an advanced degree. In short, you learn the scientific method, which starts with the ability to take a complex yet abstract problem and break it down into a set of testable hypotheses. This continues with how well you design experiments to test your hypotheses, and how you analyze the results to see whether the hypotheses are confirmed or contradicted. A determined person can learn these skills outside of academia or via the right mix of online training and practice — so there’s some flexibility around having the actual degree — but direct experience applying the scientific method is a must.

Another advantage of an advanced degree is the rigor of the peer review process and publishing requirements that the degree programs impart. To get published, candidates have to present their work in a way that allows others to review and reproduce it. You must also provide evidence that the results are valid and the methods are sound. Doing so requires a deep understanding of the difference between probabilistic and deterministic factors as well as the value and curse of the correlation. It’s possible to get an abstract sense of those values, but there’s no substitute for the negative and positive reinforcement from mentors or the rejection or acceptance of journals and reviews.

2. Expertise in machine learning and statistics, with an emphasis on decision optimization

Applying the scientific method to business problems lets us make better decisions by predicting what will happen next. Those predictions are the product of artificial intelligence and more specifically machine learning. For a true data scientist, the core technical skillsets of machine learning and statistics are simply non-negotiable.

In addition, decision optimization (aka operations research) is a fast-growing aspect of data science. Indeed, the goal of data science is to help make better decisions by probabilistically estimating what’s likely to occur in the future. Carefully applying decision optimization lets data scientists prescribe or determine the next best action for the best business outcome.

3. Expertise in R, Python, or Scala

Being a data scientist doesn’t require you to be as good at programming as professional developers, but the ability to create and run code that supports the data science process is mandatory — and that includes the ability to use statistical and machine learning packages in one of the popular data science languages.

Python, R, and Scala are the fastest-growing languages for data science, along with Julia, another upcoming language in the space, though Julia isn’t yet fully mature. Like Python, R, and Scala, the core of Julia is open source. But it’s important to note that the reason to use these languages isn’t that they’re free, but for the innovation and the freedom to take them where you want to go.

4. Ability to transform and manage large data sets

The fourth skill is sometimes called big data. Here, the ability to use distributed data processing frameworks like Apache Spark is key. The true data scientist will know how to pull data sets together from multiple sources and multiple data types with the help of his or her data science team. The data itself might be a combination of structured, semi-structured, and unstructured data living on multiple clouds.

The data management process consists of finding and collecting the data, exploring the data, transforming the data, identifying features (data elements important in the prediction), engineering the features, and making the data accessible to the model for training. A priority for any data scientist will be streamlining this process, which can easily eat up 80 percent of their time.

5. Proven ability to apply the skills above to real-world business problems

Fifth on the list is a soft skill set. It’s the ability to communicate with non-data scientists in order to make sure that data science teams have the data resources they need and that they’re applying data science to the right business problems. Mastering this skill also means ensuring that the results of data science projects — for instance, predictions about the probable evolution of the business — are fully understood and actionable by business people. This requires good storytelling skills, and in particular, the ability to map mathematical concepts to common sense.

6. Ability to evaluate model performance and tune it accordingly

To some, this sixth skillset is an aspect of the second skillset: expertise in machine learning in general. We wanted to call it out separately because, all too often, it’s what distinguishes a good data scientist from a dangerous one. Data scientists who lack this skill can easily believe that they’ve created and deployed effective models when in fact their models are badly over-fit to the available training data.

Be a true data scientist

If you want to be a true data scientist — as opposed to an aspiring data scientist or a data scientist in title only — we encourage you to master each of these six competencies. A data scientist is fundamentally different from a business analyst or data analyst, who often serve as product owners on data science teams, with the important role of providing subject matter expertise to the data scientists themselves.

That’s not to say business analysts, data analysts, and others can’t transition to become true data scientists — but understand that it takes time, commitment, mentoring, and applying yourself again and again to real and difficult problems.

Seth Dobrin is vice president and chief data officer at IBM Analytics.

Jean-François Puget is an IBM distinguished engineer in machine learning and optimization.

Let’s block ads! (Why?)

Big Data – VentureBeat

TIBCO Named a Leader in The Forrester Wave™: Enterprise Data Virtualization, Q4 2017

forrester wave TIBCO Named a Leader in The Forrester Wave™: Enterprise Data Virtualization, Q4 2017

Off to a roaring start after its November 1, 2017 acquisition, TIBCO Data Virtualization has been named a Leader in The Forrester Wave™: Enterprise Data Virtualization, Q4 2017 by Forrester Research.

Forrester believes that enterprise data virtualization has become critical to every organization in overcoming growing data challenges by delivering faster access to connected data and support self-service and agile data-access capabilities for EA pros to drive new business initiatives.

According to the report, 56% of global technology decision makers claim they have already implemented, are implementing, or are expanding or upgrading their implementations of DV technology in 2017, up from 45% in 2016.

The analyst firm had the following to say about the TIBCO acquisition:

“We see this as a big opportunity for TIBCO to expand its data management and integration capabilities by integrating its existing in-memory, graph, analytics, data quality, and master data management with Cisco’s DV solution. This will help TIBCO deliver an end- to-end data and analytics platform to support new business use cases that require orchestration of silos in real time with self-service and automation capabilities.” —The Forrester Wave™: Enterprise Data Virtualization, Q4 2017

Learn more about TIBCO Data Virtualization and download a complimentary copy of the report.

Let’s block ads! (Why?)

The TIBCO Blog

Convert ArrayPlot Data to Matrix

 Convert ArrayPlot Data to Matrix

I might have made a mistake here. Suppose I have spent a long time doing a calculation for an array plot as follows:

ArrayPlot[
 ParallelTable[
  abc[15, 1, K, n], {K, 1, 400}, {n, 1, 400}]]

Where abc is just some function that takes a while to evaluate for large K and n. Now, once the computation finished, all of the information is there, visually. But suppose I want to keep all of the data (actual numerical values) in some kind of external file. Is there any way to do this after the fact? Or is the only way to get those numbers to go back, re-compute every value again in a table, and then export that table to a .hdf file?

Let’s block ads! (Why?)

Recent Questions – Mathematica Stack Exchange

Summarizing Data Using the GROUPING SETS Operator

Maybe you have felt overwhelmed when you’re analyzing a dataset because of its size. The best way to handle this situation is by summarizing the data to get a quick review.

In T-SQL, you summarize data by using the GROUP BY clause within an aggregate query. This clause creates groupings which are defined by a set of expressions. One row per unique combination of the expressions in the GROUP BY clause is returned, and aggregate functions such as COUNT or SUM may be used on any columns in the query. However, if you want to group the data by multiple combinations of group by expressions, you may take one of two approaches. The first approach is to create one grouped query per combination of expressions and merge the results using the UNION ALL operator. The other approach is to use the GROUPING SETS operator along with the GROUP BY clause and define each grouping set within a single query.

In this article I’ll demonstrate how to achieve the same results using each method.

Prepare the data set

All queries in this article will run in the AdventureWorks2012 database. If you wish to follow along with this article, download it from here.

Case Study: Data Analyst at Adventure Works

Imagine you’re working as a data analyst at the bike manufacturer Adventure Works, and you’re interested in the company’s income over the last few years. This means you need to group the company’s income per year and run the following query:

Query 1. Income by year

Query 1 returns the following result set:

OrderYear

Income

2005

11331809

2006

30674773.2

2007

42011037.2

2008

25828762.1

Table 1. Company’s income per year.

According to Table 1, the company have been registering income between 2005 and 2008. Assuming that the currency is in US dollars, in 2005 their income was around eleven million dollars. In 2006 it was around thirty million dollars, and so on. This kind of information would be useful for supporting a business decision such as opening a company extension elsewhere.

However, if you still want more details about the company’s income, you must perform a new grouping by adding a column or expression to the GROUP BY clause. Add the order month to the previous set of group by expressions. By doing this, the query will return the company’s income per year and month. Review the GROUP BY clause in the following query.

Query 2. Company’s income per year and month.

The following table contains the result set of Query 2:

OrderYear

OrderMonth

Income

2005

7

962716.742

2005

8

2044600

2005

9

1639840.11

2005

10

1358050.47

2005

11

2868129.2

2005

12

2458472.43

2006

1

1309863.25

2006

2

2451605.62

2006

3

2099415.62

2006

4

1546592.23

2006

5

2942672.91

2006

6

1678567.42

2006

7

2894054.68

2006

8

4147192.18

2006

9

3235826.19

2006

10

2217544.45

2006

11

3388911.41

2006

12

2762527.22

2007

1

1756407.01

2007

2

2873936.93

2007

3

2049529.87

2007

4

2371677.7

2007

5

3443525.25

2007

6

2542671.93

2007

7

3554092.32

2007

8

5068341.51

2007

9

5059473.22

2007

10

3364506.26

2007

11

4683867.05

2007

12

5243008.13

2008

1

3009197.42

2008

2

4167855.43

2008

3

4221323.43

2008

4

3820583.49

2008

5

5194121.52

2008

6

5364840.18

2008

7

50840.63

Table 2. Company’s income per year and month.

This result set is more detailed than the former. In July 2005, their income was around nine hundred sixty thousand dollars. In August 2005, it was around two million dollars, and so on. The more expressions or columns added to the GROUP BY clause, the more detailed the results will be.

If you observe the structure of the two queries, you will see they’re grouped by a single set of grouping expressions. The former is grouped by order year, and the latter is grouped by order year and month.

Suppose the business manager at Adventure Works wants to visualize both results within a single result set. To accomplish this, you may merge the previous queries – Query 1 and Query 2 – by using the UNION ALL operator. First, modify Query 1 by adding a dummy column so it will have the same number of columns as Query 2. All queries merged by the UNION operator must have the same number of columns. This dummy column will return NULL in the OrderMonth column, identifying the OrderYear total rows of this query. The UNION ALL query looks like this:

Query 3. Company’s income per year and per year and month.

Figure 1 shows the result set produced by Query 3. Review the comments in the figure which identify the grouping sets.

 Summarizing Data Using the GROUPING SETS Operator

Figure 1. Company’s income per year and per year and month. Notice the comments added to the figure.

This information doesn’t look new, because you already know that in 2005 the company’s income was around eleven million dollars. In July 2005 the company’s income was around nine hundred sixty thousand dollars, and so on. What’s new to you is that each grouping result –year grouping result and year and month grouping result– is merged.

Maybe you’ve figured out how the NULL values appeared in the result set. Remember you used the NULL as a dummy column to identify the results from the order year grouping. Look carefully at Figure 2 which details the placeholders in the first grouped query.

 Summarizing Data Using the GROUPING SETS Operator

Figure 2. Pointing out the placeholders.

When there’s more than one group by expression list involved in the query, a NULL is used as a placeholder to identify one of the groupings in the results. Looking at Figure 2 again, a row that has NULL in the OrderMonth column means the row belongs to the order year grouping. When the row has a value in both the OrderYear and OrderMonth columns, it means the row belongs to the order year and month grouping. This situation happens when one of the grouped queries doesn’t have the same number of columns grouped. In this example, the first grouping is by order year and the second grouping is by order year and month.

Although you obtained the desired result, Query 3 would be even larger if you added another grouping set, such as order day. As a data analyst, you decided to search the internet to find a way to achieve the same results but with less work. You find that by using the GROUPING SETS operator you should get the same result set, but with less coding! This really motivates you, and you write the following query using GROUPING SETS:

Query 4. Getting the same result set produced by the Query #3 but using the GROUPING SETS clause.

The result set produced by Query 4 is the same as that displayed in Figure 1. Figure 3 shows the results, but the new technique requires less code. The GROUPING SETS operator is used along with the GROUP BY clause, and allows you to make multi-grouped queries just by specifying the grouping sets separated by comma. However, you need to be careful when specifying the grouping sets. For example, if a grouping contains two columns, say column A and column B, both columns need to be contained within parenthesis: (column A, column B). If there’s not a parenthesis between them, the GROUPING SETS clause will define them as separate groupings, and the query will not return the desired results.

 Summarizing Data Using the GROUPING SETS Operator

Figure 3. Company’s income per year and per year and month using the GROUPING SETS clause.

By the way, if you want to perform the aggregation over the entire result set without grouping but still use the GROUPING SETS operator, just add an empty parenthesis for the grouping set. Look at Query 5 which calculates the company’s income per year, month, and overall total:

Query 5. Company’s income per year, per year and month, and overall.

Notice the placeholders for the third grouping shown in Figure 4. The query calculated the grand total of incomes by just specifying an empty parenthesis as the third grouping set; the third grouping set is the sum of SubTotal for the table itself.

 Summarizing Data Using the GROUPING SETS Operator

Figure 4. Company’s income per year, per year and month and all over the time.

By the way, if you’ve asked yourself “What would happen if the NULL is part of the data and isn’t used as placeholder?” or “How can I tell when NULL is used as placeholder or is just the value?” In this example, I ensured that the grouped columns aren’t nullable, so the NULLs are used as placeholders. In case the grouped columns are nullable, you will need to use the GROUPING or GROUPING_ID function to identify if the NULL came from the GROUPING SETS operator – it can also come with other groupings operators like ROLLUP and CUBE– or is part of the data. Both functions – GROUPING and GROUPING_ID– will be treated in another article.

Conclusion

In this article, you learned how to achieve an aggregate query with more than one grouping expression list by using the GROUPING SETS operator. Unlike other operators such as ROLLUP and CUBE, you must specify each grouping set. These grouping operators are very important for summarizing data and producing grand totals and sub totals. If you want more information about these operators, please read this article.

I suggest that you practice what you’ve learned in this article; this topic is very important for anyone working with SQL Server data. The data volume is increasing very quickly, and it’s vital to summarize it for better knowledge about the business.

Let’s block ads! (Why?)

SQL – Simple Talk