Category Archives: Oracle

Optimizing Queries with Attribute Clustering

Attribute clustering is a feature that’s new to Oracle Database 12.1.0.2 (Enterprise Edition). It was designed to work with other features, such as compression, storage indexes, and especially with a new feature called zone maps, but since attribute clustering has some useful benefits of its own I’ll cover them here and make zone maps the subject of a later post.

So what is attribute clustering? It is simply a table property – just like compression – but it defines how rows should be ordered and clustered together in close physical proximity, based on one or more column values. For example, in a sales transaction table you could choose to cluster together rows that share common customer ID values. Why would you consider doing this? Perhaps your system frequently queries sales data relating to particular customers, and perhaps there is a requirement for extremely fast and consistent response times (a call center CRM application would be a good example). Your physical database design will probably incorporate an index on the customer ID column already, but you can gain further benefit if sales table rows are physically near to one another when they share common customer IDs. The diagram below represents an index being scanned to find a particular customer, followed by some reads that fetch the corresponding sales table rows:

aciclust s Optimizing Queries with Attribute Clustering

With attribute clustering, the matching sales table rows are near to one another, so it is likely that fewer database blocks will need to be read from storage (or database cache) than if the rows are scattered throughout the sales table. The reason for this is that database blocks will usually contain multiple rows, so it is beneficial if each block we read happens to contains multiple matching rows.  Technically, attribute clustering improves index clustering factors, an important metric with regards to the efficiency of scanning indexes and fetching the corresponding table rows.

Many DBAs have used a similar trick in the past by ordering rows as they are loaded into the database (using an explicit SQL “ORDER BY” clause). Attribute clustering has the advantage of being transparent and a property of the table itself; clustering behavior is inherited from the table definition and is implemented automatically. Just like compression, attribute clustering is a directive that transparently kicks in for certain operations, namely direct path insertion or data movement. This is especially useful because row clustering will occur during table and partition movement as well as during data load.

It’s pretty common for database systems to order rows on data load or data movement to improve table compression ratios. Attribute clustering can be used instead to achieve a similar result.

The name, “attribute clustering” might remind you of another database feature called Oracle Table Clusters, but be careful not to confuse the two. Oracle Table Clusters store rows from one or multiple tables in close proximity inside a specialized storage structure. Attribute clustering simply orders rows within a table (or its partitions and subpartitions); related rows will be physically close to one another, but they will not be held inside any new type of storage structure that’s specific to attribute clustering.

Although attribute clustering is especially useful in conjunction with zone maps, it can be used as a transparent, declarative way to cluster or order table rows in order to improve:

  • Index range scan performance.
  • Table compression ratios (including for Oracle Database In-Memory).
  • Smart scan filtering on Exadata and the In-Memory column store through more efficient storage indexes.

Here’s an example of using attribute clustering to speed up a query. We’ll compare before and after; so start by creating a table that is not attribute clustered:

CREATE TABLE sales_ac (sale_id NUMBER(10), customer_id NUMBER(10));

INSERT INTO sales_ac
SELECT ROWNUM, MOD(ROWNUM,1000)
FROM   dual
CONNECT BY LEVEL <= 100000;

EXEC dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'SALES_AC');

CREATE INDEX sales_ac_cix ON sales_ac (customer_id);

SET AUTOTRACE ON STATISTIC

Our table is not very large, so I’ve included a hint in the test query to encourage the optimizer to use the index: 

SELECT /*+ INDEX(sales_ac sales_ac_cix) */ COUNT(DISTINCT sale_id)
FROM  sales_ac
WHERE customer_id = 50;

Run the query a few times, and see what the reported value for “consistent gets” settles at. I get 96, but since the value depends on some storage defaults, don’t be surprised if your value is different:
ac stats before Optimizing Queries with Attribute Clustering

Attribute clustering is a property of the table, so when it is added, existing rows are not re-ordered. The following command is very fast because it just makes a change to the data dictionary:

ALTER TABLE sales_ac 
ADD CLUSTERING BY LINEAR ORDER (customer_id) 
WITHOUT MATERIALIZED ZONEMAP;

Now we can physically cluster the existing table data by moving the table:

ALTER TABLE sales_ac MOVE;

Moving tables and partitions is much cleaner and simpler than the manual “ORDER BY” method, where we would have to create a new table, add indexes, drop the old table and then rename. The simpler MOVE approach is particularly relevant in real-world scenarios, where it would be more usual to move tables a partition or sub-partition at a time, potentially using on-line operations.

Rebuild the index:

ALTER INDEX sales_ac_cix REBUILD;

Use the same test query:

SELECT /*+ INDEX(sales_ac sales_ac_cix) */ COUNT(DISTINCT sale_id)
FROM   sales_ac
WHERE  customer_id = 50;

Again, run it a number of times to settle the reported value for “consistent gets”. In my case, I now read 3 database blocks instead of 96: a considerable improvement!  

ac stats after Optimizing Queries with Attribute Clustering 



This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.

The Data Warehouse Insider

Part 4 of DBAs guide to managing sandboxes – Observe

This is the next part in my on-going series of posts on the topic of how to successfully manage sandboxes within an Oracle data warehouse environment. In Part 1 I provided an overview of sandboxing (key characteristics, deployment models) and introduced the concept of a lifecycle called BOX’D (Build, Observe, X-Charge and Drop). In Part 2 I briefly explored the key differences between data marts and sandboxes. Part 3 explored the Build-phase of our lifecycle.

Now, in this post I am going to focus on the Observe-phase. At this stage in the lifecycle we are concerned with managing our sandboxes. Most modern data warehouse environments will be running hundreds of data discovery projects so it is vital that the DBA can monitor and control the resources that each sandbox consumes by establishing rules to control the resources available to each project both in general terms and specifically for each project.  

In most cases, DBAs will setup a sandbox with dedicated resources. However, this approach does not create an efficient use of resources since sharing of unused resources across other projects is just not possible. The key advantage of Oracle Multitenant is its unique approach to resource management. The only realistic way to support thousands of sandboxes, which in today’s analytical driven environments is entirely possible if not inevitable, is to allocate one chunk of memory and one set of background processes for each container database. This provides much greater utilisation of existing IT resources and greater scalability as multiple pluggable sandboxes are consolidated into the multitenant container database.

 Part 4 of DBAs guide to managing sandboxes   Observe

Using multitenant we can now expand and reduce our resources as required to match our workloads. In the example below we are running an Oracle RAC environment, with two nodes in the cluster. You can see that only certain PDBs are open on certain nodes of the cluster and this is achieved by opening the corresponding services on these nodes as appropriate. In this way we are partitioning the SGA across the various nodes of the RAC cluster. This allows us to achieve the scalability we need for managing lots of sandboxes. At this stage we have a lot of project teams running large, sophisticated workloads which is causing the system to run close to capacity as represented by the little resource meters.

 Part 4 of DBAs guide to managing sandboxes   Observe

It would be great if our DBA could add some additional processing power to this environment to handle this increased workload. With 12c what we can do is simply drop another node into the cluster which allows us to spread the processing of the various sandbox workloads loads out across the expanded cluster. 

 Part 4 of DBAs guide to managing sandboxes   Observe

Now our little resource meters are showing that the load on the system is a lot more comfortable. This shows that the new multitenant feature integrates really well with RAC. It’s a symbiotic relationship whereby Multitenant makes RAC better and RAC makes Multitenant better.

So now we can add resources to the cluster how do we actually manage resources across each of our sandboxes? As a DBA I am sure that you are familiar with the features in Resource Manager that allow you to control system resources: CPU, sessions, parallel execution servers, Exadata I/O. If you need a quick refresher on Resource Manager then check out this presentation by Dan Norris “Overview of Oracle Resource Manager on Exadata” and the chapter on resource management in the 12c DBA guide.

With 12c Resource Manager is now multitenant-aware. Using Resource Manager we can configure policies to control how system resources are shared across the sandboxes/projects. Policies control how resources are utilised across PDBs creating hard limits that can enforce a “get what you pay for” model which is an important point when we move forward to the next phase of the lifecycle: X-Charge. Within Resource Manager we have adopted an “industry standard” approach to controlling resources based on two notions:

  1. a number of shares is allocated to each PDB
  2. a maximum utilization limit may be applied to each PDB

To help DBAs quickly deploy PDBs with a pre-defined set of shares and utilisation limits there is a “Default” configuration that works, even as PDBs are added or removed. How would this work in practice? Using a simple example this is how we could specify resource plans for the allocation of CPU between three PDBs:

 Part 4 of DBAs guide to managing sandboxes   Observe

As you can see, there are four total shares, 2 for the data warehouse and one each for our two sandboxes. This means that our data warehouse is guaranteed 50% of the CPU whatever else is going on in the other sandboxes (PDBs). Similarly each of our sandbox projects is guaranteed at least 25%. However, in this case we did not specify settings for maximum utilisation. Therefore, our marketing sandbox could use 100% of the CPU if both the data warehouse and the sales sandbox were idle.

By using the “Default” profile we can simplify the whole process of adding and removing sandboxes/PDBS. As we add and remove sandboxes, the system resources are correctly rebalanced, by using the settings specific default profile, across all the plugged-in sandboxes/PDBs as shown below.

 Part 4 of DBAs guide to managing sandboxes   Observe

Summary

In this latest post on sandboxing I have examined the “Observe” phase of our BOX’D sandbox lifecycle. With the new  multitenant-aware Resource Manager we can configure policies to control how system resources are shared across sandboxes. Using Resource Manager it is possible to configure a policy so that the first tenant in a large, powerful server experiences a realistic share of the resources that will eventually be shared as other tenants are plugged in.

In the next post I will explore the next phase of our sandbox lifecycle, X-charge, which will cover the metering and chargeback services for pluggable sandboxes. 

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.

The Data Warehouse Insider