Tag Archives: indexes

Big Data SQL Quick Start. Storage Indexes – Part10.

Today I’m going to explain very powerful Big Data SQL feature – Storage Indexes. Before all, I want to note, that name “Storage Index” could mislead you. In fact, it’s dynamic structure that automatically built over your data after you scan it. There is no any specific command or something that user have run. There is nothing that you have to maintain (like Btree index), rebuild. You just run your workload and after a while, you may note better performance. 

Storage Indexes is not something completely new for Big Data SQL. Oracle Exadata also has this feature and with Big Data SQL we just re-use it.

How it works. 

The  main idea is that we could create some metadata over the unit of the scan (block or multiple blocks ). For example, we scan HDFS blocks with one given query, which has some predicate in where clause (like WHERE id=123). If this block doesn’t return any rows we build statistics for this column, like the minimum, the maximum for given block.

3 Big Data SQL Quick Start. Storage Indexes   Part10.

Next scan could use these statistics for skipping the scan.

4 Big Data SQL Quick Start. Storage Indexes   Part10.

It’s very powerful feature for unique columns.

Fine-granule unit for Storage Index in the case of Hadoop is HDFS block. As you may know, HDFS block has pretty big size (in Big Data Appliance default is 256MB) and if you will be able to skip full scan of it will bring to you significant performance benefits.

Query initially scan granule and if this scan doesn’t return any row, storage index is built (if you find at least one row in the block, Storage Index will not be created over this concrete block).

In HDFS data usually stored in 3 copies. For maximize performance and get benefits from Storage Index as quick as possible, Big Data SQL (since 3.1 version) uses the deterministic order of the hosts.

If you scan table once and create Storage Indexes over the first replica, the second scan will be performed over the same copy and will use Storage Index right with the second scan. 

In conclusion, I want to show you couple bad and good examples for Storage Index. I have table, with one pretty unique column:

SQL> SELECT num_distinct FROM user_tab_col_statistics

WHERE table_name = ‘STORE_SALES_CSV’

AND COLUMN_NAME = ‘SS_TICKET_NUMBER';

———— 

849805312

SQL> 

the table is quite big:

SQL> select count(1) from STORE_SALES_CSV

——————-

6 385 178 703 

SQL>

which means that in average each value appears in given dataset in average 7-8 times, which is quite selective (this is 900.1 GB dataset).

For show Storage Indexes in action, I run the query that uses predicate which returns 2 rows.

SQL> select count(1) from STORE_SALES_CSV where SS_TICKET_NUMBER=187378862;

the first scan consumes a lot of IO and CPU and finishes in 10.6 minutes. The second and next scans finished extremely fast in 3 seconds (because of Storage Index I definitely know that there is no data that matches with my predicate in my table). For checking number of Storage Index efficiency I query the session statistic view:

SQL> SELECT n.name, CASE NAME

          WHEN ‘cell XT granule predicate offload retries’ THEN

   VALUE

  WHEN ‘cell XT granules requested for predicate offload’ THEN

           VALUE

          ELSE

           round(VALUE / 1024 / 1024 / 1024,2)

       END Val,

       CASE NAME

          WHEN ‘cell XT granule predicate offload retries’ THEN

           ‘Granules’

          WHEN ‘cell XT granules requested for predicate offload’ THEN

           ‘Granules’

          ELSE

           ‘GBytes’

       END Metric

  FROM v$ mystat   s,

       v$ statname n

 WHERE s.statistic# = n.statistic#

   AND n.name IN (‘cell XT granule IO bytes saved by storage index’,

                  ‘cell XT granule bytes requested for predicate offload’)

 ORDER BY Metric;

————————————————————————————- 

cell XT granule IO bytes saved by storage index         899.86  GBytes

cell XT granule bytes requested for predicate offload   900.11 GBytes

Based on this statistics we could conclude that only (cell XT granule bytes requested for predicate offload – cell XT granule IO bytes saved by storage index) = 256 Mbytes were read, which is one HDFS block.

First scan.

I don’t  recommend you to disable Storage Index in your real production environment, but it has one side effect. First scan in the case if Storage Indexes are enabled takes longer than without Storage Indexes. In my previous example, first scan took 10.6 minutes, but all next were finished in seconds:

1 Big Data SQL Quick Start. Storage Indexes   Part10.

If I disable Storage Index first, second and next scan will take the same time – about 5.1 minutes:

2 Big Data SQL Quick Start. Storage Indexes   Part10.

I could summarise all above in the table:

 Elapsed time with Storage Indexes  Elapsed time without Storage Indexes
 First scan of the table 10.3 minutes 5.1 minutes
 Second and next scan of the table 3 seconds 5.1 minutes

Query by the unselective predicate.

Taking into account that in our previous example, Storage Index brought performance degradation for the first query it’s interesting to check the behaviour of the query which uses unselective predicates.

The same table has column SS_QUANTITY which has only 100 unique values:

SQL> SELECT num_distinct FROM user_tab_col_statistics

WHERE table_name = ‘STORE_SALES_CSV’

AND COLUMN_NAME = ‘SS_QUANTITY';

———— 

100

SQL> 

that are between 0 and 100:

SQL> SELECT min(SS_QUANTITY, max(SS_QUANTITY) FROM STORE_SALES_CSV

—-   —— 

0       100

SQL>  

With enabled Storage Indexex I ran 3 times query like:

SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY=82;

and all 3 times it was done in 5.3 minutes.

But when I tried to query with unexisting predicates (like negative):

SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY-82;

Behaviour was like in the previous example. First scan tool 10.5 minutes, second and next 3 seconds.

In the end of the test I disable Storage Index and  run query multiple times. In the end, I got 5.3 minutes.

Let me conclude results into the table:

 Elapsed time with Storage Index  Elapsed time without Storage Index
 The first run, return many rows  5.3 minutes  5.3 minutes
 Second and next run, return many rows  5.3 minutes  5.3 minutes
 The first run, return 0 rows  10.5 minutes  5.3 minutes
 Second and next run, return 0 rows  3 seconds  5.3 minutes

Based on this experiments we may see:

1) Storage Index built only when block doesn’t return any row

2) If block return at least one row SI will not be built

3) It means, that first scan will not have performance degradation, but the second scan will not have Index for accelerate performance

Sort by

In the last example, we saw that query which uses unselective predicate is the bad candidate for Storage Indexes unless you sort out your source data.

What does it mean? I  create a new dataset from original one, like CTAS with hive sort by statement:

hive> create table csv.store_sales_quantity_sort stored as textfile as select * from csv.store_sales sort by SS_QUANTITY;

After this I repeat my query again two times:

SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY= -82;

5 Big Data SQL Quick Start. Storage Indexes   Part10.

as you can see the second one is way faster. And now Storage Indexes take place. I can prove this with SI statistics:

 cell XT granule IO bytes saved by storage index         601.72 GBytes

 cell XT granule bytes requested for predicate offload   876.47 GBytes

Please feel free to ask any questions in the comments!

Let’s block ads! (Why?)

The Data Warehouse Insider

Dynamics CRM indexes (missing, redundant, and OOB)

It is a fact that Dynamics CRM is an OLTP application, and that it’s performance relies heavily on how well the back end database (SQL server) is performing.  A major component in the performance matrix is indexes.  They have to be well defragmented, you should have the needed indexes to optimize query performance (missing indexes), and you should not have idle indexes that have more writes than reads (redundant indexes).

Adding missing indexes

The Missing Indexes report/query has very favourable effect on performance although it has limitations. (Limitations of the Missing Indexes Feature) But it recommends very effective indexes based on the actual usage and queries that SQL server receives from your end users,

Removing redundant indexes (editing / removing the OOB indexes is not supported!)

Be careful not to create too many indexes, which can affect the insert and update performance. Balance indexing needs according to business requirements, indexes that have a lot of writes and no reads (or very few) are not efficient and are considered a burden to the system (redundant indexes).  So it is a good idea when working with missing indexes to give it some time (depending on your operations cycle maybe a month) and off set them against a redundant indexes query.  However, please remember removing or editing OOB indexes is NOT supported.

Ref: Optimizing and Maintaining Database Indexes

OOB indexes

The famous question is, how do I know the OOB indexes from newly added indexes, The MS content team has recently added new content to shed more light on OOB indexes, but I’d say your best guard is to fully document your indexes and make sure you have proper naming conventions.

The indexes that are created in a Microsoft Dynamics CRM organization database are designed to provide fast retrieval of commonly requested data from tables and views stored on a SQL Server. Here are a few characteristics for the indexes that are created in a Microsoft Dynamics CRM organization database.

  • Depending on the version and update applied, a Microsoft Dynamics CRM organization database without any customizations or installed solutions (out-of-box database) has between 1,000 and 1,600 total indexes.
  • Later versions of Dynamics CRM have more features and, subsequently, more database objects such as tables and indexes.
  • At least five new indexes are created whenever you create a new entity or reference a new column in a quick find.
  • Installing a solution increases the number of total indexes.

How to get a list of all indexes stored in an organization database?

To get a list of all indexes, run the following sample SQL query against the organization database.

SELECT s.name +‘.’+t.name AS ‘table_name’,i.name,i.index_id

FROM sys.schemas s JOIN sys.tables t ON s.schema_id=t.schema_id

JOIN sys.indexes i ON t.object_id=i.object_id LEFT OUTER JOIN sys.objects o

ON o.parent_object_id=t.object_id AND i.name=o.name

WHERE i.name is not null

For an approximation of the out-of-box indexes in an organization database that corresponds to a specific version of Microsoft Dynamics CRM, select from the links below to an Excel worksheet that contains a list of indexes.

For more information about SQL Server indexes, see SQL Server Index Design Guide, Microsoft Dynamics CRM Server installed configuration components

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.

Dynamics CRM in the Field