Big Data SQL Quick Start. Storage Indexes – Part10.

Today I’m going to explain very powerful Big Data SQL feature – Storage Indexes. Before all, I want to note, that name “Storage Index” could mislead you. In fact, it’s dynamic structure that automatically built over your data after you scan it. There is no any specific command or something that user have run. There is nothing that you have to maintain (like Btree index), rebuild. You just run your workload and after a while, you may note better performance. 

Storage Indexes is not something completely new for Big Data SQL. Oracle Exadata also has this feature and with Big Data SQL we just re-use it.

How it works. 

The  main idea is that we could create some metadata over the unit of the scan (block or multiple blocks ). For example, we scan HDFS blocks with one given query, which has some predicate in where clause (like WHERE id=123). If this block doesn’t return any rows we build statistics for this column, like the minimum, the maximum for given block.

3 Big Data SQL Quick Start. Storage Indexes   Part10.

Next scan could use these statistics for skipping the scan.

4 Big Data SQL Quick Start. Storage Indexes   Part10.

It’s very powerful feature for unique columns.

Fine-granule unit for Storage Index in the case of Hadoop is HDFS block. As you may know, HDFS block has pretty big size (in Big Data Appliance default is 256MB) and if you will be able to skip full scan of it will bring to you significant performance benefits.

Query initially scan granule and if this scan doesn’t return any row, storage index is built (if you find at least one row in the block, Storage Index will not be created over this concrete block).

In HDFS data usually stored in 3 copies. For maximize performance and get benefits from Storage Index as quick as possible, Big Data SQL (since 3.1 version) uses the deterministic order of the hosts.

If you scan table once and create Storage Indexes over the first replica, the second scan will be performed over the same copy and will use Storage Index right with the second scan. 

In conclusion, I want to show you couple bad and good examples for Storage Index. I have table, with one pretty unique column:

SQL> SELECT num_distinct FROM user_tab_col_statistics

WHERE table_name = ‘STORE_SALES_CSV’

AND COLUMN_NAME = ‘SS_TICKET_NUMBER';

———— 

849805312

SQL> 

the table is quite big:

SQL> select count(1) from STORE_SALES_CSV

——————-

6 385 178 703 

SQL>

which means that in average each value appears in given dataset in average 7-8 times, which is quite selective (this is 900.1 GB dataset).

For show Storage Indexes in action, I run the query that uses predicate which returns 2 rows.

SQL> select count(1) from STORE_SALES_CSV where SS_TICKET_NUMBER=187378862;

the first scan consumes a lot of IO and CPU and finishes in 10.6 minutes. The second and next scans finished extremely fast in 3 seconds (because of Storage Index I definitely know that there is no data that matches with my predicate in my table). For checking number of Storage Index efficiency I query the session statistic view:

SQL> SELECT n.name, CASE NAME

          WHEN ‘cell XT granule predicate offload retries’ THEN

   VALUE

  WHEN ‘cell XT granules requested for predicate offload’ THEN

           VALUE

          ELSE

           round(VALUE / 1024 / 1024 / 1024,2)

       END Val,

       CASE NAME

          WHEN ‘cell XT granule predicate offload retries’ THEN

           ‘Granules’

          WHEN ‘cell XT granules requested for predicate offload’ THEN

           ‘Granules’

          ELSE

           ‘GBytes’

       END Metric

  FROM v$ mystat   s,

       v$ statname n

 WHERE s.statistic# = n.statistic#

   AND n.name IN (‘cell XT granule IO bytes saved by storage index’,

                  ‘cell XT granule bytes requested for predicate offload’)

 ORDER BY Metric;

————————————————————————————- 

cell XT granule IO bytes saved by storage index         899.86  GBytes

cell XT granule bytes requested for predicate offload   900.11 GBytes

Based on this statistics we could conclude that only (cell XT granule bytes requested for predicate offload – cell XT granule IO bytes saved by storage index) = 256 Mbytes were read, which is one HDFS block.

First scan.

I don’t  recommend you to disable Storage Index in your real production environment, but it has one side effect. First scan in the case if Storage Indexes are enabled takes longer than without Storage Indexes. In my previous example, first scan took 10.6 minutes, but all next were finished in seconds:

1 Big Data SQL Quick Start. Storage Indexes   Part10.

If I disable Storage Index first, second and next scan will take the same time – about 5.1 minutes:

2 Big Data SQL Quick Start. Storage Indexes   Part10.

I could summarise all above in the table:

 Elapsed time with Storage Indexes  Elapsed time without Storage Indexes
 First scan of the table 10.3 minutes 5.1 minutes
 Second and next scan of the table 3 seconds 5.1 minutes

Query by the unselective predicate.

Taking into account that in our previous example, Storage Index brought performance degradation for the first query it’s interesting to check the behaviour of the query which uses unselective predicates.

The same table has column SS_QUANTITY which has only 100 unique values:

SQL> SELECT num_distinct FROM user_tab_col_statistics

WHERE table_name = ‘STORE_SALES_CSV’

AND COLUMN_NAME = ‘SS_QUANTITY';

———— 

100

SQL> 

that are between 0 and 100:

SQL> SELECT min(SS_QUANTITY, max(SS_QUANTITY) FROM STORE_SALES_CSV

—-   —— 

0       100

SQL>  

With enabled Storage Indexex I ran 3 times query like:

SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY=82;

and all 3 times it was done in 5.3 minutes.

But when I tried to query with unexisting predicates (like negative):

SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY-82;

Behaviour was like in the previous example. First scan tool 10.5 minutes, second and next 3 seconds.

In the end of the test I disable Storage Index and  run query multiple times. In the end, I got 5.3 minutes.

Let me conclude results into the table:

 Elapsed time with Storage Index  Elapsed time without Storage Index
 The first run, return many rows  5.3 minutes  5.3 minutes
 Second and next run, return many rows  5.3 minutes  5.3 minutes
 The first run, return 0 rows  10.5 minutes  5.3 minutes
 Second and next run, return 0 rows  3 seconds  5.3 minutes

Based on this experiments we may see:

1) Storage Index built only when block doesn’t return any row

2) If block return at least one row SI will not be built

3) It means, that first scan will not have performance degradation, but the second scan will not have Index for accelerate performance

Sort by

In the last example, we saw that query which uses unselective predicate is the bad candidate for Storage Indexes unless you sort out your source data.

What does it mean? I  create a new dataset from original one, like CTAS with hive sort by statement:

hive> create table csv.store_sales_quantity_sort stored as textfile as select * from csv.store_sales sort by SS_QUANTITY;

After this I repeat my query again two times:

SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY= -82;

5 Big Data SQL Quick Start. Storage Indexes   Part10.

as you can see the second one is way faster. And now Storage Indexes take place. I can prove this with SI statistics:

 cell XT granule IO bytes saved by storage index         601.72 GBytes

 cell XT granule bytes requested for predicate offload   876.47 GBytes

Please feel free to ask any questions in the comments!

Let’s block ads! (Why?)

The Data Warehouse Insider