Big Data SQL Quick Start. Semi-structured data. – Part12.

In my previous blogpost, I was talking about Schema on Read and Schema on Write advantages and disadvantages. As a conclusion, we found that HDFS could be quite suitable for data in the original format. Very often customers have data in a semi-structure format like XML or JSON. In this post, I will show how to work with it.

Use case for  storing semi-structure data.

One of the most common use case for storing semi-structure data in the HDFS could be desire to store all original data and move only part of it in the relational database. This may be due to the fact that part of the data may be needed on daily basis, but other parts of the data will be accessed very rarely (but they still may be required for some deep analytics). For example, we have XML follow format:

<XML>

<NAME> …</NAME>

<AGE> …</AGE>

<INCOME> …</INCOME>

<Color of eyes> …</Color of eyes>

<Place of birth> …</Place of birth>

</XML>

 and on daily basis, we need in the relational database only the name and age of a person. Like this:

Name     Age

——-     ——-

…. …. 

Others fields will be accessed very rarely. At this case, HDFS seems like a good solution to store data in the original format and Big Data SQL seems like a good tool for access it from the database. Let me show to you couple examples how you can do this.

Big Data SQL and XML.

For start querying the XML data with Big Data SQL you have to define Hive metadata over it (using StorageHandler). After this, you have to define an external table in the Oracle Database, which will link you to the Hive table and you are ready to run your queries. Here is the abstract picture:

1 Big Data SQL Quick Start. Semi structured data.   Part12.

Now, let me give you an example of data with DDLs.

Like an example of the data I took some machine data (it could be smart meter or so):

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

<row><CUSTOMER_KEY>8170837</CUSTOMER_KEY><End_Datetime>4/04/2013 12:29</End_Datetime><General_Supply_KWH>0.197</General_Supply_KWH><Off_Peak_KWH>0</Off_Peak_KWH><Gross_Generation_KWH>0</Gross_Generation_KWH><Net_Generation_KWH>0</Net_Generation_KWH></row>

<row><CUSTOMER_KEY>8170837</CUSTOMER_KEY><End_Datetime>4/04/2013 12:59</End_Datetime><General_Supply_KWH>0.296</General_Supply_KWH><Off_Peak_KWH>0</Off_Peak_KWH><Gross_Generation_KWH>0</Gross_Generation_KWH><Net_Generation_KWH>0</Net_Generation_KWH></row>

<row><CUSTOMER_KEY>8170837</CUSTOMER_KEY><End_Datetime>4/04/2013 13:29</End_Datetime><General_Supply_KWH>0.24</General_Supply_KWH><Off_Peak_KWH>0</Off_Peak_KWH><Gross_Generation_KWH>0</Gross_Generation_KWH><Net_Generation_KWH>0</Net_Generation_KWH></row>

<row><CUSTOMER_KEY>8170837</CUSTOMER_KEY><End_Datetime>4/04/2013 13:59</End_Datetime><General_Supply_KWH>0.253</General_Supply_KWH><Off_Peak_KWH>0</Off_Peak_KWH><Gross_Generation_KWH>0</Gross_Generation_KWH><Net_Generation_KWH>0</Net_Generation_KWH></row>

<row><CUSTOMER_KEY>8170837</CUSTOMER_KEY><End_Datetime>4/04/2013 14:29</End_Datetime><General_Supply_KWH>0.24</General_Supply_KWH><Off_Peak_KWH>0</Off_Peak_KWH><Gross_Generation_KWH>0</Gross_Generation_KWH><Net_Generation_KWH>0</Net_Generation_KWH></row> 

I put this data on the HDFS:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

[Linux] $ hadoop fs -put source.xml hdfs://cluster-ns/user/hive/warehouse/xmldata/

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

Like a second step i have to define Hive metadata:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

hive> CREATE EXTERNAL TABLE meter_counts(

  customer_key string,

  end_datetime string,

  general_supply_kwh float,

  off_peak_kwh int,

  gross_generation_kwh int,

  net_generation_kwh int)

ROW FORMAT SERDE

  ‘oracle.hadoop.xquery.hive.OXMLSerDe’

STORED AS INPUTFORMAT

  ‘oracle.hadoop.xquery.hive.OXMLInputFormat’

OUTPUTFORMAT

  ‘oracle.hadoop.xquery.hive.OXMLOutputFormat’

LOCATION

  ‘hdfs://cluster-ns/user/hive/warehouse/xmldata/’

TBLPROPERTIES (

  ‘oxh-column.CUSTOMER_KEY’=’./CUSTOMER_KEY’,

  ‘oxh-column.End_Datetime’=’./End_Datetime’,

  ‘oxh-column.General_Supply_KWH’=’./General_Supply_KWH’,

  ‘oxh-column.Gross_Generation_KWH’=’./Gross_Generation_KWH’,

  ‘oxh-column.Net_Generation_KWH’=’./Net_Generation_KWH’,

  ‘oxh-column.Off_Peak_KWH’=’./Off_Peak_KWH’,

  ‘oxh-elements’=’row’);

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

As a second step we have to define External table in the Oracle Database, which linked to the hive table:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> CREATE TABLE OXH_EXAMPLE 

   ( CUSTOMER_KEY VARCHAR2(4000), 

      END_DATETIME VARCHAR2(4000), 

      GENERAL_SUPPLY_KWH BINARY_FLOAT, 

      OFF_PEAK_KWH NUMBER, 

      GROSS_GENERATION_KWH NUMBER, 

      NET_GENERATION_KWH NUMBER

   ) 

   ORGANIZATION EXTERNAL 

    ( TYPE ORACLE_HIVE

      DEFAULT DIRECTORY DEFAULT_DIR

      ACCESS PARAMETERS

      (com.oracle.bigdata.tablename=default.meter_counts)    

    )

   REJECT LIMIT UNLIMITED 

  PARALLEL ; 

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

here we are. Now we are ready to query XML data from the  Oracle DB:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> SELECT * FROM oxh_example WHERE ROWNUM <= 3;

…… 

8170837 4/04/2013 12:29 0.196999997 0 0 0

8170837 4/04/2013 12:59 0.296000004 0 0 0

8170837 4/04/2013 13:29 0.239999995 0 0 0

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

Great, we expose XML data as structure in the Database. 

2 Big Data SQL Quick Start. Semi structured data.   Part12.

Another one great thing about Big Data SQL, that parsing and part of the processing is pushed down to the Hadoop side.

For example, if we run query like:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> SELECT COUNT(1) FROM oxh_example WHERE customer_key=’8170837′;

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

will be pushed to the Hadoop nodes and will not utilize the database. In the OEM we could see only “cell external table smart scan” event:12 Big Data SQL Quick Start. Semi structured data.   Part12.

In Cloudera Manager we see that 3 Hadoop nodes are utilized and at the same point of time database node is idle.

5 Big Data SQL Quick Start. Semi structured data.   Part12.

and session stat could show us, that only 8KB out of 100GB returned back to the Database (all other were filtered on the cell side):

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> SELECT n.name, VALUE

  FROM v$ mystat   s, v$ statname n

WHERE s.statistic# = n.statistic#

AND n.name LIKE ‘%XT%';

… 

cell interconnect bytes returned by XT smart scan         8192 Bytes

cell XT granule bytes requested for predicate offload    115035953517 Bytes

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

Big Data SQL and JSON.

All right, now we know how to work with XML data with Big Data SQL. But there is another one popular semi-structure data format – JSON. Here Oracle Database has prepared a pleasant surprise. Since 12c version, we have very convenient and flexible API for working with JSON in the database as well as out of the database (external table). Let me show this. 

Example of the input data:

{wr_returned_date_sk:37890,wr_returned_time_sk:8001,wr_item_sk:107856,wr_refunded_customer_sk:5528377,wr_refunded_cdemo_sk:172813,wr_refunded_hdemo_sk:3391,wr_refunded_addr_sk:2919542,wr_returning_customer_sk:5528377,wr_returning_cdemo_sk:172813,wr_returning_hdemo_sk:3391,wr_returning_addr_sk:2919542,wr_web_page_sk:1165,wr_reason_sk:489,wr_order_number:338223251,wr_return_quantity:4,wr_return_amt:157.88,wr_return_tax:11.05,wr_return_amt_inc_tax:168.93,wr_fee:11.67,wr_return_ship_cost:335.88,wr_refunded_cash:63.15,wr_reversed_charge:87.15,wr_account_credit:7.58,wr_net_loss:357.98}

{wr_returned_date_sk:37650,wr_returned_time_sk:63404,wr_item_sk:1229906,wr_refunded_customer_sk:5528377,wr_refunded_cdemo_sk:172813,wr_refunded_hdemo_sk:3391,wr_refunded_addr_sk:2919542,wr_returning_customer_sk:5528377,wr_returning_cdemo_sk:172813,wr_returning_hdemo_sk:3391,wr_returning_addr_sk:2919542,wr_web_page_sk:1052,wr_reason_sk:118,wr_order_number:338223251,wr_return_quantity:19,wr_return_amt:3804.37,wr_return_tax:266.31,wr_return_amt_inc_tax:4070.68,wr_fee:47.27,wr_return_ship_cost:3921.98,wr_refunded_cash:1521.75,wr_reversed_charge:2100.01,wr_account_credit:182.61,wr_net_loss:4454.6}

Put it in the HDFS from the Linux filesystem and create Hive table with single column over it. After this create Oracle DB external table with the single column:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

[Linux] $  hadoop fs -put source.json hdfs://cluster-ns/user/hive/warehouse/jsondata/

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

hive> CREATE TABLE json_string(

  json_str string)

ROW FORMAT SERDE

  ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’

STORED AS INPUTFORMAT

  ‘org.apache.hadoop.mapred.TextInputFormat’

OUTPUTFORMAT

  ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’

LOCATION

  ‘hdfs://cluster-ns/user/hive/warehouse/jsondata/’

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> CREATE TABLE WEB_RETURNS_JSON_STRING

   ( JSON_STR VARCHAR2(4000)

   ) 

   ORGANIZATION EXTERNAL 

    ( TYPE ORACLE_HIVE

      DEFAULT DIRECTORY “DEFAULT_DIR”

      ACCESS PARAMETERS

      (com.oracle.bigdata.tablename=json.json_string)

    )

   REJECT LIMIT UNLIMITED 

  PARALLEL ;

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

Ok, it the beggining it seems senceless. Why do we need the table with the single column?

6 Big Data SQL Quick Start. Semi structured data.   Part12.

But in the Oracle 12c you have very wide JSON capabilities, which is automatically available for Big Data SQL (you do remember that Big Data SQL is Oracle SQL, aren’t you?). If you are not familiar with it, I advise to check out this blogpost (thank you, Gerald).

It’s extrimely easy to parse your JSONs with the Oracle SQL – just put dot after name of the column and write name of the field.

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SELECT j.json_str.wr_returned_date_sk,

       j.json_str.wr_returned_time_sk

  FROM web_returns_json_string j

 WHERE j.json_str.wr_returned_time_sk = 8645

   AND ROWNUM <= 5;

… 

38195 8645

38301 8645

37820 8645

38985 8645

37976 8645

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

If we check the system stat, we could find that we filter out a lot of data on the Hadoop Side:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SELECT n.name, VALUE

  FROM v$ mystat   s, v$ statname n

WHERE s.statistic# = n.statistic#

AND n.name LIKE ‘%XT%';

… 

cell interconnect bytes returned by XT smart scan      507904 Bytes

cell XT granule bytes requested for predicate offload  16922334453 Bytes

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

Note: parsing and filtering happen on the Hadoop side!

Big Data SQL and JSON.  Restrictions and workaround.

So, everything works well until you have JSON strings more than 4000 characters and you are able to define the table with VARCHAR2(4000) column. But, what I suppose to do if I do have JSON strings longer than 4000 characters? Define it like a CLOB, but (!) at this case all parsing and filtering will happen on the Database side.

Test Case:

1) Using table definition from previous example(VARCHAR2), I ran  the query:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> SELECT COUNT(1)

  FROM web_returns_json_string j

 WHERE j.json_str.wr_returned_time_sk = 8645;

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

 it was finished in 75 seconds. OEM shows that most of the event were “cell external table smart scan”, which means that we offload scans on the Storage side.

11 Big Data SQL Quick Start. Semi structured data.   Part12.

After this I defined External table over the same hive table, but define column like a CLOB:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> CREATE TABLE WEB_RETURNS_JSON_STRING

   ( JSON_STR CLOB

   ) 

   ORGANIZATION EXTERNAL 

    ( TYPE ORACLE_HIVE

      DEFAULT DIRECTORY “DEFAULT_DIR”

      ACCESS PARAMETERS

      (com.oracle.bigdata.tablename=json.json_string)

    )

   REJECT LIMIT UNLIMITED 

  PARALLEL ; 

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

And run the same query:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> SELECT COUNT(1)

  FROM web_returns_json_string j

 WHERE j.json_str.wr_returned_time_sk = 8645;

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

it was finished in 90 Minutes (!!!) in 3600 times slower. OEM shows that most of the event on the Database CPUs. Which means that we couldn’t offload because it’s a CLOB column. 

13 Big Data SQL Quick Start. Semi structured data.   Part12.

Cloudera Manager also shows us the difference between two queries. First one utilizes the cell side (3 Hadoop nodes), second one utilize only the database.

8 Big Data SQL Quick Start. Semi structured data.   Part12.

Well, now we understand the problem (low performance in case if JSON longer than 4000 characters), but how to work around it?

It’s easy, like in the XML example define structure in the Hive metastore and map hive table to Oracle Table one to one.

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

 SQL> CREATE EXTERNAL TABLE j1_openx(

  wr_returned_date_sk bigint,

  wr_returned_time_sk bigint,

  wr_item_sk bigint,

  wr_refunded_customer_sk bigint,

  wr_refunded_cdemo_sk bigint,

  wr_refunded_hdemo_sk bigint,

  wr_refunded_addr_sk bigint,

  wr_returning_customer_sk bigint,

  wr_returning_cdemo_sk bigint,

  wr_returning_hdemo_sk bigint,

  wr_returning_addr_sk bigint,

  wr_web_page_sk bigint,

  wr_reason_sk bigint,

  wr_order_number bigint,

  wr_return_quantity int,

  wr_return_amt double,

  wr_return_tax double,

  wr_return_amt_inc_tax double,

  wr_fee double,

  wr_return_ship_cost double,

  wr_refunded_cash double,

  wr_reversed_charge double,

  wr_account_credit double,

  wr_net_loss double)

ROW FORMAT SERDE

  ‘org.openx.data.jsonserde.JsonSerDe’

STORED AS INPUTFORMAT

  ‘org.apache.hadoop.mapred.TextInputFormat’

OUTPUTFORMAT

  ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’

LOCATION

  ‘hdfs://cluster-ns/user/hive/warehouse/jsondata/’

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

there is plenty of different SerDe for JSON, but from the performance perspective, I personally would recommend org.openx.data.jsonserde.JsonSerDe

After this we just need to define Oracle external table over it:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> CREATE TABLE WEB_RETURNS_JSON_SD_OPENX 

   ( WR_RETURNED_DATE_SK NUMBER(10,0), 

WR_RETURNED_TIME_SK NUMBER(10,0), 

WR_ITEM_SK NUMBER(10,0), 

WR_REFUNDED_CUSTOMER_SK NUMBER(10,0), 

WR_REFUNDED_CDEMO_SK NUMBER(10,0), 

WR_REFUNDED_HDEMO_SK NUMBER(10,0), 

WR_REFUNDED_ADDR_SK NUMBER(10,0), 

WR_RETURNING_CUSTOMER_SK NUMBER(10,0), 

WR_RETURNING_CDEMO_SK NUMBER(10,0), 

WR_RETURNING_HDEMO_SK NUMBER(10,0), 

WR_RETURNING_ADDR_SK NUMBER(10,0), 

WR_WEB_PAGE_SK NUMBER(10,0), 

WR_REASON_SK NUMBER(10,0), 

WR_ORDER_NUMBER NUMBER(10,0), 

WR_RETURN_QUANTITY NUMBER(10,0), 

WR_RETURN_AMT NUMBER, 

WR_RETURN_TAX NUMBER, 

WR_RETURN_AMT_INC_TAX NUMBER, 

WR_FEE NUMBER, 

WR_RETURN_SHIP_COST NUMBER, 

WR_REFUNDED_CASH NUMBER, 

WR_REVERSED_CHARGE NUMBER, 

WR_ACCOUNT_CREDIT NUMBER, 

WR_NET_LOSS NUMBER

   ) 

   ORGANIZATION EXTERNAL 

    ( TYPE ORACLE_HIVE

      DEFAULT DIRECTORY DEFAULT_DIR

      ACCESS PARAMETERS

      ( com.oracle.bigdata.cluster:bds30

        com.oracle.bigdata.tablename:json.j1_openx))

   REJECT LIMIT UNLIMITED 

  PARALLEL ;

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

SQL> SELECT COUNT(1) FROM WEB_RETURNS_JSON_SD_OPENX j WHERE j.wr_returned_time_sk = 8645; 

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

Now it took 141 seconds with offloading on the Hadoop side. In two times slower than native processing with VARCHAR2, but in 1800 times faster than native processing with CLOB.

Conclusion

1) HDFS is well suitable for storing data in the original format

2) Big Data SQL offers wide capabilities for working with  Semi-Structure data

3) For JSON file format it has convenient API out of the box.

Let’s block ads! (Why?)

The Data Warehouse Insider