Big Data SQL Quick Start. Oracle Text Integration – Part 18

Today, we’ll focus on the
integration of Oracle Text
,
the Full-Text Indexing capabilities from the Oracle database with documents
residing on HDFS
.

Oracle Text has been available since years and has
evolved to address today’s needs regarding indexing:

  • 150+
    document formats supported (PDF, DOCX, XLSX, PPTX, XML, JSON…)
  • Dozens of
    languages
  • Files can be stored inside the
    database (SecureFiles), or outside on File System or accessible through URL
  • Advanced search functions:
    Approximate hit, Wild-Card, Fuzzy, Stemming, Boolean operators, Entity
    extraction…
  • Inherits the database capabilities:
    security, high availability, performance…
  • One API: SQL that allows joining
    with other types of data (clients, products…) residing inside or outside of the
    database (i.e. NoSQL and/or HDFS using Big Data SQL)
  • Machine Learning: Classification and
    Segmentation (supervised or not)

As numerous requests have
been made to allow indexing for external documents on HDFS, this is the time to
demonstrate one way to do it.

For the demonstration, I’ll use the Oracle Big Data Lite VM which contains all the required components.

Oracle
Text and WebHDFS Integration

The “magic” will reside in
the right integration of WebHDFS with Oracle Text
URL_DATASTORE
capability. As WebHDFS is already configured on the Big Data Lite VM, the only
point to take care of is the need to provide the right document URL to Oracle
Text.

Let’s start by copying some documents into HDFS:

[oracle@bigdatalite ~]$   cd /usr/share/doc/search-1.0.0+cdh5.8.0+0/examples/test-documents
[oracle@bigdatalite test-documents]$   ls -l *.doc* *.xls* *.pdf *.ppt*
-rw-r--r--. 1 root root   4355 Jul 12  2016 NullHeader.docx
-rw-r--r--. 1 root root  13824 Jul 12  2016 testEXCEL.xls
-rw-r--r--. 1 root root   9453 Jul 12  2016 testEXCEL.xlsx
-rw-r--r--. 1 root root  34824 Jul 12  2016 testPDF.pdf
-rw-r--r--. 1 root root 164352 Jul 12  2016 testPPT_various.ppt
-rw-r--r--. 1 root root  56659 Jul 12  2016 testPPT_various.pptx
-rw-r--r--. 1 root root  35328 Jul 12  2016 testWORD_various.doc
[oracle@bigdatalite test-documents]$   hdfs dfs -mkdir /user/oracle/documents
[oracle@bigdatalite test-documents]$   hdfs dfs -put NullHeader.docx /user/oracle/documents/
[oracle@bigdatalite test-documents]$   hdfs dfs -put testEXCEL.xls /user/oracle/documents/
[oracle@bigdatalite test-documents]$   hdfs dfs -put testEXCEL.xlsx /user/oracle/documents/
[oracle@bigdatalite test-documents]$   hdfs dfs -put testPDF.pdf /user/oracle/documents/
[oracle@bigdatalite test-documents]$   hdfs dfs -put testPPT_various.ppt /user/oracle/documents/
[oracle@bigdatalite test-documents]$   hdfs dfs -put testPPT_various.pptx /user/oracle/documents/
[oracle@bigdatalite test-documents]$   hdfs dfs -put testWORD_various.doc /user/oracle/documents/
[oracle@bigdatalite test-documents]$   hdfs dfs -ls /user/oracle/documents
Found 7 items
-rw-r--r--   1 oracle oracle       4355 2016-12-16 11:44 /user/oracle/documents/NullHeader.docx
-rw-r--r--   1 oracle oracle      13824 2016-12-16 11:44 /user/oracle/documents/testEXCEL.xls
-rw-r--r--   1 oracle oracle       9453 2016-12-16 11:44 /user/oracle/documents/testEXCEL.xlsx
-rw-r--r--   1 oracle oracle      34824 2016-12-16 11:44 /user/oracle/documents/testPDF.pdf
-rw-r--r--   1 oracle oracle     164352 2016-12-16 11:44 /user/oracle/documents/testPPT_various.ppt
-rw-r--r--   1 oracle oracle      56659 2016-12-16 11:44 /user/oracle/documents/testPPT_various.pptx
-rw-r--r--   1 oracle oracle      35328 2016-12-16 11:44 /user/oracle/documents/testWORD_various.doc
[oracle@bigdatalite test-documents]$   

Now, the WebHDFS URL to access the NullHeader.docx document would be:

http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/NullHeader.docx?user.name=oracle&op=OPEN

2 doc link Big Data SQL Quick Start. Oracle Text Integration – Part 18

Using the aforementioned URL, WebHDFS provides us
the content of the document. However, you can remark that the combo box says
“from http://bigdatalite.localdomain:50075” which means some redirection is being done behind the curtain! This
point has to be managed properly or nothing will be indexed.

On the Oracle database side, we have to prepare
some preferences to index the documents properly according to the type of
search we want to perform:

The command hereunder needs to be run once by SYS to give the rights to create a Full-Text Index to all users in the ORCL pluggable database:

[oracle@bigdatalite test-documents]$   sql sys/welcome1@//localhost:1521/cdb as sysdba

SQLcl: Release 4.2.0 Production on Fri Feb 10 15:33:53 2017

Copyright (c) 1982, 2017, Oracle.  All rights reserved.


	New version: 4.1.0 available to download

log4j:WARN No appenders could be found for logger (org.apache.http.client.protocol.RequestAddCookies).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options


SQL> alter session set container = ORCL;

Session altered.

SQL> exec ctxsys.ctx_adm.set_parameter('FILE_ACCESS_ROLE', 'PUBLIC');

PL/SQL procedure successfully completed.

SQL> SELECT par_value FROM ctxsys.ctx_parameters WHERE par_name = 'FILE_ACCESS_ROLE';

PAR_VALUE                                                                                                                                                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PUBLIC                                                                                                                                                                                               

SQL> 

Now we’ll create the user IDX with the appropriate rights. The CTXAPP role will allow IDX to create the Full-Text Index.

SQL> create user idx identified by idx
  2  default tablespace users
  3  temporary tablespace temp
  4  quota unlimited on users;

User IDX created.

SQL> 
SQL> grant create session, create table, ctxapp, create sequence, create procedure to idx;

Grant succeeded.

SQL> 
SQL> grant execute on UTL_HTTP to idx;

Grant succeeded.

SQL> begin
  2  DBMS_NETWORK_ACL_ADMIN.append_host_ace (
  3      host       => 'bigdatalite.localdomain', 
  4      lower_port => 50000,
  5      upper_port => 50100,
  6      ace        => xs$  ace_type(privilege_list => xs$  name_list('http'),
  7                                principal_name => 'idx',
  8                                principal_type => xs_acl.ptype_db)); 
  9  end;
 10  
 11  /

PL/SQL procedure successfully completed.

SQL> col host format a30
SQL> col acl format a30
SQL> col acl_owner format a30
SQL> SELECT HOST,
  2         LOWER_PORT,
  3         UPPER_PORT,
  4         ACL,
  5         ACLID,
  6         ACL_OWNER
  7  FROM   dba_host_acls
  8  ORDER BY host;

HOST                           LOWER_PORT UPPER_PORT ACL                            ACLID            ACL_OWNER                     
------------------------------ ---------- ---------- ------------------------------ ---------------- ------------------------------
*                                                    NETWORK_ACL_DD7C57F0D3BE0871E0 0000000080002710 SYS                           
                                                     4325AAE80A17A8                                                                

bigdatalite.localdomain             50000      50100 NETWORK_ACL_44A2890856044C91E0 0000000080002760 SYS                           
                                                     530100007F5A1A                                                                

localhost                                            /sys/acls/oracle-sysman-ocm-Re 0000000080002738 SYS                           
                                                     solve-Access.xml                                                              


SQL>

The grants to use the PL/SQL package UTL_HTTP will be useful to manage the URL Follow Redirection that WebHDFS is
doing.

Remark that the DBMS_NETWORK_ACL_ADMIN package is used for security purpose to allow accessing the URL
resources from PL/SQL code. 
Also the view DBA_NETWORK_ACLS can be queried to check for the URL access rights.

Full Text Index Creation

Now it is time to focus on the index part. We need
to connect with user IDX and execute the following steps:

  • Create a table that will store the
    URLs of the documents in HDFS to index
  • Create preferences to create the
    index
  • Create the index
  • Insert
    the proper URLs managing WebHDFS redirections
SQL> connect idx/idx@ORCL
Connected.
SQL> create table hdfs_docs (
  2   id number GENERATED ALWAYS AS IDENTITY not null,
  3   doc_url varchar2(4000) not null
  4  );

Table HDFS_DOCS created.

SQL> begin
  2  -- must be run by SYS: ctxsys.ctx_adm.set_parameter('FILE_ACCESS_ROLE','PUBLIC');
  3  -- reset
  4   -- ctx_ddl.drop_preference('hdfs_url_pref');
  5   -- ctx_ddl.drop_preference('hdfs_stem_fuzzy_pref');
  6  
  7   ctx_ddl.create_preference('hdfs_url_pref','URL_DATASTORE');
  8   
  9   -- ctx_ddl.set_attribute('hdfs_url_pref','HTTP_PROXY','www-proxy.us.example.com');
 10   -- ctx_ddl.set_attribute('hdfs_url_pref','NO_PROXY','us.example.com');
 11   ctx_ddl.set_attribute('hdfs_url_pref','Timeout','300')  ;
 12   
 13   ctx_ddl.create_preference('hdfs_stem_fuzzy_pref', 'BASIC_WORDLIST');
 14   ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','FUZZY_MATCH', 'ENGLISH');
 15   ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','FUZZY_SCORE', '70');
 16   ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','FUZZY_NUMRESULTS', '200');
 17   ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','SUBSTRING_INDEX', 'TRUE');
 18   ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','PREFIX_INDEX', 'TRUE');
 19   ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','PREFIX_MIN_LENGTH', 3);
 20   ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','PREFIX_MAX_LENGTH', 4);
 21  
 22   ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','STEMMER','ENGLISH');
 23  
 24  end;
 25  
 26  /

PL/SQL procedure successfully completed.

SQL> SQL> create index hdfs_full_text_index on hdfs_docs( doc_url )
  2   indextype is ctxsys.context
  3   parameters ('datastore hdfs_url_pref Wordlist hdfs_stem_fuzzy_pref filter ctxsys.auto_filter sync (on commit)')
  4   ; 

Index HDFS_FULL_TEXT_INDEX created.

SQL>

We then insert the simple URLs surrounded by a call to a PL/SQL function to store the right URL to let the indexing process work accordingly:

SQL> create or replace function get_real_url( url in varchar2 ) return varchar2 is
  2  l_http_request   UTL_HTTP.req;
  3  l_http_response  UTL_HTTP.resp;
  4  
  5  value VARCHAR2(1024);
  6  
  7  begin
  8  value := url;
  9  l_http_request  := UTL_HTTP.begin_request(url);
 10  UTL_HTTP.SET_HEADER(l_http_request, 'User-Agent', 'Mozilla/4.0');
 11  UTL_HTTP.SET_FOLLOW_REDIRECT(l_http_request, 0);
 12  l_http_response := UTL_HTTP.get_response(l_http_request);
 13  IF(l_http_response.STATUS_CODE in (UTL_HTTP.HTTP_MOVED_PERMANENTLY,UTL_HTTP.HTTP_TEMPORARY_REDIRECT, UTL_HTTP.HTTP_FOUND))
 14  THEN
 15    UTL_HTTP.GET_HEADER_BY_NAME(l_http_response, 'Location', value);
 16  END IF;  
 17  UTL_HTTP.end_response(l_http_response);
 18    
 19  return value;
 20  
 21  end;
 22  
 23  /

Function GET_REAL_URL compiled

SQL> set define off
SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/NullHeader.docx?user.name=oracle&op=OPEN'));

1 row inserted.

SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testEXCEL.xls?user.name=oracle&op=OPEN'));

1 row inserted.

SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testEXCEL.xlsx?user.name=oracle&op=OPEN'));

1 row inserted.

SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testPDF.pdf?user.name=oracle&op=OPEN'));

1 row inserted.

SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testPPT_various.ppt?user.name=oracle&op=OPEN'));

1 row inserted.

SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testPPT_various.pptx?user.name=oracle&op=OPEN'));

1 row inserted.

SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testWORD_various.doc?user.name=oracle&op=OPEN'));

1 row inserted.

SQL> set timing on
SQL> commit;

Commit complete.

Elapsed: 00:00:01.083
SQL> 

Notice it took one second to index these documents.

Search examples

We are now ready to proceed with some full text queries:

Related to the document excerpt 7 find signalled Big Data SQL Quick Start. Oracle Text Integration – Part 18

We have the following output:

SQL> select score(1), doc_url 
  2  from hdfs_docs 
  3  where contains(doc_url, 'signaled', 1) > 0 
  4  order by score(1) desc;

no rows selected

Elapsed: 00:00:00.639

SQL> select score(1), doc_url 
  2  from hdfs_docs 
  3  where contains(doc_url, 'fuzzy(signaled, 70, 200, w)', 1) > 0 
  4  order by score(1) desc;

  SCORE(1) DOC_URL                                                                                                                                                                                              
---------- ---------------------------------------------------------------------------------------------------------------------------------------
        47 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/NullHeader.docx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0                              


Elapsed: 00:00:00.619
SQL> 

Only the second query using Fuzzy function is able to detect approximate syntax. Did you saw the word signalled was misspelled in the query?

We can even search using the Soundex function which performs a phonetic search:

For instance, the query hereunder will find the document having [parts of] words which sounds like “econo”.

SQL> select score(1), doc_url 
  2  from hdfs_docs 
  3  where contains(doc_url, '!econo', 1) > 0 
  4  order by score(1) desc;

  SCORE(1) DOC_URL          
---------- ---------------------------------------------------------------------------------------------------------------------------------------
        11 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/NullHeader.docx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0                              


Elapsed: 00:00:00.448
SQL> 

c highlight soundex Big Data SQL Quick Start. Oracle Text Integration – Part 18

Finally, we can have a look at boolean operators such as:

SQL> select score(1), doc_url 
  2  from hdfs_docs 
  3  where contains(doc_url, 'Yemen or Gothic', 1) > 0 
  4  order by score(1) desc;

  SCORE(1) DOC_URL
---------- ---------------------------------------------------------------------------------------------------------------------------------------
        44 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/NullHeader.docx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0                              

         4 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/testPPT_various.pptx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0                         

         4 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/testWORD_various.doc?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0                         

         4 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/testPPT_various.ppt?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0                          


Elapsed: 00:00:00.113

SQL>

We can see that the score for the first document (44) is higher because “Yemen” is mentioned 9 times instead of once per document for “Gothic”.

You can now enrich your existing queries targeting data in HDFS, NoSQL or RDBMS using Big Data SQL and now adding Full Text search capabilities.

HTTP Redirect in PL/SQL

As stated previously, to get the final (right) URL to retrieve from WebHDFS the content of a document as done by any browser, we need to find out the Location for redirection. The following PL/SQL function will give us this information (now that we have the appropriate ACLs):

SQL> set define off
SQL> set serveroutput on size 1000000
SQL> create or replace function get_real_url( url in varchar2 ) return varchar2 is
  2  l_http_request   UTL_HTTP.req;
  3  l_http_response  UTL_HTTP.resp;
  4  
  5  value VARCHAR2(1024);
  6  
  7  begin
  8  value := url;
  9  l_http_request  := UTL_HTTP.begin_request(url);
 10  UTL_HTTP.SET_HEADER(l_http_request, 'User-Agent', 'Mozilla/4.0');
 11  UTL_HTTP.SET_FOLLOW_REDIRECT(l_http_request, 0);
 12  l_http_response := UTL_HTTP.get_response(l_http_request);
 13  IF(l_http_response.STATUS_CODE in (UTL_HTTP.HTTP_MOVED_PERMANENTLY,UTL_HTTP.HTTP_TEMPORARY_REDIRECT, UTL_HTTP.HTTP_FOUND))
 14  THEN
 15    UTL_HTTP.GET_HEADER_BY_NAME(l_http_response, 'Location', value);
 16  END IF;  
 17  UTL_HTTP.end_response(l_http_response);
 18    
 19  return value;
 20  
 21  end;
 22  
 23  /

Function GET_REAL_URL compiled

SQL> begin
  2  dbms_output.put_line(get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/NullHeader.docx?user.name=oracle&op=OPEN'));
  3  end;
  4  
  5  /

http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/NullHeader.docx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0



PL/SQL procedure successfully completed.

SQL> 

As you can see the redirected URL is now:

http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/NullHeader.docx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0

And Oracle Text can now locate the content and index it!

Let’s block ads! (Why?)

The Data Warehouse Insider