Big Data SQL Quick Start. Add SerDe in Big Data SQL classpath. – Part17.

today i’m going to write about how to add custom SerDe in Big Data SQL. SerDe is one of the most powerful feature of Hadoop and Big Data SQL in particular. It allows you to  read any type of data as structured, you just need to explain how to do parse it. 

Let’s imagine, that we have JSON file:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

{“wr_returned_date_sk”:38352,”wr_returned_time_sk”:46728,”wr_item_sk”:561506,”wr_refunded_customer_sk”:1131210}

{“wr_returned_date_sk”:38380,”wr_returned_time_sk”:78937,”wr_item_sk”:10003,”wr_refunded_customer_sk”:1131211}

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

and want to proceed it with custom SerDe, for example org.openx.data.jsonserde.JsonSerDe. Based on the guide I’m trying to create the external table:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

hive> CREATE EXTERNAL TABLE json_openx(

       wr_returned_date_sk bigint,

       wr_returned_time_sk bigint,

       wr_item_sk bigint,

       wr_refunded_customer_sk bigint)

     ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’

     LOCATION ‘hdfs://scaj43-ns/user/hive/warehouse/json_string';

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerDe

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

in the end i got the error, which tells me that i don’t have this jar. Fair enough. I have to add it. I can add this file with the Hive API and create table again:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

hive> add jar hdfs://scaj43-ns/tmp/json/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar;

converting to local hdfs://scaj43-ns/tmp/json/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar

Added [/tmp/f0317b31-2df6-4a24-ab8d-66136f9c26e6_resources/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar] to class path

Added resources: [hdfs://scaj43-ns/tmp/json/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar]

hive> CREATE EXTERNAL TABLE json_openx(

            wr_returned_date_sk bigint,

            wr_returned_time_sk bigint,

            wr_item_sk bigint,

            wr_refunded_customer_sk bigint)

          ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’

          LOCATION ‘hdfs://scaj43-ns/user/hive/warehouse/json_string';

OK

Time taken: 0.047 seconds

hive> select * from json_openx limit 1;

OK

38352   46728   561506  1131210

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

Now everything seems good, but if i logoff and log in again, query will not work.

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

hive> select * from json_openx limit 1;

FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found)

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

well, I have to add this JAR in Hive config and most convinient way to do this is Cloudera Manager.

before this I have to copy it on each machine on the cluster:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

[Linux]# dcli -C “mkdir /home/oracle/serde/”

[Linux]# dcli -C -f /root/json/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar -d /home/oracle/serde/

[Linux]# dcli -C “ls /home/oracle/serde”

192.168.42.92: json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar

192.168.42.93: json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar

192.168.42.94: json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

after all files are propagated on each node, click on the hive service in Cloudera manager: 

1 Big Data SQL Quick Start. Add SerDe in Big Data SQL classpath.   Part17.

then configuration: 

2 Big Data SQL Quick Start. Add SerDe in Big Data SQL classpath.   Part17.

type in search bar “hive-env” and add path to the jar:

3 Big Data SQL Quick Start. Add SerDe in Big Data SQL classpath.   Part17.

after this reboot hive service and deploy config: 

4 Big Data SQL Quick Start. Add SerDe in Big Data SQL classpath.   Part17.

Check that hive works:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

hive> select * from json_openx limit 1;

OK

38352   46728   561506  1131210

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

Now we are ready to create external table in Oracle (I’m using PL/SQL package for that):

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> DECLARE

   DDLout VARCHAR2(4000);

BEGIN

   dbms_hadoop.create_extddl_for_hive(CLUSTER_ID      => ‘scaj43′,

                                      DB_NAME         => ‘default’,

                                      HIVE_TABLE_NAME => ‘json_openx’,

                                      HIVE_PARTITION  => FALSE,

                                      TABLE_NAME      => ‘json_openx’,

                                      PERFORM_DDL     => FALSE,

                                      TEXT_OF_DDL     => DDLout);

   dbms_output.put_line(DDLout);

END;

SQL> CREATE TABLE BDS.json_openx 

(wr_returned_date_sk NUMBER,  

wr_returned_time_sk NUMBER,  

wr_item_sk NUMBER,  

wr_refunded_customer_sk NUMBER)  

ORGANIZATION EXTERNAL

 (TYPE ORACLE_HIVE

 DEFAULT DIRECTORY DEFAULT_DIR 

 ACCESS PARAMETERS (

com.oracle.bigdata.cluster=scaj43

com.oracle.bigdata.tablename=default.json_openx)

PARALLEL 2 

REJECT LIMIT UNLIMITED;

SQL> SELECT * FROM BDS.json_openx;

SELECT * FROM BDS.json_openx

*

ERROR at line 1:

ORA-29913: error in executing ODCIEXTTABLEOPEN callout

ORA-29400: data cartridge error

KUP-11504: error from external driver: oracle.hadoop.sql.JXADException: error

parsing “com.oracle.bigdata.colmap” field name “wr_returned_date_sk” not found 

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

well… we got error, because this jar doesn’t exist in Big Data SQL classpath. Let’s add it. On the cell side in file bigdata.properties (in my enviroment int’s in /opt/oracle/bigdatasql/bdcell-12.1/bigdata.properties) add this jar to java.classpath.hadoop varible:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

java.classpath.hadoop=/opt/oracle/bigdatasql/bdcell-hadoopconf/*:/opt/cloudera/parcels/CDH/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/*:/home/oracle/serde/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

On the database side in $ ORACLE_HOME/bigdatasql/bigdata_config/bigdata.properties file add the same path against the same varible java.classpath.hadoop.

After this you need to restart Big Data SQL on the cell side:

5 Big Data SQL Quick Start. Add SerDe in Big Data SQL classpath.   Part17.

and restart extproc on the database side: 

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

[Linux]# $ GRID_HOME/bin/crsctl stop resource bds_DBINSTANCE_HADOOPCLUSTER

[Linux]# $ GRID_HOME/bin/crsctl start resource bds_DBINSTANCE_HADOOPCLUSTER

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

and finaly, check that everything works properly:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> SELECT * FROM BDS.json_openx;

              38352               46728     561506                 1131210

              38380               78937      10003                 1131211

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

Let’s block ads! (Why?)

The Data Warehouse Insider