Big Data SQL Quick Start. Custom SerDe – Part 20

Big Data SQL Quick Start. Custom SerDe – Part 20

Many thanks to Bilal Ibdah, who is actual author of this content, I’m just publishing it in the Big Data SQL blog.

A modernized data warehouse is a data warehouse augmented with insights and data from a Big Data environment, typically Hadoop, now rather than moving and pushing the Hadoop data to a database, companies tend to expose this data through a unified layer that allows access to all data storage platforms, Hadoop, Oracle DB & NoSQL to be more specific.

The problem lies when the data that we want to expose is stored in its native format and in the lowest granularity possible, for example packet data, which can be in a binary format (PCAP), typical uses of packet data is in the telecommunications industry where this data is generated from a packet core, and can contain raw data records, known in the telecom industry as XDRs.

Here as an example of traditional architecture when source data is loading into mediation and after this TEXT (CSV) files parsed to some ETL engine and then load data into Database:

10 Big Data SQL Quick Start. Custom SerDe – Part 20

here is an alternative architecture, when you load the data directly to the HDFS (which is the part of your logical datawarehouse) and after this parse it on the fly during SQL running:

11 Big Data SQL Quick Start. Custom SerDe – Part 20

In this blog we’re going to use Oracle Big Data SQL to expose and access raw data stored in PCAP format living in hadoop.

The first step is up store the PCAP files in HDFS using the “copyFromLocal” command.

1 Big Data SQL Quick Start. Custom SerDe – Part 20

This is what the file pcap file looks like in HDFS:

2 Big Data SQL Quick Start. Custom SerDe – Part 20

In order to expose this file using Big Data SQL, we need to register this file in the Hadoop Metastore, once it’s registered in the metastore Big Data SQL can access the metadata, create an external table, and run pure Oracle SQL queries on the file, but registering this file requires to unlock the content using a custom SerDe, more details here.

Start by downloading the PCAP project from GitHub here, the project contains two components:

  • The hadoop-pcap-lib, which can be used in MapReduce jobs and,
  • The hadoop-pcap-serde, which can be used to query PCAPs in HIVE

For this blog, we will only use the serde component.

If the serde project hasn’t been compiled, compile it in an IDE or in a cmd window using the command “mvn package -e -X”

3 Big Data SQL Quick Start. Custom SerDe – Part 20

Copy the output jar named “hadoop-pcap-serde-1.1-SNAPSHOT-jar-with-dependencies.jar” found in the target folder to each node in your hadoop cluster:

4 Big Data SQL Quick Start. Custom SerDe – Part 20

Then add the pcap serde to the HIVE environment variables through Cloudera Manager:

5 Big Data SQL Quick Start. Custom SerDe – Part 20

Then save the changes and restart HIVE (you might also need to redeploy the configuration and restart the stale services).

Now let’s create a HIVE table and test the serde; copy the below to create a HIVE table:

DROPtable pcap;
ADD JAR hadoop-pcap-serde-0.1-jar-with-dependencies.jar;
SET net.ripe.hadoop.pcap.io.reader.class=net.ripe.hadoop.pcap.DnsPcapReader;
CREATEEXTERNALTABLE pcap (ts bigint,
                             ts_usec string,
                             protocol string,
                             src string,
                             src_port int,
                             dst string,
                             dst_port int,
                             len int,
                             ttl int,
                             dns_queryid int,
                             dns_flags string,
                             dns_opcode string,
                             dns_rcode string,
                             dns_question string,
                             dns_answer array<string>,
                             dns_authority array<string>,
                             dns_additional array<string>)
ROW FORMAT SERDE 'net.ripe.hadoop.pcap.serde.PcapDeserializer'
STORED AS INPUTFORMAT 'net.ripe.hadoop.pcap.io.PcapInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION'hdfs:///user/oracle/pcap/';

Now it’s time to test the serde on HIVE, let’s run the below query:

select*from pcap limit5;

6 Big Data SQL Quick Start. Custom SerDe – Part 20

The query ran successfully. Next we will create an Oracle external table that points to the pcap file using Big Data SQL, for this purpose we need to add the PCAP serde file to the Big Data SQL environment variables (this must be done on each node in your hadoop cluster). Create a directory on each server in the Oracle Big Data Appliance such as “/home/oracle/pcapserde/ ” Copy the serde jar to each node in your Big Data Appliance. Browse to /opt/oracle/bigdatasql/bdcell-12.1

Add the the pcap jar file to the environment variables list in the configuration file “bigdata.properties”

7 Big Data SQL Quick Start. Custom SerDe – Part 20

The class also needs to be updated in bigdata.properties file on the database nodes.

First we need to copy the jar to the database nodes: 

  • Copy jar to db side
  • Add jar to class path
  • Create db external table and run query
  • Restart “bdsql” service in Cloudera Manager

After this we are goot to define External table in Oracle RDBMS and query it!

8 Big Data SQL Quick Start. Custom SerDe – Part 20

9 Big Data SQL Quick Start. Custom SerDe – Part 20

Just in case I will highlight that in the last query we query (read as parse and query) binary data on the fly.

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog