Category Archives: Data Warehousing

Big Data SQL Quick Start. Correlate real-time data with historiacal benchmarks – Part 24

In Big Data SQL 3.2 we have introduced new capability – Kafka as a data source. Some details about how it works with some simple examples, I’ve posted over here. But now I want to talk about why do you want to run queries over Kafka. Here is Oracle concept picture on Datawarehouse:

You have some stream (real-time data), data lake where you land raw information and cleaned Enterprise data. This is just a concept, which could be implemented in many different ways, one of this depict here:

Kafka is the hub for streaming events, where you accumulate data from multiple real-time producers and provide this data to many consumers (it could be real-time processing, such as Spark-Streaming or you could load data in batch mode to the next Datawarehouse tier, such as Hadoop). 

In this architecture, Kafka contains stream data and it’s able to answer the question “what is going on right now”, whereas in Database you store operational data, in Hadoop historical and those two sources are able to answer the question “how it use to be”. Big Data SQL allows you to run the SQL over those tree sources and correlate real-time events with historical.

Example of using Big Data SQL over Kafka and other sources.

So, above I’ve explained the concept why you may need to query Kafka with Big Data SQL, now let me give a concrete example. 

Input for demo example:

- We have company, called MoviePlex, which sells video content all around the world

- There are two stream datasets – network data, which contains information about network errors, conditions of routing devices and so. The second data source is the fact of the movie sales. 

- Both stream data in real-time in Kafka

- Also, we have historical network data, which we store in HDFS (because of the cost of this data), historical sales data (which we store in database) and multiple dimension tables, stored in RDBMS as well.

Based on this we have a business case – monitor revenue flow, correlate current traffic with the historical benchmark (depend on Day of the Week and Hour of the Day) and try to find the reason in case of failures (network errors, for example).

Using Oracle Data Visualization Desktop, we’ve created a dashboard, which shows how real-time traffic correlate with statistical and also, shows a number of network errors based on the countries:

The blue line is a historical benchmark.

Over the time we see that some errors appear in some countries (left dashboard), but current revenue is more or less the same as it uses to be.

After a while revenue starts going down.

This trend keeps going.

A lot of network errors in France. Let’s drill down into itemized traffic:

Indeed, we caught that overall revenue goes down because of France and cause of this is some network errors.

Conclusion:

1) Kafka stores real-time data  and answers on question “what is going on right now”

2) Database and Hadoop stores historical data and answers on the question: “how it use to be”

3) Big Data SQL could query the data from Kafka, Hadoop, Database within single query (Join the datasets)

4) This fact allows us to correlate historical benchmarks with real-time data within SQL interface and use this with any SQL compatible BI tool 

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

Review of Big Data Warehousing at OpenWorld 2017 – Now Available

 Review of Big Data Warehousing at OpenWorld 2017   Now Available

Did you miss OpenWorld 2017? Then my latest book is definitely something you will want to download! If you went to OpenWorld this book is also for you because it covers all the most important big data warehousing messages and sessions during the five days of OpenWorld.

Following on from OpenWorld 2017 I have put together a comprehensive review of all the big data warehousing content from OpenWorld 2017. This includes all the key sessions and announcements from this year’s Oracle OpenWorld conference. This review guide contains the following information:

Chapter 1 Welcome – an overview of the contents.  

Chapter 2 Let’s Go Autonomous - containing all you need to know about Oracle’s new, fully-managed Autonomous Data Warehouse Cloud. This was the biggest announcement at OpenWorld so this chapter contains videos, presentations and podcasts to get you up to speed on this completely new data warehouse cloud service.

Chapter 3 Keynotes – Relive OpenWorld 2017 by watching the most important highlights from this year’s OpenWorld conference with our on demand video service which covers all the major keynote sessions.

Chapter 4 Key Presenters – a list of the most important speakers by product area such as database, cloud, analytics, developer and big data. Each biography includes all relevant social media sites and pages.

Chapter 5 Key Sessions - a list of all the most important sessions with links to download the related presentations organized

Chapter 6 Staying Connected – Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data.  This covers all our websites, blogs and social media pages.

This review is available in three formats:

1) For highly evolved users, i.e. Apple users, who understand the power of Apple’s iBook format, your multi-media enabled iBook version is available here.

2) For Windows users who are forced to endure a 19th-Century style technological experience, your PDF version is available here.

3) For Linux users, Oracle DBAs and other IT dinosaurs, all of whom are allergic to all graphical user interfaces, the basic version of this comprehensive review is available here.

I hope you enjoy this review and look forward to seeing you next year at OpenWorld 2018, October 28 to November 1. If you’d like to be notified when registration opens for next year’s Oracle OpenWorld then register your email address here.
 

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

New Release: BDA 4.10 is now Generally Available

As of today, BDA version 4.10 is Generally Available. As always, please refer to If You Struggle With Keeping your BDAs up to date, Then Read Thisto learn about the innovative release process we do for BDA software.

This new release includes a number of features and updates:

  • Support for Migration From Oracle Linux 5 to Oracle Linux 6 - Clusters on Oracle Linux 5 must first be upgraded to v4.10.0 on Oracle Linux 5 and can then be migrated to Oracle Linux 6. This process must be done one server at a time. HDFS data and Cloudera Manager roles are retained. Please review the documentation for the entire process carefully before starting.

    • BDA v4.10 is the last release built for Oracle Linux 5 and no further upgrades for Oracle Linux 5 will be released.
  • Updates to NoSQL DB, Big Data Connectors, Big Data Spatial & Graph
    • Oracle NoSQL Database 4.5.12
    • Oracle Big Data Connectors 4.10.0
    • Oracle Big Data Spatial & Graph 2.4.0
  • Support for Oracle Big Data Appliance X-7 systems – Oracle Big Data Appliance X7 is based on the X7–2L server. The major enhancements in Big Data Appliance X7–2 hardware are:

    • CPU update: 2 24–core Intel Xeon processor
    • Updated disk drives: 12 10TB 7,200 RPM SAS drives
    • 2 M.2 150GB SATA SSD drives (replacing the internal USB drive)
    • Vail Disk Controller (HBA)
    • Cisco 93108TC-EX–1G Ethernet switch (replacing the Catalyst 4948E).
  • Spark 2 Deployed by Default – Spark 2 is now deployed by default on new clusters and also during upgrade of clusters where it is not already installed.
  • Oracle Linux 7 can be Installed on Edge Nodes – Oracle Linux 7 is now supported for installation on Oracle Big Data Appliance edge nodes running on X7–2L, X6–2L or X5–2L servers. Support for Oracle Linux 7 in this release is limited to edge nodes.
  • Support for Cloudera Data Science Workbench – Support for Oracle Linux 7 on edge nodes provides a way for customers to host Cloudera Data Science Workbench (CDSW) on Oracle Big Data Appliance. CDSW is a web application that enables access from a browser to R, Python, and Scala on a secured cluster. Oracle Big Data Appliance does not include licensing or official support for CDSW. Contact Cloudera for licensing requirements.
     
  • Scripts for Download & Configuration of Apache Zeppelin, Jupyter Notebook, and RStudio –  This release includes scripts to assist in download and configuration of these commonly used tools. The scripts are provided as a convenience to users. Oracle Big Data Appliance does not include official support for the installation and use of Apache Zeppelin, Jupyter Notebook, or RStudio.
     
  • Improved Configuration of Oracle’s R Distribution and ORAAH –  For these tools, much of the environment configuration that was previous done by the customer is now automated.
  • Node Migration Optimization – Node migration time has been improved by eliminating some steps.
  • Support for Extending Secure NoSQL DB clusters

This release is based on Cloudera Enterprise (CDH 5.12.1 & Cloudera Manager 5.12.1) as well as Oracle NoSQL Database (4.5.12).

  • Cloudera 5 Enterprise includes CDH (Core Hadoop), Cloudera Manager, Apache Spark, Apache HBase, Impala, Cloudera Search and Cloudera Navigator
  • The BDA continues to support all security options for CDH Hadoop clusters : Kerberos authentication – MIT or Microsoft Active Directory, Sentry Authorization, HTTPS/Network encryption, Transparent HDFS Disk Encryption, Secure Configuration for Impala, HBase , Cloudera Search and all Hadoop services configured out-of-the-box.
  • Parcels for Kafka 2.2, Spark 2.2, Kudu 1.4 and Key Trustee Server 5.12 are included in the BDA Software Bundle

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

OpenWord 2017: Must-See Sessions for Day 3 – Tuesday

Day 3, Tuesday, is here and this is my definitive list of Must-See sessions for today. Today we are focused on the new features in Oracle Database 18c – multitenant, in-memory, Oracle Text, machine learning, Big Data SQL etc etc. These sessions are what Oracle OpenWorld is all about: the chance to learn about the latest technology from the real technical experts.

MONDAY’s MUST-SEE GUIDE

Don’t worry if you are not able to join us in San Francisco for this year’s conference because I will be providing a comprehensive review after the conference closes on Thursday.

The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well.

Have a great conference.

If you are here in San Francisco then enjoy the conference – it’s going to be an awesome conference this year.

Don’t forget to make use of our Big DW #oow17 smartphone app which you can access by pointing your phone at this QR code:

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

OpenWord 2017 – Must-See Sessions for Day 1

aaeaaqaaaaaaaawfaaaajgq3ymnjmdkxlwyyywitnde5nc05njnilwjmnzm4ndexzthmmq OpenWord 2017   Must See Sessions for Day 1

It all starts today –  OpenWorld 2017. Each day I will provide you with a list of must-see sessions and hands-on labs. This is going to be one of the most exciting OpenWorlds ever!

Today is Day 1 so here here is my definitive list of Must-See sessions for the opening day. The list is packed full of really excellent speakers such as Franck PachotAmi AharonovichGalo Balda and Rich Niemiec. These sessions are what Oracle OpenWorld is all about: the chance to learn from the real technical experts.

Of course you need to end your first day in Moscone North Hall D for Larry Ellison’s welcome keynote – it’s going to be a  great one!
 

SUNDAY’S MUST-SEE GUIDE

Don’t worry if you are not able to join us in San Francisco for this year’s conference because I will be providing a comprehensive review after the conference closes on Thursday.

The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference.

If you are here in San Francisco then enjoy the conference – it’s going to be an awesome conference this year.

Don’t forget to make use of our Big DW #oow17 smartphone app which you can access by pointing your phone at this QR code:
 

qrcode.41572804 OpenWord 2017   Must See Sessions for Day 1

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

UPDATED: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

 UPDATED: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017 ** NEW ** Chapter 5

 UPDATED: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

*** UPDATED *** Must-See Guide now available as PDF and via Apple iBooks Store

This updated version now contains details of all the most important hands-on labs AND a day-by-day calendar. This means that our comprehensive guide now covers absolutely everything you need to know about this year’s Oracle OpenWorld conference. Now, when you arrive at Moscone Conference Center you are ready to get the absolute most out of this amazing conference.

The updated, and still completely free, big data warehousing Must-See guide for OpenWorld 2017 is now available for download from the Apple iBooks Store – click hereand in PDF format – click here.

Just so you know…this guide contains the following information:

Chapter 1

 – Introduction to the must-see guide. 

Chapter 2

 – A guide to the key the highlights from last year’s conference so you can relive the experience or see what you missed. Catch the most important highlights from last year’s OpenWorld conference with our on demand video service which covers all the major keynote sessions. Sit back and enjoy the highlights. The second section explains why you need to attend this year’s conference and how to justify it to your company. 

Chapter 3

- Full list of Oracle Product Management and Development presenters who will be at this year’s OpenWorld. Links to all their social media sites are included alongside each profile. Read on to find out about the key people who can help you and your teams build the FUTURE using Oracle’s Data Warehouse and Big Data technologies. 

Chapter 4

 – List of the “must-see” sessions

and hands-on labs

at this year’s OpenWorld by category. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2017. 

Chapter 5

 – Day-by-Day “must-see” guide. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2017. 

Chapter 6

 – Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data. This covers all our websites, blogs and social media pages. 

Chapter 7  

Details of our exclusive web application for smartphones and tablets provides you with a complete guide to everything related to data warehousing and big data at OpenWorld 2017. 

Chapter 8

 – Information to help you find your way around the area surrounding the Moscone Conference Center this section includes some helpful maps. 

Let me know if you have any comments. Enjoy and see you in San Francisco.

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

The Colbran Group of Companies – Home

To thrive and excel in today’s business environment, you have to be able to focus on your core business. 

All kinds of distractions can slow your company down, and you may end up losing your competitive edge.

We analyze your business processes and provide support in all other areas of management so you can focus on your business.

Let’s block ads! (Why?)

The Colbran Group of Companies – Home

Big Data SQL Quick Start. Complex Data Types – Part 21

Many thanks to Dario Vega, who is the actual author of this content. I’m just publishing it on this blog.

A common potentially mistaken approach that people take regarding the integration of NoSQL, Hive and ultimately BigDataSQL is to use only a RDBMS perspective and not an integration point of view. People generally think about all the features and data types they’re already familiar with from their experience using one of these products; rather than realizing that the actual data is stored in the Hive (or NoSQL) database rather than RDBMS. Or without understanding that the data will be querying from RDBMS. 

When using Big Data SQL with complex types, we are thinking to use JSON/SQL without taking care of differences between Oracle Database and Hive use of Complex Types. Why ? Because the complex types are mapped to varchar2 in JSON format, so we are reading the data in JSON style instead of the original system. 

The Best sample of this is from a Json perspective JSON ECMA-404 – Map type does not exist. 

Programming languages vary widely on whether they support objects, and if so, what characteristics and constraints the objects offer. The models of object systems can be wildly divergent and are continuing to evolve. JSON instead provides a simple notation for expressing collections of name/value pairs. Most programming languages will have some feature for representing such collections, which can go by names like record, struct, dict, map, hash, or object.

The following built-in collection functions are supported in Hive:

  • int size (Map) Returns the number of elements in the map type.

  • array map_keys(Map) Returns an unordered array containing the keys of the input map.

  • array map_values(Map)Returns an unordered array containing the values of the input map.

Are they supported in RDBMS? the answer is NO but may be YES if using APEX PL/SQL or JAVA programs. 

In the same way, there is also a difference between Impala and Hive.

Lateral views. In CDH 5.5 / Impala 2.3 and higher, Impala supports queries on complex types (STRUCT, ARRAY, or MAP), using join notation rather than the EXPLODE() keyword. See Complex Types (CDH 5.5 or higher only) for details about Impala support for complex types.

The Impala complex type support produces result sets with all scalar values, and the scalar components of complex types can be used with all SQL clauses, such as GROUP BY, ORDER BY, all kinds of joins, subqueries, and inline views. The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or other programming languages to deconstruct the underlying data structures.


Best practices We would advise taking a conservative approach.

This is because the mappings between the NoSQL data model, the Hive data model, and the Oracle RDBMS data model is not 1-to-1.
For example, the NoSQL data model is quite a rich and there are many things one can do with nested classes in NoSQL that have no counterpart in either Hive or Oracle Database (or both). As a result, integration of the three technologies had to take a ‘least-common-denominator’ approach; employing mechanisms common to all three.

But let me show a sample

Impala code

`phoneinfo`map<string,string>
 
impala>SELECT
        ZIPCODE
       ,LASTNAME
       ,FIRSTNAME
       ,SSN
       ,GENDER
       ,PHONEINFO.*FROM rmvtable_hive_parquet, rmvtable_hive_parquet.PHONEINFO AS PHONEINFO
WHERE zipcode ='02610'AND lastname ='ACEVEDO'AND firstname ='TAMMY'AND ssn =576228946
;
 
+---------+----------+-----------+-----------+--------+------+--------------+| zipcode | lastname | firstname | ssn       | gender |KEY| VALUE        |+---------+----------+-----------+-----------+--------+------+--------------+|02610| ACEVEDO  | TAMMY     |576228946| female |WORK|617-656-9208||02610| ACEVEDO  | TAMMY     |576228946| female | cell |408-656-2016||02610| ACEVEDO  | TAMMY     |576228946| female | home |213-879-2134|+---------+----------+-----------+-----------+--------+------+--------------+

Oracle code:

`phoneinfo`IS JSON
 
 
SQL>SELECT/*+ MONITOR  */
     a.json_column.zipcode
    ,a.json_column.lastname
    ,a.json_column.firstname
    ,a.json_column.ssn
    ,a.json_column.gender
    ,a.json_column.phoneinfo
FROM pmt_rmvtable_hive_json_api a
WHERE a.json_column.zipcode ='02610'AND a.json_column.lastname ='ACEVEDO'AND a.json_column.firstname ='TAMMY'AND a.json_column.ssn =576228946 ;
 
 
ZIPCODE : 02610 
LASTNAME : ACEVEDO
FIRSTNAME : TAMMY
SSN : 576228946
GENDER : female
PHONEINFO :{"work":"617-656-9208","cell":"408-656-2016","home":"213-879-2134"}

QUESTION : How to transform this JSON – PHONEINFO in two “arrays” keys, values- Map behavior expected.

Unfortunately, the nested path JSON_TABLE operator is only available for JSON ARRAYS. In the other side, when using JSON, we can access to each field as columns.

SQL>SELECT/*+ MONITOR  */
   ZIPCODE
  ,LASTNAME
  ,FIRSTNAME
  ,SSN
  ,GENDER
  ,LICENSE
  ,a.PHONEINFO.work
  ,a.PHONEINFO.home
  ,a.PHONEINFO.cell
FROM pmt_rmvtable_hive_orc a  WHERE zipcode ='02610'AND lastname ='ACEVEDO'AND firstname ='TAMMY'AND ssn =576228946;
 
 
 
ZIPCODE 	     LASTNAME		  FIRSTNAME		      SSN GENDER	       LICENSE		  WORK		  HOME		  CELL
-------------------- -------------------- -------------------- ---------- -------------------- ------------------ --------------- --------------- ---------------02610		     ACEVEDO		  TAMMY 		576228946 female	       533933353734363933617-656-9208213-879-2134408-656-2016

and what about using map columns on the where clause Looking for a specific phone number

Impala code

`phoneinfo`map<string,string>SELECT
   ZIPCODE
  ,LASTNAME
  ,FIRSTNAME
  ,SSN
  ,GENDER
  ,PHONEINFO.*FROM rmvtable_hive_parquet, rmvtable_hive_parquet.PHONEINFO AS PHONEINFO
WHERE PHONEINFO.key='work'AND PHONEINFO.value ='617-656-9208'
;
 
+---------+------------+-----------+-----------+--------+------+--------------+| zipcode | lastname   | firstname | ssn       | gender |KEY| VALUE        |+---------+------------+-----------+-----------+--------+------+--------------+|89878| ANDREWS    | JEREMY    |848834686| male   |WORK|617-656-9208||00183| GRIFFIN    | JUSTIN    |976396720| male   |WORK|617-656-9208||02979| MORGAN     | BONNIE    |904775071| female |WORK|617-656-9208||14462| MCLAUGHLIN | BRIAN     |253990562| male   |WORK|617-656-9208||83193| BUSH       | JANICE    |843046328| female |WORK|617-656-9208||57300| PAUL       | JASON     |655837757| male   |WORK|617-656-9208||92762| NOLAN      | LINDA     |270271902| female |WORK|617-656-9208||14057| GIBSON     | GREGORY   |345334831| male   |WORK|617-656-9208||04336| SAUNDERS   | MATTHEW   |180588967| male   |WORK|617-656-9208|
...
|23993| VEGA       | JEREMY    |123967808| male   |WORK|617-656-9208|+---------+------------+-----------+-----------+--------+------+--------------+
 
Fetched 852ROW(s) IN99.80s

But let me continue showing the same code on Oracle (querying on work phone).

Oracle code

`phoneinfo`IS JSON
 
 
SELECT/*+ MONITOR */
   ZIPCODE
  ,LASTNAME
  ,FIRSTNAME
  ,SSN
  ,GENDER
  ,PHONEINFO
FROM pmt_rmvtable_hive_parquet  a
WHERE JSON_QUERY("A"."PHONEINFO" FORMAT JSON , '$  .work' RETURNING VARCHAR2(4000) ASIS  WITHOUTARRAY WRAPPER NULLON   ERROR)='617-656-9208'
;
 
35330		     SIMS		  DOUGLAS		295204437 male		       {"work":"617-656-9208","cell":"901-656-9237","home":"303-804-7540"}43466		     KIM		  GLORIA		358875034 female	       {"work":"617-656-9208","cell":"978-804-8373","home":"415-234-2176"}67056		     REEVES		  PAUL			538254872 male		       {"work":"617-656-9208","cell":"603-234-2730","home":"617-804-1330"}07492		     GLOVER		  ALBERT		919913658 male		       {"work":"617-656-9208","cell":"901-656-2562","home":"303-804-9784"}20815		     ERICKSON		  REBECCA		912769190 female	       {"work":"617-656-9208","cell":"978-656-0517","home":"978-541-0065"}48250		     KNOWLES		  NANCY 		325157978 female	       {"work":"617-656-9208","cell":"901-351-7476","home":"213-234-8287"}48250		     VELEZ		  RUSSELL		408064553 male		       {"work":"617-656-9208","cell":"978-227-2172","home":"901-630-7787"}43595		     HALL		  BRANDON		658275487 male		       {"work":"617-656-9208","cell":"901-351-6168","home":"213-227-4413"}77100		     STEPHENSON 	  ALBERT		865468261 male		       {"work":"617-656-9208","cell":"408-227-4167","home":"408-879-1270"}852ROWS selected.
 
Elapsed: 00:05:29.56

In this case, we can also use the dot-notation A.PHONEINFO.work = ‘617-656-9208′

Note: for make familiar with Database JSON API you may use follow blog series: https://blogs.oracle.com/jsondb

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

The first really hidden gem in Oracle Database 12c Release 2: runtime modification of external table parameters

We missed to document some functionality !!!

With the next milestone for Oracle Database 12c Release 2 just taking place – the
availability on premise for Linux x86-64, Solaris Sparc64, and
Solaris x86-64, in addition to the Oracle Cloud – I managed to use this as an excuse to play around with
it for a bit .. and found that we somehow missed to document new
functionality. Bummer. But still better than the other way around .. icon wink The first really hidden gem in Oracle Database 12c Release 2: runtime modification of external table parameters

We missed to document the capability to override some parameters of an external table at runtime.

So I decided to quickly blog about this to not only fill the gap in
documentation (doc bug is filed already) but also to ruthlessly hijack the momentum and to start
highlighting new functionality (there’s more blogs to come, specifically
around my pet peeve Partitioning, but that’s for later).

So what does it mean to override some parameters of an external table at runtime?

It simply means hat you can use one external table definition stub as proxy for external
data access of different files, with different reject limits, at different points in time. Without the
need to do a DDL to modify the external table definition.

The usage is pretty simple and straightforward, so let me quickly
demonstrate this with a not-so-business-relevant sample table. The
pre-requirement SQL for this one to run is at the end of this blog and
might make its way onto github as well; I have not managed that yet and just wanted to get this blog post out.

Here is my rather trivial external table definition. Works for me since version 9, so why not using it with 12.2 as well.

CREATE TABLE et1 (col1 NUMBER, col2 NUMBER, col3 NUMBER)
ORGANIZATION EXTERNAL
(TYPE ORACLE_LOADER
  DEFAULT DIRECTORY d1
  ACCESS PARAMETERS
  ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII
    NOBADFILE
    NOLOGFILE
    FIELDS TERMINATED BY ","
   )
  LOCATION ('file1.txt')
)
   REJECT LIMIT UNLIMITED
;

Pretty straightforward vanilla external table. Let’s now see how many rows this external table returns (the simple “data generation” is at the end of this blog):

SELECT count(*) FROM et1;

  COUNT(*)
———-
        99
So far, so good. And now the new functionality. We will now access the exact
same external table but tell the database to do a runtime modification of the file (location) we are accessing:

SELECT count(*) FROM et1
EXTERNAL MODIFY
(LOCATION ('file2.txt'));

  COUNT(*)
———-
         9
As you can see, the row count changes without me having done any change
to the external table definition like an ALTER TABLE. You will also see that nothing has
changed in the external table definition:

SQL> SELECT table_name, location FROM user_external_locations WHERE table_name='ET1';

TABLE_NAME                     LOCATION
—————————— ——————————
ET1                            file1.txt
And there’s one more thing. You might have asked yourself right now, right this moment … why do I have to specify a location then for the initial external table creation? The
answer is simple: you do not have to do this anymore.

Here is my external table without a location:

CREATE TABLE et2 (col1 NUMBER, col2 NUMBER, col3 NUMBER)
ORGANIZATION EXTERNAL
(TYPE ORACLE_LOADER
  DEFAULT DIRECTORY d1
  ACCESS PARAMETERS
  ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII
    NOBADFILE
    NOLOGFILE
    FIELDS TERMINATED BY ","
   )
)
   REJECT LIMIT UNLIMITED
;

When I now select from it, guess what: you won’t get any rows back. The location is NULL.

SQL> SELECT * FROM et2;

no rows selected
Personally I am not sure I would go for this approach rather than having at least one
dummy file in the location with a control record with some pre-defined
values to being able to tell whether there’s really no records or
whether there’s a programmatic mistake when you plan to always override the location. But as many things in life, that’s a choice. As you saw, you don’t have to.

Using this stub table in the same way as before gives me access to my data.

SELECT count(*) FROM et1
EXTERNAL MODIFY
(LOCATION ('file2.txt'));

  COUNT(*)
———-
         9

You get the idea. Pretty cool stuff. 

Aaah, and to complete the
short functional introduction: the following clauses can be over-ridden:
DEFAULT DIRECTORY,
LOCATION, ACCESS PARAMETERS (BADFILE, LOGFILE, DISCARDFILE) and REJECT
LIMIT. 

That’s about it for now for online modification capabilities for
external tables. I am sure I have forgotten
some little details here and there, but there’s always soo many
things to talk (write) about that you will never catch it all. And
hopefully the documentation will cover it all rather sooner than later.

Stay tuned for now. There’s more blog posts about 12.2 to come. And
please, if you have any comments about this specific one or
suggestions for future ones, then please
let me know. You can always reach me at hermann.baer@oracle.com.

Cheers, over and out.

And here’s the most simple “data generation” I used for the examples above to get “something” in my files. Have fun playing.

rem my directory
rem
create or replace directory d1 as '/tmp';

rem create some dummy data in /tmp
rem
set line 300 pagesize 5000
spool /tmp/file1.txt
select rownum ||’,’|| 1 ||’,’|| 1 ||’,’  from dual connect by level < 100;
spool off

spool /tmp/file2.txt
select rownum ||’,’|| 22 ||’,’|| 22 ||’,’  from dual connect by level < 10;
spool off

Let’s block ads! (Why?)

The Data Warehouse Insider

Data loading into HDFS – Part3. Streaming data loading

In my previous blogs, I already told about data loading into HDFS. In the first blog, I covered data loading from generic servers to HDFS. The second blog was devoted by offloading data from Oracle RDBMS. Here I want to explain how to load into Hadoop streaming data. Before all, I want to note that I will now explain Oracle Golden Gate for Big Data just because it deserves a dedicated blog post. Today I’m going to talk about Flume and Kafka.

What is Kafka? 

Kafka is distributed service bus. Ok, but what is service bus? Let’s imagine that you do have few data systems, and each one needs data from others. You could link it directly, like this:

5 Data loading into HDFS   Part3. Streaming data loading

but it became very hard to manage. Instead this you could have one centralized system, that will accumulate data from all sources and be a single point of contact for all systems. Like this:

 Data loading into HDFS   Part3. Streaming data loading

What is Flume? 

“Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.” – this definition from documentation pretty good explains what is Flume. Flume historically was developed for loading data in HDFS. But why I couldn’t just use Hadoop client?

Challenge 1. Small files.

Hadoop have been designed for storing large files and despite on that on the last few year were done a lot of optimizations around NameNode, it’s still recommended to store only big files. If your source has a lot of small files, Flume could collect them and flush this collection in batch mode, like a single big file. I always use the analogy with glass and drops. You could collect one million drops in one glass and after this, you will have one glass of water instead one million drops.

Challenge 2. Lots of data sources

Let’s imagine that I do have an application (even two on two different servers) that produce files which I want to load into HDFS.

7 Data loading into HDFS   Part3. Streaming data loading

life is good,  if files are large enough it’s not gonna be a problem.

But now let’s imagine, that I have 1000 application servers and each one wants to write data into HDFS. Even if files are large this workload will collapse your Hadoop cluster. If not believe – just try it (but not on production cluster!). So, we have to have something in between HDFS and our data sources. 

8 Data loading into HDFS   Part3. Streaming data loading

Now is time for Flume. You could do two tiers architecture, fist ties will collect data from different sources, the second one will aggregate them and load into HDFS.

9 Data loading into HDFS   Part3. Streaming data loading

In my example I depict 1000 sources, which is handled by 100 Flume servers on the first tier, which is load data on the second tier, that connect directly to HDFS and in my example, it’s only two connections – it’s affordable. Here you could find more details, just want to add that general practice is use one aggregation agent for 4-16 client agents.

I also want to note, that it’s a good practice to use AVRO sink when you move data from one tier to next. Here is example of the flume config file:

########################################################################################################################

——————————————————————————————

agent.sinks = avroSink

agent.sinks.avroSink.type = avro 

agent.sinks.avroSink.channel = memory 

agent.sinks.avroSink.hostname = avrosrchost.example.com

agent.sinks.avroSink.port = 4353 

——————————————————————————————

######################################################################################################################## 

Kafka Architecture.

Deep technical presentation about Kafka you could find here and here actually, I got few screens from there. The Very interesting technical video you could find here. In my article, I just will remind key terms and concepts.

10 Data loading into HDFS   Part3. Streaming data loading

Producer – a process that writes data into Kafka cluster. It could be part of an application or edge nodes could play this role.

Consumer – a process that reads the data from Kafka cluster. 

Brocker – a member of Kafka cluster. Set of members is Kafka cluster. 

Flume Architecture.

You could find a lot of useful information about Flume in this book, here I just highlight key concepts.

11 Data loading into HDFS   Part3. Streaming data loading

Flume  has 3 major components:

1) Source – where I get the data

2) Chanel – where I buffer it. It could be memory or disk, for example. 

3) Sink – where I load my data. For example, it could be another tier of Flume agents, HDFS or  HBase.

12 Data loading into HDFS   Part3. Streaming data loading

Between source and channel, there are two minor components: Interceptor and Selector.

With Interceptor you could do simple processing, with Selector you could choose channel depends on the message header. 

Flume and Kafka similarities and differences.

It’s a frequent question: “what is the difference between Flume and Kafka”, the answer could be very expanded, but let me briefly explain key points.

1) Pull and Push.

Flume accumulates data up to some condition (number of the events, size of the buffer or timeout) and then push it to the disk

Kafka accumulates data until client initiate reads. So client pulls data whenever he wants.

2)  Data processing

Flume could do simple transformations by interceptors

Kafka doesn’t do any data processing, just store that data. 

3) Clustering

Flume usually is a batch of single instances.

Kafka is the cluster, which means that it has such benefits as High Availability and scalability out of the box without extra efforts. 

4) Message size

Flume doesn’t have any obvious restrictions for size of the message

Kafka was designed for few KB messages

5) Coding vs Configuring

Flume usually configurable tool (users usually don’t write the code, instead of this they use configure capabilities).

With Kafka, you have to write code for load/unload the data.

Flafka.

Many customers are thinking about choosing right technology either Flume or Kafka for handing their data streaming. Stop choosing, use both. It’s quite common use case and it named as Flafka. Good explanation and nice pictures you could find here (actually I borrowed few screens from there).

First of all, Flafka is not a dedicated project. It’s just bunch of Java classes for integration Flume and Kafka.

1 Data loading into HDFS   Part3. Streaming data loading

Now  Kafka could be either source for Flume by flume config:

flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource

or channel by the following directive:

flume1.channels.kafka-channel-1.type = org.apache.flume.channel.kafka.KafkaChannel 

Use Case1. Kafka as a source or Chanel

if you do have Kafka as enterprise service bus (see my example above) you may want to load data from your service bus into HDFS. You could do this by writing Java program, but if don’t like it, you may use Kafka as a Flume source. 

2 Data loading into HDFS   Part3. Streaming data loading

in this case, Kafka could be also useful for smoothing peak load. Flume provides flexible routing in this case.

Also, you could use Kafka as a Flume Chanel for high availability purposes (it’s distributed by application design). 

Use case 2. Kafka as a sink.

If you use Kafka as enterprise service bus, I may want to load data into it. The native way for Kafka is Java program, but if you feel, that it will be way more convenient with Flume (just using few config files) – you have this option. The only one that you need is config Kafka as a sink.

3 Data loading into HDFS   Part3. Streaming data loading

Use case 3. Flume as the tool to enrich data.

As I Already told before – Kafka could do any data processing. It just stores data without any transformation. You could use Flume as the way to add some extra information to your Kafka messages. For doing this you need to define Kafka as a source, implement interceptor which will add some information to your message and write back to the Kafka in a different topic.

4 Data loading into HDFS   Part3. Streaming data loading

Conclusion.

There are two major tools for loading stream data – Flume and Kafka. There is no right answer, what to use because each tool has own advantages/disadvantages. Generally, it’s why Flafka have been created – it’s just a combination of those two tools.

Let’s block ads! (Why?)

The Data Warehouse Insider