Big Data SQL Quick Start. Correlate real-time data with historiacal benchmarks – Part 24

In Big Data SQL 3.2 we have introduced new capability – Kafka as a data source. Some details about how it works with some simple examples, I’ve posted over here. But now I want to talk about why do you want to run queries over Kafka. Here is Oracle concept picture on Datawarehouse:

You have some stream (real-time data), data lake where you land raw information and cleaned Enterprise data. This is just a concept, which could be implemented in many different ways, one of this depict here:

Kafka is the hub for streaming events, where you accumulate data from multiple real-time producers and provide this data to many consumers (it could be real-time processing, such as Spark-Streaming or you could load data in batch mode to the next Datawarehouse tier, such as Hadoop). 

In this architecture, Kafka contains stream data and it’s able to answer the question “what is going on right now”, whereas in Database you store operational data, in Hadoop historical and those two sources are able to answer the question “how it use to be”. Big Data SQL allows you to run the SQL over those tree sources and correlate real-time events with historical.

Example of using Big Data SQL over Kafka and other sources.

So, above I’ve explained the concept why you may need to query Kafka with Big Data SQL, now let me give a concrete example. 

Input for demo example:

- We have company, called MoviePlex, which sells video content all around the world

- There are two stream datasets – network data, which contains information about network errors, conditions of routing devices and so. The second data source is the fact of the movie sales. 

- Both stream data in real-time in Kafka

- Also, we have historical network data, which we store in HDFS (because of the cost of this data), historical sales data (which we store in database) and multiple dimension tables, stored in RDBMS as well.

Based on this we have a business case – monitor revenue flow, correlate current traffic with the historical benchmark (depend on Day of the Week and Hour of the Day) and try to find the reason in case of failures (network errors, for example).

Using Oracle Data Visualization Desktop, we’ve created a dashboard, which shows how real-time traffic correlate with statistical and also, shows a number of network errors based on the countries:

The blue line is a historical benchmark.

Over the time we see that some errors appear in some countries (left dashboard), but current revenue is more or less the same as it uses to be.

After a while revenue starts going down.

This trend keeps going.

A lot of network errors in France. Let’s drill down into itemized traffic:

Indeed, we caught that overall revenue goes down because of France and cause of this is some network errors.


1) Kafka stores real-time data  and answers on question “what is going on right now”

2) Database and Hadoop stores historical data and answers on the question: “how it use to be”

3) Big Data SQL could query the data from Kafka, Hadoop, Database within single query (Join the datasets)

4) This fact allows us to correlate historical benchmarks with real-time data within SQL interface and use this with any SQL compatible BI tool 

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog