Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

It’s very frequent case when somebody talks about Big Data he or she also wants to know how to apply Machine Learning algorithms over this data sets. Oracle Big Data SQL provides the easy and seamless way to do this. Secret of this in Oracle Advanced Analytics (OAA) option, which has been existing for many years. This is the set of the existing algorithms, together with SQL Developer Data Minner interface, allow easily in drag and drop style create advanced models. OAA works over the Oracle DB tables. Big Data SQL allows us to get access to Hadoop data through the External table. Easy!

It’s better to see once rather than hear multiple times.

let me give an example. Let’s imagine we have some store and for sure we have customers who make purchases there. Roughly customers could be divided into 3 categories: who spend a little money, who spend a lot of money and average customer.

1 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

Now, having statistics of the sales and some personal information about customers, we want to understand what is the profile of Big Spender (a person who spend a lot of money in our store).

I simplified my example and I only have two tables:

- Fact of the sales

- Dimension table with customer’s info

Big Fact stores data into JSON format on the HDFS. Dimension tables stored into Oracle RDBMS. 

2 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

As you may remember from my previous post, Big Data SQL allows us to represent semi-structure (or even structure) data as a table in Oracle Database.

13 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

Now we have two tables, which is related by customer key (primary key for the dimension table, foreign key for the fact table).

7 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

After this I build the model in SQL Developer:

4 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

Data Minner allows us to write SQL inside the model, let’s have a look:

5 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

This query defines first 5% of customers who spend more money as “big spender”. Bottom 5% as “low spender”. Others are average customers.

I materialize this aggregate and join it with customers table. After this run Machine learning algorithms (they already created I just use it).

Decision Trees.

Let’s have a look at the results:

6 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

First is decision tree:

8 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

it’s hard to see here something. Let’s zoom out two nodes which show us Big Spenders (our target): 

12 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

From this Node we could conclude, that there is pretty high probability, that Big Spender profile is:

- Female

- With high annual_income (more than 84 500)

- Who lives in either California or Maryland or New Jersey

- Her marital status is Divorced or Married or Separated or Widowed

The second node shows us:

10 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

that, there is also pretty high probability, that Big Spender profile is:

- Female

- With high annual_income (more than 84 500)

- Who is younger than 43.5

- Her marital status is Single

Naive Bayes

I also used Naive Bayes classifier in my example:

11 Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

it shows, that most probably big Spender is:

- Female

- Younger than 34.5

- Single

- With high (more than 118 712 annual income)

Conclusion.

The couple of Oracle Advances Analytics and Big Data SQL allows us to:

- Use HDFS as a cheap storage

- Use HDFS as schema-on read storage, which allows us to define the data schema during the read (parse semi-structure data)

- Use Oracle Advanced analytics Drag and Drop intuitive interface for building classification models over Big Data (data stored into Hadoop).

All code and Advanced Analytics models, which have been used in my example available here

Let’s block ads! (Why?)

The Data Warehouse Insider