Big Data SQL Quick Start. Big Data SQL and YARN on the same cluster. – Part14.

today I’m going to explain how to multitenant on the same cluster Big Data SQL and YARN. I think it’s a quite common scenario – you may want to store historical data and query it with the Big Data SQL. As well you may want to perform the ETL job within the same cluster. If so, resource management became one of the main requirement for this. In other words, you have to warranty certain performance despite on other jobs. For example:

1) You may need to finish your ETL  as fast as possible. In this case, MapReduce, which run on YARN has higher priority

2) You build critical reports with the Big Data SQL and in this case, Big Data SQL have to have higher priority rather than YARN

Life without resource manager.

Let’s have a start from the beginning. I do have MapReduce (YARN) jobs and Big Data SQL queries, which runs on the same cluster. It will work perfectly fine unless you have exceeded your CPU or IO boundary. Let me give you an example. I picked up small data set for quering it (my goal is not exceed CPU limit)

1) I run the MapReduce job (hive query) and it finished in 165 seconds.

2) I run the Big Data SQL and it finished in 30 seconds.

3) I run Big Data SQL together with Hive and BDS finished in 31 seconds, Hive in 170 Sec. Almost the same results! 

But as soon as you run the query, which has reach CPU boundary and your engines (Big Data SQL and YARN) start to share the CPU among two processes. Resource manager will not increase your CPU capacity, but it will help you to define how to share resources between those two processes.

How to enable Resource Sharing between YARN and Big Data SQL.

Cloudera has a very powerful mechanism to share resources – “Static Service Pool”. Under the hood,  it uses Linux cgroups. It defines the proportion of CPU and IO resources between processes. The easiest way to enable it is the use Cloudera Manager:

1) Go to the  “Cluster -> Static Service Pool”:

1 Big Data SQL Quick Start. Big Data SQL and YARN on the same cluster.   Part14.

2) Go to the configuration: 

2 Big Data SQL Quick Start. Big Data SQL and YARN on the same cluster.   Part14.

3) Enable Cgroup Managment and use: Cgroup CPU Shares and Cgroup IO Weight 

3 Big Data SQL Quick Start. Big Data SQL and YARN on the same cluster.   Part14.

It’s interesting that Linux CPU share may vary between 2 and 262144 and at the same moment IO weight vary between 100 and 1000. I recommend you to change those two handlers synchronously (in other words, change the values between 100 and 1000 for both). After the restart of coresponding processes, you will have enabled Resource Managment. 

Trust, bu verify.

it’s all the theory and every theory has to be proven by some concrete examples. I played a bit with Static Service Pools in the context of tenant Big Data SQL and Hive query (read YARN) on the same cluster. For benchmarking I picked up the simplest query which use neither Storage Indexes nor Predicate Push Down and which returns exactly 0 rows.

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

SQL> SELECT * FROM store_sales_csv WHERE MOD(ss_ticket_number,10)=20;

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

In case of Hive this query will be very similar:

—————————————————————————————————————————————————–

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

hive> SELECT * FROM csv.store_sales WHERE PMOD(ss_ticket_number,10)=20; 

#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#

—————————————————————————————————————————————————–

Before all, I sequentially ran the Big Data SQL query and Hive. Bellow you could find the CPU and IO profile for Big Data SQL and Hive:

5 Big Data SQL Quick Start. Big Data SQL and YARN on the same cluster.   Part14.

6 Big Data SQL Quick Start. Big Data SQL and YARN on the same cluster.   Part14.

Hive query was done in 890.628 seconds, BDS in 391 seconds.

I start my test with running those queries without any Resource Managment, just run two statements simultaneously.

Big Data SQL (BDS) took  731 seconds

Hive have finished within 1434.75 seconds

After this, I enable cgroup resource management (by Static Service Pool) and run Hive and Big Data SQL queries simultaneously.  

In my tests, I only play with the CPU shares which indirectly handle IO as well. I conclude the results into the table, which you could find bellow:

CPU.shares configuration (BDS/Hive) Big Data SQL, elapsed time seconds Hive, elapsed time seconds
Stand alone 391 890.628 
No control 731 1434.75 
2/262144 1231 1022.083
100/1000  1217 1184.115
200/800 1166 1244.993
500/500 749  1269.115
800/200 513 1277.694
1000/100 465 1288.094
262144/2 407 1284.804

this table shows:

1) Static Service Pool works

2) It’s coarse handler. In other words, you couldn’t expect exact proportions from it.

Let’s block ads! (Why?)

The Data Warehouse Insider