Tag Archives: Part

Better Machine Learning Models with Multi-Objective Feature Selection: Part 1

The Basics of Feature Selection

Feature selection can greatly improve your machine learning models. In this blog series, I’ll outline all you need to know about feature selection. In Part 1 below I discuss why feature selection is important, and why it’s in fact a very hard problem to solve. I’ll detail some of the different approaches which are used to solve feature selection today.

Why should we care about Feature Selection?

There is a consensus that feature engineering often has a bigger impact on the quality of a model than the model type or its parameters. Feature selection is a key part of feature engineering, not to mention Kernel functions and hidden layers are performing implicit feature space transformations. Therefore, is feature selection then still relevant in the age of support vector machines (SVMs) and Deep Learning? Yes, absolutely.

First, we can fool even the most complex model types. If we provide enough noise to overshadow the true patterns, it will be hard to find them. The model starts to use the noise patterns of the unnecessary features in those cases. And that means, that it does not perform well. It might even perform worse if it starts to overfit to those patterns and fail on new data points. This is made even easier for a model with many data dimensions. No model type is better than others in this regard. Decision trees can fall into this trap as well as multi-layer neural networks. Removing noisy features can help the model focus on relevant patterns.

But there are other advantages of feature selection. If we reduce the number of features, models are generally trained much faster. And often the resulting model is simpler and easier to understand. We should always try to make the work easier for we model. Focus on the features which carry the signal over those that are noise and we will have a more robust model.

Why is this a hard problem?

Let’s begin with an example. Let’s say we have a data set with 10 attributes (features, variables, columns) and one label (target, class). The label column is the one we want to predict. We’ve trained a model on this data and determined the accuracy of the model built on data is 62%. Can we identify a subset of those 10 attributes where a trained model would be more accurate?

We can depict any subset of 10 attributes as bit vectors, i.e. as a vector of 10 binary numbers 0 or 1. Zero means that the specific attribute is not used, and 1 depicts an attribute which is used for this subset. If we want to indicate that we use all 10 attributes, we would use the vector (1 1 1 1 1 1 1 1 1 1). Feature selection is the search for such a bit vector that produces the optimal accuracy. One possible approach for this would be to try out all the possible combinations. Let’s start with using only a single attribute. The first bit vector looks like this:

As we can see, when we use the first attribute we come up with an accuracy of 68%. That’s already better than our accuracy with all attributes, 62%.  But can we improve this even more? Let’s try using only the second attribute:

Still better than using all 10 attributes, but not as good as only using the first.

We could continue to go through all possible subsets of size 1. But why we should stop there?  We can also try out subsets of 2 attributes now:

Using the first two attributes immediately looks promising with 70% accuracy. We can collect all accuracies of these subsets until we have tried all of the possible combinations:

We call this a brute force approach.

How many combinations did we try for 10 attributes? We have two options for each attribute: we can decide to either use it or not.  And we can make this decision for all 10 attributes which results in 2 x 2 x 2 x … = 210 or 1,024 different outcomes. One of those combinations does not make any sense though, namely the one which does not use any features at all. So, this means that we only need to try 210 – 1 = 1,023 subsets. Even for a small data set, we can see there are a lot of attribute subsets. It is also helpful to keep in mind that we need to perform a model validation for every single one of those combinations. If we use a 10-fold cross-validation, we need to train 10,230 models. It is still doable for fast model types on fast machines.

But what about more realistic data sets?  If we have 100 instead of only 10 attributes in our data set, we already have 2100 – 1 combinations bringing the number combination to 1,267,650,600,228,229,401,496,703,205,375. Even the largest computers can no longer perform this.

Heuristics to the Rescue!

Going through all possible attribute subsets is not a feasible approach then. We should however try to focus only the combinations which are more likely to lead to more accurate models. We could try to prune the search space and ignore feature sets which are not likely to produce good models. However, there is of course no guarantee that we will find the optimal solution any longer. If we ignore complete areas of our solution space, it might be that we also skip the optimal solution, but these heuristics are much faster than our brute force approach. And often we end up with a good, and sometimes even with the optimal solution in a much faster time. There are two widely used approaches for feature selection heuristics in machine learning. We call them forward selection and backward elimination.

Forward Selection

The heuristic behind forward selection is very simple. We first try out all subsets with only one attribute and keep the best solution. But instead of trying all possible subsets with two features next, we only try specific 2-subsets. We try the 2-subsets which contain the best attribute from the previous round. If we do not improve, we stop and deliver the best result from before, i.e. the single attribute. But if we have improved the accuracy, we continue trying by keeping the best attributes so far and try to add one more. We continue this until we no longer have to improve.

What does this mean for the runtime for our example with 10 attributes from above? We start with the 10 subsets of only one attribute which is 10 model evaluations. We then keep the best performing attribute and try the 9 possible combinations with the other attributes. This is another 9 model evaluations then. We stop if there is no improvement or keep the best 2-subset if we get a better accuracy. We now try the 8 possible 3-subsets and so on. So, instead of going brute force through all 1,023 possible subsets, we only go through 10 + 9 + … + 1 = 55 subsets. And we often will stop much earlier as soon as there is no further improvement.  We see below that this is often the case. This is an impressive reduction in runtime. And the difference becomes even more obvious for a case with 100 attributes. Here we will only try at most 5,050 combinations instead of the 1,267,650,600,228,229,401,496,703,205,375 possible ones.

Backward Elimination

Things are similar with backward elimination, we just turn the direction around. We begin with the subset consisting of all attributes first. Then, we try to leave out one single attribute at a time. If we improve, we keep going. But we still leave out the attribute which led to the biggest improvement in accuracy. We then go through all possible combinations by leaving out one more attribute. This is in addition to the best ones we already left out. We continue doing this until we no longer improve. Again, for 10 attributes this means that we will have at most 1 + 10 + 9 + 8 + … + 2 = 55 combinations we need to evaluate.

Are we done?  It looks like we found some heuristics which work much faster than the brute force approach. And in certain cases, these approaches will deliver a very good attribute subset. The problem is that in most cases, they unfortunately will not. For most data sets, the model accuracies form a so-called multi-modal fitness landscape. This means that besides one global optimum there are several local optima. Both methods will start somewhere on this fitness landscape and will move from there. In the image below, we have marked such a starting point with a red dot.  From there, we continue to add (or remove) attributes if the fitness improves. They will always climb up the nearest hill in the multi-modal fitness landscape. And if this hill is a local optimum they will get stuck in there since there is no further climbing possible. Hence, those algorithms do not even bother with looking out for higher hills. They take whatever they can easily get. Which is exactly why we call those “greedy” algorithms. And when they stop improving, there is only a very small likelihood that they made it on top of the highest hill. It is much more likely that they missed the global optimum we are looking for. Which means that the delivered feature subset is often a sub-optimal result.

Slow vs. Bad.  Anything better out there?

This is not good then, is it? We have one technique which would deliver the optimal result, but is computationally not feasible.  This is the brute force approach. But as we have seen, we cannot use it at all on realistic data sets.  And we have two heuristics, forward selection and backward elimination, which deliver results much quicker. But unfortunately, they will run into the first local optimum they find. And that means that they most likely will not deliver the optimal result.

Don’t give up though – in our next post we will discuss another heuristic which is still feasible even for larger data sets. And it often delivers much better results than forward selection and backward elimination. This heuristic is making use of evolutionary algorithms.

Let’s block ads! (Why?)

RapidMiner

Why Data Quality Should be Part of Your Disaster Recovery Plan

When you think of disaster recovery, data quality is likely not the first thing that comes to mind. But data quality should factor prominently into your disaster recovery plan. Here’s why.

Disaster recovery is the discipline of preparing for unexpected events that can severely disrupt your IT infrastructure and services, and the business processes that depend on them.

The disasters that necessitate disaster recovery can take many forms. They could be natural disasters, like a major storm that wipes out a data center. They could be security events, wherein hackers hold your data for ransom or bring your services down using DDoS attacks. They could be an attack by a disgruntled employee who deliberately wipes out a crucial database.

blog banner Data Quality Magic Quadrant Why Data Quality Should be Part of Your Disaster Recovery Plan

What all types of disasters have in common is that it’s virtually impossible to know when they’ll occur, or exactly what form they’ll take.

Forming a Disaster Recovery Plan

That’s why it’s essential to have a disaster recovery plan in place. Your plan should:

  • Identify all data sources that need to be backed up so that they can be recovered in the event of a disaster.
  • Specify a method or methods for backing up the data.
  • Identify how frequently backups should occur.
  • Determine whether on-site data backups are sufficient for your needs, or if you should back up data to a remote site (in case your local infrastructure is destroyed during a disaster).
  • Specify who is responsible for performing backups, who will verify that backups were completed successfully and who will restore data after a disaster.

If you need help building and implementing a disaster recovery plan, you can find entire companies dedicated to the purpose. With the right planning and skills, however, there is no reason that you cannot also maintain an effective disaster recovery yourself. Regardless of whether you outsource disaster recovery or not, the most important thing is simply to have a plan in place. (See also: 5 Tips for Developing a Disaster Recovery Plan)

blog time money2 Why Data Quality Should be Part of Your Disaster Recovery Plan

Data Quality and Disaster Recovery

Now that we’ve covered the basics of disaster recovery, let’s discuss where data quality fits in.

Put simply, data quality matters in this context because whenever you are backing up or restoring data, you need to ensure data quality. Since data backups and restores are at the center of disaster recovery, data quality should be factored into every phase of your disaster recovery plan.

After all, when you’re copying data from one location to another to perform backups, data quality errors are easy to introduce for a variety of reasons. You might have formatting issues copying files from one type of operating system to another because of different encoding standards. Data could become corrupted in transit. Backups could be incomplete because you run out of space on the backup destination. The list could go on.

It’s even easier to make data quality mistakes when you’re recovering data after a disaster. Even the most prepared organization will be working under stress when it’s struggling to recover data after a disaster. The personnel performing data recoveries may not be familiar with all the data sources and formats they are restoring. In the interest of getting things up and running again quickly – a noble goal when business viability is at stake – they may take shortcuts that leave data missing, corrupted or inconsistent.

All of the above are reasons why data quality tools should be used to verify the integrity of backed-up data, as well as data that is recovered after a disaster. It’s not enough to check the quality of your original data sources, then assume that your backups and the data recovered based on those backups will also be accurate. It might not be, for all the reasons outlined above and many more.

The last thing your business needs after it has suffered through and recovered from a disaster is lasting problems with its data. To prevent a disaster from having a lasting effect on your business, you must ensure that the data you’ve recovered is as reliable as your original data.

Syncsort’s data quality software and disaster recovery solutions can help you build your disaster recovery plan. Learn about why Syncsort is a leader for the 12th consecutive year in Gartner’s Magic Quadrant for Data Quality Toolsreport.

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Big Data SQL Quick Start. Correlate real-time data with historiacal benchmarks – Part 24

In Big Data SQL 3.2 we have introduced new capability – Kafka as a data source. Some details about how it works with some simple examples, I’ve posted over here. But now I want to talk about why do you want to run queries over Kafka. Here is Oracle concept picture on Datawarehouse:

You have some stream (real-time data), data lake where you land raw information and cleaned Enterprise data. This is just a concept, which could be implemented in many different ways, one of this depict here:

Kafka is the hub for streaming events, where you accumulate data from multiple real-time producers and provide this data to many consumers (it could be real-time processing, such as Spark-Streaming or you could load data in batch mode to the next Datawarehouse tier, such as Hadoop). 

In this architecture, Kafka contains stream data and it’s able to answer the question “what is going on right now”, whereas in Database you store operational data, in Hadoop historical and those two sources are able to answer the question “how it use to be”. Big Data SQL allows you to run the SQL over those tree sources and correlate real-time events with historical.

Example of using Big Data SQL over Kafka and other sources.

So, above I’ve explained the concept why you may need to query Kafka with Big Data SQL, now let me give a concrete example. 

Input for demo example:

- We have company, called MoviePlex, which sells video content all around the world

- There are two stream datasets – network data, which contains information about network errors, conditions of routing devices and so. The second data source is the fact of the movie sales. 

- Both stream data in real-time in Kafka

- Also, we have historical network data, which we store in HDFS (because of the cost of this data), historical sales data (which we store in database) and multiple dimension tables, stored in RDBMS as well.

Based on this we have a business case – monitor revenue flow, correlate current traffic with the historical benchmark (depend on Day of the Week and Hour of the Day) and try to find the reason in case of failures (network errors, for example).

Using Oracle Data Visualization Desktop, we’ve created a dashboard, which shows how real-time traffic correlate with statistical and also, shows a number of network errors based on the countries:

The blue line is a historical benchmark.

Over the time we see that some errors appear in some countries (left dashboard), but current revenue is more or less the same as it uses to be.

After a while revenue starts going down.

This trend keeps going.

A lot of network errors in France. Let’s drill down into itemized traffic:

Indeed, we caught that overall revenue goes down because of France and cause of this is some network errors.

Conclusion:

1) Kafka stores real-time data  and answers on question “what is going on right now”

2) Database and Hadoop stores historical data and answers on the question: “how it use to be”

3) Big Data SQL could query the data from Kafka, Hadoop, Database within single query (Join the datasets)

4) This fact allows us to correlate historical benchmarks with real-time data within SQL interface and use this with any SQL compatible BI tool 

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

Big Data SQL Quick Start. Big Data SQL over Kafka – Part 23

G Data SQL 3.2 version brings a few interesting features. Among those features, one of the most interesting is the ability to read Kafka. Before drilling down into details, I’d like to explain in the nutshell what Kafka is.

What is Kafka?

The full scope of the information about Kafka you may find here, but in the nutshell, it’s distributed fault tolerant message system. It allows you to connect many systems in an organized fashion. Instead, connect each system peer to peer:

you may land all your messages company wide on one system and consume it from there, like this:

Kafka is kind of Data Hub system, where you land the messages and serve it after.

More technical details.

I’d like to introduce a few key Kafka’s terms.

1) Kafka Broker. This is Kafka service, which you run on each server and which operates all read and write request

2) Kafka Producer. The process which writes data in Kafka

3) Kafka Consumer. The process, which reads data from Kafka.

4) Message. The name describes itself, I just want to add that messages have key and value. In comparison to NoSQL databases key Kafka’s key is not indexed. It has application purposes (you may put some application logic in Key) and administrative purposes (each message with the same key goes to the same partition).

5) Topic. Set of the messages organized into topics. Database guys would compare it with a table.

6) Partition. It’s a good practice to divide the topic into partitions for performance and maintenance purposes. Messages within the same key go to the same partition. If a key is absent, messages are distributing in round – robin fashion.

7) Offset. The offset is the position of each message in the topic. The offset is indexed and it allows you quickly access your particular message.

When do you delete data?

One of the basic Kafka concepts is that of retention – Kafka does not keep data forever, nor does it wait for all consumers to read a message before deleting a message. Instead, the Kafka administrator configures a retention period for each topic – either amount of time for which to store messages before deleting them or how much data to store older messages are purged. This two parameters control this: log.retention.ms and log.retention.bytes.

The amount of data to retain in the log for each topic-partition. This is the limit per partition: multiply by the number of partitions to get the total data retained for the topic. 

How to query Kafka data with Big Data SQL?

for query the Kafka data you need to create hive table first. let me show an ent-to-end example. I do have a JSON file:

$   cat web_clicks.json
{ click_date: "38041", click_time: "67786", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "396439", web_page: "646"}
{ click_date: "38041", click_time: "41831", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "90714", web_page: "804"}
{ click_date: "38041", click_time: "60334", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "afternoon", item_sk: "151944", web_page: "867"}
{ click_date: "38041", click_time: "53225", date: "2004-02-26", am_pm: "PM", shift: "first", sub_shift: "afternoon", item_sk: "175796", web_page: "563"}
{ click_date: "38041", click_time: "47515", date: "2004-02-26", am_pm: "PM", shift: "first", sub_shift: "afternoon", item_sk: "186943", web_page: "777"}
{ click_date: "38041", click_time: "73633", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "118004", web_page: "647"}
{ click_date: "38041", click_time: "43133", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "148210", web_page: "930"}
{ click_date: "38041", click_time: "80675", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "380306", web_page: "484"}
{ click_date: "38041", click_time: "21847", date: "2004-02-26", am_pm: "AM", shift: "third", sub_shift: "morning", item_sk: "55425", web_page: "95"}
{ click_date: "38041", click_time: "35131", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "185071", web_page: "118"}

and I’m going to load it into Kafka with standard Kafka tool “kafka-console-producer”:

$   cat web_clicks.json|kafka-console-producer --broker-list bds2:9092,bds3:9092,bds4:9092,bds5:9092,bds6:9092 --topic json_clickstream

for a check that messages have appeared in the topic you may use the following command:

$   kafka-console-consumer --zookeeper bds1:2181,bds2:2181,bds3:2181 --topic json_clickstream --from-beginning

after I’ve loaded this file into Kafka topic, I create a table in Hive.

Make sure that you have oracle-kafka.jar and kafka-clients*.jar in your hive.aux.jars.path:

and here:

after this you may run follow DDL in the hive:

hive> CREATE EXTERNAL TABLE json_web_clicks_kafka
row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe'
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'
tblproperties(
'oracle.kafka.table.key.type'='long',
'oracle.kafka.table.value.type'='string',
'oracle.kafka.bootstrap.servers'='bds2:9092,bds3:9092,bds4:9092,bds5:9092,bds6:9092',
'oracle.kafka.table.topics'='json_clickstream'
);
hive> describe json_web_clicks_kafka;
hive> select * from json_web_clicks_kafka limit 1;

and as soon as hive table been created I create ORACLE_HIVE table in Oracle:

SQL> CREATE TABLE json_web_clicks_kafka (
topic varchar2(50),
partitionid integer,
VALUE  varchar2(4000),
offset integer,
timestamp timestamp, 
timestamptype integer
)
ORGANIZATION EXTERNAL
(TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR
   ACCESS PARAMETERS
      (
       com.oracle.bigdata.cluster=CLUSTER
       com.oracle.bigdata.tablename=default.json_web_clicks_kafka
      )
) 
PARALLEL 
REJECT LIMIT UNLIMITED;

here you also have to keep in mind that you need to add oracle -kafka.jar and  kafka -clients*.jar in your bigdata.properties file on the database and on the Hadoop side. I have dedicated the blog about how to do this here.

Now we are ready to query:

SQL> SELECT * FROM json_web_clicks_kafka
WHERE 
ROWNUM<3;

json_clickstream	209	{ click_date: "38041", click_time: "43213"..."}	0	26-JUL-17 05.55.51.762000 PM	1
json_clickstream	209	{ click_date: "38041", click_time: "74669"... }	1	26-JUL-17 05.55.51.762000 PM	1

Oracle 12c provides powerful capabilities for working with JSON, such as dot API. It allows us to easily query the JSON data as a structure: 

SELECT t.value.click_date,
       t.value.click_time
  FROM json_web_clicks_kafka t
 WHERE ROWNUM < 3;

38041	40629
38041	48699

Working with AVRO messages.

In many cases, customers are using AVRO as flexible self-described format and for exchanging messages through the Kafka. For sure we do support it and doing this in very easy and flexible way.

I do have a topic, which contains AVRO messages and I define Hive table over it:

CREATE EXTERNAL TABLE web_sales_kafka
row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe'
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'
tblproperties(
'oracle.kafka.table.key.type'='long',
'oracle.kafka.table.value.type'='avro',
'oracle.kafka.table.value.schema'='{"type":"record","name":"avro_table","namespace":"default","fields":
[{"name":"ws_sold_date_sk","type":["null","long"],"default":null},
{"name":"ws_sold_time_sk","type":["null","long"],"default":null},
{"name":"ws_ship_date_sk","type":["null","long"],"default":null},
{"name":"ws_item_sk","type":["null","long"],"default":null},
{"name":"ws_bill_customer_sk","type":["null","long"],"default":null},
{"name":"ws_bill_cdemo_sk","type":["null","long"],"default":null},
{"name":"ws_bill_hdemo_sk","type":["null","long"],"default":null},
{"name":"ws_bill_addr_sk","type":["null","long"],"default":null},
{"name":"ws_ship_customer_sk","type":["null","long"],"default":null}
]}',
'oracle.kafka.bootstrap.servers'='bds2:9092',
'oracle.kafka.table.topics'='web_sales_avro'
);
describe web_sales_kafka;
select * from web_sales_kafka limit 1;

Here I define ‘oracle.kafka.table.value.type’=’avro’ and also I have to specify ‘oracle.kafka.table.value.schema’. After this we have structure.

In a similar way I define a table in Oracle RDBMS:

SQL> CREATE TABLE WEB_SALES_KAFKA_AVRO
   (  "WS_SOLD_DATE_SK" NUMBER, 
  "WS_SOLD_TIME_SK" NUMBER, 
  "WS_SHIP_DATE_SK" NUMBER, 
  "WS_ITEM_SK" NUMBER, 
  "WS_BILL_CUSTOMER_SK" NUMBER, 
  "WS_BILL_CDEMO_SK" NUMBER, 
  "WS_BILL_HDEMO_SK" NUMBER, 
  "WS_BILL_ADDR_SK" NUMBER, 
  "WS_SHIP_CUSTOMER_SK" NUMBER
  topic varchar2(50),
  partitionid integer,
  KEY NUMBER,
  offset integer,
  timestamp timestamp, 
  timestamptype INTEGER
   ) 
   ORGANIZATION EXTERNAL 
    ( TYPE ORACLE_HIVE
      DEFAULT DIRECTORY "DEFAULT_DIR"
      ACCESS PARAMETERS
      ( com.oracle.bigdata.tablename: web_sales_kafka
          )     
    )
   REJECT LIMIT UNLIMITED ;

And we good to query the data!

Performance considerations.

1) Number of Partitions.

This is the most important thing to keep in mind there is a nice article about how to choose a right number of partitions. For Big Data SQL purposes I’d recommend using a number of partitions a bit more than you have CPU cores on your Big Data SQL cluster.

2) Query fewer columns

Use column pruning feature. In other words list only necessary columns in your SELECT and WHERE statements. Here is the example.

I’ve created void PL/SQL function, which does nothing. But PL/SQL couldn’t be offloaded to the cell side and we will move all the data towards the database side:

SQL> create or replace function fnull(input number) return number is
Result number;
begin
Result:=input;
return(Result);
end fnull;

after this I ran the query, which requires one column and checked how much data have been returned to the DB side:

SQL> SELECT MIN(fnull(WS_SOLD_DATE_SK))
     FROM WEB_SALES_KAFKA_AVRO;

“cell interconnect bytes returned by XT smart scan” 5741.81MB

after this I repeat the same test case with 10 columns:

SQL> SELECT MIN(fnull(WS_SOLD_DATE_SK)),
       MIN(fnull(WS_SOLD_TIME_SK)),
       MIN(fnull(WS_SHIP_DATE_SK)),
       MIN(fnull(WS_ITEM_SK)),
       MIN(fnull(WS_BILL_CUSTOMER_SK)),
       MIN(fnull(WS_BILL_CDEMO_SK)),
       MIN(fnull(WS_BILL_HDEMO_SK)),
       MIN(fnull(WS_BILL_ADDR_SK)),
       MIN(fnull(WS_SHIP_CUSTOMER_SK)),
       MIN(fnull(WS_SHIP_CDEMO_SK))
  FROM WEB_SALES_KAFKA_AVRO;

“cell interconnect bytes returned by XT smart scan” 32193.98 MB

so, hopefully, this test case clearly shows that you have to use only useful columns

3) Indexes

There is no Indexes rather than Offset columns. The fact that you have key column doesn’t have to mislead you – it’s not indexed. The only offset allows you have a quick random access

4) Warm up your data

If you want to read data faster many times, you have to warm it up, by running “select *” type of the queries.

Kafka relies on Linux filesystem cache, so for reading the same dataset faster many times, you have to read it the first time.

Here is the example

- I clean up the Linux filesystem cache

dcli -C "sync; echo 3 > /proc/sys/vm/drop_caches"

- I tun the first query:

SELECT COUNT(1) FROM WEB_RETURNs_JSON_KAFKA t

it took 278 seconds.

- Second and third time took 92 seconds only.

5) Use bigger Replication Factor

Use bigger replication factor. Here is the example. I do have two tables one is created over the Kafka topic with Replication Factor  = 1, second is created over Kafka topic with ith Replication Factor  = 3.

SELECT COUNT(1) FROM JSON_KAFKA_RF1 t

this query took 278 seconds for the first run and 92 seconds for the next runs

SELECT COUNT(1) FROM JSON_KAFKA_RF3 t

This query took 279 seconds for the first run, but 34 seconds for the next runs.

6) Compression considerations

Kafka supports different type of compressions. If you store the data in JSON or XML format compression rate could be significant. Here is the examples of the numbers, that could be:

Data format and compression type Size of the data, GB
JSON on HDFS, uncompressed 273.1
JSON in Kafka, uncompressed 286.191
JSON in Kafka, Snappy 180.706
JSON in Kafka, GZIP 52.2649
AVRO in Kafka, uncompressed 252.975
AVRO in Kafka, Snappy 158.117
AVRO in Kafka, GZIP 54.49

This feature may save some space on the disks, but taking into account, that Kafka primarily used for the temporal store (like one week or one month), I’m not sure that it makes any sense. Also, you will pay some performance penalty, querying this data (and burn more CPU). 

I’ve run a query like:

SQL> select count(1) from ...

and had followed results:

Type of compression Elapsed time, sec
uncompressed 76
snappy 80
gzip 92

so, uncompressed is the leader. Gzip and Snappy slower (not significantly, but slow). taking into account this as well as fact, that Kafka is a temporal store, I wouldn’t recommend using compression without any exeptional need. 

7) Use parallelize your processing.

If for some reasons you are using a small number of partitions, you could use Hive metadata parameter “oracle.kafka.partition.chunk.size” for increase parallelism. This parameter defines a size of the input Split. So, if you set up this parameter equal 1MB and your topic has 4MB total, you will proceed it with 4 parallel threads.

Here is the test case:

- Drop Kafka topic

$   kafka-topics --delete --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --topic store_sales

- Create again with only one partition

$   kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1 --topic store_sales

- Check it

$   kafka-topics --describe --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --topic store_sales
...
Topic:store_sales       PartitionCount:1        ReplicationFactor:3     Configs:
      Topic: store_sales      Partition: 0    Leader: 79      Replicas: 79,76,77      Isr: 79,76,77
...

- Check the size of input file:

$   du -h store_sales.dat
19G     store_sales.dat

- Load data to the Kafka topic

$   cat store_sales.dat|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic store_sales  --request-timeout-ms 30000  --batch-size 1000000

- Create Hive External table

hive> CREATE EXTERNAL TABLE store_sales_kafka
row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe'
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'
tblproperties(
'oracle.kafka.table.key.type'='long',
'oracle.kafka.table.value.type'='string',
'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092',
'oracle.kafka.table.topics'='store_sales'
);

- Create Oracle external table

SQL> CREATE TABLE STORE_SALES_KAFKA
   (	TOPIC VARCHAR2(50), 
      PARTITIONID NUMBER, 
      VALUE VARCHAR2(4000), 
      OFFSET NUMBER, 
      TIMESTAMP TIMESTAMP, 
      TIMESTAMPTYPE NUMBER
   ) 
   ORGANIZATION EXTERNAL 
    ( TYPE ORACLE_HIVE
      DEFAULT DIRECTORY DEFAULT_DIR
      ACCESS PARAMETERS
      ( com.oracle.bigdata.tablename=default.store_sales_kafka
    )     
    )
   REJECT LIMIT UNLIMITED 
  PARALLEL ;

- Run test query

SQL> SELECT COUNT(1) FROM store_sales_kafka;

it took 142 seconds

- Re-create Hive external table with ‘oracle.kafka.partition.chunk.size’ parameter equal 1MB

hive> CREATE EXTERNAL TABLE store_sales_kafka
row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe'
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'
tblproperties(
'oracle.kafka.chop.partition'='true',
'oracle.kafka.partition.chunk.size'='1048576',
'oracle.kafka.table.key.type'='long',
'oracle.kafka.table.value.type'='string',
'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092',
'oracle.kafka.table.topics'='store_sales'
);

- Run query again:

SQL> SELECT COUNT(1) FROM store_sales_kafka;

Now it took only 7 seconds

One MB split is quite low, and for big topics we recommend to use 256MB.

8) Querying small topics.

Sometimes it happens that you need to query really small topics (few hundreds of messages, for example), but very frequently. At this case, it makes sense to create a topic with fewer paritions.

Here is the test case example:

- Create topic with 1000 partitions

$   kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1000 --topic small_topic

- Load only one message there

$   echo "test"|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic small_topic

- Create hive external table

hive> CREATE EXTERNAL TABLE small_topic_kafka
row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe'
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'
tblproperties(
'oracle.kafka.table.key.type'='long',
'oracle.kafka.table.value.type'='string',
'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092',
'oracle.kafka.table.topics'='small_topic'
);

- Create Oracle external table

SQL> CREATE TABLE small_topic_kafka (
topic varchar2(50),
partitionid integer,
VALUE varchar2(4000),
offset integer,
timestamp timestamp,
timestamptype integer
)
ORGANIZATION EXTERNAL
(TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.tablename=default.small_topic_kafka
)
)
PARALLEL
REJECT LIMIT UNLIMITED;

- Query all rows from it

SQL> SELECT * FROM small_topic_kafka

it took 6 seconds

- Create topic with only one partition and put only one message there and run same SQL query over it

$   kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1 --topic small_topic
  
$   echo "test"|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic small_topic
SQL> SELECT * FROM small_topic_kafka

now it takes only 0.5 second

9) Type of data in Kafka messages.

You have few options for storing data in Kafka messages and for sure you want to do pushdown processing. Big Data SQL supports pushdown operations only for JSONs. This means that everything that you could expose thought the JSON will be pushed down to the cell side and will be prosessed there.

Example

- The query which could be pushed down to the cell side (JSON):

SQL> SELECT COUNT(1) FROM WEB_RETURN_JSON_KAFKA t
WHERE
t.VALUE.after.WR_ORDER_NUMBER=233183247;

- The query which could not be pushed down to the cell side (XML):

SQL> SELECT COUNT(1)
  FROM WEB_RETURNS_XML_KAFKA t
 WHERE 
XMLTYPE(t.value).EXTRACT('/operation/col[@name="WR_ORDER_NUMBER"]/after/text()')
 .getNumberVal() = 233183247;

If amounts of data is not significant, you could use Big Data SQL for processing. If we are talking about big data volumes, you could process it once and convert into different file formats on HDFS, with Hive query:

hive> select xpath_int(value,'/operation/col[@name="WR_ORDER_NUMBER"]/after/text()') from WEB_RETURNS_XML_KAFKA limit 1 ;

10) JSON vs AVRO format in the Kafka topics

In continuing to the previous point, you may be wondering which semi-structured format use? The answer is easy – use what your data source produce there is no significant performance difference between Avro and JSON. For example, a query like:

SQL> SELECT COUNT(1) FROM WEB_RETURNS_avro_kafka t
WHERE
t.WR_ORDER_NUMBER=233183247;

Will be done in 112 seconds in case of JSON and in 105 seconds in case of Avro.

and JSON topic will take 286.33 GB and Avro will take 202.568 GB. There is some difference, but not worth for converting the original format.

How to bring data from OLTP databases in Kafka? Use Golden Gate!

Oracle Golden Gate is the well-known product for capturing commit logs on the database side and bring the changes into a target system. The good news that Kafka may play a role in the target system. I’d like to skip the detailed explanation of this feature, because it’s already explained in very deep details here.

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

Data Management and Integration Using Data Entities – Part 2

Data Management Part 2 300x225 Data Management and Integration Using Data Entities – Part 2

In part two of our Dynamics 365 for Finance and Operations: Data Management and Integration series, we will cover detailed information on data management and integration using OData Services.

This type of integration is real time in nature and mainly conducted in scenarios where business requirements are around office integration and third party mobile apps integration. OData stands for Open Data Protocol, which is an industry standard Representational State Transfer (REST) based protocol for performing CRUD operations (Create, Read, Update, Delete) and integration with Dynamics 365 for Finance and Operations.

OData uses REST application programming interfaces (APIs) and OAuth 2.0 authorization mechanism to receive data to and from integration systems and Finance and Operations.

111317 2255 DataManagem1 Data Management and Integration Using Data Entities – Part 2

With OData Services for Finance and Operations, you can seamlessly integrate with all types of web technologies, such as HTTP and JavaScript Object Notation (JSON) and it lets developers interact with data in a standard yet powerful manner using RESTful web services.

OData endpoint

Data Entities that are marked Yes for the IsPublic property are exposed as an OData endpoint. When the IsPublic property for an updatable view is set to TRUE, that view is exposed as a top-level OData entity. Developers can consume this OData endpoint in their external application such as a .Net application for integration scenarios.

111317 2255 DataManagem2 Data Management and Integration Using Data Entities – Part 2

Integrating Client Application with OData:

OData integration REST API uses the same OAuth 2.0 authentication model as the other service endpoints. Before the integrating client application can consume this endpoint, developers must create and register the application ID in the Microsoft Azure Active Directory (AAD) and give it appropriate permission to Finance and Operations as per the steps below:

Go to Azure Portal > Azure Active directory > AppRegistrations

111317 2255 DataManagem3 Data Management and Integration Using Data Entities – Part 2

Click New Application Registration and select “Web app/API” for application type. Enter your Dynamics 365 URL for sign on.

111317 2255 DataManagem4 Data Management and Integration Using Data Entities – Part 2

Click Create and make sure to note the application ID. Click on the app and go to the “Required Permissions” page.

111317 2255 DataManagem6 Data Management and Integration Using Data Entities – Part 2

111317 2255 DataManagem5 Data Management and Integration Using Data Entities – Part 2

Click add and select “Microsoft Dynamics AX.” Go to the select permissions tab and select all the permissions available.

111317 2255 DataManagem7 Data Management and Integration Using Data Entities – Part 2

Once the Application is created and registered below activities are performed in Dynamics 365 for Finance and Operations. A Data project using the Data Management Framework can be used to create Import Export jobs for loading the data-to-data entities of extraction of the Data.

Click the Dynamics 365 URL > Go to System Administration > Data Management > Click on Import

111317 2255 DataManagem8 Data Management and Integration Using Data Entities – Part 2

In our next part of this series, we’ll look at Asynchronous Integrations. Stay tuned and subscribe to our blog to receive the latest posts in your inbox!

Happy Dynamics 365’ing!

Let’s block ads! (Why?)

PowerObjects- Bringing Focus to Dynamics CRM

“The Space Migration Part Is What I’m Working On Right Now”

Timothy Leary had numerous odd experiences behind prison walls. There was the time he dropped acid with Massachusetts inmates, the one in which he shared a Folsom cell block with Charles Manson and let us never forget that he was lectured in the pen by friend Marshall McLuhan. Such was the life of an LSD salesman.

One of the few trips Leary never got to take, except posthumously, was a trek to outer space. In 1976, during his “comeback tour” after stays in 29 jails and a retirement of sorts, Leary dreamed of leaving it all behind—way behind. The opening of John Riley’s People article “Timothy Leary Is Free, Demonstrably in Love and Making Extraterrestrial Plans“:

High in New Mexico’s Sangre de Cristo Mountains, in a wood-heated A-frame beside a rushing stream, the retired guru speaks:

“After six years of silence, we have three new ideas which we think are fairly good. One is space migration. Another is intelligence increase. The third is life extension. We use the acronym SMI2LE to bring them together.”

The sage is Timothy Leary, high priest of the 1960s LSD movement, who is just four weeks out of the 29th jail he has inhabited since his first arrest in Laredo, Texas, 11 years ago. That charge was possession of less than half an ounce of marijuana that his then-wife, Rosemarie, had handed to his daughter. In recent months, when Leary was appearing before federal grand juries investigating the Weather Underground, he was moved from prison to prison for his own safety. Now paroled at age 56, he will soon start a term of probation whose length will be set by a federal judge.

Leary fled a federal work camp in California in 1970, an escape planned by Rosemarie and the Weather Underground. The Learys went first to Africa, then to Switzerland, where their marriage collapsed. Leary met and was captivated by a then 26-year-old jet-setter, Joanna Harcourt-Smith, whom he married in 1972. Three weeks later they traveled to Afghanistan, where U.S. authorities captured them both and flew them back to Los Angeles.

“Joanna visited me regularly,” Leary says. “She published several of my books and lobbied and schemed to get me free.” He looks at her adoringly, and she turns from the breakfast dishes in the sink to kiss him. Joanna tells how she collared Betty Ford on a street in San Diego and pleaded with her for Tim’s freedom. “I’m doing for my husband what you’re doing for yours. You’re helping yours get elected President, and I’m helping mine get out of prison.”

“One of the plans that she was continually hatching to break me out,” says Leary, “was for her to descend onto the Vacaville prison grounds in a silver helicopter blaring Pink Floyd music, wearing nothing but a machine gun. We called it Plan No. 346.”

“You know,” he continues, after Joanna has left to drive to a village 10 miles away for groceries and cigarettes, “in 1970 the U.S. government directly and bluntly shut me up. It was the greatest thing that could have happened, because I had run out of ideas.” His face, its prison pallor turned to brown by the mountain sun, breaks into a grin. A woodpecker hammers at the chimney of their Franklin stove. “Does that every morning,” says Leary. “We’ve named him the tinpecker.

“Well, SMI2LE, as I said, is a good idea. The acronym is woven into Joanna’s belts and purses. The space migration part is what I’m working on right now. Los Alamos [the atomic laboratory] is not far away and I have lots of questions about laser fusion. And this valley is an ideal temporary planetary base of operations for getting away from earth.”

Leary not only wants to live on a space station between the earth and the moon, he wants to take some of the planet with him. “How far can we see from here?” he asks. “Half a mile? According to a professor at Princeton, such an area could be compressed to a degree that I figure could be fit within a NASA spacecraft.”•

Let’s block ads! (Why?)

Afflictor.com

Data Management & Integration Using Data Entities – Part 1

Data Management 300x225 Data Management & Integration Using Data Entities – Part 1

Today we will review data management and integration using Data Entities in the Dynamics 365 for Finance and Operations application.

Modern enterprise organizations have interconnected devices with systems and processes that collaborate with each other and users across business functions. This helps provide an exceptional user experience and supplies the data and information to ensure the correct business decisions are being made while aiding productivity – that is where integration plays a vital role.

Dynamics 365 for Finance and Operations offers both real time (synchronous) and batch driven (asynchronous) data management and integration options depending upon user requirements. Most commonly, Data Entities are used for data management and integration. A data entity provides
conceptual abstraction and encapsulation of underlying table schemas to represent key data concepts and functionalities

These Data Entities are grouped into five types that are based on their functions and the type of data that they provide.

Data Entity Type Features Example
Parameter Based Pre-requisite setup data specific to industry and or business functional and deployment parameters for modules. Consists of tables that contain only one record where columns are value for settings. AP / AR / GL / Inv. Setups, etc.
Reference Based Setup reference data specific to industry or business processes in small quantities to enable transactions. Units, dimensions, and tax group / codes, etc.
Master Large amounts of data required for business operations that are critical for transactions Customers, Vendors etc.
Documents Business Operational Data having complex structures including header and line items. Sales Order, Purchase Order, Journals, etc.
Transaction Operational transactional data such as posted transactions which are considered as record. Pending Invoices

How to build a data entity in Dynamics 365 for Finance & Operations

Option 1: building an entity by using a wizard

The easiest way to build an entity is to start the Wizard, add a new item of type Data entity
to the project. When building an entity, start with a root data source. However, you can add additional data sources either manually by adding new data sources or by selecting a surrogate foreign key field in the root data source to automatically expand the required data sources. When the Wizard is completed it creates following:

  • Data entity
  • Staging table (optional, if data management was enabled)

Here is a step-by-step illustration:

110917 1830 DataManagem1 Data Management & Integration Using Data Entities – Part 1

110917 1830 DataManagem2 Data Management & Integration Using Data Entities – Part 1

110917 1830 DataManagem3 Data Management & Integration Using Data Entities – Part 1

110917 1830 DataManagem4 Data Management & Integration Using Data Entities – Part 1

110917 1830 DataManagem5 Data Management & Integration Using Data Entities – Part 1

Option 2: Building an entity from a table

Developers can quickly create an entity from a table, and then customize the properties, data sources, and fields later.

Right-click the table, and then select Addins > Create data entity.

110917 1830 DataManagem6 Data Management & Integration Using Data Entities – Part 1

Output:

110917 1830 DataManagem7 Data Management & Integration Using Data Entities – Part 1

In previous versions of Microsoft Dynamics AX, there were multiple ways to manage data, such as Microsoft Excel Add-ins, AIF, and DIXF. The concept of data entities merges different ways into one.

Here is a summary of the Data Entity Management Integration Scenarios:

Type of Integration

Where Used

Synchronous Integration using OData Office Integration and Third Party Mobile Apps Integration
Asynchronous Integration Interactive file-based import/export

and Recurring integrations (file, queue, and so on)

Business Intelligence Aggregate data

And Standardized key performance indicators (KPIs)

Application Life Cycle Management Data Migration using Packages (exporting and importing data packages)
The Dynamics 365 Data Integration using Common Data Services Integration and flow of data between various MS Dynamics 365 products

Stay tuned for part 2 of this series, where we will look into each of the data management and integration options in detail. Don’t miss out, be sure to subscribe to our blog!

Happy Dynamics 365’ing!

Let’s block ads! (Why?)

PowerObjects- Bringing Focus to Dynamics CRM

Power Query (M)agic Part 3: Divide & Conquer

Blue Pill Red Pill thumb Power Query (M)agic Part 3: Divide & ConquerWelcome back P3 Nation! By now we’re in full swing with this Power Query (M)agic series, three posts and counting! You’re now knowledgeable enough about Power Query (if you read Part One and Part Two, anyway) that I want to throw some cool techniques at you involving M code. The time has come to take off the metaphorical training wheels, dive into the deep end, leave the nest, take a leap of faith (boy does this guy love idioms…), because frankly you are READY FOR IT. “I know it, you know it, everyone knows it.”

“Take the blue pill, the story ends…take the red pill, you stay in Wonderland, and I show you how deep the (PQ) rabbit hole goes.” If you’re still reading, then you took the red pill and chose…wisely. Today I’ll be showing you how to add custom M code using the Advanced Editor. The Advanced Editor is where I LIVE and BREATHE when using Power Query, so welcome to my universe. wlEmoticon smile Power Query (M)agic Part 3: Divide & Conquer

Specifically, I’ll be demonstrating two important ways to visually organize your query by renaming your applied steps and adding visual dividers between them. This technique is especially useful when you have a lot of applied steps, and want to add visual cues between related applied steps.

Since you’ve read the previous post in this series about folders, you’ve seen that visual aids can be VERY HELPFUL when building queries! These techniques are by no means necessary, but they can be a tremendous help in organizing your model. Anything that makes it easier to step back into a model months after development is a win in my book.

RENAMING STEPS > TAKING NOTES!

So I’ll be the first to tell you that, despite my best efforts, I HATE taking notes. It takes time and energy I chose not to give, especially when I need to store them separately in OneNote or a Word document. A beautiful feature of Power Query is the ability to rename any applied step to whatever you desire, essentially allowing you to write a single line of notes for each step! Corner cutters of the world unite!

This saves so much time, and it’s done easily enough by right-clicking on ANY step, selecting Rename, then changing the original title to whatever friendly name you want. My two cents: rename it to something that will make sense to anyone coming back to this model months (or years!) down the road.

Simply right click on the step you want change, the select rename from the drop down list:

Rename Steps Power Query (M)agic Part 3: Divide & Conquer

But wait, there’s more…you can also add notes to each individual step! You can do this in the SAME dropdown menu as above. What’s even cooler is that those notes…well they show up as a tooltip when you over. So in essence you can create high level summary notes by renaming the step. Then for steps that more complicated (think Rube Goldberg complicated), you can add additional tooltip notes. Badass, that’s what that is. Honestly I didn’t know this existed for the LONGEST TIME! I never would have guessed additional notes would be buried in the properties menu.

Just right click on the step you want to add notes too, select properties, then add your notes:

Applied Step Properties thumb Power Query (M)agic Part 3: Divide & Conquer

Hovering over your step now results in tooltips displaying:

Tooltip thumb Power Query (M)agic Part 3: Divide & Conquer

The Way I Used To Do It (Old & Busted)

A comment on this post enlightened me to a better way to do this! So thanks to Milan Grahovac for the new hotness! I’ve decided to keep the old method in this post, because it does give a good intro to the Advanced Editor. Plus I’m sure some people will appreciate two ways to do something. With that said, let’s walk through the old method first.

Dividers can help you look like a Power Query Ninja if anyone ever digs into your models. Have any of you EVER looked at a workbook (or code) that was organized or had added comments, and NOT thought it looked professional? Take a minute to ponder that…you’ll realize that it almost ALWAYS left a better impression than one that didn’t.

I’m sure many of you are wondering “What the heck does this technique look like?!” Let’s seen an example! I create a calendar table in many of the models I build.  Building it requires a lot of steps, and many of which are related and can easily be grouped together. Below is a before and after screenshot of the steps from that query, before and after adding Dividers between applied steps.

Giant hand man (let’s call him Steve from now on) is back to pass judgement on my organizational skills:

PQ Dividers Before And After thumb Power Query (M)agic Part 3: Divide & Conquer

A keen observer (sarcasm much?) will notice extra lines I’ve added on right image, visually grouping sections of my applied steps. Adding these is easily accomplished by adding a few lines into the Advanced Editor window. It’s really easy to add them, but can be tricky to remove later.

Before I demo the technique, let me explain some important Power Query fundamentals. These concepts will help you understand how the applied steps work, and how they connect to each other. It’ll be important to know this if you ever want to remove these dividers later, or add/move steps to your query.

Fun Fact! Whenever you make a new transformation (new step) to your list, by default it references the previous step. In the image below the highlighted step (Added Calendar Year) references the previous step (Added Calendar Month), in the body of the M code. Every step will have a reference similar to the one you see here (make note as we’ll circle back to this later).

Example of how Applied Steps relate to each other:

Power Query Step Link thumb Power Query (M)agic Part 3: Divide & Conquer

Interested in Learning How to Do this Kind of Thing?

Our New Online Power Query Course just launched this year. You’ll learn Power Query from the best in the business, two Excel MVPs Ken Puls and Miguel Escobar. We’ve included 7 Hours of Self-Paced Video over 31 Courses, with lots of example files, and a Digital Badge powered by Acclaim. CLICK HERE to claim your spot!

PQ training Power Query (M)agic Part 3: Divide & Conquer

 

Let’s start by walking through how to add these dividers into our Applied Steps. First order of business, you need to know where the Advanced Editor Window is located! It’s found under the View ribbon (same place you turn on the formula bar to see the M Code).

Opening the Advanced Editor window:

Advanced Editor Window thumb Power Query (M)agic Part 3: Divide & Conquer

Advanced Editor Window – Not as scary as it looks, promise:

Advanced Editor Unedited thumb Power Query (M)agic Part 3: Divide & Conquer

As you gaze into the depths of code within the Advanced Editor window, you can see that each step is nicely given it’s own line. This is the typical layout for any unmodified query in Power Query. It makes it relatively easy to see where one applied step ends, and the next begins. Next I’m going to insert a dividing line into the code, to provide some degrees of separation. I’ll go ahead and insert it right before I start making all my custom columns (E.g. before #”Added Calendar Month”).

Advanced Editor window AFTER adding in the code for my dividing line:

Advanced Editor Added Steps thumb Power Query (M)agic Part 3: Divide & Conquer

Let’s break down this line of code to understand why I added what I did. Each applied step at is core is essentially two parts. The first part is the applied step name, and the second part is the data output. The name in this instance is #”===START Custom Columns===”. NOTE that you’ll ONLY see the applied step name in the Advanced Editor Window; the Formula Bar only shows the data output.

For any Applied Name, you technically don’t NEED the #”…“ part before and after the name unless it contains spaces. However, as a best practice, I always use them for habit and consistency. And the equals signs in my step name aren’t anything special or required, but rather a way to fill out the row and create a distinctive look compared to other steps.

The second part of any applied step is the data output. This is where the transformation goes whenever you create a new applied step. Some data (either a table output or single value) is REQUIRED here for any applied step. Hence the reason I’ve created “dummy” data in mine as a single numerical value. It doesn’t do anything other than make sure this artificial step works.

Applied Step results after adding my dividing line (Note the 1 value output when selected):

Adding a dividing step thumb Power Query (M)agic Part 3: Divide & Conquer

The Applied Step Added Calendar Month still references the SAME STEP before the addition of the divider:

Previous Step connection thumb Power Query (M)agic Part 3: Divide & Conquer

The divider step is essentially hidden between the other two, there to provide visual separation and nothing else. I’m a HUGE FAN of this technique because it’s a great way to slowly introduce users to the logic and components that make up each applied step. Now if you want to delete a divider step, you can easily just select the delete (X) icon next to the step to get rid of it, just like any other step.

Power Query can get finicky if you start trying to add and/or move steps around after creating your divider steps. If you do this, there are a few ways that a new or existing step could accidentally get referenced. If that happens it will temporarily break the query unless you redirect the step back to the correct one.

Error message after moving a step in front of my custom step:

Moving applied step error thumb Power Query (M)agic Part 3: Divide & Conquer

Moving the Added Calendar Year applied step in front of my custom step caused it to break…but why? Well, whenever a step is reordered, it will always default to using the previous step in the list. So by reordering this step right after my custom one, it assumes it should reference that as it’s data source…which we DO NOT WANT.

Circling back to the M code basics you learned earlier, you can FIX THIS by reassigning that source back to the correct step. You’ll want to change the step name in the Formula Bar to point it back to the Changed to DATETYPE applied step. Once you’ve done that the divider step will go back to ninja mode, hidden amongst the other steps.

Query working correctly after updating the step (data) source for Added Calendar Year to Changed to DATETYPE:

Moving applied step fixed error thumb Power Query (M)agic Part 3: Divide & Conquer

Alright folks, now let me show you how to do this WITHOUT the Advanced Editor! Again, as Milan Grahovac graciously mentioned in the comments. There’s an easy way to add these note line notes without touching M code. Same concept, but without the issue of accidentally breaking a step if you delete the custom line of code. That same dropdown menu we go to when renaming a step, well there’s another option in there we can use. There’s a button called Insert Step After that essentially references the previous step, but without doing any actual transformations. What this means is no performance hit, and a beautifully integrated step you can rename as a divider!

Same process as renaming. Right click, and select Insert Step After:

Inserted Step After thumb Power Query (M)agic Part 3: Divide & Conquer

Boom, new custom step:

New Integrated Step thumb Power Query (M)agic Part 3: Divide & Conquer

The beauty of this is twofold: no Advanced Editor needed, and it’s less prone to break. You can observe how this custom step is simply referencing the previous step, without any transformation applied. So this won’t slow down your refresh AT ALL. All this does it provide us a line in our Applied Steps to create a line or section. Just simply rename it like you would any other step. Now the old method is still a great gateway drug to the world of M coding. Next thing you know you’ll be inserting functions, folding queries, and appending tables oh my! If this managed to wet your appetite for the possibilities with M, then mission accomplished for me. Stay tuned for the next post in this series! Until then folks!

Let’s block ads! (Why?)

PowerPivotPro

DAX Reanimator: Moving Averages Controlled by Slicer (Part 2, including GFITW)

Today, I’ll show you how to bring the awesome to moving averages with DAX text formulas for slicers and titles…  and yes, I’ll throw in a Greatest Formula In the World (GFITW) version, too. In Part One, I used What-if-parameters in Power BI to control the moving average period with a slicer. And the original post from 2013 is Moving Averages Controlled by a Slicer, which brought the awesome to Moving Averages, Sums, Etc.

Last time, I promised to go beyond what can be done right out of the box using What-if-parameters. I’m going to help you take your reports from ordinary to awesome by using features like:

  • Small sets of values to choose from:  -12, -6, -5, -4, -3, -2, 2, 3, 4, 5, 6, 12.  By itself, the GENERATESERIES() function can only make sets that follow a fixed pattern: +1, +.05, +2, etc.
  • Moving average slicers with text labels, but what-if-parameters only generate numbers.
  • Dynamically generated title for the chart.

But wait, there’s more! I also added some bonus items:

  • Viewing multiple parameters at the same time. AND…
  • The Greatest Formula in the World (GFITW) version of the moving average formulas. For those who use specialized calendars for manufacturing or retail, like a 4-4-5 calendar.

NewTableButton DAX Reanimator: Moving Averages Controlled by Slicer (Part 2, including GFITW)
The first thing I noticed about GENERATESERIES() is that it’s a table function. It returns a table instead of just one value. That means we can go straight to the New Table button on the Modeling tab. This button lets us create a new table in the data model from a table formula.

The New Table button is handy for quickly adding a simple calendar table to a model. For example, the table created with this formula below is trimmed to the Order Date column of the Sales table. Note that the calendar starts on the first day of the month, because otherwise it could throw off the time intelligence functions.

Calendar =
CALENDAR (
    DATE ( YEAR ( MIN ( Sales[OrderDate] ) ), MONTH ( MIN ( Sales[OrderDate] ) ), 1 ),
    MAX ( Sales[OrderDate] )
)

And New Table is also a great place to look at table functions used in formulas, for troubleshooting, or for learning DAX.

Here is a formula to generate a table made up on non-regular intervals. Tables are held in variables at the top of the formula (Yes! Tables in variables!). I use UNION to mash up the different series into one MA Length table. Matt Allington blogged this approach when GENERATESERIES was introduced.


This result is a table that’s not entirely continuous, but the order doesn’t matter because the numbers will be sorted automatically.

Moving Average Label, sorted by length value

ma reanimator2 6 DAX Reanimator: Moving Averages Controlled by Slicer (Part 2, including GFITW)

Next, we add the text for the table. The text for the rows comes from a calculated column we added to the Moving Average Length table by clicking the New Column button on the Modeling tab and entering this formula:


Months to Include =
VAR MonthText =
    FORMAT ( ABS ( ‘MA Length'[MA Length] ), “General Number” )
VAR S_Text =
    IF ( ABS ( ‘MA Length'[MA Length] ) = 1, “”, “s” )
RETURN
    SWITCH (
        TRUE (),
‘MA Length'[MA Length] > 0, MonthText & ” Month”
& S_Text
& ” Forward”,
‘MA Length'[MA Length] < 0, MonthText & ” Month”
& S_Text
& ” Back”,
‘MA Length'[MA Length] = 0, “Current Month”
    )

This applies our diabolical old friend Switch TRUE(), and also includes a test for a value of 1 for values that are singular. When creating a text equivalent for a value, be sure to set the sort by column in the Modeling tab so that it doesn’t come out alphabetical.

This makes a text label for the top of the report. I put it onto the report in a using a Card visualization. Make sure to set the sort by column to sort the calculated column by the MA Length value.


Chart Title =
IF (
    COUNTROWS ( ‘Moving Average Length’ ) > 1,
    “Monthly Sales vs. 3 Months Back Moving Average “,
    IF (
        COUNTROWS ( ‘Moving Average Length’ ) = 1
&& [MA Length Value] = 0,
        “Monthly Sales”,
        “Monthly Sales vs. “
& FIRSTNONBLANK (
‘Moving Average Length'[Months to Include],
[Variable Moving Average]
            )
& ” Moving Average”
    )
)

Text formulas are also great for static titles:

Chart Title2 = “Actual Sales vs. Multiple Moving Average Periods”

Why? Because they have more formatting options and you can also make them match dynamic titles.

Picture1 thumb DAX Reanimator: Moving Averages Controlled by Slicer (Part 2, including GFITW)

Because we used disconnected tables, we can compare multiple months back and forward at the same time by putting ‘MA Length'[Months to Include] column on the legend of the line chart. The slicer is still to the left of the chart above, so we don’t have to see all the ranges at the same time. In the first version, we compare different measures: Total Sales and a dynamically set moving average period. In this version, however, we comparing different values of the disconnected table. This is why I have 0/Current Month in my series: so that it could be compared with multiple moving averages. Our SWITCH() statement just substitutes Total Sales for the moving average calculation when the Current Month is active.

This version of the measures have the same results, but are used when an organization uses a custom calendar, like a retail 4-4-5 calendar. The formula depends on having a column in the Calendar table which is an index of all months, where Period 13 of Fiscal Year of one year is one less than Period 1 of the following year. Year + Month won’t work because it has gaps. My year month index is [Year] * 12 + [Month], which only works if your calendar always has exactly 12 months. The Power Pivot and Power BI book gives a great synonym for GFITW formulas: “Clear Filters, then Re-Filter.”

GFITW Variable Moving Sum =
VAR End_Month =
    IF (
[MA Length Value] > 0,
        MIN ( ‘Calendar'[YearMonth Index] ) + [MA Length Value]
1,
        MIN ( ‘Calendar'[YearMonth Index] )
    )
VAR Start_Month =
    IF (
[MA Length Value] < 0,
        MIN ( ‘Calendar'[YearMonth Index] ) + [MA Length Value]
+ 1,
        MIN ( ‘Calendar'[YearMonth Index] )
    )
RETURN
    CALCULATE (
[Sales Amount],
        FILTER (
            ALL ( ‘Calendar’ ),
‘Calendar'[YearMonth Index] >= Start_Month
&& ‘Calendar'[YearMonth Index] <= End_Month
        )
    )

I redefine the variables in both formulas instead of using measures, because it was the easiest way to avoid a filter context transition (formula breaking because of the measures being evaluated within the FILTER statement).  And let me tell you, if you really want to appreciate the convenience of time intelligence functions, try writing some GFITW versions of things!

GFITW Variable Moving Avg =
VAR End_Month =
    IF (
[MA Length Value] > 0,
        MIN ( ‘Calendar'[YearMonth Index] ) + [MA Length Value]
1,
        MIN ( ‘Calendar'[YearMonth Index] )
    )
VAR Start_Month =
    IF (
[MA Length Value] < 0,
        MIN ( ‘Calendar'[YearMonth Index] ) + [MA Length Value]
+ 1,
        MIN ( ‘Calendar'[YearMonth Index] )
    )
VAR NumberOfMonths =
    CALCULATE (
        DISTINCTCOUNT ( Calendar[YearMonth] ),
        FILTER (
            ALL ( ‘Calendar’ ),
‘Calendar'[YearMonth Index] >= Start_Month
&& ‘Calendar'[YearMonth Index] <= End_Month
        )
    )
RETURN
    IF (
[MA Length Value] = 0,
[Sales Amount],
[GFITW Variable Moving Sum] / NumberOfMonths
    )

Filters section under Visualizations: Page Level and Report Level

dax multiple moving averages slicer gfitw Filters 4 DAX Reanimator: Moving Averages Controlled by Slicer (Part 2, including GFITW)
And as a finishing touch, I add some page filters to the report in Power BI. On the first page, I set MA Length <>0, so that the Current Month won’t display in the slicer.  And, on both moving average charts, I set visual level filters to show only completed months.

Let’s block ads! (Why?)

PowerPivotPro

Understanding Dynamics 365 for Finance and Operations – Part 2

Party 2 300x225 Understanding Dynamics 365 for Finance and Operations – Part 2

In today’s topic, we will be discussing page options and attachments in the navigation of the Finance and Operations modules.

Let’s start by logging in to Dynamics 365 for Finance and Operations. The Page options are available on almost every page. We’ll demonstrate navigating to the Page Option. For this example, we’ll open all vendors.

Path: Modules > Accounts Payable > Vendors > All Vendors

  1. Select all Vendors
  2. Then on the action pane> select Options tab.
  3. In the screenshot below, you will see that under Page Options you will have different actions that you can select. Actions will be different on all pages throughout. In this example, there are four. Please note the asterisk indicates that we will address this in more detail in a future post.
    • Security Diagnostics: administrators can assign roles and security access to users.
    • Advanced Filter/Sort: this functionality is available in all forms; this allows more advanced filtering and sorting functionality. *
    • Record Info: this is a function that will show details of your selection. There are several different options you can select from in the form. *
    • Change View: This is to select a different view of the page you have selected. There are two different views to choose from. *
      • Details View
      • Grid View

110217 2113 Understandi1 Understanding Dynamics 365 for Finance and Operations – Part 2

To attach documents to any record in Dynamics 365 for Finance and Operations, select the paper clip on the form – this will open the attachment form. This functionality assists users in keeping information with the correct records.

110217 2113 Understandi2 Understanding Dynamics 365 for Finance and Operations – Part 2

110217 2113 Understandi3 Understanding Dynamics 365 for Finance and Operations – Part 2

When you attach documents to the records, there are several options:

  • Name the attachment – select what type of attachment
  • Detailed notes
  • Created by and Date fields
  • Restrictions for Internal or External

Stay tuned for Part 3 in this navigation series where we’ll be exploring the features of Help. In the meantime, keeping checking our blog for more helpful tips and tricks!

Happy Dynamics 365’ing!

Let’s block ads! (Why?)

PowerObjects- Bringing Focus to Dynamics CRM