Category Archives: Pentaho

Farewell Pentaho

blank Farewell Pentaho

Dear Kettle friends,

12 years ago I joined a wonderful team of people at Pentaho who thought they could make a real change in the world of business analytics. At that point I recently open sourced my own data integration tool (then still called ‘ETL’) called Kettle and so I joined in the role of Chief Architect of Data Integration. The title sounded great and the job included everything from writing articles (and a book), massive amounts of coding, testing, software releases, giving support, doing training, workshops, … In other words, life was simply doing everything I possibly and impossibly could to make our software succeed when deployed by our users. With Kettle now being one of the most popular data integration tools on the planet I think it’s safe to say that this goal has been reached and that it’s time for me to move on.

I don’t just want to announce my exit from Pentaho/Hitachi Vantara. I would also like to thank all the people involved in making our success happen. First and foremost I want to express my gratitude to the founders (Richard, Doug, James, Marc, …) for even including a crazy Belgian like myself on the team but I also want to extend my warmest thanks to everyone who I got to become friends with at Pentaho for the always positive and constructive attitude. Without exaggeration I can say it’s been a lot of fun.

I would also explicitly like to thank the whole community of users of Kettle (now called Pentaho Data Integration). Without your invaluable support in the form of new plugins, bug reports, documentation, forum posts, talks, … we could never have pulled off what we did in the past 12 years! I hope we will continue to meet at one of the many popular community events.

Finally I want to thank everyone at Hitachi and Hitachi Vantara for being such a positive and welcoming group of people. I know that Kettle is used all over Hitachi and I’m quite confident this piece of software will not let you down any time soon.

Now I’m going to go skiing for a week and when I get back it’s time to hunt for a new job. I can’t wait to see what impossible problems need solving out there…


Let’s block ads! (Why?)

Matt Casters on Data Integration

Announcing Pentaho 8.0 – Coming in November to a theater near you!

Pentaho 8!

announce Announcing Pentaho 8.0   Coming in November to a theater near you!

The first of a new Era

Wow – time flies… Another Pentaho World this week, and another blog post announcing another release. This time… the best release ever! icon wink Announcing Pentaho 8.0   Coming in November to a theater near you!
This is our first Pentaho product announcement since we became Hitachi Vantara – and you’ll see that some synergies are already appearing. And as I said before, again and again… the Community Edition is still around! We’re not kidding – we’re here to rule the world and we know it’s though an open source core strategy that we’ll get there icon smile Announcing Pentaho 8.0   Coming in November to a theater near you!

Pentaho 8.0 In a nutshell

Ok, let’s get on with this cause there’s a lot of people at the bar calling me to have a drink. And I know my priorities! 
  • Platform and Scalability
    • Worker Nodes
    • New theme
  • Data Integration
    • Streaming support!
    • Run configurations for Jobs
    • Filters in Data Explorer
    • New Open / Save experience
  • Big Data
    • Improvements on AEL
    • Big Data File Formats – Avro and Parquet
    • Big Data Security – Support for Knox
    • VFS improvements for Hadoop Clusters
  • Others
    • Ops Mart for Oracle, MySQL, SQL Server
    • Platform password security improvements
    • PDI mavenization
    • Documentation changes on
    • Feature Removals:
      • Analyzer on MongoDB
      • Mobile Plug-in (Deprecated in 7.1)
Is it done? Can I go now? No?…. damn, ok, now on to further details…

Platform and Scalability

Worker Nodes (EE)

This is big. I never liked the way we handled scalability in PDI. Having the ETL designer responsible for manually defining the slave server in advance, having to control the flow of each execution, praying for things not to go down… nah! Also, why ETL only? What about all the other components of the stack?
So a couple of years ago, after getting info from a bunch of people I submitted a design document with a proposal for this:
02 DesignDoc WorkerNodes%2B2017 10 24%2B10 28 49 Announcing Pentaho 8.0   Coming in November to a theater near you!
This was way before I knew the term “worker nodes” was actually not original… but hey, they’re nodes, they do work, and I’m bad with names, so there’s that… :p
It took time to get to this point, not because we didn’t think this was important, but because of the underlying order of execution; We couldn’t do this without merging the servers, without changing the way we handle the repository, without having AEL (the Adaptive Execution Layer). Now we got to it!
Fortunately, we have an engineering team that can execute things properly! They took my original design, took a look at it, laughed at me, threw me out of the room and came up with the proper way of doing things. Here’s the high-level description:
03 WorkerNodes Announcing Pentaho 8.0   Coming in November to a theater near you!
This is where I mentioned that we are already leveraging Hitachi Vantara resources. We are using Lumada Foundry for worker nodes. Foundry is a platform for rapid development of service-based applications delivering the management of containers, communications, security, and monitoring toward creating enterprise products/applications, leveraging technology like docker, mesos, marathon, etc. More on this later, as it’s something we’ll be talking a lot more about…
Here’s some of the features
  • Deploy consistently in physical, virtual and cloud environments
  • Scale and load balance services , helping to deal with peaks and limited time-windows, allocate the resources that are needed.
  • Hybrid deployments can be used to distribute load, even when the on-premise resources are not sufficient, scaling out into the Cloud is possible to provide more resources. 
So, how does this work in practice? Once you have a Pentaho Server installed, you can configure it to connect to the cluster of Pentaho Worker nodes. From that point on – things will work! No need to configure access to repositories, accesses, funky stuff. You only need to say “Execute at scale” and if the worker nodes are there, it’s where things will be executed. Obviously, the “things will work” will have to obey the normal rules of clustered execution, for instance, don’t expect a random node on the cluster to magically find out your file:///c:/my computer/personal files/my mom’s excel file.xls…. :/
So what scenarios will this benefit the most? A lot! Now your server will not be bogged down executing a bunch of jobs and transformations as they will be handed out for execution in one of the nodes.
This does require some degree of control, because there may be cases where you don’t want remote execution (for instance, a transformation to feed a dashboard). This is where Run Configurations come into play. Also important to note that even though the biggest benefits of this will be ETL work, this concept is for any kind of execution.
This a major part of the work we’re doing with the Hitachi Vantara team; By leveraging Foundry we’ll be able to do huge improvements on areas we’ve been wanting to tackle for a while but never were able to properly address on our own: better monitoring, improving lifecycle management and active-active HA, among others. In 8.0 we leapfrogged in this worker nodes story, and we expect much more going forward!

New Theme – Ruby (EE/CE)

One of the things you’ll notice is that we have a new theme that reflects the Hitachi Vantara colors. The new theme is the default on new installations (not for upgrades) and the others are still available
ruby Announcing Pentaho 8.0   Coming in November to a theater near you!

Data Integration

Streaming Support: Kafka (EE/CE)

In Pentaho 8.0 we’re introducing proper streaming support in PDI! In case you’re thinking “hum… but don’t we already have a bunch of steps for streaming datasources? JMS, MQTT, etc?” you’re not wrong. But the problem is that PDI is a micro batching engine, and these streaming protocols introduce issues that can’t be solved with the current approach. Just think about it – a streaming datasource requires an always running transformation, and in PDI execution all steps run in different threads while the data pipeline is being processed; There are cases, when something goes wrong, where we don’t have the ability to do proper error processing. It’s simply not as simple as a database query or any other call where we get a finite and well known amount of data.
So we took a different approach – somewhat similar to sub-transformations but not quite… First of all, you’ll see a new section in PDI:
pdi streaming Announcing Pentaho 8.0   Coming in November to a theater near you!
Kafka is the one that was prioritized as being the most important for now, but this will actually be something that will be extended for other streaming sources.
The secret here is on the Kafka Consumer step:

KafkaConsumer Announcing Pentaho 8.0   Coming in November to a theater near you!
The highlighted tabs should be generic for pretty much all the steps, and the Batch is what controls the flow. So what we did was instead of having an always running transformation at the top level, we break the input data into chunks – either by number of records or duration and the second transformation takes that input, the fields structure and does a normal execution. In here, the abort step was also improved to give you more control the flow of this execution. This is actually something that’s been a long standing request from the community – we can now specify if we want to abort with error or without, having an extra ability to control the flow of our ETL.
Here’s an example of this thing put together:
streamingdiagram Announcing Pentaho 8.0   Coming in November to a theater near you!
Now, even more interesting that that is that this also works in AEL (our Adaptive Execution Layer, introduced in Pentaho 7.1), so when you run this on a cluster you’ll get spark native kafka support being executed at scale, which is really nice…
Like I mentioned before, moving forward you’ll see more developments here, namely:
  • More streaming steps, and currently MQTT seems the best candidate for the short term
  • (and my favorite) Developer’s documentation with a concrete example so that it’s easy for anyone on the community to develop (and hopefully submit) their own implementations without having to worry about the 90% of the stuff that’s common to all of them

New Open / Save experience (EE/CE)

In Pentaho 7.0 we merged the servers (no more that nonsense of having a distinct “BA Server” and a “DI Server”) and introduced the unified Pentaho Server with a new and great looking experience to connect to it:
 Announcing Pentaho 8.0   Coming in November to a theater near you!
but then I clicked on Open file from repository and felt sick… That thing was absolutely horrible and painfully slow. We were finally able to do something about that! Now the experience is … well… slightly better (as in, I don’t feel like throwing up anymore!):
pdi opensave Announcing Pentaho 8.0   Coming in November to a theater near you!
A bit better, no? icon smile Announcing Pentaho 8.0   Coming in November to a theater near you!  Also with search capabilities and all the kind of stuff that you’ve been expecting from a dialog like this on the past 10 years! Same for the save experience.
This is another small but IMO always important step in unifying the user experience and work towards a product that gets progressively more pleasant to use. It’s a never-ending journey but that’s not an excuse not to take it.

Filters in Data Explorer (EE)

Now that I was able to open my transformation, I can show some of the improvements that we did on our Data Explorer experience in PDI. We now support the first set of filters and actions! This one is easy to show but extremely powerful to use.
Here’s filters – depending on the data type you’ll have a few options, like excluding nulls, equals, greater/lesser than and a few others. Like mentioned, others will come with time. 
filters Announcing Pentaho 8.0   Coming in November to a theater near you!
Also, while previous version only allowed for drill down, we can now do more operations on the visualizations.
actions Announcing Pentaho 8.0   Coming in November to a theater near you!

Run configuration: Leveraging worker nodes and execute on server (EE/CE)

Now that we are connected to the repository, opened our transformation with a really nice experience and took benefit of these data exploration improvements to make sure our logic is spot on, we are ready to execute it to the server. 
Now this is where the run configuration part comes in. I have my transformation, defined it, played with it, verified that really works as expected on my box. And now, I will want to make sure it also runs well on the server. What before was a very convoluted process, it’s now much simplified.
What I do is define a new Run Configuration, like described in 7.1 for AEL, but with a little twist: I don’t want it to use the spark engine; I want it to use the pentaho engine but on the server, not the one local to spoon:
run config Announcing Pentaho 8.0   Coming in November to a theater near you!
Now, what happens when I execute this selecting the Pentaho Server run configuration?
run config dialog Announcing Pentaho 8.0   Coming in November to a theater near you!
Yep, that!! \o/
executeOnServer Announcing Pentaho 8.0   Coming in November to a theater near you!
This screenshot shows PDI trigger the execution and my Pentaho Server console logging it’s execution.
And if I had worker nodes configured, what I would see would be my Pentaho Server automatically dispatching the execution of my transformation to an available worker node! 
This doesn’t apply to the immediate execution only; We can now specify the run configuration on the job entry as well, allowing a full control of the flow of our more complex ETL
jobentry Announcing Pentaho 8.0   Coming in November to a theater near you!

Big Data

Improvements on AEL (EE/CE apart from the security bits)

As expected, a lot of work was done on AEL. The biggest ones:
  • Communicates with Pentaho client tools over WebSocket; does NOT require Zookeeper
  • Uses distro-specific Spark library
  • Enhanced Kerberos impersonation on client-side
This brings a bunch of benefits:
  • Reduced number of steps to setup 
  • Enable fail-over, load-balancing
  • Robust error and status reporting 
  • Customization of Spark jobs (i.e. memory , settings)
  • Client to AEL connection can be secured
  • Kerberos impersonation from client tool 
And not to mention performance improvements… One benchmark I saw that I found particularly impressive is that AEL is practically on pair with native spark execution! And this is impressive! Kudos for the team, just spectacular work!

Big Data File Formats – Avro and Parquet (EE/CE)

Big data platforms introduced various data formats to improve performance, compression and interoperability, and we added full support for these very popular big data formats: Avro and Parquet. Orc will come next.
When you run in AEL, these will also be natively interpreted by the engine, which adds a lot to the value of this.
bigdataformats Announcing Pentaho 8.0   Coming in November to a theater near you!
The old steps will still be available on the marketplace but we don’t recommend using them.

Big Data Security – Support for Knox

Knox provides perimeter security so that the enterprise can confidently extend Hadoop access to more of those new users while also maintaining compliance with enterprise security policies and used in some HortonWorks deployments. It is now supported on the Hadoop Clusters’ definition if you enable the property KETTLE_HADOOP_CLUSTER_GATEWAY_CONNECTION on the file.
knox Announcing Pentaho 8.0   Coming in November to a theater near you!

VFS improvements for Hadoop Clusters (EE/CE)

In order to simplify the overall lifecycle of jobs and transformations we made the hadoop clusters available through VFS, on the format hc://hadoop_cluster/
namedclusters Announcing Pentaho 8.0   Coming in November to a theater near you!


There are some other generic improvements worth noting

Ops Marts extended support (EE)

Ops Mart now supports Oracle, MySQL and SQL Server. I can’t really believe I’m still writing about this thing icon sad Announcing Pentaho 8.0   Coming in November to a theater near you!

PDI Mavenization (CE)

Now, this is actually nice! PDI is now fully mavenized. Go to, do a mvn package and you’re done!!!


Pentaho 8 will be available to download mid-November.

Learn more about Pentaho 8.0 and a webinar here:
Also, you can get a glimpse of PentahoWorld this week watching it live at:

Last but not See you in a few weeks at the Pentaho Community meeting in Mainz!

That’s it – I’m going to the bar!

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Pentaho 8 is now available!

17 152 8.0 launch community v1 Pentaho 8 is now available!

I recently wrote about everything you needed to know about Pentaho 8. And now is available! Go get your Enterprise Edition or trial version from the usual places

For CE, you can find it on the new community home!



Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

A new collaboration space

newForums A new collaboration space

With the move to Hitachi Vantara we’re not letting the community go away – exactly on the contrary. And one of the first things is trying to give the community a new home, in here:

We’re trying to gather people from the forums, user groups, whatever, and give a better and more modern collaboration space. This space will continue open, also because the content is extremely value, so the ultimate decision is yours.

Your mission, should you choose/decide to accept it, is to register and try this new home. Counting on your help to make it a better space

See you in



Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Pentaho Business Analytics Blog

Today, our parent company Hitachi, a global leader across industries, infrastructure and technology, announced the formation of Hitachi Vantara , a company whose aim is to help organizations thrive in today’s uncertain and turbulent times and prepare for the future. This new company unifies the mission and operations of Pentaho,…

Let’s block ads! (Why?)

Pentaho Business Analytics Blog

Pentaho Community Meeting 2017: exciting use cases & final Call for Papers

Enjoyed your vacations? Good – now let’s get back in business!

The Pentaho Community Meeting 2017 in Mainz, taking place from November 10-12, is approaching and more than 140 participants interested in BI and Big Data are already on board.

Many great speakers from all over the world will present their Pentaho use cases, including data management and analysis at CERN, evaluation of environmental data at the Technical University of Liberec and administration of health information in Mozambique. And of course Matt Casters, Pedro Alves and Jens Bleuel will introduce the latest features in Pentaho.</span>

The 10th jubilee edition features many highlights:

·      Hackathon and technical presentations on FRI, Nov 10 
·      Conference day on SAT, Nov 11                    
·      Dinner on SAT, Nov 11                          
·      Get-together and drinks on SAT, Nov 11  
·      Social event on SUN, Nov 12

See here the completeagenda with all presentations of the business and technical track on the conference day. Food and drinks will be provided.  Highlight to the CERN use case (you can read a blog post on it here)

And don’t forget: you can participate in the Call for Papers till September 30th! Send your Pentaho project to Jens Bleuel via the</span> contact form.

 Some of the speakers: 

·      Pedro Alves – Aka… me! All about Pentaho 8.0, which is a different way to say “hum, just put some random title, I’ll figure out something later”
·      Dan Keeley – Data Pipelines – Running PDI on AWS Lambda
·      Francesco Corti – Pentaho 8 Reporting for Java Developers
·      Pedro Vale – Machine Learning in PDI – What’s new in the Marketplace?
·      Caio Moreno de Souza – Working with Automated Machine Learning (AutoML) and Pentaho
·      Nelson Sousa – 10 WTF moments in Pentaho Data Integration
If you haven’t done so, Register Here

We are looking forward to seeing you in
Mainz, which can be reached in only 20 minutes by train from Frankfurt airport or main train station!
In the meantime follow-up on all updateson Twitter.

-pedro, with all the content from this post shamelessly stolen from Ruth and Carolin, the spectacular organizers from IT-Novum

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Hello Hitachi Vantara!

cslogo Hello Hitachi Vantara!

Ok, I admit it – I am one of those people that actually likes changes and views it as an opportunity. Four years ago, I announced here that Webdetails joined Pentaho. For the ones who don’t know, Webdetails was the Portugese-based consulting company that then turned into Pentaho Portugal (and expanded from 20 people at the time to 60+), completely integrated into the Pentaho structure.

Two years ago, we announced that Pentaho was acquired by HDS, becoming a Hitachi Group Company.

We have a new change today – and since I’m lazy (and in Vegas, for the Hitachi Next event, and would rather be at our party at the Mandalay Bay Beach than in my room writing this blog post!), I’ll simply steal the same structure I used two years ago (when Pentaho was acquired) and get straight to the point! :p

Big news

17 148 Hitachi NewCo blog v1 Hello Hitachi Vantara!
 An extremely big transformation has been taking place and materialized itself today, September 19, 2017. A new company is born. Meet: Hitachi Vantara

You may be asking yourselves: Can it possibly be a coincidence that the new company is launched on the exact same day I turn 40? Well, actually yes, a complete coincidence… :/

This new company unifies the mission and operations of Pentaho, Hitachi Data Systems and Hitachi Insight Group into a single business. More info in the Pentaho blog: Hitachi Vantara – Here’s what it means

What does this mean?

It has always been our goal to provide an offering that would allow customers to build their high value, data driven solutions. We were, I think, successful at doing that! And now we (Hitachi Vantara) want to take it to the next level, thus this transformation is needed: We’re aiming higher – we want to not only to be the best at (big) data orchestration and analytics, we want to do so in this new IoT / social innovation ecosystem aiming to be the biggest player in the market.

And this transformation will allow us to do that!

What will change?

So that it’s clear, Pentaho, as a product will continue to exist. Pentaho, as a company, is now Hitachi Vantara.

And for Pentaho as a product, this gives us conditions we’ve never had to improve the product focusing on what we need to do best (big data orchestration and analytics) and leveraging from other groups in the company on areas that even though they weren’t our core focus, people expect us to have. 
Overall, we’ll also improve the overall portfolio interoperability. While so far we’ve always tried to be completely agnostic, now we’ll keep saying that but add a small detail: But we have to work better with our stuff – because we can make it happen! 

Community implications

This one is very easy!!! I’ll just copy paste my previous answer – because it didn’t change:

Throughout all the talks, our relationship and involvement with the community has always been one of the strong points of Pentaho, and seen with much interest.
The relationship between the community and a commercial company exists because it’s mutually beneficial. In Pentaho’s case, the community gets access to software it otherwise couldn’t, and Pentaho gets access to an insane amount of resources that contribute to the project. Don’t believe me? Check the Pentaho Marketplace for the large number of submissions, Jira for all the bug reports and improvement suggestions we get out of all the real world tests, and discussions on the forums or on the several available email lists.
Is anyone, in his or her right mind, willing to let all this go? Nah.
Plus, not having a community would render my job obsolete, and no one wants that, right? (don’t answer, please!)

The difference? We wanna do this bigger, better and faster!


And things are already moving in that direction. We are moving the Pentaho Community page to the Hitachi Vantara communit site with some really col interactive and social features. You can visit our new home here I look forward to engaging with all of you on this new site.

Will Hitachi Vantara shut down it’s Pentaho CE edition / it’s open source model?

I will, once again, repeat the previous answer:

Just in case the previous answer wasn’t clear enough, lemme spell it out with all the words: There are no plans of changing our opensource strategy or stop providing a CE edition to our community!
Can that change in the future? Oh, absolutely yes! Just like it could have changed in the past. And when could it change? When it stops making sense; when it stops being mutually beneficial. And on that day, I’ll be the first one to suggest a change to our model.

And speaking of which – don’t forget to register to PCM17! It’s going to be the best ever!

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Pentaho Maven repository changed to

From a recent (at the time of writing, obviously!) issue in the mondrian project we noticed we failed to notify an important change:

This morning the pentaho maven repository seems to be down.

Each download request during maven build fails with 503 error:
[WARNING] Could not transfer metadata XXX/maven-metadata.xml from/to pentaho-releases ( Failed to transfer file: Return code is: 503 , ReasonPhrase:Service Temporarily Unavailable.

The reason for this is that the maven url is now .

Here’s a link to a complete ~/.m2/settings.xml config file:


Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

PCM17 – Pentaho Community Meeting: November 10-12, Mainz

PCM17 – 10th Edition

PCM17 Banner EN PCM17   Pentaho Community Meeting: November 10 12, Mainz

One of my favourite blog posts of the year – Announcing PCM17. And this year, for the 10th edition, we’re going back to the beginning – Mainz in Germany.


Location address: Kupferbergterrasse, Kupferbergterrasse 17-19, 55116 Mainz. Close to Frankfurt, Germany

map PCM17   Pentaho Community Meeting: November 10 12, Mainz


We’re maintaining the schedule of the previous years: A meet-up on friday for drinks preceded by a hackathon; A meet-up on Saturday for drinks preceded by a bunch of presentations or really cool stuff; A meet-up on Sunday for drinks preceded by a city sightseeing! You got the idea

All the information….

Here:! IT-Novum is doing a spectacular work organizing this event, and you’ll find all the information needed, from instructions on how to get there to suggestions for hotels to stay on

Registration and Call for Presentations

Please go to the #PCM17 website to register and also to send us a presentation proposal!



Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

A consulting POV: Stop thinking about Data Warehouses!

What I am writing in here is the materialization of a line of thought that started bothering me a couple of years ago. While I implemented projects after projects, built ETLs, optimized reports, designed dashboards, I couldn’t help but thinking that something didn’t quite make sense, but couldn’t quite see what. When I tried to explain it to someone, I just got blank stares…
Eventually things started to make more sense to me (which is far from saying they actually make sense, as I’m fully aware my brain is, hum, let’s just say a little bit messed up!) and I ended up realizing that I’ve been looking at the challenges from a wrong perspective. And while this may seem a very small change in mindset (specially if I fail in passing the message, which may very well happen), the implications are huge: not only it changed our methodology on how to implement projects in our services teams, it’s also guiding Pentaho’s product development and vision.

A few years ago, in a blog post far, far away…

A couple of years ago I wrote a blog post called ”Kimball is getting old”. It focused on one fundamental point: technology was evolving to a point where just looking at the concept of an enterprise datawarehouse (EDW) seemed restrictive. After all, the end users care only about information; they couldn’t care less about what gets the numbers in front of them. So I proposed that we should apply a very critical eye to our problem, and maybe, sometimes, Kimball’s DW, with its star schemas, snowflakes and all that jazz wasn’t the best option and we should choose something else…

But I wasn’t completely right…

I’m still (more than ever?) a huge proponent of the top down approach: focus on usability, focus on the needs of the user, provide him a great experience. All rest follows. All of that is still spot on.
But I made 2 big mistakes:
1.    I confused data modelling with data warehouse
2.    I kept seeing data sources conceptually as the unified, monolithic source of every insight

Data Modelling – the semantics behind the data

Kimball was a bloody genius! Actually, my mistake here was actually due to the fact that he is way smarter than everyone else. Why do I say this? Because he didn’t come up with one, but with two groundbreaking ideas…
First, he realized that the value of data, business-wise, comes when we stop considering it as just zeros and ones and start treating it as business concepts. That’s what the Data Modelling does: By adding semantics to raw data, immediately gives it meaning that makes sense to a wide audience of people. And this is the part that I erroneously dismissed. This is still spot on! All his concepts of dimensions, hierarchies, levels and attributes, are relevant first and foremost because that’s how people think.
And then, he immediately went prescriptive and told us how we could map those concepts to database tables and answer the business questions with relational database technology with concepts like star schemas, snowflake, different types of slowly changing dimensions, aggregation techniques, etc.
He did such a good job that he basically shaped how we worked; How many of us were involved in projects where we were talked to build data warehouses to give all possible answers when we didn’t even know the questions? I’m betting a lot, I certainly did that. We were taught to provide answers without focusing on understanding the questions.

Project’s complexity is growing exponentially

Classically, a project implementation was simply around reporting on the past. We can’t do that anymore; If we want our project to succeed, it can’t just report on the past: It also has to describe the present and predict the future.
There’s also the explosion on the amount of data available.
IoT brought us an entire new set of devices that are generating data we can collect.
Social media and behavior analysis brought us closer to our users and customers
In order to be impactful (regardless of how “impact” is defined), a BI project has to trigger operational actions: schedule maintenances, trigger alerts, prevent failures. So, bring on all those data scientists with their predictive and machine learning algorithms…
On top of that, in the past, we might have been successful at convincing our users that it’s perfectly reasonable to expect a couple of hours for that monthly sales report that processed a couple of gigabytes of data. We all know that’s changed; if they can search the entire internet in less than a second, why would they waste minutes for a “small” report?? And let’s face it, they’re right…
The consequence? It’s getting much more complex to define, architect, implement, manage and support a project that needs more data, more people, more tools.
Am I making all of this sound like a bad thing? On the contrary! This is a great problem to have! In the past, BI systems were confined to delivering analytics. We’re now given the chance to have a much bigger impact in the world! Figuring this out is actually the only way forward for companies like Pentaho: We either succeed and grow, or we become irrelevant. And I certainly don’t want to become irrelevant!

IT’s version of the Heisenberg’s Uncertainty Principle: Improving both speed and scalability??

So how do we do this?
My degree is actually in Physics (don’t pity me, took me a while but I eventually moved away from that), and even though I’m a really crappy one, I do know some of the basics…
One of the most well-known theorems in physics is Heisenberg’s Uncertainty principle. You cannot accurately know both the speed and location of (sub-)atomic particle with full precision. But can have a precise knowledge over one in detriment of the other
I’m very aware this analogy is a little bit silly (to say the least) but it’s at least vivid enough on my mind to make me realize that we can’t expect in IT to solve both the speed and scalability issue – at least not to a point where we have a one size fits all approach.
There have been spectacular improvements in the distributed computing technologies – but all of them have their pros and cons, the days where a database was good for all use cases is long gone.
So what do we do for a project where we effectively need to process a bunch of data and at the same time it has to be blazing fast? What technology do we chose?

Thinking “data sources” slightly differently

When we think about data sources, there are 2 traps most of us fall into:
1.    We think of them as a monolithic entity (eg: Sales, Human Resources, etc) that hold all the information relevant to a topic
2.    We think of them from a technology perspective
Let me try to explain this through an example. Imagine the following customer requirement, here in the format of a dashboard, but could very well be any other delivery format (yeah, cause a dashboard, a report, a chart, whatever, is just the way we chose to deliver the information):
Pentaho%2B8%2B %2BPage%2B3 S A consulting POV: Stop thinking about Data Warehouses!

Pretty common, hum?

The classical approach

When thinking about this (common) scenario from the classical implementation perspective, the first instinct would be to start designing a data warehouse (doesn’t even need to be an EDW per se, could be Hadoop, a no-sql source, etc). We would build our ETL process (with PDI or whatever) from the source systems through an ETL and there would always be a stage of modelling so we could get to our Sales data source that could answer all kinds of questions.
After that is done, we’d be able to write the necessary queries to generate the numbers our fictitious customer wants.
And after a while, we would implement a solution architecture diagram similar to this, that I’m sure looks very similar to everything we’ve all been doing in consulting:
Pentaho%2B8%2B %2BPage%2B4 S A consulting POV: Stop thinking about Data Warehouses!

Our customer gets the number he numbers he want, he’s happy and successful. So successful that he expands, does a bunch of acquisitions, gets so much data that our system starts to become slow. The sales “table” never stops growing. It’s a pain to do anything with it… Part of our dashboard takes a while to render… we’re able to optimize part of it, but other areas become slow.
In order to optimize the performance and allow the system to scale, we consider changing the technology. From relational databases to vertical column store databases, to nosql data stores, all the way through Hadoop, in a permanent effort to keep things scaling and fast…

The business’ approach

Let’s take a step back. Looking at our requirements, the main KPI the customer wants to know is:
How much did I sell yesterday and how is that compared to budget?
It’s one number he’s interested in.
Look at the other elements: He wants the top reps for the month. He wants a chart for the MTD sales. How many data points is that? 30 tops? I’m being simplistic on purpose, but the thing is that it is extremely stupid to force ourselves to always go through all the data when the vast majority of the questions isn’t a big data challenge in the first place. It may need big data processing and orchestration, but certainly not at runtime.
So here’s how I’d address this challenge
Pentaho%2B8%2B %2BPage%2B5 S A consulting POV: Stop thinking about Data Warehouses!

I would focus on the business question. I would not do a single Sales datasource. Instead, I’d define the following Business Data Sources (sorry, I’m not very good at naming stuff..), and I’d force myself to define them in a way where each of them contains (or output) a small set of data (up to a few millions the most):
·      ActualVsBudgetThisMonth
·      CustomerSatByDayAndStore
·      SalesByStore
·      SalesRepsPerformance
Then I’d implement these however I needed! Materialized, unmaterialized, database or Hadoop, whatever worked. But through this exercise we define a clear separation between where all the data is and the most common questions we need to answer in a very fast way.
Does something like this gives us all the liberty to answer all the questions? Absolutely not! But at least for me doesn’t make a lot of sense to optimize a solution to give answers when I don’t even know what the questions are. And the big data store is still there somewhere for the data scientists to play with
Like I said, while the differences may seem very subtle at first, here are some advantages I found of thinking through solution architecture this way:
·      Faster to implement – since our business datasources’s signature is much smaller and well identified, it’s much easier to fill in the blanks
·      Easier to validate – since the datasources are smaller, they are easier to validate with the business stakeholders as we lock them down and move to other business data sources
·      Technology agnostic – note that at any point in time I mentioned technology choices. Think of these datasources as an API
·      Easier to optimize – since we split a big data sources in multiple smaller ones, they become easier to maintain, support and optimize  

Concluding thoughts

Give it a try – this will seem odd at first, but it forces us to think differently. We spend too much time worrying about the technology that more than often we forget what we’re here to do in the first place…


Let’s block ads! (Why?)

Pedro Alves on Business Intelligence