Category Archives: Pentaho

A consulting POV: Stop thinking about Data Warehouses!

What I am writing in here is the materialization of a line of thought that started bothering me a couple of years ago. While I implemented projects after projects, built ETLs, optimized reports, designed dashboards, I couldn’t help but thinking that something didn’t quite make sense, but couldn’t quite see what. When I tried to explain it to someone, I just got blank stares…
Eventually things started to make more sense to me (which is far from saying they actually make sense, as I’m fully aware my brain is, hum, let’s just say a little bit messed up!) and I ended up realizing that I’ve been looking at the challenges from a wrong perspective. And while this may seem a very small change in mindset (specially if I fail in passing the message, which may very well happen), the implications are huge: not only it changed our methodology on how to implement projects in our services teams, it’s also guiding Pentaho’s product development and vision.

A few years ago, in a blog post far, far away…

A couple of years ago I wrote a blog post called ”Kimball is getting old”. It focused on one fundamental point: technology was evolving to a point where just looking at the concept of an enterprise datawarehouse (EDW) seemed restrictive. After all, the end users care only about information; they couldn’t care less about what gets the numbers in front of them. So I proposed that we should apply a very critical eye to our problem, and maybe, sometimes, Kimball’s DW, with its star schemas, snowflakes and all that jazz wasn’t the best option and we should choose something else…

But I wasn’t completely right…

I’m still (more than ever?) a huge proponent of the top down approach: focus on usability, focus on the needs of the user, provide him a great experience. All rest follows. All of that is still spot on.
But I made 2 big mistakes:
1.    I confused data modelling with data warehouse
2.    I kept seeing data sources conceptually as the unified, monolithic source of every insight

Data Modelling – the semantics behind the data

Kimball was a bloody genius! Actually, my mistake here was actually due to the fact that he is way smarter than everyone else. Why do I say this? Because he didn’t come up with one, but with two groundbreaking ideas…
First, he realized that the value of data, business-wise, comes when we stop considering it as just zeros and ones and start treating it as business concepts. That’s what the Data Modelling does: By adding semantics to raw data, immediately gives it meaning that makes sense to a wide audience of people. And this is the part that I erroneously dismissed. This is still spot on! All his concepts of dimensions, hierarchies, levels and attributes, are relevant first and foremost because that’s how people think.
And then, he immediately went prescriptive and told us how we could map those concepts to database tables and answer the business questions with relational database technology with concepts like star schemas, snowflake, different types of slowly changing dimensions, aggregation techniques, etc.
He did such a good job that he basically shaped how we worked; How many of us were involved in projects where we were talked to build data warehouses to give all possible answers when we didn’t even know the questions? I’m betting a lot, I certainly did that. We were taught to provide answers without focusing on understanding the questions.

Project’s complexity is growing exponentially

Classically, a project implementation was simply around reporting on the past. We can’t do that anymore; If we want our project to succeed, it can’t just report on the past: It also has to describe the present and predict the future.
There’s also the explosion on the amount of data available.
IoT brought us an entire new set of devices that are generating data we can collect.
Social media and behavior analysis brought us closer to our users and customers
In order to be impactful (regardless of how “impact” is defined), a BI project has to trigger operational actions: schedule maintenances, trigger alerts, prevent failures. So, bring on all those data scientists with their predictive and machine learning algorithms…
On top of that, in the past, we might have been successful at convincing our users that it’s perfectly reasonable to expect a couple of hours for that monthly sales report that processed a couple of gigabytes of data. We all know that’s changed; if they can search the entire internet in less than a second, why would they waste minutes for a “small” report?? And let’s face it, they’re right…
The consequence? It’s getting much more complex to define, architect, implement, manage and support a project that needs more data, more people, more tools.
Am I making all of this sound like a bad thing? On the contrary! This is a great problem to have! In the past, BI systems were confined to delivering analytics. We’re now given the chance to have a much bigger impact in the world! Figuring this out is actually the only way forward for companies like Pentaho: We either succeed and grow, or we become irrelevant. And I certainly don’t want to become irrelevant!

IT’s version of the Heisenberg’s Uncertainty Principle: Improving both speed and scalability??

So how do we do this?
My degree is actually in Physics (don’t pity me, took me a while but I eventually moved away from that), and even though I’m a really crappy one, I do know some of the basics…
One of the most well-known theorems in physics is Heisenberg’s Uncertainty principle. You cannot accurately know both the speed and location of (sub-)atomic particle with full precision. But can have a precise knowledge over one in detriment of the other
I’m very aware this analogy is a little bit silly (to say the least) but it’s at least vivid enough on my mind to make me realize that we can’t expect in IT to solve both the speed and scalability issue – at least not to a point where we have a one size fits all approach.
There have been spectacular improvements in the distributed computing technologies – but all of them have their pros and cons, the days where a database was good for all use cases is long gone.
So what do we do for a project where we effectively need to process a bunch of data and at the same time it has to be blazing fast? What technology do we chose?

Thinking “data sources” slightly differently

When we think about data sources, there are 2 traps most of us fall into:
1.    We think of them as a monolithic entity (eg: Sales, Human Resources, etc) that hold all the information relevant to a topic
2.    We think of them from a technology perspective
Let me try to explain this through an example. Imagine the following customer requirement, here in the format of a dashboard, but could very well be any other delivery format (yeah, cause a dashboard, a report, a chart, whatever, is just the way we chose to deliver the information):
Pentaho%2B8%2B %2BPage%2B3 S A consulting POV: Stop thinking about Data Warehouses!

Pretty common, hum?

The classical approach

When thinking about this (common) scenario from the classical implementation perspective, the first instinct would be to start designing a data warehouse (doesn’t even need to be an EDW per se, could be Hadoop, a no-sql source, etc). We would build our ETL process (with PDI or whatever) from the source systems through an ETL and there would always be a stage of modelling so we could get to our Sales data source that could answer all kinds of questions.
After that is done, we’d be able to write the necessary queries to generate the numbers our fictitious customer wants.
And after a while, we would implement a solution architecture diagram similar to this, that I’m sure looks very similar to everything we’ve all been doing in consulting:
Pentaho%2B8%2B %2BPage%2B4 S A consulting POV: Stop thinking about Data Warehouses!

Our customer gets the number he numbers he want, he’s happy and successful. So successful that he expands, does a bunch of acquisitions, gets so much data that our system starts to become slow. The sales “table” never stops growing. It’s a pain to do anything with it… Part of our dashboard takes a while to render… we’re able to optimize part of it, but other areas become slow.
In order to optimize the performance and allow the system to scale, we consider changing the technology. From relational databases to vertical column store databases, to nosql data stores, all the way through Hadoop, in a permanent effort to keep things scaling and fast…

The business’ approach

Let’s take a step back. Looking at our requirements, the main KPI the customer wants to know is:
How much did I sell yesterday and how is that compared to budget?
It’s one number he’s interested in.
Look at the other elements: He wants the top reps for the month. He wants a chart for the MTD sales. How many data points is that? 30 tops? I’m being simplistic on purpose, but the thing is that it is extremely stupid to force ourselves to always go through all the data when the vast majority of the questions isn’t a big data challenge in the first place. It may need big data processing and orchestration, but certainly not at runtime.
So here’s how I’d address this challenge
Pentaho%2B8%2B %2BPage%2B5 S A consulting POV: Stop thinking about Data Warehouses!

I would focus on the business question. I would not do a single Sales datasource. Instead, I’d define the following Business Data Sources (sorry, I’m not very good at naming stuff..), and I’d force myself to define them in a way where each of them contains (or output) a small set of data (up to a few millions the most):
·      ActualVsBudgetThisMonth
·      CustomerSatByDayAndStore
·      SalesByStore
·      SalesRepsPerformance
Then I’d implement these however I needed! Materialized, unmaterialized, database or Hadoop, whatever worked. But through this exercise we define a clear separation between where all the data is and the most common questions we need to answer in a very fast way.
Does something like this gives us all the liberty to answer all the questions? Absolutely not! But at least for me doesn’t make a lot of sense to optimize a solution to give answers when I don’t even know what the questions are. And the big data store is still there somewhere for the data scientists to play with
Like I said, while the differences may seem very subtle at first, here are some advantages I found of thinking through solution architecture this way:
·      Faster to implement – since our business datasources’s signature is much smaller and well identified, it’s much easier to fill in the blanks
·      Easier to validate – since the datasources are smaller, they are easier to validate with the business stakeholders as we lock them down and move to other business data sources
·      Technology agnostic – note that at any point in time I mentioned technology choices. Think of these datasources as an API
·      Easier to optimize – since we split a big data sources in multiple smaller ones, they become easier to maintain, support and optimize  

Concluding thoughts

Give it a try – this will seem odd at first, but it forces us to think differently. We spend too much time worrying about the technology that more than often we forget what we’re here to do in the first place…


Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Pentaho 7.1 is available!

Pentaho 7.1 is out

pentaho7.1 Pentaho 7.1 is available!

Remember when I said at the time of the previous release that Pentaho 7.0 was the best release ever? Well, I was true till today! But not any more, as 7.1 is even better!  :p

Why do I say that? It’s a big step forward in the direction we’ve been aiming – consolidating and simplifying our stack, not passing the complexity to the end user. 

These are the main features in the release:

  • Visual Data Experience
    • Data Exploration (PDI)
      • Drill Down
      • New Viz’s: Geo map, sunburst, Heat Grid
      • Tab Persistency
      • Several other improvements including performance
    • Viz API 3.0 (Beta)
      • Viz API 3.0, with documentatino
      • Rollout of consistent visualizations between Analyzer, PDI and Ctools
  • Enterprise Platform
    • VCS-friendly features
      • File / repository abstraction
      • PDI files properly indented
      • Repository performance improvements
    • Reintroducing Ops Mart
    • New default theme on User Console
    • Pentaho Mobile deprecation
  • Big Data Innovation
    • AEL – Adaptive Execution Layer (via Spark)
    • Hadoop Security
      • Kerberos Impersonation (for Hortonworks)
      • Ranger support
    • Microsoft Azure HD Insights shim

I’m getting tired just of listing all this stuff… Now into a bit more detail, and I’ll jump back and forth in these different topics ordering by the ones that… well, that I like the most :p

Adaptive Execution with Spark

This is huge; We’ve decoupled the execution engine from PDI so we can plug in other engines. Now we have 2: 
  • Pentaho – the classic pentaho engine
  • Spark – you’ve guessed it…
What’s the goal of this? Making sure we treat our ETL development with a pay as you go approach; First, we worry about the logic, then we select the engine that makes most sense.

adaptiveexecution v1 Pentaho 7.1 is available!
AEL execution of Spark
One of the things people need to do on other tools (and even on our own tools, that’s why I don’t like our own approach to the Pentaho Map Reduce) is that from the start you need to think about the engine and technology you’re going to use. But this makes little sense.

Scale as you go

Pentaho’s message is one of future-proofing the IT architecture, leveraging the best of what the different technologies have to offer without imposing a certain configuration or persona as the starting point. The market is moving towards a demand for BA/DI to come together in a single platform.  Pentaho has an advantage here as we have seen the differentiation of BI and DI better together with our customers and what sets us apart from the competition. Gartner predicts that BI and Discovery tool vendors will partner to accomplish this.  Larger, proprietary vendors, will attempt to build these platforms themselves.  With this approach from the competition, Pentaho has a unique and early lead in delivering this platform.

A good example is the story we can tell about governed blending. We don’t need to impose on customers any pre-determined configuration; We can start with the simple use of dataservices and unmaterialized data sets. If it’s fast enough, we’re done. If not, we can materialize the data into a data base or even an enterprise data warehouse. If it’s fast enough, we’re done. If not we can resort to other technologies – NoSQL, Lucene based engines, etc. If it’s fast enough, we’re done. If everything else fails, we can setup a SDR blueprint which is the ultimate scalability solution. And throughout this entire journey we never let go of the governed blending message.

This is an insanely powerful and differentiated message; We allow our customers to start simple, and only go down the more complex routes when needed. When going down a single path, a user knows, accepts and sees the value in extra complexity to address scalability 

Adaptive Execution Layer

The strategy described for the “Logical Data Warehouse” is exactly the one we need for the execution environment; A lot of times customers get hung up on a certain technology without even understanding if they actually needed. Countless times we we’ve seen customers asking for Spark without a use case that justifies it. We have to challenge that.

We need to move towards a scenario where the customer doesn’t have to think about technology first. We’ll offer one single approach and ways to scale as needed. If a data integration job works on a single Pentaho Server, why bother with other stacks? if it’s not enough, then making the jump to something like Map Reduce or Spark has to be a linear move.

The following diagram shows the Adaptive Execution Layer approach just described

AEL Pentaho 7.1 is available!
AEL conceptual diagram

Implementation in 7.1 – Spark

For 7.1 we chose Spark as the first engine to implement for AEL. It has seen a lot of adoption, and the fact that it’s not restricted to a map reduce paradigm makes it a good candidate to separate business logic and execution.

How to make it work? This high definition conceptual diagram should help me explain it:

ael sketch Pentaho 7.1 is available!
An architectural diagram so beautiful it should almost be roughly correct

We start by generating a PDI Driver for Spark from our own PDI instance. This is a very important starting point because using this methodology we ensure that any plugins we may have developed / installed will work when we run the transformation – we couldn’t let go of the extensibility capabilities of Pentaho 

That driver will be installed on an edge node of the cluster, and that’s what will be responsible for executing the transformation. Note that by using spark we’re leveraging all it’s characteristics: namely, we don’t even need a cluster, as we can select if we want to use spark standalone or yarn mode, even though I suspect the majority of users will be on yarn mode leveraging the clustering capabilities.

Runtime flow

One of the main capabilities of AEL is that we don’t need to think about adapting the business logic to the engine; We develop the transformation first and then we select where we want to execute. This is how this will work from within Spoon:

RunOptions Pentaho 7.1 is available!
Creating and selecting a Spark run configuration

We created the concept of a Run Configuration. Once we select a run configuration set up to use Spark as the engine, PDI will send the transformation to the edge node and the driver will then execute it.

All transformation steps in PDI will run in AEL-Spark! This was the thought from the start.  And to understand how this works, there are 2 fundamental concepts to understand:

  • Some steps are safe to run in parallel while others are not parallelizable or not recommended to run in clustered engines such as Spark. All the steps that take one row as input and one row as output (calculator, filter, select values, etc, etc), all of them are parallelizable; Steps that require access to other rows or depend on the position and order on the row set, still run on spark, but have to run on the edge node, which implies a collect of the RDDs (spark’s datasets) from the nodes. It is what it is. And how do we know that? We simply tell PDI which steps are safe to run in parallel, and which are not
  • Some steps can leverage Spark’s native APIs for perfomance and optimization. When that’s the case, we can pass to PDI a native implementation of the step, greatly increasing the scalability on possible bottleneck points. Examples of these steps are the hadoop file inputs, hbase lookups, and many more

Feedback please!

Even though running on secured clusters (and leveraging impersonation) is an EE capability only, AEL is also available in CE. Reason for that is that we want to get help from the community in testing, hardening, nativizing more steps and even writing more engines for AEL. So go and kick the tires of this thing! (and I’ll surely do a blog post on this alone)

Visual Data Experience (PDI) Improvements

This is one of my favorite projects. You may be wondering what’s the real value of having this improved data experience in PDI, why is this all that exciting… Let me tell you why: This is the first materialization of something that we hope becomes the way to handle data in pentaho regardless of where we are. So this thing that we’re building in PDI, will eventually make it’s way to the server… I’d like to throw away all the technicalities that we expose in our server (analyzer for olap, pir for metadata, prd for dashboards….) into a single content driver approach and usability experience. This is surely starting to sound confusing, so I better stop here :p

In the 7.1 release, Pentaho provides new Data Explorer capabilities to further support the following key use cases more completely:

  • Data Inspection: During the process of cleansing, preparing, and onboarding data, organizations often need to validate the quality and consistency of data across sources. Data Explorer enables easier identification of these issues, informing how PDI transformations can be adjusted to deliver clean data. 
  • BI Prototyping: As customers deliver analytic ready data to business analysts, Data Explorer reduces the iterations between business and IT. Specifically, It enables the validation of metadata models that are required for using Pentaho BA. Models can be created in PDI and tested in Data Explorer, ensuring data sources are analytics-ready when published to BA.

And how? By adding these improvements:

New visualization: Heatgrid

This chart can display 2 measures (metrics) and 2 attributes (categories) at once. Attributes are displayed on the axes and measures are represented by the size and color of the points on the grid. It is most useful for comparing metrics at the ‘intersection’ of 2 dimensions, as seen in the comparisons of quantity and price across combinations of different territories and years below (did I just define what an heatgrid is?! No wonder it’s taking me hours to write this post!):

heatgrid Pentaho 7.1 is available!
Look at all those squares!

New visualization: Sunburst

A pie chart on steroids that can show hierarchies. Less useless than a normal piechart!

sunburst Pentaho 7.1 is available!
Circles are also pretty!

New visualization: Geo Maps

The geo map uses the same auto-geocoding as Analyzer, with out of box ability to plot latitude and longitude pairs, all countries, all country subdivisions (state/province), major cities in select countries, as well as United States counties and postal codes.

GeoMap Pentaho 7.1 is available!
Geo Map visualization

Drill down capabilities

–> When using dimensions in Data Explorer charts or pivot tables, users can now expand hierarchies in order to see the next level of data.  This is done by double clicking a level in the visualization (for instance, double click a ‘country’ bar in a bar chart to drill down to ‘city’ data).

drilldown1 Pentaho 7.1 is available!
Drill down in the visualizations…

This can be done though the visualizations or though the labels / axis. Once again, look at this as the beginning of a coherent way to handle data exploration!

drilldown2 Pentaho 7.1 is available!
… or from where it makes more sense

And this is only the first of a new set of actions we’ll introduce here…

Analysis persistency

In 7.0 these capabilities were a one-time inspection only. Now we’ve taken a step further – they get persisted with the transformations. You can now use to validate the data, get insights right on the spot, and make sure everything is lined up to show to the business users.

tabPersistency Pentaho 7.1 is available!
Analysis persistency indicator

Viz Api 3.0

Every old timer knows how much disparity we’ve had throughout the stack in terms of offering a consistent visualization. This is not an easy challenge to solve – the reason they are different is because different parts of our stack were created in completely different times and places, so a lot of different technologies were used. An immediate follow-up consequence is that we can’t just add a new viz and expect it to be available in several places of the stack

We’re been working on a visualization layer, codenamed VizAPI (for a while, actually, but now we reached a point where we can make it available on beta form), that brings this so needed consistency and consolidation.

vizApiPentahoServer Pentaho 7.1 is available!
Viz API compatible containers

In order to make this effort worthwhile, we needed the following solve order:

  1. Define the VizAPI structure
  2. Implement the VizAPI in several parts of the product
  3. Document and allow users to extend it

And… we did it. We re-implemented all the visualizations in this new VizAPI structure, adapted 3 containers – Analyzer, Ctools and DET (Data Exploration) in PDI, and as a consequence, the look and feel of the visualizations are the same

VizApiAnalyzer Pentaho 7.1 is available!
Analyzer visualizations are now much better looking _and_ usable

One important note though – migration users will still default to the “old” VizAPI (yeah, we called it the same as well, isn’t that smart :/ ) not to risk interfering with existing installations. In order for you to test an existing project with the new visualizations you need to change the VizAPI version number in New installs will default to the new ones.

In order to allow people to include their own visualization and promote more contributions to Pentaho (I’d love to start seeing more contributions to the marketplace with new and shiny Viz’s), we need to really make it easy for people to know how to create them.

And I think we did that! Even though this will require it’s own blog post, just take a look at the documentation the team prepared for this

VizAPIDoc Pentaho 7.1 is available!
Instructions for how to add new visualizations

You’ll see this documentation has beta written on it. The reason is simple – we decided to put it out there, collect feedback from the community and implement any changes / fine tunes / etc before 8.0 timeframe, where we’ll lock this down, guaranteeing long term support for new visualizations

MS HD Insights

HD Insights (HDI) is a hosted Hadoop cluster that is part of Microsoft’s Azure cloud offering. HDI is based on Hortonworks Data Platform (HDP). One of the major differences between the standard HDP release and HDI’s offering is the storage layer. HDI connects to local cluster storage via HDFS or to Azure Blob Storage (ABS) via a WASB protocol.

We now have a shim that allows us to leverage this cloud offering, something we’ve been seeing getting more and more interest on the marketplace.

Hortonworks security support

This is a continuation of the previous release, available on the Enterprise Edition (EE)
 Pentaho 7.1 is available!
Added support for Hadoop user impersonation
Earlier releases of PDI introduced enterprise security for Cloudera, specifically, Kerberos Impersonation for authentication and integration with Apache Sentry for authorization. 

This release of PDI extends these enterprise level security features to Hortonworks’s Hadoop distribution as well. Kerberos Impersonation is now support Hortonworks’s HDP. For authorization, PDI integrates with Apache Ranger, an alternative OSS component included in the HDP security platform.

Data Processing-Enhanced Spark Submit and SparkSQL JDBC

Earlier PDI and BA/Reporting releases broaden access to Spark for querying and preparing data through a dedicated transformation step Spark Submit and Spark SQL JDBC. 

This release will be extending these existing features to support additional vendors so that these features can be used more widely. Apart from additional vendors, these features have been now certified with a more up to date version of Spark 2.0. 

Additional big data infrastructure vendors supported for these functionalities apart from Cloudera and Hortonworks:
  1. Amazon EMR
  2. MapR
  3. Azure HD Insights

VCS Improvements

Repository agnostic transformations and jobs

Currently some specific step interfaces (the sub-transformation one being the more impactful) where the ETL dev has to choose, upfront, if he’s using a file on the file system or the repository. This prevents us from being able to abstract the environment where we’re working, so checking out things from git/svn and just import them is a no-go.

Here’s an example of a step that used this:

vcs old Pentaho 7.1 is available!
The classic way to reference dependent objects
ThisIn general, we need to abstract the linkage to other artifacts (sub-jobs and sub-transformations) independent on the used repository or file system.

The linkage needs to work in all environments whether it is a repository (Pentaho, Database, File) or File Based system (kjb and ktr).

The linkage needs to work independently of the execution system: On the Pentaho Server, on a Carte Server (with a repository or file based system), in Map Reduce and future execution systems as part of the Adaptive Execution System (AES) 

So we turned this into something much simpler:

vfs Pentaho 7.1 is available!
The current approach to define dependencies
We just define where the transformation lives. This may seem a “what, just this??” moment, but now we can just work locally, remotely, check into a repository, even automate the promotion and control the lifecycle in between different installation environments. I’m absolutely sure that existing users will value this a lot (as we can deprecate the stupid file-based repository)

KTR / KJB XML format

We did something very simple (in concept), but very useful. While we absolutely don’t recommend playing around with the job and transformation files (they are plain old XML files), we guaranteed that they are properly indented. Why? Cause when we use a version control system (git / svn, don’t  care which as long as you USE one!), you can easily identify what changes happened from version to version

Repository performance improvements

We want you to use the Pentaho Repository. And till now, performance while browsing that repository from Spoon was crap (there’s no other way to say it!). We addressed that – it’s now about 100x faster to browse and open files from the repository

Operations Mart Updates

Also known as the ops marts, available in EE. Used to work. Then it stoped working. Now it’s working again. Yay :/ 

I’ll skip this one. I hate it. We’re working on a different way to handle monitoring on our product, and at scale

Other Data Integration Improvements

Apart from all the above new big features, there are some smaller data integration enhancements added to product to build data pipeline with Pentaho easier.

Metadata Injection Enhancement

Metadata Injection enables creating generalized ETL transformations whose behavior can be changed at run-time and thus significantly improves data integration developer agility and productivity. 
In this release, a new option for constant has been added for Metadata Injection which will help making steps more dynamic with Metadata Injection feature.   
This functionality extended to Analytic Query and Dimension Lookup/Update steps which will help making these steps dynamic and thus make them highly dynamic. Dynamism of these steps will improve the Data Warehouse & Customer 360 blueprints and similar analytic data pipeline. 

Lineage Collection Enhancement

Customers can now configure the location for the lineage output and add the ability to write to VFS location. This will help customers to maintain lineage in clustered / transient node environments, such as Pentaho MapReduce. Lineage information helps with data compliance and security needs of the customers. 

XML Input Step Enhancement

XML Input Stream (StAX) step has been updated to receive XML from a previous step. This will make it easier to develop XML processing in data pipeline when you are working with XML data. 

New Mobile approach (and the deprecation of Pentaho Mobile)

We used to have a mobile specific plugin, introduced in a previous Pentaho release, that enabled touch gestures to work with analyzer.

But while it sounds good, in fact it didn’t work as we’d expected. The fact that we had to develop and maintain a completely separate access to information caused that mobile plugin to become very outdated. 

To complement that, the maturity of the browsers on mobile devices and the increased strength of tables makes it possible for Pentaho reports and analytic views to be accessed directly without any specialized mobile interface. Thus, we are deprecating the Pentaho mobile plug-in and investing on the responsive capabilities of the interface

It sounds bad? Actually it’s not – just use your tablet to access your EE pentaho, looks great icon smile Pentaho 7.1 is available!

Pentaho User Console Updates

sapphire Pentaho 7.1 is available!
Sapphire theme in PUC

Starting in Pentaho 7.1, Onyx will be deprecated and removed from the list of available themes in PUC. In addition, a new theme “Sapphire” has been introduced in 7.0. As of Pentaho 7.1, Sapphire will be PUC’s default selected theme. Crystal will be the available alternative.

Moreover, a newly refreshed log-in screen has been implemented in Pentaho 7.1, this screen has been based on the new Sapphire theme that was introduced in Pentaho 7.0. This is something that was already in 7.0 CE and now it’s the default for EE as well


As usual, you can get EE from here and CE from here

This is a spectacular release! I should be celebrating! But instead, it’s 8pm, I’m stuck in the office writing this blog post, and already very very stressed because I have all my 8.0 work stuff already piling up on my inbox…  icon sad Pentaho 7.1 is available!

I’m out, have fun!


Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

PentahoDay 2017 – Brazil, Curitiba, May 11 and 12

PentahoDay 2017 – Brazil, Curitiba, May 11 and 12

header PentahoDay 2017   Brazil, Curitiba, May 11 and 12

After a pause to rest in 2016, the biggest Pentaho event organized by the community is back. 2 days, May 11 and 12, dozens of presentations, use cases, even hands-on mini-labs will happen in Curitiba, Brazil.

speakers PentahoDay 2017   Brazil, Curitiba, May 11 and 12
Pentaho Day speakers

400 attendees or more are expected on this huge event. It’s really amazing, so if you’re even near South America, be there!

location PentahoDay 2017   Brazil, Curitiba, May 11 and 12

Register here

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Building Pentaho Platform from source and debugging it

After all, if it’s open source, means we can compile it, right?

I love this hammer Building Pentaho Platform from source and debugging it
I’m sure you’ve guessed by now this is not an original image from me even though I’ve been told I’m very good at drawing stuff – and I always believe my daughter!

Sure – but sometimes it’s not as easy as it seems. However, we’re doing a huge consolidation work to streamline all our build processes. Historically, each project, specially the older ones (kettle, mondrian, prd, ctools) used each own build method, depending on the author’s personal stance on them (and boy, there are some heavy opinions in here…)

Personally, I come from the CCLJTMHIWAPMIS school of thought (for the ones not familiar with it, the acronym means Couldn’t Care Less Just Tell Me How It Works And Please Make It Simple, very popular specially within lazy Portuguese people).

And we’re now doing this, slowly and surely, to all projects, as you can see from browsing through Pentaho’s Github.

So let’s take a look at an example – building Pentaho Platform from source. Please note that we’ll try to make sure the project’s contains the correct instructions. Also, this won’t work for all versions, as we don’t backport this changes; In the case of Pentaho Platform, this works for master and will appear in 7.1. Other will have it’s own timeline.

Compiling Pentaho Platform

1. Clone it from source

Ok, so step one, clone it from source:

$ git clone

(or use git:// if you already have a user)

2. Set up your m2 config right

Before compiling it, you need to set some stuff in your maven settings file. In your home directory, under the .m2 folder, place this settings file. If you already one m2 settings files, that means you’re probably familiar with maven in the first place and will know how to merge the two. Don’t ask me, I have no clue.

If you’re wondering why we need a specific settings file… I wonder too, but since my laziness is bigger than my curiosity (CCLJTMHIWAPMIS, remember?) I think I zoned out when they were explaining it to me and now I forgot.

3. Build it

This one is easy icon smile Building Pentaho Platform from source and debugging it

$ mvn clean install

or the equivalent without the tests:

$ mvn clean package  -Dmaven.test.skip=true

If all goes well, you should see 

[INFO] — maven-site-plugin:3.4:attach-descriptor (attach-site-descriptor) @ pentaho-server-ce —
[INFO] — maven-assembly-plugin:3.0.0:single (assembly_package) @ pentaho-server-ce —
[INFO] Building zip: /Users/pedro/tex/pentaho/pentaho-platform-master/assemblies/pentaho-server/target/
[INFO] ————————————————————————
[INFO] Reactor Summary:
[INFO] Pentaho BI Platform Community Edition ………….. SUCCESS [  4.461 s]
[INFO] pentaho-platform-api …………………………. SUCCESS [ 10.149 s]
[INFO] pentaho-platform-core ………………………… SUCCESS [ 19.819 s]
[INFO] pentaho-platform-repository …………………… SUCCESS [  2.210 s]
[INFO] pentaho-platform-scheduler ……………………. SUCCESS [  0.172 s]
[INFO] pentaho-platform-build-utils ………………….. SUCCESS [  1.695 s]
[INFO] pentaho-platform-extensions …………………… SUCCESS [01:22 min]
[INFO] pentaho-user-console …………………………. SUCCESS [ 19.596 s]
[INFO] Platform assemblies ………………………….. SUCCESS [  0.059 s]
[INFO] pentaho-user-console-package ………………….. SUCCESS [ 16.399 s]
[INFO] pentaho-samples ……………………………… SUCCESS [  1.159 s]
[INFO] pentaho-plugin-samples ……………………….. SUCCESS [ 11.129 s]
[INFO] pentaho-war …………………………………. SUCCESS [ 45.434 s]
[INFO] pentaho-style ……………………………….. SUCCESS [  0.742 s]
[INFO] pentaho-data ………………………………… SUCCESS [  0.211 s]
[INFO] pentaho-solutions ……………………………. SUCCESS [31:31 min]
[INFO] pentaho-server-manual-ce ……………………… SUCCESS [01:15 min]
[INFO] pentaho-server-ce ……………………………. SUCCESS [01:51 min]
[INFO] ————————————————————————
[INFO] ————————————————————————
[INFO] Total time: 38:36 min
[INFO] Finished at: 2017-03-31T15:36:43+01:00
[INFO] Final Memory: 102M/1084M
[INFO] ————————————————————————

There you go! In the end you should see a dist file like assemblies/pentaho-server/target/pentaho-server-ce– Unzip it, run it, done.

Debugging / inspecting the code

So the next thing you’d probably want, would be to be able to inspect and debug the code. This is actually pretty simple and common to all java projects. Goes something like this:

1. Open the project in a Java IDE

Since we use maven, it’s pretty straightforward to do this – simply navigate to the folder and open the project as a maven project.

In theory, any java IDE would do, but I had some issues with Netbeans given it uses an outdated version of maven and ended up switching to IntelliJ IDEA.
idea Building Pentaho Platform from source and debugging it
I actually took this screenshot of IntelliJ myself, so no need to give credits to anyone

2. Define a remote run configuration

Now you need to define a remote debug configuration. It works pretty much the same in all IDEs. Make sure you point to the port of the Java Debug Wire Protocol (JDWP) port you’ll be using in the application you’re attaching to
debugConfig Building Pentaho Platform from source and debugging it
Setting up a debug configuration

3. Make sure you start your application with JDWP enabled

This sounds complex, but really isn’t. Just make sure your java command includes the following options:


For pentaho platform is even easier, as you can simply run

4. Once the server / application is running, simply attach to it

And from this point on, any breakpoints should be intercepted

breakpoints Building Pentaho Platform from source and debugging it
Inspecting and debugging the code

Submitting your fixes

Now that you know how to compile and debug code, we’re a contributor in the works! Let’s imagine you add some new functionality or fix a bug, and you want to send it back to us (you do, right???). Here are the steps you need – they may seem extensive but it’s really pretty much the normal stuff:
  1. Create a jira
  2. Clone the repository
  3. Implement the improvement / fixes in your repository
  4. Make sure to include a unit test on it (link to how to write a unit test or a sample would be good)
  5. Separate formatting-only commits from actual commits. So if your commit reformats the java class, you need to have a commit with [CHECKSTYLE] as the commit comment. Your main changes including your test case should be in a single commit.  
  6. Get the code formatting style template for your IDE and update the year in the copyright header
  7. Issue a pull request against the project with [JIRA-ID] as the start of the commit comment
  8. For visibility, add that PR to the jira you created, email me, tweet, whatever it takes. Won’t promise it will be fast, but I promise we’ll look icon smile Building Pentaho Platform from source and debugging it

Hope this is useful!


Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

From 0 to a full blown Pentaho 7 spectacular dashboard in 60m

CBF2 is awesome? Hell yeah!

I’ve recently been blogging about CBF2 and talking about how great it is. But I admit that even just by looking at the blog post some people may not take it seriously assuming it’s too complex. It’s not.  

What you’ll get – in less than one hour

Today I did a demo on a topic that I’m extremely passioned about, horology. With the help of Miguel Leite, one of our UX wizards here, we did a one day push to build this project (bidder beware, this was a looooong day….). 

The result? Absolutely spectacular, completely worth the effort: 

dashboard From 0 to a full blown Pentaho 7 spectacular dashboard in 60m

And all this fueled by the amazingly powerful dataservices + annotations:

pdi From 0 to a full blown Pentaho 7 spectacular dashboard in 60m

You can get this in even less than one hour; you know, it’s just that most of the time is downloading stuff, and I kept getting distracted and forget to go back to what I was doing. I’m absolutely sure you can do it in much less!

So, let’s go!

Pre requisites

Here’s what you need:
  • Any operating system, and a machine with at least 8gb
  • Docker configured with at least 4gb on it (get it from here)
  • Git (or any UI for git)
  • Not being afraid to launch a terminal window…

C’mon, it’s not asking much, is it?

Getting it all working in just 6 steps

1. Create a directory for pentaho and CBF

Create a directory called pentaho, open a terminal there and clone CBF2

$ git clone

You should have all the directory structure as described in the CBF2 blog post

2. Download Pentaho 7.0

Under the software directory, create another folder, called (I like to use the version / build number) and put pentaho there, CE or EE:
  • Get CE from
  • Get EE from the Pentaho support portal (customers only):,,, . If you download patches for, they will be automatically applied. In this case you also need to put your license files under the cbf2/licenses/  folder.

3. Get the horlogery-demo project

Clone the horlogery-demo project under the cbf2/projects directory:

$ git clone

4. Do the CBF2 magic

Under the cbf2/ folder you have the magic script, built by pink unicorns. Go to that dir and…
  1. Execute cbf2 and press [A] to add a new image and select the server you downloaded. If you’re using EE you’ll need to accept the license agreement. A new image should be available
  2. Execute cbf2 and press [C] to create a new project. Select the horlogery-demo project and the image created previously.
  3. There’s no 3

5. Start using it!

If everything went as expected, you should be seeing something like this:

pedro@orion:~/tex/pentaho/cbf2/projects/horlogery-demo (master *) $ cbf2

Core Images available:


[0] baserver-ce-

[1] baserver-ee-

Core containers available:


Project images available:


[2] pdu-horlogery-demo-baserver-ce-

[3] pdu-horlogery-demo-baserver-ee-

Project containers available:


> Select an entry number, [A] to add new image or [C] to create new project:</span></span>

Select the project you want, press [L]  to launch it and it will soon be available for you to start exploring!

(Note that depending on the operating system, the docker IP may not be though, I can’t help there)

puc From 0 to a full blown Pentaho 7 spectacular dashboard in 60m

6. Next steps? 

From this point on it’s you writing your own project and success story! And I’m going to get some sleep, since I had nearly none last night!! :p

Have fun!


Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

[Marketplace Spotlight] BTable 3.x

Marketplace spotlight time! This time for an amazing contribution by our Italian friends from

Massimo Bonometto just blogged about the new BTable release, that I shamelessly report here:

Hats off, Massimo!


Repost from Massimo’s blog post

In January 2017 a new BTable version has been released to Pentaho Community.
As always it is available from Pentaho Markeplace.

Note about BTable version numbering: Pentaho 7.0 uses an newer version of Spring platform. This is why we are forced to maintain 2 different versions of BTable. BTable 3.0 works with Pentaho 5.x and 6.x while BTable 3.6 is the one for Pentaho 7.x. 

What’s New?

In the following I’m going to give a brief description of the most important features introduced with this new version. 

Styling And Alarms
We introduced the concept of BTable Templates. One template is a JSON file with .bttemplate suffix, that usually lives inside Pentaho Repository, whose structure is composed of 3 sections:
  • alarmRules: defines the alarm logic for each measure;
  • inlineCss: contains CSS statements added dynamically to one single BTable;
  • externalCss: similar to the previous one but uses externalCss file. 
BTable With Templates [Marketplace Spotlight] BTable 3.x

Alarm styling is based on CSS and gives developers the opportunity to create very nice results. 

The Template is a BTable property and can be set inside CDE or changed in BTable Analyzer; that is developers can create, for example, many templates with different alarm logics and users can dynamically change templates in order to evaluate their effect.

It is possible to drive the default template for all BTables and default template for each Mondrian cube. Just create a new folder named /public/BTableCustom and add:

  • Default_Mondrian Catalog_Mondrian Cube.bttemplate (For example Default_SteelWheels_SteelWheelsSales): it is used as default for BTables on specific Mondrian cube;
  • Default.bttemplate: it is used as default when a specific template for cube is not found.

Show Table Option 
I’m sure that most of you love to spend time adding filters to CDE dashboards. Well, I really hate it!!! (In particular when a customer asks to add one filter after I finished the dashboard).   icon smile [Marketplace Spotlight] BTable 3.x 
This is why I had the idea to use BTable just for filters selection. I find it really tricky.

In the BTable With Templates example I show you how you can add a BTable just for filter selection and then synchronize other 2 BTables.
The same can be easily done with other components based on MDX query.

Using BTable Filter Panel From External Applications 
Sometimes it happens that in your custom application you need to work with dimension members selections (for example for profiling purposes). You can do it working directly on database but I found it very useful to create one way to do it through BTable Filters Panel. Basically you have the opportunity to invoke BTable passing an endpoint as parameter. When the user saves filters selections the endpoint is launched.
If you are curious about this, you can use comments to this post and I will do my best to explain it in details in another post.

Filter On Dimension Members 

When the user selects one dimension inside Filter Panel the dimension member showed are filtered based on filter selections made for other dimensions. This is the default behaviour but can be optionally changed by users. 

Show Toolbar Option 
Now it is possible to show one toolbar with most common actions on top of BTable. Toolbar is active by default when you start from BTable Analyzer and viceversa for CDE dashboard. 
Users can toggle the toolbar visibility.

Since its first version BTable has the command Reset to reload the initial state. Now we also added the Back button in the toolbar that gives the opportunity to move BTable to previous states. 

Show Zeros Option
It is common in OLAP/MDX world to deal with NOT NULL option but it happens frequently that measures fields inside facts tables contain zeros values.
This option, active by default, deletes rows and columns when all values are nulls or zeros.


We made some improvements in order to speedup BTable rendering. I tested I’m able to list more then 300,000 rows in a reasonable amount of time. 

New posts with further details will follow.


Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Doing GeoLocation in PDI – Pentaho Data Integration (Kettle)

[unable to retrieve full-text content]

Geo Location

Geo location is something we often need in ETL work. And while we had a step that worked in PDI 5.x and earlier releases, we just noticed it’s not currently working.

Until this morning, that is :p

I just forked Matt’s initial project and applied the relevant changes to make it compatible with Pentaho 6+

The basics

Well, easy to understand… We have an IP address, we want to know where it comes from!
trans Doing GeoLocation in PDI   Pentaho Data Integration (Kettle)
Geolocation transformation – Let me see if it finds out where I am…
Once I execute this, I get the following result:
det Doing GeoLocation in PDI   Pentaho Data Integration (Kettle)
Yep, this is where I am…
I am indeed in Porto Salvo, Portugal, so this is right. Can’t get any easier than this!

Making it work

So, how to make this work? First, you have to get the plugin from the PDI marketplace
This plugin is available through the marketplace. Just go ahead and install it.
marketplace Doing GeoLocation in PDI   Pentaho Data Integration (Kettle)
PDI Marketplace – Get your goodies from here
After installing it and restarting PDI, you’ll see the GeoIP Lookup step in the lookup folder. Configuring it is straightforward: You point to the stream field containing the IP address, point to the IP database files and specify what fields you want back:
stepConfig Doing GeoLocation in PDI   Pentaho Data Integration (Kettle)
Configuring the step

Getting the IP Database files

You need to get the files from MaxMind, and from my experience these guys do a great job here. They have some great commercial offerings but also a GeoLite database for country and city location. You can get them from here
GeoLite Doing GeoLocation in PDI   Pentaho Data Integration (Kettle)
Getting the GeoIP data files
And you should be done! This even works great in a map reduce job

Pedro Alves on Business Intelligence

New WEKA releases: 3.6.15, 3.8.1 and 3.9.1

[unable to retrieve full-text content]

The Weka team is on fire. New releases available for download from the Weka homepage:

Weka 3.8.1 – stable version. 

It is available as ZIP, with Win32 installer, Win32 installer incl. JRE 1.8.0_112, Win64 installer, Win64 installer incl. 64 bit JRE 1.8.0_112 and Mac OS X application with Oracle 64 bit JRE 1.8.0_112.

Weka 3.9.1 – development version

It is available as ZIP, with Win32 installer, Win32 installer incl. JRE 1.8.0_112, Win64 installer, Win64 installer incl. 64 bit JRE 1.8.0_112 and Mac OS X application with Oracle 64 bit JRE 1.8.0_112.

Weka 3.6.15 – stable book 3rd edition version

It is available as ZIP, with Win32 installer, Win32 installer incl. JRE 1.8.0_112, Win64 installer, Win64 installer incl. 64 bit JRE 1.8.0_112 and Mac OS X application with Oracle 64 bit JRE 1.8.0_112.

    Stable 3.8 receives bug fixes and new features that do not include breaking API changes and maintain serialized model compatibility. 3.9 (development) receives bug fixes and new features that might include breaking API changes and/or render models serialized using earlier versions incompatible.

    NOTE: 3.6.15 is the final release of stable-3-6.

    Weka homepage:

    Pentaho data mining community documentation:

    Packages for Weka>=3.7.2 can be browsed online at:

What’s new in 3.8.1/3.9.1?

    Some highlights

    In core weka:

  • Package manager now handles redirects generated by SourceForge
  • Package manager now employs a new class loading mechanism that attempts to avoid third-party library clashes by isolating the third-party libraries in each package
  • new RelationNameModifier, SendToPerspective, WriteWekaLog, Job, StorePropertiesInEnvironment, SetPropertiesFromEnvironment, WriteDataToResult and GetDataFromResult steps in Knowledge Flow
  • RandomForest now has an option for computing the mean impurity decrease variable importance scores
  • JRip now prunes redundant numeric attribute-value tests from rules
  • Knowledge Flow now offers an additional executor service that uses a single worker thread; steps can, if necessary, declare programmatically that they should run in the single-threaded executor.
  • GUIs with result lists now support multi-entry delete
  • GUIs now support copying/pasting of array configurations to/from the clipboard

    In packages:

  • Multi-class FLDA in the discriminantAnalysis package
  • New implementations in the ensemblesOfNestedDichotomies package
  • distributedWekaBase now includes the latest version of Ted Dunning’s t-digest quantile estimator, bringing a factor of 4 speedup over the old implementation
  • New streamingUnivariateStats package
  • RPlugin package updated to support the latest version of MLR
  • New wekaDeepLearning4j package – provides a MLP classifier built using the DL4J library. Can work with either CPU-based or GPU-based native libraries
  • New logarithmicErrorMetrics package
  • New RankCorrelation package, courtesy of Quan Sun. Provides rank correlation metrics, Kendall tau and Spearman rho, for evaluating regression schemes
  • New AffectiveTweets package, courtesy of Felipe Bravom. Provides text filters for sentiment analysis of tweets
  • New AnalogicalModeling package, courtesy of Nathan Glenn. Provides an exemplar-based approach to modeling
  • New MultiObjectiveEvolutionaryFuzzyClassifier package, courtesy of Carlos Martinez Cortes. Provides a fuzzy rule-based classifier
  • New MultiObjectiveEvolutionarySearch package, courtesy of Carlos Martinez Cortes. Provides a search method that uses the ENORA multi-objective evolutionary algorithm

    As usual, for a complete list of changes refer to the changelogs.

Pedro Alves on Business Intelligence

Announcing Pentaho 7.0 (available mid-November)

[unable to retrieve full-text content]

Announcing Pentaho 7.0 (available mid-November)

I’ll go straight to it – This is the most spectacular release ever!
This previous sentence would even be more meaningful if I hadn’t been deeply involved on this release, and by “deeply involved” I actually mean that sometimes I was able to sneak in to the development rooms and a few times speak to a few of the devs before the heads of engineering kicked me out of the room… but still, the janitor sometimes pat me in the back when he saw me crying in a corner and said that someone must listen to me, so I’m taking his word for it….
Anyway, here’s the announcement and mid-november will be available for download!

The Year of the Product

The beginning of the year, our CEO, Quentin Gallivan, gave us a challenge: “Make this the year of the product!”. In CEO-language, this basically means I’m gonna be fired if we don’t make good progress in a journey to improve usability and ease of use! That’s motivation in my book!
So here’s the main announcement of Pentaho 7.0, that will be made available to download mid-November. These are the main release highlights
 Announcing Pentaho 7.0 (available mid November)
Figure 1: 7.0 Release Highlights
I’m going through this in a somewhat random order.

Admin Simplification

The Pentaho Server

This has been a long term goal internally, and we’ve been testing it in CE since 6.1. The BA Server / DI Server distinction is no more (actually, I don’t make it a secret that I think it shouldn’t ever have been created, but that’s just my sweet person talking…).
We now have one single artifact: The Pentaho Server, with full combined BA/DI capabilities. It’s important to notice that this doesn’t change the deployment topology strategy – there will be a lot of times, specially on larger organizations, where it will make sense to have multiple servers, some dedicated to the more interactive, BA style operations and others optimized to the heavy duty data integration work.

A simplified architecture

It’s a fact that our product is architecturally complex; Not because we want – it’s a consequence of us being the only vendor with a platform that works all the way through the data pipeline, from the data integration to the business analytics side.
 Announcing Pentaho 7.0 (available mid November)
Figure 2: The data pipeline
We’re still faithful to the original founders’ vision: Offer a unified platform throughout all these stages, and we’ve been tremendously successful at that. But we believe it’s possible to combine this vision with an improved – and much simplified – user experience. And it’s why we’re doing this.
Some of you that have been around long enough that you can recognize this image:
 Announcing Pentaho 7.0 (available mid November)
Figure 3: Oh my god, my eyes!!!
We’re moving to a much simpler (conceptual) approach:
 Announcing Pentaho 7.0 (available mid November)
Figure 4: Pentaho Architecture
This means that going forward, we want to focus our platform on two main cornerstones: PDI and the Pentaho Server. And we’re working on making the two interact as seamlessly as possible.
Please note that this doesn’t mean we’re not counting on other areas (Mondrian, PRD, CTools, I’m looking at you), on the contrary. They’ll keep being a fundamental part of our platform, but they will take a more of a backstage role making all the wheels turning instead of a taking a front seat.

Connecting PDI to the Pentaho Server

One of the first materializations of this concept was the work done on connecting from the PDI (spoon) to the Pentaho Server. It’s now a much more streamlined experience:
 Announcing Pentaho 7.0 (available mid November)
Figure 5: Pentaho Repository Connection
Once defined, we’ll be able to get a new login experience:
 Announcing Pentaho 7.0 (available mid November)
Figure 6: Logging in to the Pentaho Server
Once done, there will be the indication of where we’re connected to, plus a few simpler ways to handle those connections:
 Announcing Pentaho 7.0 (available mid November)
Figure 7: Identifying the current connection
And remember when I mentioned the simplified architecture? Now both the Data Integration user and the Business user have access to the same view:
 Announcing Pentaho 7.0 (available mid November)
Figure 8: Different views over the same ecosystem
A lot of optimizations were done here to allow a smoother experience:
  • Repository performance optimizations (and we still want to improve the browsing / open / save experience)
  • Versioning is turned off by default
  • That somewhat annoying commit message every time we save is now also turned off by default
  • Every connection dialog now connects to port 8080 and to the pentaho/ webapp instead of the 9080 and pentaho-di that has now been somewhat discontinued (even though for migration purposes we still hand out this artifact)


It’s fundamental to note that existing installations with the BA / DI configuration won’t turn into some kind of legacy scenario; This configuration is still supported and, much on the contrary, it still is the recommended topology. This is about capabilities, not about installation.
In 7.0, for migration purposes, we’ll still have the baserver / diserver artifacts for upgrades only.

Analytics Anywhere

A completely new approach

Ok, so this is absolutely huge! You’re certainly familiar with the classic data pipeline that describes most of the market positioning / product placement:
 Announcing Pentaho 7.0 (available mid November)
Figure 9: Data pipeline
In this scenario we identify three different funnels: Engineering, Data Preparation and Analytics. But we started thinking about this and got to the somewhat obvious conclusion that this doesn’t actually make a lot of sense. The truth is that the need for Analytics happens anywhere in the data pipeline.
By being one of the few products that work on all this 3 areas, we’re in a unique position to completely break this model and deliver analytics anywhere in the data pipeline:
 Announcing Pentaho 7.0 (available mid November)
Figure 10: Analytics Anywhere in the data pipeline
And 7.0 is the first step in a journey that aims to break these boundaries while working towards a consolidated UX experience; And the first materialization is bringing analytics to PDI…

An EE feature

This is huge. Really huge! And let me say from the beginning that this feature is EE only. Why? Because according to our CE/EE framework this falls there: it’s not an engine level functionality, and while it doesn’t prevent any work from being done, it drastically accelerates the time to results.
And just a word on this – even though I’m the Community guy, and one of the biggest advocates of the advantages of having a great CE release, I’m also a huge proponent that a good, well thought balance has to exist between the CE and EE versions. This balance is never easy to get to – we know we can’t be 100% open source and we know we’ll absolutely lose this battle if we’re completely closed source. The sweet spot is somewhere in the middle.

Entry point

Starting from 7.0, we’ll be able to see a new flyover when in PDI with 2 buttons in there:
  • Run and inspect data
  • Inspect data
 Announcing Pentaho 7.0 (available mid November)
Figure 11: Analytics entry point
The difference between both are subtle but will grow in importance over time; The first option always runs the transformation and get the set of data to inspect, while the second option gets data from cache if it’s available. If not, acts as the first one.

A new Data Inspection experience

If we click any of those options, we should land in a completely new Data Inspection experience:
 Announcing Pentaho 7.0 (available mid November)
Figure 12: A new Data Inspection experience
The first thing you’ll see here is obviously the most immediate kind of information you’ll expect to see: A table that shows the data that’s flowing on the transformation stream. However, there’s a lot more that you can do from this point on, and even without moving away from this initial visualization you can select which columns to see and sort the available data.
It’s important to note that this may not be (and most likely won’t be) the entire data set. This is about data inspection and spot-checking; What this does is looking at the stream of data that passes in PDI and uses a limited amount of data. This limit is still to be determined, but should be in the range of thousands of rows. This (configurable) number will go up in time, never compromising in usability and speed of analysis.

Other ways to visualize the data

So we can see a table. Not exactly exciting so far, even though it’s much more legible and useful that the good-ol’ preview window. But this is just one of the possible ways to look at the data:
 Announcing Pentaho 7.0 (available mid November)
Figure 13: Visualization selector
So as you see we can have several ways to look at the data:
  • Table
  • Pivot
  • Bar and stacked bar charts
  • Column and stacked column charts
  • Line and area charts
  • Pie and doughnut chart
  • Scatter and bubble charts
One thing you’ll notice is that you’re not restricted to work with a single visualization; It’s possible to create different tabs so you can do other kind of analysis:
 Announcing Pentaho 7.0 (available mid November)
Figure 14: Working with different visualizations simultaneously
Here’s an example of getting this information with a different visualization:
 Announcing Pentaho 7.0 (available mid November)
Figure 15: Stacked bar chart

Chart tweaks and improvements

The previous screenshot showed a bar chart. And you have no idea how much work was put on these visualizations… You’re surely thinking “it’s a stupid bar chart. I’ve seen hundreds just like this one”. Well, let me tell you – you’re wrong. This is not just a bar chart – this is an astonishing bar chart with a lot of attention given to details.
Let me go through some areas where the team did a great work:

A new color palette

 Announcing Pentaho 7.0 (available mid November)
Figure 16: A completely useless chart just to prove a point
From the start we had a goal: This experience had to be pleasant to the user. It had to be pretty, and a great color balance is absolutely fundamental. However, it’s really not an easy task to get a generic color palette to the visualizations that even with a lot of categories is pleasant to the eyes.
But I think that objective has been achieved. If you look at the previous utterly stupid pie chart with tons of categories, you’ll have to agree that even with lots and lots of colors the overall color balance is still very easy on the eyes – a great balance between beauty and legibility

Screen real estate optimization

How many times did you see a chart with so many bars that they seemed thinner than a pixel? Or a single bar that caused a dashboard to look like your garage door?
Well, not here…
 Announcing Pentaho 7.0 (available mid November)
Figure 17: Not that many bars on screen
We always try to leverage as much screen real estate as we possibly can, but trying to prevent edge cases; on the case of the bar charts, bars have a maximum width so they don’t become stupidly large.
But the opposite is also true: We defined a minimum width for the visual elements on screen, and if that minimum is reached, instead of sacrificing legibility by allowing bars to shrink to a tiny value, we simply stop at a given size and let the chart overflow on its categorical axis.
 Announcing Pentaho 7.0 (available mid November)
Figure 18: Much more bars, but still readable!
This screenshot shows exactly that. More decades than what fits on screen will result on the appearance of that scroll bar you see on the right.

Axis label legibility

A bit related with the previous item is how we treat the axis labels. We try to show them the “best” way possible… If we see the axis label fit on screen, we put then on their natural position, horizontally:
 Announcing Pentaho 7.0 (available mid November)
Figure 19: Horizontally placed axis labels
But if we see there’s not enough room and they would overlap, we automatically slant them; If still they don’t fit on screen, we don’t let them overflow part a certain point (I don’t recall the exact rule but it’s something like never going over 20% of the chart height/width).
And on those cases there a tooltip will allow to see the full label.
 Announcing Pentaho 7.0 (available mid November)
Figure 20: Lots of wide labeled categories
This is obviously very hard to guarantee it works on every condition, but so far I think it’s a huge improvement

Chart legends

This is one of those that it’s so obvious that we ask ourselves why didn’t we do it from the start… So, legends are good. They provide information… So yeah, we have them, like shown on this stacked chart of votes by type (I’m parsing an IMDB ratings file I grabbed from the internet):
 Announcing Pentaho 7.0 (available mid November)
Figure 21: Cool and useful looking legends, I salute you
However, suppose that instead of breaking down by decade, I want to do a breakdown by year. That’s a lot of legends, right?
 Announcing Pentaho 7.0 (available mid November)
Figure 22: Hum, formerly cool and useful looking legends, where did you go??
No, they’re gone. The rationale here is simple: If you have a lot of series the legends become completely useless, and even risk stealing away precious screen real estate – how many times did we see legends taking more space than the chart itself?
So we applied an extremely advanced algorithm here. Heavy math, guys, we used a predictive univariate model that based on font being monospaced or not, size of strings, number of elements, width of the chart, number of lines the legend would use and…. Nah, I’m kidding, we didn’t bother- we just hide the damn legend if it has more than 20 elements. Simple and effective! :p


You probably noticed it by now, but simply put, they look great and give you the information you need 
 Announcing Pentaho 7.0 (available mid November)
Figure 23: A great looking tooltip

And a few other minor but very important things…

There were other minor (?) interventions that really work well with all these other items I mentioned previously: a correct choice of font family, size and color; A balanced chart configuration for gridlines; The placement and orientation of the axis titles and more. Everything working together to provide a combined result that I personally think is not short than amazing and for which I’m extremely proud of the team.
We decided for this go round not to give the user the possibility to customize chart properties, and it was a conscious decision. We believe that in a lot of places there’s an incorrect mix of data related properties with visual properties – sometimes this shouldn’t even be done by the same person. In this context it’s all about the data, so we opted to work a lot on a great set of defaults that make reading the data as easy as possible and on a later stage (dashboards, I’m thinking of you) we’ll work on allowing to set visual specific properties. I think it was the right decision.

The underlying technology

Even though we’re not making that public for now, we developed a new version of what we internally call the VizAPI, and that’s what’s currently providing the visualizations to this interface (by the way, we internally code name this interface DET, don’t ask me why…). And this is obviously pluggable, so when we get the chance to make the documentation available, anyone will be able to provide extra visualizations that will be available to use alongside the othere.
And the visualization implementation itself? I’m sure you won’t be surprised to know it’s the Ctools’ CCC charting engine, and we also want to make all the described behavior the default behavior of CCC, which would obviously benefit all the Ctools users out there.
We didn’t get time to do it, but very soon we’re going to apply this new VizAPI to Analyzer as well, so the visualizations and it’s behaviors will be coherent between this new analysis interface and Analyzer.

Stream vs Model: Modelling at the source

You probably noticed that I always used Votes in the previous screenshots, and there’s a reason for it: While votes are a cumulative concept, doesn’t make any sense at all to show a sum of ranks. But until this point, we have no information that allow us to know what’s the business meaning of these fields; All we know is if they’re strings, numbers, binaries, dates, etc… In order to get insights from fields like rank, we need to get more semantics out of the fields mean from a business perspective.
And how do we get this information? Classically, this information is appended in a separate stage of the process. We are used to calling it the modelling stage, and it’s an operation usually done after the data integration stage is complete. On our stack, we do this by writing Mondrian schemas (if we want to use OLAP) or Pentaho Metadata models for interactive reporting.
But this is incredibly stupid! From the point we get a field called rank, we already know it should be treated as an average. As soon as we see a date field, most likely it will feed a date dimension. If we get country, state, city fields, they will mostly likely be attributes of a territory dimension. Makes no sense at all to wait till the end of this data preparation stage and resort to a different tool to append an information we have from the start.
In this new way of analyzing data as part of the data integration process, we started with the following assumption: There are two different lenses that we can apply to look at a data set:
  • Stream view: This is the bi-dimensional representation of the physical data that we’re working with; A view over the fields and their primary types
  • Model view: The semantic meaning of those fields; Dimensions, attributes, measures, basically the real business meaning of the stream underneath
On the example I’ve been using, these are the two views:
 Announcing Pentaho 7.0 (available mid November)
Figure 24: Stream view and Model view
Like mentioned before, these are two views over the same domain; If we’re interested in looking at the physical stream, the one on the left will be used. If we’re looking from a business perspective, it’s the model view that has the added information. Our current thinking is that only the model view will be available for the end users (once we get this data exploration experience there).

Annotating the stream

The first time you switch to the model view (and you’ll notice that some visualizations only make sense for a specific view, which is the case of the table and the pivot view for stream and model respectively) you’ll probably notice that some of the information is not as you want it: Rank is defined as a cumulative measure, decade and years are not on the same dimension, just to name two specific examples.
How to correct this information? Through the special Annotate Stream step. This is where you’ll add the extra business information that we’ll use to render the correct model view. Here’s an example:
 Announcing Pentaho 7.0 (available mid November)
Figure 25: Annotating the stream
The concepts should be familiar, as they are based on the dimensional modeling concepts that have been around for 30+ years. Why? Because most of those concepts are not technical, on the contrary – IMO the biggest advantage of the core data warehouse concepts is the way raw data is turned into business meaningful terminology. The technologies to turn one in the other may evolve, but the main concepts are exactly the same: Measures, dimensions, attributes, properties, etc.
So by adding this information, we’ll be able to get the correct model from this step on and see the correct model information and expected output from our visualizations:
 Announcing Pentaho 7.0 (available mid November)
Figure 26: A visualization using the correct model information
As I’m sure you realized by now, underneath we’re generating a Mondrian model. To be more accurate, we’re generating what we call a DSW model, which contains more than the Mondrian schema. It’s important for us not to lock this down to a specific technology or implementation to allow for future optimizations.

The pivot view

One special visualization is the pivot view, for OLAP- style analysis
 Announcing Pentaho 7.0 (available mid November)
Figure 27: The pivot view
The result? An experience you may be very familiar about:
 Announcing Pentaho 7.0 (available mid November)
Figure 28: Exploring the data in a pivot table format
One of the key items in Pentaho is embed-ability, and we have a lot of OEM customers. Here we have a classic case of “eat your own dog food”, as we’re leveraging a highly stylized analyzer visualization and taking advantage of its capabilities to be embedded in an external application. On this case, we are the external application, but it was a great validation that we actually are capable of doing what we say we do icon wink Announcing Pentaho 7.0 (available mid November)
You’ll notice that we disabled all the options that are available out of the box in analyzer, like filters, drilldowns, tops, ranks, etc… In future versions we’ll progressively add these fundamental operations to this exploration experience, but we’ll have to do it in a way where we can do them in all visualizations, not only in the pivot view, and we simply didn’t have time to do everything we wanted.

The number of rows

 Announcing Pentaho 7.0 (available mid November)
Figure 29: Data set size of the inspection
I’ve been talking about data inspection. I mentioned this before but want to reinforce so people don’t have the wrong expectations over this. We are not exploring the full dataset, at least for all the cases. PDI can process tons of data, and would be physically impossible to have something that could generically analyzer any non-optimized dataset size, at least in a fast and practical manner.
This is about data inspection and spot-checking; What this does is looking at the stream of data that passes in PDI and uses a limited amount of data. This limit, shown here as 1000 rows but will be larger, is still to be determined, but should be in the range of thousands of rows. This (configurable) number will go up in time, never compromising in usability and speed of analysis.

AgileBI, Instaview… Deja-vu?

Some of the older Pentaho users may be asking the following question:
But… isn’t this very similar to Agile BI and / or Instaview?
My best answer to that is: While they are indeed similar in concept, the different approach to the implementation makes this extremely useful while the others were, in my (very critical) opinion, completely useless.
What the others did wrong is that they forced the user to go out of the way to use them; AgileBI, for instance, only worked on specific steps, where data was materialized in a table. Then it would take a huge amount of time to prepare the infrastructure, you’d always have to provide modelling information, and eventually you’d be greeted with an analyzer frame running inside a pentaho instance running inside an application server embedded inside spoon…. You’d only be able to do slice and dice operations and when you’re done, you lose everything, there’s nowhere to go.
Instaview (discontinued and actually removed from the product a while back) had a slightly different approach – while it worked at any step, it always ran the full transformation and moved the data to an embedded MonetDB database and only after that we’d go into analyzer, that once again was running inside a thing that was running inside a thing, that was running inside a thing… jeez, it always felt to me like the architectural version of a Turducken (can you tell that I really, really, really hated those plugins?).
This new approach was built on what we learned from the others:
  • It doesn’t force you to exit your usual work flow, on the contrary, complements it;
  • We tried to make it extremely fast – there’s just a small overhead over the time it takes to actually run the transformation to get the data to be inspected;
  • It’s completely unmaterialized, no persistency involved;
  • Leverages data services, so takes advantage of push-down optimizations when needed
  • Gives you several ways to look at the data while you build your transformations, so it’s not restricted to a pivot table only
  • Blends the concept of stream and model on a unified view of the dataset
  • A single click publishes a datasource in the Pentaho Server

Looking into the future

I really believe that this will have a huge impact for PDI users; On its own it’s a fantastic tool that we’ll be improving over time and will be a real differentiation in the market. But we want more than that.
We’re on a journey to build one single interface for users to engage with information, regardless of where they are. We want to move away from a tool based approach to a data-centric approach, which will drastically improve the overall user experience.

Share Analytics in PDI

This one is it’s in own section because it’s one of the most useful features; When we’re connected to a Pentaho Server, we can immediately publish the dataset to have that available to the users on the Pentaho Server.
 Announcing Pentaho 7.0 (available mid November)
Figure 30: Publish datasource
This feature requires a connection to a server because on most cases it will immediately it will create an unmaterialized connection on the Pentaho Server through data services, which means the transformation will be executed on demand on the server. Special care has to be taken to make sure all the resources are available and working correctly on that server. For performance reasons, cache will be enabled by default.
 Announcing Pentaho 7.0 (available mid November)
Figure 31: Publish datasource dialog
From this point on you’ll be able to name the datasource and it will be created on the Pentaho Server you chose; One important feature, actually inherited from the SDR blueprint (SDR stands for Streamlined Data Refinery) is that the system is smart enough to create a direct JDBC connection to a database if we’re publishing from a table output step or equivalent.
As soon as we publish the dataset from within PDI all limits will be removed; The business users will be able to analyze all the data from within the User Console, completely unmaterialized. This requires that some care is needed related to data set size; Data services work extremely well for datasets up to a few million rows, but over that we may need to do some optimizations.
If more performance is needed, the integration developer can, at any time, materialize the data in a database and if he publishes again that unmaterialized connection will be immediately replaced by a “materialized” connection to the database.
This is a key message and strategic direction of the platform: keep things simple, go complex as needed; If the system behaves well with an unmaterialized model, we’ll leave it that way. If not, we’ll explore other solutions, knowing that there’s always a price involved in that (database maintenance, data lifecycle management, etc).
And what’s the final result? An exploration in Analyzer that mimics exactly what we saw on PDI (since, on my case, I had a small dataset)
 Announcing Pentaho 7.0 (available mid November)
Figure 32: The published result

Reporting Enhancements

And now, for something completely different. This release is not only made of new and shiny stuff. As in all releases tons of issues are addressed, and this time we also revamped the reporting bits to add an extremely important feature: Progress report and partial renderings.
From 7.0 on, if you run a report (PRD or PIR based), you’ll see this cool looking progress indicator. And even better, if you see it’s taking a lot of time to render cause it’s destroying half of the Amazon forest, you have the option to send it to background execution and it will be saved where you want it to.
 Announcing Pentaho 7.0 (available mid November)
Figure 33: Progress indicator! I know it’s 2016, but better late than ever!
There’s a second insanely useful improvement is that we’ll start giving you the pages as soon as they’re available without the need for you to wait for everything
 Announcing Pentaho 7.0 (available mid November)
Figure 34: Handing out stuff as soon as it’s ready


In 7.0 we increased our support to Spark in 2 main areas: Added orchestration abilities and support to SQL on Spark

Expanded Spark Orchestration

 Announcing Pentaho 7.0 (available mid November)
Figure 35: Spark orchestration improvementsd /
Allows IT/developers to visually coordinate and schedule Spark applications that leverage libraries for streaming, machine learning, structured data query, and other purposes; also supports applications written in Python.
This is important cause it allows to visually coordinate and schedule Spark applications to run in broader pipelines. Having this visual environment makes it easier to manage the wide variety of programming languages and different application types.
In here we expanded the existing Spark submit step capabilities, allowing to submit existing applications that use libraries including Spark Streaming, SparkMLlibSparkML, and Spark SQL.
This is supported for Cloudera and Hortonworks.

SQL on Spark

 Announcing Pentaho 7.0 (available mid November)
Figure 36: SQL on Spark capabilities added to PDI
PDI is now enabled to connect to data with SQL on Spark, making it easier for data analysts to query structured Spark data and integrate it with other data for preparation and analytics. This is done through an HQL query in the relevant PDI steps.
We’re leveraging the different Hadoop implementations; Cloudera uses the Hive on Spark JDBC while Hortonworks uses the Spark SQL JDBC driver.
From this point on… business as usual!

Metadata Injection

 Announcing Pentaho 7.0 (available mid November)
Figure 37: Metadata injection concept
You don’t know what Metadata injection is? You should. It’s absolutely useful when you have disparate datasources / rules and want to dynamically change them at runtime, avoiding having to build and maintain a huge amount of transformations and jobs. Define a template, pass metadata at runtime, and you’ll be good! Not the easiest thing to do, but that’s the price you get for this insanely powerful approach.
We did tons of improvements to this story in 6.1, and we kept adding support to it by enabling more than 30 new steps to this list
 Announcing Pentaho 7.0 (available mid November)
Figure 38: Added steps enabled with Metadata Injection in 7.0

Hadoop Security

 Announcing Pentaho 7.0 (available mid November)
Figure 39: Added support for Hadoop user impersonation
What a huge and amazing effort the team did here.
We are extending our Kerberos integration to effectively cover impersonation from multiple PDI users (whereas before it was focused on authentication from a single user).  Updated PDI Kerberos enhancements will allow multiple authenticated PDI users to access Kerberos-enabled Cloudera Hadoop clusters as multiple Hadoop users, promoting more secure big data integration. This also enables the Hadoop cluster to perform user level tracking and resource management.  This granular level of auditing of user activity is essential in enterprise-grade PDI implementations with Hadoop.
While Kerberos is focused on authenticating or providing a secure ‘log in’ to the cluster, Cloudera Sentry is a framework for user/role authorization to specific resources and data within Hadoop to help enforce business security policies.  In other words, users only have access to the data they have been provisioned to access by IT.  7.0 enables the integration of PDI with Cloudera Sentry in order to enforce enterprise data authorization rules.  Sentry enables unified user and role based access controls to data, including specific Hive or HBase tables and other data in HDFS, down to the column level of granularity

If you made it to the end of this long and thorough blog on 7.0…..I’m impressed. That probably means that you want some more information? If yes, check out these links:

-          Pentaho 7.0 webpage on with additional information and resources

-          Register for the 7.0 webinar on November 9, 2016 where you get to see a live demo of all of this!

As a final comment, I reiterate what I said in the beginning – I consider this the most spectacular release this product has ever seen – the only release better than this one will be the next one 

Pedro Alves on Business Intelligence

Pentaho Community Meetup 2016 recap

blank Pentaho Community Meetup 2016 recap

Dear friends,

I just came back from PCM16, the 9th annual edition of our European Pentaho Community Meetup.  We had close to 200 subscriptions for this event of which about 150 showed up making this the biggest so far.  Even though veterans of the conference like myself really appreciate the warmth in previous locations like Barcelona and Cascais, I have to admit we did get a great venue in Antwerp this year with 2 large rooms, great catering, top notch audiovisual support in a nice part in the city center. (Free high speed Antwerp city WiFi, Yeah!)

Content-wise everything was more than OK with back-to-back presentation on a large variety of subjects and I’m happy to say lots of Kettle related stuff as well.

For an in depth recap of the content you can see here for the technical track and here for the other sessions.

Personally I was touched by the incredibly positive response from the audience after my presentation on the state of the PDI unit testing project.  However, the big bomb was dropped when Hiromu Hota from Hitachi Reseach America started to present a new “WebSpoon” project.  You could almost hear everyone think: “Oh no, not another attempt at making a new Kettle web interface”.  However, 2 minutes into the presentation everyone in the audience started to realize that it was the real, original Spoon with all the functionality it has, ported 1:1 to a web browser on your laptop, thin device, phone or tablet. Applause spontaneously erupted, twitter exploded and people couldn’t stop talking about about it until I left the PCM crowd a day later.  Now I’m obviously very happy we managed to keep the WebSpoon project a secret for the past few months and it’s impossible to thank Hota-san enough for traveling all the way to present at our event this weekend.

My heartfelt thanks also go out to Bart Maertens and the whole crew for making PCM16 a wonderful experience and an unforgettable event!

See you all at PCM17!


Let’s block ads! (Why?)

Matt Casters on Data Integration