Category Archives: Pentaho

Pentaho 8.1 is available

Pentaho 8.1 is available

The team has once again over delivered on a dot release! Below are what I think are the many highlights of Pentaho 8.1 as well as a long list of additional updates.
If you don’t have time to read to the end of my very long blog, just save some time and download it now. Go get your Enterprise Edition or trial version from the usual places

For CE, you can find it on the community home!


One of the biggest themes of the release: Increased support for Cloud. A lot of vendors are fighting for becoming the best providers, and what we do is try to make sure Pentaho users watch all that comfortably sitting on their chairs, having a glass of wine, and really not caring about the outcome. Like in a lot of areas, we want to be agnostic – which is not saying that we’ll leverage the best of each – and really focus on logic and execution.
It’s hard to do this as a one time effort, so we’ve been adding support as needed (and by “as needed” I really mean based on the prioritization given by the market and our customers). A big focus of this release was Google and AWS:
 Pentaho 8.1 is available

Google Storage (EE)

Google Cloud Storage is a RESTful unified storage for storing and accessing data on Google’s infrastructure. PDI support for import and export Data To/From Cloud Storage is now done through a new VFS driver (gs://). You may even use it on the several steps that support it as well as browse it’s contents.
These are the roles required on Google Storage for this to work:
     Storage Admin
     Storage Object Admin
     Storage Object Creator
     Storage Object Viewer
In terms of authentication, you’ll need the following environment variable defined:
From this point on, just treat it as a normal VFS source.

 Pentaho 8.1 is available

 Google BigQuery – JDBC Support  (EE/CE)

BigQuery is Google’s serverless, highly scalable, low cost enterprise data warehouse. Fancy name for a database, and that’s how we treat it.
In order to connect to it first we need the appropriate drivers. Steps here are pretty simple:
2.      Copy google*.* files from Simba driver to /pentaho/design-tools/data-integration/libs folder
Host Name will default to but your mileage may vary.
Unlike the previous item, authentication doesn’t use the previously defined environment variable as does Google VFS. Authentication here is done at the JDBC driver level, though a driver option, OAuthPvtKeyPath, set in the Database Connection Option and the you need to point to the Google Storage certificate through the P12 key format.
The following Google BigQuery roles are required:
1.      BigQuery Data Viewer
2.      BigQuery User
 Pentaho 8.1 is available

Google BigQuery – Bulk Loader  (EE)

While you can use a regular table output to insert data into BigQuery that’s going to be slow as hell (who said hell was slow? This expression makes no sense at all!). So we’ve added a step for that: Google BigQuery Loader.
This step leverages google’s loading abilities, and is processed out on Google, not on PDI. So the data, that has to be either in Avro, JSON or CSV has to be previously copied to Google Storage. From that point on is pretty straightforward. Authentication is done via the GOOGLE_APPLICATION_CREDENTIALS environment variable point to the Google JSON file.
 Pentaho 8.1 is available
Google Drive  (EE/CE)
While Google Storage will probably be seen more frequently in production scenarios, we also added support for Goggle Drive, a file storage and synchronization service, allows users to store files on their servers, synchronize files across devices, and share files.
This is also done through a VFS driver, but given it’s a per user authentication a few steps need to be fulfilled to leverage this support:
     Copy your Google client_secret.json file into (The Google Drive option will not appear as a Location until you copy the client_secret.json file into the credentials directory and restart)
o   Spoon: data-integration/plugins/pentaho-googledrive-vfs/credentials directory, and restart spoon.
o   Pentaho Server:  pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-googledrive-vfs/credentials directory and restart the server
     Select Google Drive as your Location. You are prompted to login to your Google account.
     Once you have logged in, the Google Drive permission screen displays.
     Click Allow to access your Google Drive Resources.
     A new file called StoredCredential will be added to the same place where you had the client_secret.json file. This file will need to be added to the Pentaho Server credential location and that authentication will be used

Analytics over BigQuery  (EE/CE, depending on the tool used)

This JDBC connectivity to Google BigQuery, as defined previously for Spoon, can also be used throughout all the other Business Analytics browser and client tools – Analyzer, CTools, PIR, PRD, modeling tools, etc. Some care has to be taken here, though, as BigQuery’s pricing is related to 2 factors:
     Data stored
     Data queried
While the first one is relatively straightforward, the second one is harder to control, as you’re charged according to total data processed in columns selected. For instance, a ‘select *’ query should be avoided if only specific columns are needed. To be absolutely clear, this has nothing to do with Pentaho, these are Google BigQuery pricing rules.
So ultimately, and a bit like we need to do on all databases / data warehouses, we need to be smart and work around the constraints (usually speed and volume, on this case price as well) to leverage best what these technologies have to offer. Some examples are given here:
     By default, there is BigQuery caching and cached queries are free. For instance, if you run a report in Analyzer, clear the Mondrian cache, and then reload the report, you will not be charged (thanks to the BigQuery caching)
     Analyzer: Turn off auto refresh, i.e, this way you design your report layout first, including calculations and filtering, without querying the database automatically after each change
     Analyzer: Drag in filters before levels to reduce data queried (i.e. filter on state = California BEFORE dragging city, year, sales, etc. onto canvas)
     Pre-aggregate data in BigQuery tables so they are smaller in size where possible (to avoid queries across all raw data)
     GBQ administrators can set query volume limits by user, project, etc. (quotas)

AWS S3 Security Improvements (IAM) (EE/CE)

PDI is now able to get IAM security keys from the following places (in this order):
1.      Environment Variables
2.      Machine’s home directory
3.      EC2 instance profile
This added flexibility helps accommodate different AWS security scenarios, such as integration with S3 data via federated SSO from a local workstation, by providing secure PDI read/write access to S3 without making user provide hardcoded credentials.
The IAM user secret key and access key can be stored in one place so they can be leveraged by PDI without repeated hardcoding in Spoon. These are the environment variables that point to them:

 Pentaho 8.1 is available

Big Data / Adaptive Execution Layer (AEL) Improvements

 Pentaho 8.1 is available

Bigger and Better (EE/CE)

AEL provides spectacular scale out capabilities (or is it scale up? I can’t cope with these terminologies…) by seamlessly allowing a very big transformation to leverage a clustered processing engine.
Currently we have support for Spark through the AEL layer, and throughout the latest releases we’ve been improving it in 3 distinct areas:
     Performance and resource optimizations
o   Added Spark Context Reuse that, under certain circumstances can speed up startup performance on the range to 5x faster, proving specially useful under development conditions
o   Spark History Server integration, providing a centralized administration, auditing and performance reviews of the transformations executed in Spark
o   Ability to passing down to the cluster customized spark properties, allowing a finer-grained control of the execution process
     Increased support for native steps (eg, leveraging the spark specific group by instead of the PDI engine one)
     Adding support for more cloud vendors – and we just did that for EMR 5.9 and MapR 5.2
This is the current support matrix for Cloud Vendors:

 Pentaho 8.1 is available

Sub Transformation support (EE/CE)

This one is big, as it was the result of a big and important refactor on the kettle engine. AEL Now supports executing sub transformations through the Transformation Executor step, a long-standing request since the times of good-old PMR (Pentaho Map Reduce)
 Pentaho 8.1 is available

Big Data formats: Added support for Orc (EE/CE)

Not directly related to AEL, but most of the use cases where we want the AEL execution we’ll need to input data in a big data specific format. In previous releases we added support for Parquet and Avro, and we now added support for ORC (Optimized Record Columnar), a format favored by Hortonworks.
Like the others, Orc will be handled natively when transformations are executed in AEL
 Pentaho 8.1 is available 

Worker Nodes (EE)

 Pentaho 8.1 is available

Jumping from scale-out to scale-up (or the opposite, like I mentioned, I never know), we continue to do lots of improvements on the Worker Nodes project. This is an extremely strategic project for us as we integrate with the larger Hitachi Vantara portfolio.
Worker nodes allow you to execute Pentaho work items, such as PDI jobs and transformations, with parallel processing and dynamic scalability with load balancing in a clustered environment. It operates easily and securely across an elastic architecture, which uses additional machine resources as they are required for processing, operating on premise or in the cloud.
It uses the Hitachi Vantara Foundry project, that leverages popular technologies under the hood such as Docker (Container Platform), Chronos (Scheduler) and Mesos/Marathon (Container Orchestration).
For 8.1 there are several other improvements:
     Improvements tn Monitoring, with accurate propagation of Work Items status for monitoring
     Performance improvements by optimizing the startup times for executing the work items
     Customizations are now externalized from docker build process
     Job clean up functionality

 Pentaho 8.1 is available


 Pentaho 8.1 is available

In Pentaho 8.0 we introduced a new paradigm to handle streaming datasources. The fact that it’s a permanently running transformation required a different approach: The new streaming steps define the windowing mode and point to a sub transformation that will then be executed on a micro batch approach.
That works not only for ETL within the kettle engine but also in AEL, enabling spark transformations to feed from Kafka sources.

New Streaming Datasources: MQTT, and JMS (Active MQ / IBM MQ) (EE/CE)

Leveraging on the new streaming approach, there are 2 new steps available – well, one new and one (two, actually) refreshed.
The new one is MQTT – Message Queuing Telemetry Transport – an ISO standard publish-subscribe-based messaging protocol that works on top of the TCP/IP protocol. It is designed for connections with remote locations where a “small code footprint” is required or the network bandwidth is limited.  Alternative IoT centric protocols include AMQP, STOMP, XMPP, DDS, OPC UA, WAMP

 Pentaho 8.1 is available

There are 2 new steps – MQTT Input and MQTT Output, that connect with the broker for consuming and publishing back the results.
Other than this new, IoT centered streaming source, there are 2 new steps, JMS Input and JMS Output. These steps replace the old JMS Consumer/Producer and the IBM Websphere MQ steps, supporting, in the new mode the following message queue platforms:
     IBM MQ
Safe Stop (EE/CE)
This new paradigm to handle streaming sources introduced a new challenge that we never had to face. Usually, when we triggered jobs and transformations, they had a well defined start and end; Our stop functionality was used when we wanted to basically kill a running process because something was not going well.
However, on these streaming use cases, a transformation may never finish. So stopping a transformation the way we’ve always done – by stopping all steps at the same time – could have unwanted results.
So we implemented a different approach – We added a new option to safe stop a transformation implemented within Spoon, Carte and the Abort step, that instead of killing all the step threads, stops the input steps and lets the other steps gracefully finish the processing, so no records currently being processed are lost.

 Pentaho 8.1 is available

This is especially useful in real-time scenarios (for example reading from a message bus). It’s one of those things that when we look back seems pretty dumb that it wasn’t there from the start. It actually makes a lot of sense, so we went ahead and made this the default behavior.

Streaming results (EE/CE)

When we launched streaming in Pentaho 8.0 we focused on the processing piece. We could launch the sub transformation but we could not get results back. Now we have the ability to define which step on the sub-transformation will send back the results to follow the rest of the flow.

 Pentaho 8.1 is available

Why is this important? Because of what comes next…
Streaming Dataservices (EE/CE)
There’s a new option new option to run data service in streaming mode. This will allow the consumers (on this case CTools Dashboards) to get streaming data from this dataservice.

 Pentaho 8.1 is available

Once defined, we can test these options within the test dataservices page and see the results as they come.

 Pentaho 8.1 is available

This screen exposes the functionality as it would be called from a client. It’s important to know that the windows that we define here are not the same as the ones we defined for the micro batching service. The window properties are the following:
     Window Size – The number of rows that a window will have (row based), or the time frame that we want to capture new rows to a window (time based).
     Every – Number of rows (row based), or milliseconds (time based) that should elapse before creating a new window.
     Limit – Maximum number of milliseconds (row based) or rows (time based) which will be used to wait for a new window to be generated.

CTools and Streaming Visualizations (EE/CE)

We took a holistic approach to this feature. We want to make sure we can have a real time / streaming dashboard leveraging what was set up before. And this is where the CTools come in. There’s a new datasource in CDE available to connect to streaming dataservices:

 Pentaho 8.1 is available

Then the configuration of the component will select the kind of query we want – Time or number of records base, window size, frequency and limit. This gives us a good control for a lot of use cases.

 Pentaho 8.1 is available

This will allow us to then connect to a component the usual way. While this will probably be more relevant for components like tables and charts, ultimately all of them will work.
It is possible to achieve a level of multi-tenancy by passing a user name parameter from the PUC session (via CDE) to the transformation as a data services push-down parameter. This will enable restriction of the data viewed on a user by user basis
One important note is that the CTools streaming visualizations do not yet operate on a ‘push’ paradigm – this is on the current roadmap. In 8.1, the visualizations poll the streaming data service on a constant interval which has a lower refresh limit of 1 second. But then again… if you’re doing a dashboard of this types and need a refresh of 1 second, you’re definitely doing something wrong…

Time Series Visualizations (EE/CE)

One of the biggest use cases for streaming, from a visualization perspective, is time series. We improved the support for CCC for timeseries line charts, so now data trends over time will be shown without needing workarounds.
This applies not only to CTools but also to Analyzer

 Pentaho 8.1 is available

Data Exploration Tool Updates (EE)

We’re keeping on our path of improving our Data Exploration Tool. It’s no secret that we want to make it feature complete so that it can become the standard data analysis tool for the entire portfolio.
This time we worked on adding filters to the Stream view.
 Pentaho 8.1 is available  Pentaho 8.1 is available 
We’ll keep improving this. Next on the queue, hopefully, will be filters on the model view and date filters!

Additional Updates

As usual, there were several additional updates that did not make it to my highlights above. So for the sake of your time and not creating a 100 page blog – here are even more updates in Pentaho 8.1.
Additional updates:
     Salesforce connector API update (API version 41)
     Splunk connection updated to version 7
     Mongo version updated to 3.6.3 driver (supporting 3.4 and 3.6)
     Cassandra version updated to support version 3.1 and Datastax 5.1
     PDI repository browser performance updates, including lazy loading
     Improvements on the Text and Hadoop file outputs, including limit and control file handling
     Improved logging by removing auto-refresh from the kettle logging servlet
     Admin can empty trash folder of other users on PUC
     Clear button in PDI step search in spoon
     Override JDBC driver class and URL for a connection
     Suppressed the Pentaho ‘session expired’ pop-up on SSO scenarios, redirecting to the proper login page
     Included the possibility to schedule generation of reports with a timestamp to avoid overwriting content
In summary (and wearing my marketing hat) with Pentaho 8.1 you can:
      Deploy in hybrid and multi-cloud environments with comprehensive support for Google Cloud Platform, Microsoft Azure and AWS for both data integration and analytics
      Connect, process and visualize streaming data, from MQTT, JMS, and IBM MQ message queues and gain insights from time series visualizations
      Get better platform performance and increase user productivity with improved logging, additional lineage information, and faster repository access

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Farewell Pentaho

blank Farewell Pentaho

Dear Kettle friends,

12 years ago I joined a wonderful team of people at Pentaho who thought they could make a real change in the world of business analytics. At that point I recently open sourced my own data integration tool (then still called ‘ETL’) called Kettle and so I joined in the role of Chief Architect of Data Integration. The title sounded great and the job included everything from writing articles (and a book), massive amounts of coding, testing, software releases, giving support, doing training, workshops, … In other words, life was simply doing everything I possibly and impossibly could to make our software succeed when deployed by our users. With Kettle now being one of the most popular data integration tools on the planet I think it’s safe to say that this goal has been reached and that it’s time for me to move on.

I don’t just want to announce my exit from Pentaho/Hitachi Vantara. I would also like to thank all the people involved in making our success happen. First and foremost I want to express my gratitude to the founders (Richard, Doug, James, Marc, …) for even including a crazy Belgian like myself on the team but I also want to extend my warmest thanks to everyone who I got to become friends with at Pentaho for the always positive and constructive attitude. Without exaggeration I can say it’s been a lot of fun.

I would also explicitly like to thank the whole community of users of Kettle (now called Pentaho Data Integration). Without your invaluable support in the form of new plugins, bug reports, documentation, forum posts, talks, … we could never have pulled off what we did in the past 12 years! I hope we will continue to meet at one of the many popular community events.

Finally I want to thank everyone at Hitachi and Hitachi Vantara for being such a positive and welcoming group of people. I know that Kettle is used all over Hitachi and I’m quite confident this piece of software will not let you down any time soon.

Now I’m going to go skiing for a week and when I get back it’s time to hunt for a new job. I can’t wait to see what impossible problems need solving out there…


Let’s block ads! (Why?)

Matt Casters on Data Integration

Announcing Pentaho 8.0 – Coming in November to a theater near you!

Pentaho 8!

announce Announcing Pentaho 8.0   Coming in November to a theater near you!

The first of a new Era

Wow – time flies… Another Pentaho World this week, and another blog post announcing another release. This time… the best release ever! icon wink Announcing Pentaho 8.0   Coming in November to a theater near you!
This is our first Pentaho product announcement since we became Hitachi Vantara – and you’ll see that some synergies are already appearing. And as I said before, again and again… the Community Edition is still around! We’re not kidding – we’re here to rule the world and we know it’s though an open source core strategy that we’ll get there icon smile Announcing Pentaho 8.0   Coming in November to a theater near you!

Pentaho 8.0 In a nutshell

Ok, let’s get on with this cause there’s a lot of people at the bar calling me to have a drink. And I know my priorities! 
  • Platform and Scalability
    • Worker Nodes
    • New theme
  • Data Integration
    • Streaming support!
    • Run configurations for Jobs
    • Filters in Data Explorer
    • New Open / Save experience
  • Big Data
    • Improvements on AEL
    • Big Data File Formats – Avro and Parquet
    • Big Data Security – Support for Knox
    • VFS improvements for Hadoop Clusters
  • Others
    • Ops Mart for Oracle, MySQL, SQL Server
    • Platform password security improvements
    • PDI mavenization
    • Documentation changes on
    • Feature Removals:
      • Analyzer on MongoDB
      • Mobile Plug-in (Deprecated in 7.1)
Is it done? Can I go now? No?…. damn, ok, now on to further details…

Platform and Scalability

Worker Nodes (EE)

This is big. I never liked the way we handled scalability in PDI. Having the ETL designer responsible for manually defining the slave server in advance, having to control the flow of each execution, praying for things not to go down… nah! Also, why ETL only? What about all the other components of the stack?
So a couple of years ago, after getting info from a bunch of people I submitted a design document with a proposal for this:
02 DesignDoc WorkerNodes%2B2017 10 24%2B10 28 49 Announcing Pentaho 8.0   Coming in November to a theater near you!
This was way before I knew the term “worker nodes” was actually not original… but hey, they’re nodes, they do work, and I’m bad with names, so there’s that… :p
It took time to get to this point, not because we didn’t think this was important, but because of the underlying order of execution; We couldn’t do this without merging the servers, without changing the way we handle the repository, without having AEL (the Adaptive Execution Layer). Now we got to it!
Fortunately, we have an engineering team that can execute things properly! They took my original design, took a look at it, laughed at me, threw me out of the room and came up with the proper way of doing things. Here’s the high-level description:
03 WorkerNodes Announcing Pentaho 8.0   Coming in November to a theater near you!
This is where I mentioned that we are already leveraging Hitachi Vantara resources. We are using Lumada Foundry for worker nodes. Foundry is a platform for rapid development of service-based applications delivering the management of containers, communications, security, and monitoring toward creating enterprise products/applications, leveraging technology like docker, mesos, marathon, etc. More on this later, as it’s something we’ll be talking a lot more about…
Here’s some of the features
  • Deploy consistently in physical, virtual and cloud environments
  • Scale and load balance services , helping to deal with peaks and limited time-windows, allocate the resources that are needed.
  • Hybrid deployments can be used to distribute load, even when the on-premise resources are not sufficient, scaling out into the Cloud is possible to provide more resources. 
So, how does this work in practice? Once you have a Pentaho Server installed, you can configure it to connect to the cluster of Pentaho Worker nodes. From that point on – things will work! No need to configure access to repositories, accesses, funky stuff. You only need to say “Execute at scale” and if the worker nodes are there, it’s where things will be executed. Obviously, the “things will work” will have to obey the normal rules of clustered execution, for instance, don’t expect a random node on the cluster to magically find out your file:///c:/my computer/personal files/my mom’s excel file.xls…. :/
So what scenarios will this benefit the most? A lot! Now your server will not be bogged down executing a bunch of jobs and transformations as they will be handed out for execution in one of the nodes.
This does require some degree of control, because there may be cases where you don’t want remote execution (for instance, a transformation to feed a dashboard). This is where Run Configurations come into play. Also important to note that even though the biggest benefits of this will be ETL work, this concept is for any kind of execution.
This a major part of the work we’re doing with the Hitachi Vantara team; By leveraging Foundry we’ll be able to do huge improvements on areas we’ve been wanting to tackle for a while but never were able to properly address on our own: better monitoring, improving lifecycle management and active-active HA, among others. In 8.0 we leapfrogged in this worker nodes story, and we expect much more going forward!

New Theme – Ruby (EE/CE)

One of the things you’ll notice is that we have a new theme that reflects the Hitachi Vantara colors. The new theme is the default on new installations (not for upgrades) and the others are still available
ruby Announcing Pentaho 8.0   Coming in November to a theater near you!

Data Integration

Streaming Support: Kafka (EE/CE)

In Pentaho 8.0 we’re introducing proper streaming support in PDI! In case you’re thinking “hum… but don’t we already have a bunch of steps for streaming datasources? JMS, MQTT, etc?” you’re not wrong. But the problem is that PDI is a micro batching engine, and these streaming protocols introduce issues that can’t be solved with the current approach. Just think about it – a streaming datasource requires an always running transformation, and in PDI execution all steps run in different threads while the data pipeline is being processed; There are cases, when something goes wrong, where we don’t have the ability to do proper error processing. It’s simply not as simple as a database query or any other call where we get a finite and well known amount of data.
So we took a different approach – somewhat similar to sub-transformations but not quite… First of all, you’ll see a new section in PDI:
pdi streaming Announcing Pentaho 8.0   Coming in November to a theater near you!
Kafka is the one that was prioritized as being the most important for now, but this will actually be something that will be extended for other streaming sources.
The secret here is on the Kafka Consumer step:

KafkaConsumer Announcing Pentaho 8.0   Coming in November to a theater near you!
The highlighted tabs should be generic for pretty much all the steps, and the Batch is what controls the flow. So what we did was instead of having an always running transformation at the top level, we break the input data into chunks – either by number of records or duration and the second transformation takes that input, the fields structure and does a normal execution. In here, the abort step was also improved to give you more control the flow of this execution. This is actually something that’s been a long standing request from the community – we can now specify if we want to abort with error or without, having an extra ability to control the flow of our ETL.
Here’s an example of this thing put together:
streamingdiagram Announcing Pentaho 8.0   Coming in November to a theater near you!
Now, even more interesting that that is that this also works in AEL (our Adaptive Execution Layer, introduced in Pentaho 7.1), so when you run this on a cluster you’ll get spark native kafka support being executed at scale, which is really nice…
Like I mentioned before, moving forward you’ll see more developments here, namely:
  • More streaming steps, and currently MQTT seems the best candidate for the short term
  • (and my favorite) Developer’s documentation with a concrete example so that it’s easy for anyone on the community to develop (and hopefully submit) their own implementations without having to worry about the 90% of the stuff that’s common to all of them

New Open / Save experience (EE/CE)

In Pentaho 7.0 we merged the servers (no more that nonsense of having a distinct “BA Server” and a “DI Server”) and introduced the unified Pentaho Server with a new and great looking experience to connect to it:
 Announcing Pentaho 8.0   Coming in November to a theater near you!
but then I clicked on Open file from repository and felt sick… That thing was absolutely horrible and painfully slow. We were finally able to do something about that! Now the experience is … well… slightly better (as in, I don’t feel like throwing up anymore!):
pdi opensave Announcing Pentaho 8.0   Coming in November to a theater near you!
A bit better, no? icon smile Announcing Pentaho 8.0   Coming in November to a theater near you!  Also with search capabilities and all the kind of stuff that you’ve been expecting from a dialog like this on the past 10 years! Same for the save experience.
This is another small but IMO always important step in unifying the user experience and work towards a product that gets progressively more pleasant to use. It’s a never-ending journey but that’s not an excuse not to take it.

Filters in Data Explorer (EE)

Now that I was able to open my transformation, I can show some of the improvements that we did on our Data Explorer experience in PDI. We now support the first set of filters and actions! This one is easy to show but extremely powerful to use.
Here’s filters – depending on the data type you’ll have a few options, like excluding nulls, equals, greater/lesser than and a few others. Like mentioned, others will come with time. 
filters Announcing Pentaho 8.0   Coming in November to a theater near you!
Also, while previous version only allowed for drill down, we can now do more operations on the visualizations.
actions Announcing Pentaho 8.0   Coming in November to a theater near you!

Run configuration: Leveraging worker nodes and execute on server (EE/CE)

Now that we are connected to the repository, opened our transformation with a really nice experience and took benefit of these data exploration improvements to make sure our logic is spot on, we are ready to execute it to the server. 
Now this is where the run configuration part comes in. I have my transformation, defined it, played with it, verified that really works as expected on my box. And now, I will want to make sure it also runs well on the server. What before was a very convoluted process, it’s now much simplified.
What I do is define a new Run Configuration, like described in 7.1 for AEL, but with a little twist: I don’t want it to use the spark engine; I want it to use the pentaho engine but on the server, not the one local to spoon:
run config Announcing Pentaho 8.0   Coming in November to a theater near you!
Now, what happens when I execute this selecting the Pentaho Server run configuration?
run config dialog Announcing Pentaho 8.0   Coming in November to a theater near you!
Yep, that!! \o/
executeOnServer Announcing Pentaho 8.0   Coming in November to a theater near you!
This screenshot shows PDI trigger the execution and my Pentaho Server console logging it’s execution.
And if I had worker nodes configured, what I would see would be my Pentaho Server automatically dispatching the execution of my transformation to an available worker node! 
This doesn’t apply to the immediate execution only; We can now specify the run configuration on the job entry as well, allowing a full control of the flow of our more complex ETL
jobentry Announcing Pentaho 8.0   Coming in November to a theater near you!

Big Data

Improvements on AEL (EE/CE apart from the security bits)

As expected, a lot of work was done on AEL. The biggest ones:
  • Communicates with Pentaho client tools over WebSocket; does NOT require Zookeeper
  • Uses distro-specific Spark library
  • Enhanced Kerberos impersonation on client-side
This brings a bunch of benefits:
  • Reduced number of steps to setup 
  • Enable fail-over, load-balancing
  • Robust error and status reporting 
  • Customization of Spark jobs (i.e. memory , settings)
  • Client to AEL connection can be secured
  • Kerberos impersonation from client tool 
And not to mention performance improvements… One benchmark I saw that I found particularly impressive is that AEL is practically on pair with native spark execution! And this is impressive! Kudos for the team, just spectacular work!

Big Data File Formats – Avro and Parquet (EE/CE)

Big data platforms introduced various data formats to improve performance, compression and interoperability, and we added full support for these very popular big data formats: Avro and Parquet. Orc will come next.
When you run in AEL, these will also be natively interpreted by the engine, which adds a lot to the value of this.
bigdataformats Announcing Pentaho 8.0   Coming in November to a theater near you!
The old steps will still be available on the marketplace but we don’t recommend using them.

Big Data Security – Support for Knox

Knox provides perimeter security so that the enterprise can confidently extend Hadoop access to more of those new users while also maintaining compliance with enterprise security policies and used in some HortonWorks deployments. It is now supported on the Hadoop Clusters’ definition if you enable the property KETTLE_HADOOP_CLUSTER_GATEWAY_CONNECTION on the file.
knox Announcing Pentaho 8.0   Coming in November to a theater near you!

VFS improvements for Hadoop Clusters (EE/CE)

In order to simplify the overall lifecycle of jobs and transformations we made the hadoop clusters available through VFS, on the format hc://hadoop_cluster/
namedclusters Announcing Pentaho 8.0   Coming in November to a theater near you!


There are some other generic improvements worth noting

Ops Marts extended support (EE)

Ops Mart now supports Oracle, MySQL and SQL Server. I can’t really believe I’m still writing about this thing icon sad Announcing Pentaho 8.0   Coming in November to a theater near you!

PDI Mavenization (CE)

Now, this is actually nice! PDI is now fully mavenized. Go to, do a mvn package and you’re done!!!


Pentaho 8 will be available to download mid-November.

Learn more about Pentaho 8.0 and a webinar here:
Also, you can get a glimpse of PentahoWorld this week watching it live at:

Last but not See you in a few weeks at the Pentaho Community meeting in Mainz!

That’s it – I’m going to the bar!

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Pentaho 8 is now available!

17 152 8.0 launch community v1 Pentaho 8 is now available!

I recently wrote about everything you needed to know about Pentaho 8. And now is available! Go get your Enterprise Edition or trial version from the usual places

For CE, you can find it on the new community home!



Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

A new collaboration space

newForums A new collaboration space

With the move to Hitachi Vantara we’re not letting the community go away – exactly on the contrary. And one of the first things is trying to give the community a new home, in here:

We’re trying to gather people from the forums, user groups, whatever, and give a better and more modern collaboration space. This space will continue open, also because the content is extremely value, so the ultimate decision is yours.

Your mission, should you choose/decide to accept it, is to register and try this new home. Counting on your help to make it a better space

See you in



Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Pentaho Business Analytics Blog

Today, our parent company Hitachi, a global leader across industries, infrastructure and technology, announced the formation of Hitachi Vantara , a company whose aim is to help organizations thrive in today’s uncertain and turbulent times and prepare for the future. This new company unifies the mission and operations of Pentaho,…

Let’s block ads! (Why?)

Pentaho Business Analytics Blog

Pentaho Community Meeting 2017: exciting use cases & final Call for Papers

Enjoyed your vacations? Good – now let’s get back in business!

The Pentaho Community Meeting 2017 in Mainz, taking place from November 10-12, is approaching and more than 140 participants interested in BI and Big Data are already on board.

Many great speakers from all over the world will present their Pentaho use cases, including data management and analysis at CERN, evaluation of environmental data at the Technical University of Liberec and administration of health information in Mozambique. And of course Matt Casters, Pedro Alves and Jens Bleuel will introduce the latest features in Pentaho.</span>

The 10th jubilee edition features many highlights:

·      Hackathon and technical presentations on FRI, Nov 10 
·      Conference day on SAT, Nov 11                    
·      Dinner on SAT, Nov 11                          
·      Get-together and drinks on SAT, Nov 11  
·      Social event on SUN, Nov 12

See here the completeagenda with all presentations of the business and technical track on the conference day. Food and drinks will be provided.  Highlight to the CERN use case (you can read a blog post on it here)

And don’t forget: you can participate in the Call for Papers till September 30th! Send your Pentaho project to Jens Bleuel via the</span> contact form.

 Some of the speakers: 

·      Pedro Alves – Aka… me! All about Pentaho 8.0, which is a different way to say “hum, just put some random title, I’ll figure out something later”
·      Dan Keeley – Data Pipelines – Running PDI on AWS Lambda
·      Francesco Corti – Pentaho 8 Reporting for Java Developers
·      Pedro Vale – Machine Learning in PDI – What’s new in the Marketplace?
·      Caio Moreno de Souza – Working with Automated Machine Learning (AutoML) and Pentaho
·      Nelson Sousa – 10 WTF moments in Pentaho Data Integration
If you haven’t done so, Register Here

We are looking forward to seeing you in
Mainz, which can be reached in only 20 minutes by train from Frankfurt airport or main train station!
In the meantime follow-up on all updateson Twitter.

-pedro, with all the content from this post shamelessly stolen from Ruth and Carolin, the spectacular organizers from IT-Novum

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Hello Hitachi Vantara!

cslogo Hello Hitachi Vantara!

Ok, I admit it – I am one of those people that actually likes changes and views it as an opportunity. Four years ago, I announced here that Webdetails joined Pentaho. For the ones who don’t know, Webdetails was the Portugese-based consulting company that then turned into Pentaho Portugal (and expanded from 20 people at the time to 60+), completely integrated into the Pentaho structure.

Two years ago, we announced that Pentaho was acquired by HDS, becoming a Hitachi Group Company.

We have a new change today – and since I’m lazy (and in Vegas, for the Hitachi Next event, and would rather be at our party at the Mandalay Bay Beach than in my room writing this blog post!), I’ll simply steal the same structure I used two years ago (when Pentaho was acquired) and get straight to the point! :p

Big news

17 148 Hitachi NewCo blog v1 Hello Hitachi Vantara!
 An extremely big transformation has been taking place and materialized itself today, September 19, 2017. A new company is born. Meet: Hitachi Vantara

You may be asking yourselves: Can it possibly be a coincidence that the new company is launched on the exact same day I turn 40? Well, actually yes, a complete coincidence… :/

This new company unifies the mission and operations of Pentaho, Hitachi Data Systems and Hitachi Insight Group into a single business. More info in the Pentaho blog: Hitachi Vantara – Here’s what it means

What does this mean?

It has always been our goal to provide an offering that would allow customers to build their high value, data driven solutions. We were, I think, successful at doing that! And now we (Hitachi Vantara) want to take it to the next level, thus this transformation is needed: We’re aiming higher – we want to not only to be the best at (big) data orchestration and analytics, we want to do so in this new IoT / social innovation ecosystem aiming to be the biggest player in the market.

And this transformation will allow us to do that!

What will change?

So that it’s clear, Pentaho, as a product will continue to exist. Pentaho, as a company, is now Hitachi Vantara.

And for Pentaho as a product, this gives us conditions we’ve never had to improve the product focusing on what we need to do best (big data orchestration and analytics) and leveraging from other groups in the company on areas that even though they weren’t our core focus, people expect us to have. 
Overall, we’ll also improve the overall portfolio interoperability. While so far we’ve always tried to be completely agnostic, now we’ll keep saying that but add a small detail: But we have to work better with our stuff – because we can make it happen! 

Community implications

This one is very easy!!! I’ll just copy paste my previous answer – because it didn’t change:

Throughout all the talks, our relationship and involvement with the community has always been one of the strong points of Pentaho, and seen with much interest.
The relationship between the community and a commercial company exists because it’s mutually beneficial. In Pentaho’s case, the community gets access to software it otherwise couldn’t, and Pentaho gets access to an insane amount of resources that contribute to the project. Don’t believe me? Check the Pentaho Marketplace for the large number of submissions, Jira for all the bug reports and improvement suggestions we get out of all the real world tests, and discussions on the forums or on the several available email lists.
Is anyone, in his or her right mind, willing to let all this go? Nah.
Plus, not having a community would render my job obsolete, and no one wants that, right? (don’t answer, please!)

The difference? We wanna do this bigger, better and faster!


And things are already moving in that direction. We are moving the Pentaho Community page to the Hitachi Vantara communit site with some really col interactive and social features. You can visit our new home here I look forward to engaging with all of you on this new site.

Will Hitachi Vantara shut down it’s Pentaho CE edition / it’s open source model?

I will, once again, repeat the previous answer:

Just in case the previous answer wasn’t clear enough, lemme spell it out with all the words: There are no plans of changing our opensource strategy or stop providing a CE edition to our community!
Can that change in the future? Oh, absolutely yes! Just like it could have changed in the past. And when could it change? When it stops making sense; when it stops being mutually beneficial. And on that day, I’ll be the first one to suggest a change to our model.

And speaking of which – don’t forget to register to PCM17! It’s going to be the best ever!

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Pentaho Maven repository changed to

From a recent (at the time of writing, obviously!) issue in the mondrian project we noticed we failed to notify an important change:

This morning the pentaho maven repository seems to be down.

Each download request during maven build fails with 503 error:
[WARNING] Could not transfer metadata XXX/maven-metadata.xml from/to pentaho-releases ( Failed to transfer file: Return code is: 503 , ReasonPhrase:Service Temporarily Unavailable.

The reason for this is that the maven url is now .

Here’s a link to a complete ~/.m2/settings.xml config file:


Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

PCM17 – Pentaho Community Meeting: November 10-12, Mainz

PCM17 – 10th Edition

PCM17 Banner EN PCM17   Pentaho Community Meeting: November 10 12, Mainz

One of my favourite blog posts of the year – Announcing PCM17. And this year, for the 10th edition, we’re going back to the beginning – Mainz in Germany.


Location address: Kupferbergterrasse, Kupferbergterrasse 17-19, 55116 Mainz. Close to Frankfurt, Germany

map PCM17   Pentaho Community Meeting: November 10 12, Mainz


We’re maintaining the schedule of the previous years: A meet-up on friday for drinks preceded by a hackathon; A meet-up on Saturday for drinks preceded by a bunch of presentations or really cool stuff; A meet-up on Sunday for drinks preceded by a city sightseeing! You got the idea

All the information….

Here:! IT-Novum is doing a spectacular work organizing this event, and you’ll find all the information needed, from instructions on how to get there to suggestions for hotels to stay on

Registration and Call for Presentations

Please go to the #PCM17 website to register and also to send us a presentation proposal!



Let’s block ads! (Why?)

Pedro Alves on Business Intelligence