Pentaho 9.0 is available
Pentaho 9.0 is available
Without further ado: Get Enterprise Edition here, and get Community Edition here
PDI Multi Cluster Hadoop Integration
Capability
Pentaho connects to Cloudera Distribution for Hadoop (CDH), Hortonworks Data Platform (HDP), Amazon Elastic MapReduce (EMR). Pentaho also supports many related services such as HDFS, HBase, Oozie, Zookeeper, and Spark.
Before this release, Pentaho Server as well as PDI design time environment – Spoon, can work with only one Hadoop cluster at a time. It required multiple transformations, instances, and pipelines to execute multiple Hadoop clusters. With 9.0 release, major architecture changes have occurred to easily configure, connect and manage multiple Hadoop clusters.
· Users can access and process data from multiple Hadoop clusters from different distros and versions- all from single transformation and instance of Pentaho.
· Also, within Spoon, users can now set up three distinct cluster configs, all having reference to the specific cluster, without having to restart Spoon. There is also a new configuration UI to easily configure your Hadoop drivers for managing different clusters.
· Improved cluster configuration experience and secure connection with the new UI
· Supports following distros: Hortonworks HDP v3.0, 3.1; Cloudera CDH v6.1, 6.2; Amazon EMR v5.21, 5.24.
Use Cases and Benefits
· Enables hybrid big data processing support (on-prem or cloud)- all within single pipeline
· Simplifies Pentaho’s integration with Hadoop clusters including enhanced UX of cluster configurations
Key Considerations
· Adaptive Execution Layer Spark isn’t validated to execute pipelines connecting to multiple Hadoop clusters.
· Pentaho Map Reduce isn’t validated to execute pipelines connecting to multiple Hadoop clusters.
Additional Resources
See Adding a new driver for how to add a driver. See Connecting to a Hadoop cluster with the PDI client for how to create a named connection.
· Follow the suggestions in the Big Data issues troubleshooting sections to help resolve common issues when working with Big Data and Hadoop, especially Legacy mode activated when named cluster configuration cannot be located.
PDI AEL-Spark Enhancements
Capability
The Pentaho Adaptive Execution Layer (AEL) is intended to provide flexible and transparent data processing with Spark, in addition to the native Kettle engine. The goal of AEL is to develop complex pipelines visually and then execute in Kettle or Spark based on data volume and SLA requirements. AEL allows PDI users to designate Spark as execution engine for their transformations apart from Kettle.
The v9.0.0 release includes the following performance and flexibility enhancements to AEL-Spark:
· Step level spark specific performance tuning options
· Enhanced logging configuration and information entered into PDI logs
Use Cases and Benefits
· Eliminates black box feel with better visibility
· Enable advanced Spark users with tools to improve performance
Key Considerations
Users must be aware of the following additional items related to AEL v9.0.0:
· Spark v2.2 is not supported.
· Native HBase steps are only available for CDH and HDP distributions.
· Spark 2.4 is the highest Spark version currently supported.
Additional Resources
See the following documentation for more details: About Spark Tuning in PDI, Setup Spark Tuning, Configuring Application Tuning Parameters for Spark
Virtual File System (VFS) Enhancements
Capability
· The changes to the VFS are in two main areas:
· 1. We added Amazon S3 and Snowflake Staging as VFS providers to named VFS Connections and introduced the Pentaho VFS (pvfs) that can reference defined VFS Connections and their protocols. In the S3 protocol, we support S3A and Session Tokens in 9.0.
· The general format of a Pentaho VFS URL is:
pvfs://VFS_Connection/path (including a namespace, bucket or similar)
Use Cases and Benefits
· Through the support of Pentaho VFS, you have an abstraction of the protocol. That means, when you want to change your provider in the future, all your jobs and transformations work seamless after this change in the VFS Connection. Today, you reference S3. Tomorrow, you want to reference another provider, for example HCP or Google Cloud. Using Pentaho VFS, your maintenance burden in these cases is much lower.
· VFS Connections also enables you to use different accounts and servers (including namespaces, buckets or similar) within one PDI transformation. Example: You want to process data within one transformation from S3 with different buckets and accounts.
· Combining named VFS connections with the new file browsing experience provides a convenient way to easily access remote locations and extend the reach of PDI. The new file browser also offers the ability to manage files across those remote locations. For example, a user can easily copy files from Google Cloud into an S3 bucket using the browser’s copy and paste capabilities. A user can then easily reference those files using their named connections, in supported steps and job entries.
A user can manage all files, whether they are local or remote in a central location. For example, there is no need to login to the Amazon S3 Management Console to create folders, rename, delete, move or copy files. Even a copy between the local filesystem and S3 is possible and you can upload/download files from within Spoon.
The new file browser also offers capabilities such as search, which allows a user to find filenames which match a specified search string. The file browser also remembers a user’s most recently accessed jobs and transformations for easy reference.
Key Considerations
As of PDI 9.0, the following protocols are supported: Amazon S3, Snowflake Staging (read only), HCP, Google CS
The following steps and job entries have been updated to use the new file open save dialog for 9.0: Avro input, Avro output, Bulk load into MSSQL, Bulk load into MySQL, Bulk load from MySQL, CSV File Input, De-serialize from file, Fixed File Input, Get data from XML, Get file names, Get files rows count, Get subfolder names, Google Analytics, GZip CSV input, Job (job entry), JSON Input, JSON Output, ORC input, ORC output, Parquet Input, Parquet output, Text file output, Transformation (job entry)
The File / Open dialog is still using the old browsing dialog. The new VFS browser for opening jobs and transformations can be reached through the File / Open URL menu entry.
Additional Resources
See Virtual File System connections, Apache Supported File Systems and Open a transformation for more information.
Cobol copybook steps
Capability
PDI now has two transformation steps that can be used to read mainframe records from a file and transform them into PDI rows.
· Copybook input: This step reads the mainframe binary data files that were originally created using the copybook definition file and outputs the converted data to the PDI stream for use in transformations.
· Read metadata from Copybook: This step reads the metadata of a copybook definition file to use with ETL Metadata Injection in PDI.
Use Cases and Benefits
Pentaho Data Integration supports simplified integration with fixed-length records in mainframe binary data files, so that more users can ingest, integrate, and blend mainframe data as part of their data integration pipelines. This capability is critical if your business relies on massive amounts of customer and transactional datasets generated in mainframes that you want to search and query to create reports.
Key Considerations
This step works with Fixed Length COBOL records only. Variable record types such as VB, VBS, OCCURS DEPENDING ON are not supported.
Additional Resources
For more information about using copybook steps in PDI, see Copybook steps in PDI
Additional Enhancements
New Pentaho Server Upgrade Installer
The Pentaho Server Upgrade Installer is an easy to use graphical user interface that automatically applies the new release version to your archive installation of the Pentaho Server. You can upgrade versions 7.1 and later of the Pentaho Server directly to version 9.0 using this simplified upgrade process via the user interface of the Pentaho Server Upgrade Installer.
See Upgradethe Pentaho Server for instructions.
Snowflake Bulk Loader improvement
The Snowflake Bulk Loader has added support for doing a table preview in PDI 9.0. When connected to Snowflake and on the Output tab, select a table in the drop-down menu. The preview window is populated, showing the columns and data types associated with that table. The user can see the expected column layout and data types to match up with the data file.
For more information, please see the job entry documentation of the Snowflake Bulk Loader.
Redshift IAM security support and Bulk load improvements
With this release, you have more Redshift Database Connection Authentication Choices, these are
· Standard credentials (default) – user password
· IAM credentials
· Profile located on local drive in AWS credentials file
Bulk load into Amazon Redshift enhancements: New Options tab and Columns option in the Output tab of the Bulk load into Amazon Redshift PDI entry. Use the settings on the Options tab to indicate if all the existing data in the database table should be removed before bulk loading. Use the Columns option to preview the column names and associated data types within your selected database table.
See Bulk load into Amazon Redshift for more information.
Improvements in AMQP and UX changes in Kinesis
The AMQP Consumer step provides Binary message support, for example allowing to process AVRO formatted data.
Within the Kinesis Consumer step, users can change the output field names and types.
See the documentation of the AMQP Consumer and Kinesis Consumer steps for more details.
Metadata Injection (MDI) Improvements
In PDI 9.0.0, we continue to enable more steps to support metadata injection (MDI):
· Split Field to Rows
· Delete
· String operations
In the Excel Writer step, the missing MDI step option “Start writing at cell”, has been added. This option can also be injected now.
Additionally, the metadata injection example is now available in the samples folder:
/samples/transformations/metadata-injection-example
See ETL metadata injection for more details.
Excel Writer: Performance improvement
The performance of the Excel Writer has been drastically improved when using templates. A sample test file with 40,000 rows needed about 90 seconds before 9.0 and now processes in about 5 seconds.
For further details, please see PDI-18422.
JMS Consumer changes
In PDI 9.0, we added the following fields to the JMS Consumer step: MessageID, JMS timestamp and JMS Redelivered.
This addition enables restartability and allows to omit duplicate messages.
For further details, please see PDI-18104 and the step documentation.
Text file output: Header support with AEL
You can set up the Text file input step to run on the Spark engine via AEL. The Header option of the Text file output step works now with AEL.
For further details, please see PDI-18083 and the Using the Text File Output step on the Spark engine documentation.
Transformation & Job Executor steps, Transformation & Job entries: UX improvement
Before 9.0, when passing parameters to transformations/jobs, the options “Stream column name” vs. “Value” (“Field to use” vs. “Static input value”) were ambiguous and led to hard to find issues.
In 9.0, we added behavior which prevents a user from entering values into both fields to avoid these situations.
For further details, please see PDI-17974.
Spoon.sh Exit code improvement
Spoon.sh (that gets called by kitchen.sh or pan.sh) sends the wrong exit status in certain situations.
In 9.0, we added a new environment variable FILTER_GTK_WARNINGS to control this behavior for warnings that effect the exit code. If the variable is set to anything, then a filter is applied to ignore any GTK warnings. If you don’t want to filter any warnings, then unset FILTER_GTK_WARNINGS.
For further details, please see PDI-17271.
Dashboard: Option for exporting analyzer report into CSV format.
Now it’s possible to export an analyzer report into a CSF format file even when embedded on a dashboard.
In the previous release the export option was available, but without the CSV format.
The CSV format was available only when using Analyzer outside dashboards, in this way we provide functional parity between Analyzer standalone charts and charts embedded in dashboards.
For further details, please see PDB-1327.
Analyzer: Use of date picker when selecting ranges for a Fiscal Date level relative filter.
Before 9.0 and for an AnalyzerFiscalDateFormat annotation on a level in a Time dimension, Analyzer did not show the “Select from date picker” link.
Now, relative dates can be looked up from the current date on the Date level, then the date picker can also be used to select the nearest fiscal time period.
For further details, please see ANALYZER-3149.
Mondrian: Option for setting the ‘cellBatchSize’ default value.
From a default installation the mondrian.properties does not include mondrian.rolap.cellBatchSize as a configurable property.
The purpose of this improvement is to include this property in the mondrian.properties by default in new builds so customers do not run into performance issues due to the default value for this property being set too low. The default value of the property should be clearly indicated in the properties file as well.
The default value has been updated to mondrian.rolap.cellBatchSize=1000000.
This value was chosen because this setting can run a very large 25M cell space report while keeping total server memory usage around 6.7 GB which is under the 8GB we list as the minimum memory required on a Pentaho server.
For further details, please see MONDRIAN-1713.