• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Pentaho 9.0 is available

February 6, 2020   Pentaho

Pentaho 9.0 is available

Without further ado: Get Enterprise Edition here, and get Community Edition here

PDI Multi Cluster Hadoop Integration

Capability

Pentaho connects to Cloudera Distribution for Hadoop (CDH), Hortonworks Data Platform (HDP), Amazon Elastic MapReduce (EMR). Pentaho also supports many related services such as HDFS, HBase, Oozie, Zookeeper, and Spark.

Before this release, Pentaho Server as well as PDI design time environment – Spoon, can work with only one Hadoop cluster at a time. It required multiple transformations, instances, and pipelines to execute multiple Hadoop clusters. With 9.0 release, major architecture changes have occurred to easily configure, connect and manage multiple Hadoop clusters.

·       Users can access and process data from multiple Hadoop clusters from different distros and versions- all from single transformation and instance of Pentaho. 

·       Also, within Spoon, users can now set up three distinct cluster configs, all having reference to the specific cluster, without having to restart Spoon.  There is also a new configuration UI to easily configure your Hadoop drivers for managing different clusters.

·       Improved cluster configuration experience and secure connection with the new UI

·       Supports following distros: Hortonworks HDP v3.0, 3.1; Cloudera CDH v6.1, 6.2; Amazon EMR v5.21, 5.24.


Existing single cluster/shim functionality will continue to work.  


The following example shows the Multi-cluster implemented in the same data pipeline via connecting to both Hortonworks HDP and Cloudera CDH clusters.

Picture%2B1 Pentaho 9.0 is available


Use Cases and Benefits

·       Enables hybrid big data processing support (on-prem or cloud)- all within single pipeline

·       Simplifies Pentaho’s integration with Hadoop clusters including enhanced UX of cluster configurations

Key Considerations

·       Adaptive Execution Layer Spark isn’t validated to execute pipelines connecting to multiple Hadoop clusters.

·       Pentaho Map Reduce isn’t validated to execute pipelines connecting to multiple Hadoop clusters.

Additional Resources

See Adding a new driver for how to add a driver. See Connecting to a Hadoop cluster with the PDI client for how to create a named connection.

·       Follow the suggestions in the Big Data issues troubleshooting sections to help resolve common issues when working with Big Data and Hadoop, especially Legacy mode activated when named cluster configuration cannot be located.

PDI AEL-Spark Enhancements

Capability

Picture%2B2 Pentaho 9.0 is available

The Pentaho Adaptive Execution Layer (AEL) is intended to provide flexible and transparent data processing with Spark, in addition to the native Kettle engine. The goal of AEL is to develop complex pipelines visually and then execute in Kettle or Spark based on data volume and SLA requirements.  AEL allows PDI users to designate Spark as execution engine for their transformations apart from Kettle.


The v9.0.0 release includes the following performance and flexibility enhancements to AEL-Spark:

·       Step level spark specific performance tuning options

·       Enhanced logging configuration and information entered into PDI logs

·       Added support for Spark 2.4, with existing 2.3 support

·       Supports following distros: Hortonworks HDP v3.0, 3.1; Cloudera CDH v6.1, 6.2; Amazon EMR v5.21, 5.24.


The following example showcases Spark App and Spark Tuning on specific steps within a PDI transformation:

Picture%2B3 Pentaho 9.0 is available

                            

Use Cases and Benefits

·       Eliminates black box feel with better visibility

·       Enable advanced Spark users with tools to improve performance

Key Considerations

Users must be aware of the following additional items related to AEL v9.0.0:

·       Spark v2.2 is not supported.

·       Native HBase steps are only available for CDH and HDP distributions.

·       Spark 2.4 is the highest Spark version currently supported.

Additional Resources

See the following documentation for more details: About Spark Tuning in PDI, Setup Spark Tuning, Configuring Application Tuning Parameters for Spark

Virtual File System (VFS) Enhancements

Capability

·       The changes to the VFS are in two main areas:

·       1. We added Amazon S3 and Snowflake Staging as VFS providers to named VFS Connections and introduced the Pentaho VFS (pvfs) that can reference defined VFS Connections and their protocols. In the S3 protocol, we support S3A and Session Tokens in 9.0.

·       The general format of a Pentaho VFS URL is:
pvfs://VFS_Connection/path (including a namespace, bucket or similar)

Picture%2B4 Pentaho 9.0 is available


·       2. A new file browsing experience has been added. The enhanced VFS browser allows users to browse any preconfigured VFS locations using named connections, their local filesystem, configured clusters via HDFS, as well as a Pentaho repository, if connected.  

Picture%2B5 Pentaho 9.0 is available

Use Cases and Benefits

·       Through the support of Pentaho VFS, you have an abstraction of the protocol. That means, when you want to change your provider in the future, all your jobs and transformations work seamless after this change in the VFS Connection. Today, you reference S3. Tomorrow, you want to reference another provider, for example HCP or Google Cloud. Using Pentaho VFS, your maintenance burden in these cases is much lower.

·       VFS Connections also enables you to use different accounts and servers (including namespaces, buckets or similar) within one PDI transformation​. Example: You want to process data within one transformation from S3 with different buckets and accounts.

·       Combining named VFS connections with the new file browsing experience provides a convenient way to easily access remote locations and extend the reach of PDI. The new file browser also offers the ability to manage files across those remote locations. For example, a user can easily copy files from Google Cloud into an S3 bucket using the browser’s copy and paste capabilities. A user can then easily reference those files using their named connections, in supported steps and job entries.

Picture%2B6 Pentaho 9.0 is available


A user can manage all files, whether they are local or remote in a central location. For example, there is no need to login to the Amazon S3 Management Console to create folders, rename, delete, move or copy files. Even a copy between the local filesystem and S3 is possible and you can upload/download files from within Spoon.

The new file browser also offers capabilities such as search, which allows a user to find filenames which match a specified search string. The file browser also remembers a user’s most recently accessed jobs and transformations for easy reference.

Key Considerations

As of PDI 9.0, the following protocols are supported: Amazon S3, Snowflake Staging (read only), HCP, Google CS

The following steps and job entries have been updated to use the new file open save dialog for 9.0: Avro input, Avro output, Bulk load into MSSQL, Bulk load into MySQL, Bulk load from MySQL, CSV File Input, De-serialize from file, Fixed File Input, Get data from XML, Get file names, Get files rows count, Get subfolder names, Google Analytics, GZip CSV input, Job (job entry), JSON Input, JSON Output, ORC input, ORC output, Parquet Input, Parquet output, Text file output, Transformation (job entry)

The File / Open dialog is still using the old browsing dialog. The new VFS browser for opening jobs and transformations can be reached through the File / Open URL menu entry.

Additional Resources

See Virtual File System connections, Apache Supported File Systems and Open a transformation for more information.

Cobol copybook steps

Capability

PDI now has two transformation steps that can be used to read mainframe records from a file and transform them into PDI rows.

·       Copybook input: This step reads the mainframe binary data files that were originally created using the copybook definition file and outputs the converted data to the PDI stream for use in transformations.

·       Read metadata from Copybook: This step reads the metadata of a copybook definition file to use with ETL Metadata Injection in PDI.

The Copybook steps also support metadata injection, extended error handling and can work with redefines. Extensive examples for these use cases are available in the PDI samples folder.

Picture%2B7 Pentaho 9.0 is available

Use Cases and Benefits

Pentaho Data Integration supports simplified integration with fixed-length records in mainframe binary data files, so that more users can ingest, integrate, and blend mainframe data as part of their data integration pipelines. This capability is critical if your business relies on massive amounts of customer and transactional datasets generated in mainframes that you want to search and query to create reports.

Key Considerations

This step works with Fixed Length COBOL records only. Variable record types such as VB, VBS, OCCURS DEPENDING ON are not supported.

Additional Resources

For more information about using copybook steps in PDI, see Copybook steps in PDI

Additional Enhancements

New Pentaho Server Upgrade Installer

The Pentaho Server Upgrade Installer is an easy to use graphical user interface that automatically applies the new release version to your archive installation of the Pentaho Server. You can upgrade versions 7.1 and later of the Pentaho Server directly to version 9.0 using this simplified upgrade process via the user interface of the Pentaho Server Upgrade Installer.

See Upgradethe Pentaho Server for instructions.

Snowflake Bulk Loader improvement

The Snowflake Bulk Loader has added support for doing a table preview in PDI 9.0. When connected to Snowflake and on the Output tab, select a table in the drop-down menu. The preview window is populated, showing the columns and data types associated with that table. The user can see the expected column layout and data types to match up with the data file.

For more information, please see the job entry documentation of the Snowflake Bulk Loader.

Redshift IAM security support and Bulk load improvements

With this release, you have more Redshift Database Connection Authentication Choices, these are

·       Standard credentials (default) – user password

·       IAM credentials

·       Profile located on local drive in AWS credentials file 

Bulk load into Amazon Redshift enhancements: New Options tab and Columns option in the Output tab of the Bulk load into Amazon Redshift PDI entry. Use the settings on the Options tab to indicate if all the existing data in the database table should be removed before bulk loading. Use the Columns option to preview the column names and associated data types within your selected database table.

See Bulk load into Amazon Redshift for more information.

Improvements in AMQP and UX changes in Kinesis

The AMQP Consumer step provides Binary message support, for example allowing to process AVRO formatted data.

Within the Kinesis Consumer step, users can change the output field names and types.

See the documentation of the AMQP Consumer and Kinesis Consumer steps for more details.

Metadata Injection (MDI) Improvements

In PDI 9.0.0, we continue to enable more steps to support metadata injection (MDI):

·       Split Field to Rows

·       Delete

·       String operations

In the Excel Writer step, the missing MDI step option “Start writing at cell”, has been added. This option can also be injected now.

Additionally, the metadata injection example is now available in the samples folder:
/samples/transformations/metadata-injection-example

See ETL metadata injection for more details.

Excel Writer: Performance improvement

The performance of the Excel Writer has been drastically improved when using templates. A sample test file with 40,000 rows needed about 90 seconds before 9.0 and now processes in about 5 seconds.

For further details, please see PDI-18422.

JMS Consumer changes

In PDI 9.0, we added the following fields to the JMS Consumer step: MessageID, JMS timestamp and JMS Redelivered.

This addition enables restartability and allows to omit duplicate messages.

For further details, please see PDI-18104 and the step documentation.

Text file output: Header support with AEL

You can set up the Text file input step to run on the Spark engine via AEL. The Header option of the Text file output step works now with AEL.

For further details, please see PDI-18083 and the Using the Text File Output step on the Spark engine documentation.

Transformation & Job Executor steps, Transformation & Job entries: UX improvement

Before 9.0, when passing parameters to transformations/jobs, the options “Stream column name” vs. “Value” (“Field to use” vs. “Static input value”) were ambiguous and led to hard to find issues.

In 9.0, we added behavior which prevents a user from entering values into both fields to avoid these situations.

For further details, please see PDI-17974.

Spoon.sh Exit code improvement

Spoon.sh (that gets called by kitchen.sh or pan.sh) sends the wrong exit status in certain situations.

In 9.0, we added a new environment variable FILTER_GTK_WARNINGS to control this behavior for warnings that effect the exit code. If the variable is set to anything, then a filter is applied to ignore any GTK warnings. If you don’t want to filter any warnings, then unset FILTER_GTK_WARNINGS.

For further details, please see PDI-17271.

Dashboard: Option for exporting analyzer report into CSV format.

Now it’s possible to export an analyzer report into a CSF format file even when embedded on a dashboard.

In the previous release the export option was available, but without the CSV format.

The CSV format was available only when using Analyzer outside dashboards, in this way we provide functional parity between Analyzer standalone charts and charts embedded in dashboards.

For further details, please see PDB-1327.

Analyzer: Use of date picker when selecting ranges for a Fiscal Date level relative filter.

Before 9.0 and for an AnalyzerFiscalDateFormat annotation on a level in a Time dimension, Analyzer did not show the “Select from date picker” link.

Now, relative dates can be looked up from the current date on the Date level, then the date picker can also be used to select the nearest fiscal time period.

For further details, please see ANALYZER-3149.

Mondrian: Option for setting the ‘cellBatchSize’ default value.

From a default installation the mondrian.properties does not include mondrian.rolap.cellBatchSize as a configurable property.

The purpose of this improvement is to include this property in the mondrian.properties by default in new builds so customers do not run into performance issues due to the default value for this property being set too low. The default value of the property should be clearly indicated in the properties file as well.

The default value has been updated to mondrian.rolap.cellBatchSize=1000000.

This value was chosen because this setting can run a very large 25M cell space report while keeping total server memory usage around 6.7 GB which is under the 8GB we list as the minimum memory required on a Pentaho server.

For further details, please see MONDRIAN-1713.

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

available, Pentaho
  • Recent Posts

    • How to Use CRM Integration to Your Advantage – Real World Examples
    • WATCH: ‘Coming 2 America’ Movie Review Available On Amazon Prime & Amazon
    • IBM launches AI platform to discover new materials
    • 3 Ways a Microsoft Dynamics 365 Supply Chain Management and EDI Integration Enhance E-Commerce CRM Strategy
    • The Neanderthals
  • Categories

  • Archives

    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited