• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Category Archives: Pentaho

Pentaho 9.1 is available!

October 19, 2020   Pentaho

Pentaho 9.1 is available

 

It’s that time of the year! A new release is available!


Go get EE through the support portal, and CE in the usual place!

Main features

 

·      Google Data Proc Support

·      Catalog Steps in Spoon

·      New Upgrade Utility

·      And a bunch of consolidation stuff:

o   20+ Continuous Improvements

o   10+ Platform Updates

o   200+ Performance/quality bugs

 

Google Data Proc

You can now access and process data from a Google Dataproc cluster in PDI. Google Dataproc is a cloud-native Spark and Hadoop managed service that has built-in integration with other Google Cloud Platform services, such as BigQuery and Cloud Storage. With PDI and Google Dataproc, you can migrate from on-premise to the Google Cloud.

You can use PDI’s Google Dataproc driver and named connection feature to access data on your Google Dataproc cluster as you would other Hadoop clusters, like Cloudera and Amazon EMR. See Set up the Pentaho Server to connect to a Hadoop cluster for further instructions.

 

 

§  What’s New

‒      New Hadoop driver

‒      AEL-Spark support

§  Version:

‒      Google Dataproc  – 1.4 (Ubuntu 18.04 LTS, Hadoop 2.9, Spark 2.4)

§  Benefit

‒      Enables processing large data sets in Google Data Proc clusters

‒      On-premise data movement/migration

 

§  Hadoop Driver supports the following:

‒      Multi-cluster

‒      HDFS

‒      Hive

‒      PMR Hive

‒      Oozie

‒      Sqoop

‒      Hadoop Job Executor

‒      Pig

‒      Parquet / Avro / ORC 

§  VFS support for GCS

§  Hbase is not supported 

 

Lumada Data Catalog steps for PDI

Lumada Data Catalog lets data engineers, data scientists, and business users accelerate metadata discovery and data categorization, and permits data stewards to manage sensitive data. Data Catalog collects metadata for various types of data assets and points to the asset’s location in storage. Data assets registered in Data Catalog are known as data resources.

You can use the folllowing four new PDI steps to work with Data Catalog metadata and data resources within your PDI transformations:

·       Read Metadata

Search Data Catalog’s existing metadata for specific data resources, including their storage location.

·       Write Metadata

Revise the existing Data Catalog tags associated with an existing data resource.

·       Catalog Input

Reads the CSV text file types or Parquet data formats of a Data Catalog data resource that is stored in a Hadoop or S3 ecosystem and outputs the data payload in the form of rows to use in a transformation.

·       Catalog Output

Encodes CSV text file types or Parquet data formats using the schema defined in PDI to create a new data resource or to replace or update an existing data resource in Data Catalog.

 

 

New Upgrade utility

 

•       Current Scope: 

‒      Scope only 9.0 to 9.1 (coming later: will extend to 8.3 LTS)

•       Reliable upgrades and rollback

‒      Initial environment check to detect product component and will only upgrade what is there

‒      White list to persist customization

‒      Will persist all plug-ins across upgrade

‒      Automatically whitelist all database driver jars

 

Compatibility Updates

 

This:

 

Picture%2B1 Pentaho 9.1 is available!

 

Other improvements:

 

This:

 

§  Data Integration

‒      S3 Multipart Upload now allow configurable part sizes (PDI-16606)

‒      MongoDB Plug-in now allows PLAIN credentials for LDAP integration (PDI-17228)

§  Dashboards / Reporting

‒      10-100x performance improvement for certain large slices and roll-ups for Mondrian Cubes (JIRA Link)

‒      Option to remove/hide the filter panel when used in a dashboard (ANALYZER-2270)

‒      Count and Count Distinct Summary on currency fields uses the default format (PIR-699)

‒      Admins can now customize the template(s) used for exporting to PDF and Excel (ANALYZER-12)

§  Platform

‒      Passwords stored in the BA Server config files and repository are now encrypted (BISERVER-3497)

‒      Users are now able to change their own password (BISERVER-13699)

 

 

 

-pedro 

 

 

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Read More

Pentaho 9.0 is available

February 6, 2020   Pentaho

Pentaho 9.0 is available

Without further ado: Get Enterprise Edition here, and get Community Edition here

PDI Multi Cluster Hadoop Integration

Capability

Pentaho connects to Cloudera Distribution for Hadoop (CDH), Hortonworks Data Platform (HDP), Amazon Elastic MapReduce (EMR). Pentaho also supports many related services such as HDFS, HBase, Oozie, Zookeeper, and Spark.

Before this release, Pentaho Server as well as PDI design time environment – Spoon, can work with only one Hadoop cluster at a time. It required multiple transformations, instances, and pipelines to execute multiple Hadoop clusters. With 9.0 release, major architecture changes have occurred to easily configure, connect and manage multiple Hadoop clusters.

·       Users can access and process data from multiple Hadoop clusters from different distros and versions- all from single transformation and instance of Pentaho. 

·       Also, within Spoon, users can now set up three distinct cluster configs, all having reference to the specific cluster, without having to restart Spoon.  There is also a new configuration UI to easily configure your Hadoop drivers for managing different clusters.

·       Improved cluster configuration experience and secure connection with the new UI

·       Supports following distros: Hortonworks HDP v3.0, 3.1; Cloudera CDH v6.1, 6.2; Amazon EMR v5.21, 5.24.


Existing single cluster/shim functionality will continue to work.  


The following example shows the Multi-cluster implemented in the same data pipeline via connecting to both Hortonworks HDP and Cloudera CDH clusters.

Picture%2B1 Pentaho 9.0 is available


Use Cases and Benefits

·       Enables hybrid big data processing support (on-prem or cloud)- all within single pipeline

·       Simplifies Pentaho’s integration with Hadoop clusters including enhanced UX of cluster configurations

Key Considerations

·       Adaptive Execution Layer Spark isn’t validated to execute pipelines connecting to multiple Hadoop clusters.

·       Pentaho Map Reduce isn’t validated to execute pipelines connecting to multiple Hadoop clusters.

Additional Resources

See Adding a new driver for how to add a driver. See Connecting to a Hadoop cluster with the PDI client for how to create a named connection.

·       Follow the suggestions in the Big Data issues troubleshooting sections to help resolve common issues when working with Big Data and Hadoop, especially Legacy mode activated when named cluster configuration cannot be located.

PDI AEL-Spark Enhancements

Capability

Picture%2B2 Pentaho 9.0 is available

The Pentaho Adaptive Execution Layer (AEL) is intended to provide flexible and transparent data processing with Spark, in addition to the native Kettle engine. The goal of AEL is to develop complex pipelines visually and then execute in Kettle or Spark based on data volume and SLA requirements.  AEL allows PDI users to designate Spark as execution engine for their transformations apart from Kettle.


The v9.0.0 release includes the following performance and flexibility enhancements to AEL-Spark:

·       Step level spark specific performance tuning options

·       Enhanced logging configuration and information entered into PDI logs

·       Added support for Spark 2.4, with existing 2.3 support

·       Supports following distros: Hortonworks HDP v3.0, 3.1; Cloudera CDH v6.1, 6.2; Amazon EMR v5.21, 5.24.


The following example showcases Spark App and Spark Tuning on specific steps within a PDI transformation:

Picture%2B3 Pentaho 9.0 is available

                            

Use Cases and Benefits

·       Eliminates black box feel with better visibility

·       Enable advanced Spark users with tools to improve performance

Key Considerations

Users must be aware of the following additional items related to AEL v9.0.0:

·       Spark v2.2 is not supported.

·       Native HBase steps are only available for CDH and HDP distributions.

·       Spark 2.4 is the highest Spark version currently supported.

Additional Resources

See the following documentation for more details: About Spark Tuning in PDI, Setup Spark Tuning, Configuring Application Tuning Parameters for Spark

Virtual File System (VFS) Enhancements

Capability

·       The changes to the VFS are in two main areas:

·       1. We added Amazon S3 and Snowflake Staging as VFS providers to named VFS Connections and introduced the Pentaho VFS (pvfs) that can reference defined VFS Connections and their protocols. In the S3 protocol, we support S3A and Session Tokens in 9.0.

·       The general format of a Pentaho VFS URL is:
pvfs://VFS_Connection/path (including a namespace, bucket or similar)

Picture%2B4 Pentaho 9.0 is available


·       2. A new file browsing experience has been added. The enhanced VFS browser allows users to browse any preconfigured VFS locations using named connections, their local filesystem, configured clusters via HDFS, as well as a Pentaho repository, if connected.  

Picture%2B5 Pentaho 9.0 is available

Use Cases and Benefits

·       Through the support of Pentaho VFS, you have an abstraction of the protocol. That means, when you want to change your provider in the future, all your jobs and transformations work seamless after this change in the VFS Connection. Today, you reference S3. Tomorrow, you want to reference another provider, for example HCP or Google Cloud. Using Pentaho VFS, your maintenance burden in these cases is much lower.

·       VFS Connections also enables you to use different accounts and servers (including namespaces, buckets or similar) within one PDI transformation​. Example: You want to process data within one transformation from S3 with different buckets and accounts.

·       Combining named VFS connections with the new file browsing experience provides a convenient way to easily access remote locations and extend the reach of PDI. The new file browser also offers the ability to manage files across those remote locations. For example, a user can easily copy files from Google Cloud into an S3 bucket using the browser’s copy and paste capabilities. A user can then easily reference those files using their named connections, in supported steps and job entries.

Picture%2B6 Pentaho 9.0 is available


A user can manage all files, whether they are local or remote in a central location. For example, there is no need to login to the Amazon S3 Management Console to create folders, rename, delete, move or copy files. Even a copy between the local filesystem and S3 is possible and you can upload/download files from within Spoon.

The new file browser also offers capabilities such as search, which allows a user to find filenames which match a specified search string. The file browser also remembers a user’s most recently accessed jobs and transformations for easy reference.

Key Considerations

As of PDI 9.0, the following protocols are supported: Amazon S3, Snowflake Staging (read only), HCP, Google CS

The following steps and job entries have been updated to use the new file open save dialog for 9.0: Avro input, Avro output, Bulk load into MSSQL, Bulk load into MySQL, Bulk load from MySQL, CSV File Input, De-serialize from file, Fixed File Input, Get data from XML, Get file names, Get files rows count, Get subfolder names, Google Analytics, GZip CSV input, Job (job entry), JSON Input, JSON Output, ORC input, ORC output, Parquet Input, Parquet output, Text file output, Transformation (job entry)

The File / Open dialog is still using the old browsing dialog. The new VFS browser for opening jobs and transformations can be reached through the File / Open URL menu entry.

Additional Resources

See Virtual File System connections, Apache Supported File Systems and Open a transformation for more information.

Cobol copybook steps

Capability

PDI now has two transformation steps that can be used to read mainframe records from a file and transform them into PDI rows.

·       Copybook input: This step reads the mainframe binary data files that were originally created using the copybook definition file and outputs the converted data to the PDI stream for use in transformations.

·       Read metadata from Copybook: This step reads the metadata of a copybook definition file to use with ETL Metadata Injection in PDI.

The Copybook steps also support metadata injection, extended error handling and can work with redefines. Extensive examples for these use cases are available in the PDI samples folder.

Picture%2B7 Pentaho 9.0 is available

Use Cases and Benefits

Pentaho Data Integration supports simplified integration with fixed-length records in mainframe binary data files, so that more users can ingest, integrate, and blend mainframe data as part of their data integration pipelines. This capability is critical if your business relies on massive amounts of customer and transactional datasets generated in mainframes that you want to search and query to create reports.

Key Considerations

This step works with Fixed Length COBOL records only. Variable record types such as VB, VBS, OCCURS DEPENDING ON are not supported.

Additional Resources

For more information about using copybook steps in PDI, see Copybook steps in PDI

Additional Enhancements

New Pentaho Server Upgrade Installer

The Pentaho Server Upgrade Installer is an easy to use graphical user interface that automatically applies the new release version to your archive installation of the Pentaho Server. You can upgrade versions 7.1 and later of the Pentaho Server directly to version 9.0 using this simplified upgrade process via the user interface of the Pentaho Server Upgrade Installer.

See Upgradethe Pentaho Server for instructions.

Snowflake Bulk Loader improvement

The Snowflake Bulk Loader has added support for doing a table preview in PDI 9.0. When connected to Snowflake and on the Output tab, select a table in the drop-down menu. The preview window is populated, showing the columns and data types associated with that table. The user can see the expected column layout and data types to match up with the data file.

For more information, please see the job entry documentation of the Snowflake Bulk Loader.

Redshift IAM security support and Bulk load improvements

With this release, you have more Redshift Database Connection Authentication Choices, these are

·       Standard credentials (default) – user password

·       IAM credentials

·       Profile located on local drive in AWS credentials file 

Bulk load into Amazon Redshift enhancements: New Options tab and Columns option in the Output tab of the Bulk load into Amazon Redshift PDI entry. Use the settings on the Options tab to indicate if all the existing data in the database table should be removed before bulk loading. Use the Columns option to preview the column names and associated data types within your selected database table.

See Bulk load into Amazon Redshift for more information.

Improvements in AMQP and UX changes in Kinesis

The AMQP Consumer step provides Binary message support, for example allowing to process AVRO formatted data.

Within the Kinesis Consumer step, users can change the output field names and types.

See the documentation of the AMQP Consumer and Kinesis Consumer steps for more details.

Metadata Injection (MDI) Improvements

In PDI 9.0.0, we continue to enable more steps to support metadata injection (MDI):

·       Split Field to Rows

·       Delete

·       String operations

In the Excel Writer step, the missing MDI step option “Start writing at cell”, has been added. This option can also be injected now.

Additionally, the metadata injection example is now available in the samples folder:
/samples/transformations/metadata-injection-example

See ETL metadata injection for more details.

Excel Writer: Performance improvement

The performance of the Excel Writer has been drastically improved when using templates. A sample test file with 40,000 rows needed about 90 seconds before 9.0 and now processes in about 5 seconds.

For further details, please see PDI-18422.

JMS Consumer changes

In PDI 9.0, we added the following fields to the JMS Consumer step: MessageID, JMS timestamp and JMS Redelivered.

This addition enables restartability and allows to omit duplicate messages.

For further details, please see PDI-18104 and the step documentation.

Text file output: Header support with AEL

You can set up the Text file input step to run on the Spark engine via AEL. The Header option of the Text file output step works now with AEL.

For further details, please see PDI-18083 and the Using the Text File Output step on the Spark engine documentation.

Transformation & Job Executor steps, Transformation & Job entries: UX improvement

Before 9.0, when passing parameters to transformations/jobs, the options “Stream column name” vs. “Value” (“Field to use” vs. “Static input value”) were ambiguous and led to hard to find issues.

In 9.0, we added behavior which prevents a user from entering values into both fields to avoid these situations.

For further details, please see PDI-17974.

Spoon.sh Exit code improvement

Spoon.sh (that gets called by kitchen.sh or pan.sh) sends the wrong exit status in certain situations.

In 9.0, we added a new environment variable FILTER_GTK_WARNINGS to control this behavior for warnings that effect the exit code. If the variable is set to anything, then a filter is applied to ignore any GTK warnings. If you don’t want to filter any warnings, then unset FILTER_GTK_WARNINGS.

For further details, please see PDI-17271.

Dashboard: Option for exporting analyzer report into CSV format.

Now it’s possible to export an analyzer report into a CSF format file even when embedded on a dashboard.

In the previous release the export option was available, but without the CSV format.

The CSV format was available only when using Analyzer outside dashboards, in this way we provide functional parity between Analyzer standalone charts and charts embedded in dashboards.

For further details, please see PDB-1327.

Analyzer: Use of date picker when selecting ranges for a Fiscal Date level relative filter.

Before 9.0 and for an AnalyzerFiscalDateFormat annotation on a level in a Time dimension, Analyzer did not show the “Select from date picker” link.

Now, relative dates can be looked up from the current date on the Date level, then the date picker can also be used to select the nearest fiscal time period.

For further details, please see ANALYZER-3149.

Mondrian: Option for setting the ‘cellBatchSize’ default value.

From a default installation the mondrian.properties does not include mondrian.rolap.cellBatchSize as a configurable property.

The purpose of this improvement is to include this property in the mondrian.properties by default in new builds so customers do not run into performance issues due to the default value for this property being set too low. The default value of the property should be clearly indicated in the properties file as well.

The default value has been updated to mondrian.rolap.cellBatchSize=1000000.

This value was chosen because this setting can run a very large 25M cell space report while keeping total server memory usage around 6.7 GB which is under the 8GB we list as the minimum memory required on a Pentaho server.

For further details, please see MONDRIAN-1713.

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Read More

Home

February 3, 2019   Pentaho

This website contains links to useful resources concerning the Kettle open source data integration project.

Blog

Go to the Kettle blog: Matt Casters on Data Integration and Graphs

Downloads

Download the nightly built Kettle REMIX version 8.2.0.1 (>1GB)
Download the nightly built Kettle REMIX version 8.2.0.1 with initial Apache Beam support (UNSTABLE, >1GB)
WebSpoon docker image with Neo4j solutions plugins

Kettle plugins for Neo4j

Neo4j plugins for Kettle
Kettle Neo4j Logging

Kettle plugins

Environment
Needful Things
Data sets and unit testing
Debugging
Load Text From File (Apache Tika)
Metastore utilities
Read from MongoDB changes stream
Azure Event Hubs

Kettle integration with Apache Beam

Kettle Beam


© 2019, Built with Gatsby

Let’s block ads! (Why?)

Home

Read More

Pentaho 8.2 is available!

December 13, 2018   Pentaho

I’ve come to accept my inefficiency on keeping up with the technical blog posts. This is the point where one accepts his complete uselessness (and I don’t even know if this is a real word!)

Anyway – up to the good things:

Pentaho 8.2 is available!

Get it here!
A really really solid release! A huge list of things that will make a serious impact on the development effort and production releases out there.

Release overview

Here’s the release at a glimpse:
  • Enhance Eco System Integration
    • Hitachi Content Platform (HCP) Connector I
    • MapR DB Support
    • Google Encryption Support
  • Improve Edge to Cloud Processing
    • Enhanced AEL 
    • Streaming AMQP
  • Better Data Operation
    • Expanded Lineage     
    • Status Monitoring UX
    • OpenJDK support
  • Enable Data Science & Visualization
    • Python Executor
    • PDI Data Science Notebook (Jupyter) Integration
    • Push Streaming
  • Improve Platform Stability and Usability
    • JSON Enhancements
    • BA Chinese Language Localization for PUC
    • Expanded MDI
  • Additional Improvements
And now a little bit of detail into each of them:

Ecosystem Integration

Hitachi Content Platform (HCP) Connectivity

HCP is a distributed storage system designed to support large, growing repositories of fixed-content data from simple text files to images, video to multi-gigabyte database images. HCP stores objects that include both data and metadata that describes that data and presents these objects as files in a standard directory structure.
An HCP repository is partitioned into namespaces owned and managed by tenants, providing access to objects through a variety of industry-standard protocols, as well as through various HCP-specific interfaces.
There are many use cases for using HCP in the Enterprise context:
  • Globally Compliant Retention Platform (GCRP)
    • Meet Compliance & Legal Retention requirements (WORM, SEC 17A-4, CFTC and MSRB)
  • Secure Analytics Archive
    • Big data source/target (land) for secure analytic workflows
    • Better Data portability
    • Multi-tenant
  • Protect data with much higher durability (up to fifteen 9s) and availability (up to ten 9s) with HCP
The PDI+HCP combo will allow much more resources into serving these use cases: By leveraging PDI’s connectivity capabilities to a wide variety of data, we can use HCP as a “Staging Data Lake” for semi-structured and unstructured data and/or using it as an execution environment for the execution of data science algorithms against this type of content will also, like enriching HCP metadata or doing deep learning for image recognition
In this release we implemented a VFS driver for HCP; Next versions will include a deeper, metadata level integration with HCP’s functionality.
hcp 1 Pentaho 8.2 is available!

MapR DB support

mapr Pentaho 8.2 is available!
Simple but important improvement: MapR DB is now supported! It’s an enterprise-grade, high performance, global NoSQL database management system. It is a multi-model database that converges operations and analytics in real-time, including the HBase API to run HBase applications, even though not all features are compatible.
It’s now validated to read/write data from MapR-DB as Hbase. In terms of what use case this enables, I’d call out: Operational Data Hub/Real-Time BI, Customer 360 and several IoT related ones.

Google Cloud Encryption

google cloud security Pentaho 8.2 is available!
Google CMEK allows data owners to have a multilayered security model that secures data and controls access to the data encryption keys. With this new capability, Pentaho users can use these custom encryption keys to access data in Google Cloud Storage and Google Big Query enhancing the security of the data. And we’re very happy to say that we were able to test that it just works with no product change required! Damn, feels good when it happens 😁

Edge to Cloud Processing

Adaptive Execution Layer (AEL) Improvements

AEL is our cluster-version of the “Write Once, Run Everywhere”, an abstraction layer where we currently have engines for Kettle (the classic one) and Spark.
Initially available in 7.1, we’ve been expanding it not only in terms of features but also in terms of vendor support. Here’s the current matrix:
ael compat matrix Pentaho 8.2 is available!
The current spark version supported is 2.3. There are many other point improvements to AEL. I won’t go into details on them but here’s a small overview:
  • Support for execution of MDI driven transformation via “ETL Metadata Injection” step
  • Support for sub-transformation steps Simple Mapping/Mapping (Transformation Executor was already supported)
  • Native Spark implementation for HBase and Hive
  • Support for S3 Cloud storage from AEL with native integration
One relevant change though is that starting from 8.2 AEL is available only on the EE version. It’s something that from the beginning was being debated, on opening it completely or not (the code was never available as we were actively changing the APIs and couldn’t guarantee stability in external contributions). After looking at all the data we made the tough call to pull it to EE land icon sad Pentaho 8.2 is available!

Streaming AMQP

streaming pdi Pentaho 8.2 is available!
8.0 introduced a new paradigm for handing streaming transformations in a continuous way. A new set fo steps work in conjunction with a newly introduced step (“Get Records from Stream”) to process micro-batch of continuous stream of records.
In the meantime, a few steps were introduced to ingest/produce streaming data from/to Kafka/MQTT/JMS and 8.1 introduced a mechanism to pass all streaming data together to downstream.
And now two steps were added: AMQP Consumer and AMQP Producer. Nuff said

Data Operations

Lineage improvements

Getting to the point:
  • What’s New
    • Architecture improvements for 3rd party lineage bridges (like IGC)
    • Add step and job entry “description” fields to lineage data output
    • Continued upgrading to Custom Lineage Analyzers for the following steps and job entries: Hadoop File Input & Output, Spark Submit, Mapping (sub-transformation), ETL Metadata Injection step (added relation to the sub-transformation being executed)
  • Benefits
    • Better and easier integration of 3rd party lineage bridges also for future partnering
    • Improve of using lineage information for documentation and compliance use cases
    • Expand completeness of data lineage steps and job entries

Monitoring Status Page Update

monitoring page Pentaho 8.2 is available!
Probably one of the most asked feature of the decade. Our “vintage-look” pdi status page has been refreshed, and along with it some extra functionality 

OpenJDK support

openjdk Pentaho 8.2 is available!
If you’ve been waiting for this one as I have, you’ll surely scream “FINALLY!!”! We now support of OpenJDK 8 JRE in server, AEL and client tools. 

Advanced Analytics and Visualizations

There must be a reason why these 2 are connected on the same part – I just don’t know why 😅

Python Executor

python Pentaho 8.2 is available!
If you’re an EE customer you’ll be able to benefit from this refreshed, AEL compliant Python executor. Feature-wise, I’d call out:
  • Automated ability to Get fields from Python script
  • Allows for multiple inputs (Row by Row or All Rows)
  • Ability to Pick a Python environment from one or more installed Python installations, i.e. virtual environments
  • Each Step gets its own Python session 
Used in conjunction with the existing R step and Spark Submit Job step to add overall Data Science offering and capabilities. 

PDI Data Science Notebook (Jupyter) Integration

Not a feature per se but an extremely useful consequence of the python step and all the improvements on data services.
Data scientists develop analytical models to achieve specific business results, and they perform much of their work within Notebooks, such as Jupyter. By using Pentaho Data Integration(PDI) and the new Python Executor step, Data Engineers can create data sets within PDI and make them available to be consumed within a Jupyter Notebook by the Data Scientist, as shown below in a collaborative workflow:
jupyter1 Pentaho 8.2 is available!

Using PDI and the Python Executor step, the required data set is created using a PDI Data Service (virtual table), which can be consumed in the Jupyter Notebook, via a notebook template file . The file can be created programmatically via the Python Executor step in PDI, and pre-filled in with required connection info to the PDI Data Service.
jupyter2 Pentaho 8.2 is available!
Some technical considerations for the integration solution are as follows:
  • Pentaho Server needs to be running to host a PDI Data Service
  • PDI Spoon needs to be connected to the repository to save/deploy/edit the Data Service
  • PDI Data Service Client Jars need to be made available to be used by Jupyter Notebook
  • Compatible with Python 2.7.x or Python 3.5.x
  • Compatible with Jupyter Notebook 5.6.x
  • Python JDBC package dependencies include JayDeBeApi and jpype

Streaming Vizualizations and CTools (Push)

realtime Pentaho 8.2 is available!
I obviously love this one! We finished the connection between the (really awesome) Streaming Data Services all the way through CTools dashboards. Now the dashboard components that are ready for it (tables and charts are) will receive data as soon as it’s ready, versus polling every N seconds.
When using CDE to create push-based streaming dashboards, the ‘Refresh period’ property for components and streaming over data services queries can now be left blank, as data can be continuously received and rendered as soon as new windows are available (see image below).
cde Pentaho 8.2 is available!
More information can be accessed here:
https://help.pentaho.com/Documentation/8.2/Products/CTools/Create_Streaming_Service_Dashboard

Platform Updates

JSON Input updates

json Pentaho 8.2 is available!
The JSON Input step now features a new Select Fields window for specifying what fields you want to extract from your source file. The window displays the structure of the source JSON file. Each field in the structure is displayed with a checkbox for you to indicate if it should be extracted from the file. You can also search within the structure for a specific field. Overall, these enhancements provide a drastic improvement in step usability.

BA Chinese Language Localization for PUC

chinese Pentaho 8.2 is available!
请不要让我失望,谷歌翻译

Expanded Metadada Injection support

Metadata injection enables the passage of metadata to transformation templates at runtime in order to drastically increase productivity, reusability, and automation of transformation workflow. This supports use cases like the onboarding of data from many files and tables to data lakes. In addition to existing metadata injection enabled steps, as of 8.2 you now can inject metadata into any field in the following Pentaho Data Integration (PDI) steps:
  • Get System Data
  • Execute Row SQL Script
  • Execute SQL Script
  • User Defined Java Class
  • AMQP Consumer
  • AMQP Producer
  • JMS Consumer
  • JMS Producer
  • Add a Checksum
  • Set Field Value
  • Set Field Value to a Constant

BA Analyzer Numeric Level Comparison Filters

filters in analyzer Pentaho 8.2 is available!
In 8.2, users can filter Analyzer reports using numeric level comparison filters, which provide an added degree of flexibility and productivity to Business Analytics customers. Previously, level filters treated all levels as text-based/non-numeric, and as such required filter criteria based on either a picklist or string matching. 
The new filters for numeric levels include greater than, less than, greater than or equals, less than or equals, and between. For instance, as seen in the insurance example below, a numeric level representing monthly auto premiums can be filtered according to a numeric range, keeping only records and measure amounts (of individual customers) where the premium level is between $ 150 and $ 400 per month.
Additional considerations:
  • The numeric level comparison filters can be parametrized for use with Dashboard Designer
  • The filters can be applied via the report URL
  • If you are working with a high cardinality level, it may make sense to optimize performance by adjusting the mondrian.olap.maxConstraints property (ensure joins are handled by the underlying database) and/or rounding your data to manage cardinality

Additional Enhancements

Here are some not-so-minor other improvements that were done on the release:

PDI Step & Job Entry Improvements

  • User Defined Java Class step: Support of Java 1.8
    • Allow PDI users to make use of newer Java language features (e.g. enhanced for loops, lambda expressions, varargs, etc.)
  • Text File Output step: Added support of variables in the “Split every…rows” property
    • Improve creating of flexible output file sizes controlled by variables.
  • FTPS job entries: Support “Advanced server protection level”
    • All FTPS steps have been enhanced by supporting “private protection level”, so the data is secured by integrity and confidentiality.
  • Rest Client step: Allow to provide custom content type headers.
    • Many REST servers require custom content types to be sent to them. In particular W3C Semantic compliant data stores such as Allegrograph and MarkLogic Server. 
  • Text File Input Step: Provide the full stack trace when a file cannot be opened
    • The full stack trace will provide very valuable debugging information and allow root cause analysis of problems to resolved them more quickly.
  • Calculator step: Added exceptions when a file is not found.
    • Instead of providing bad data when a file is not available, the process ends with an error to notify the user of the issue.

BA Improvements

  • PUC Upload/Download: Users with ‘publish content’ permission can now upload/download files to PUC
    • No longer need to rely on a few users with complete ‘admin’ rights to move content btwn environments
  • Scheduling Access: PUC users without scheduling permissions can no longer see the scheduling perspective
    • More logical permissions and user experience for BA customers
  • MDX Performance: MDX optimizations for some scenarios that incl. subtotals, numeric filters, and percentages
    • Better performance in some Analyzer/Mondrian query scenarios
  • Analyzer Business Groups: Global setting option to expand or collapse Analyzer business groups
    • Long lists of fields can be rolled up by default when a report is opened, reducing scrolling / improving UX
  • Analyzer Numeric Dimension Filters: (*Stretch Goal*) Comparison filters ( < , > , btwn, …) on  numeric levels (i.e. age, credit score, customer id)
    • Much greater flexibility to query data with numeric levels (i.e. show me sales for customers between ages of 18 and 30).  Previously every distinct level value would have to be manually added to an include filter criteria.
Get it here and Enjoy!!!

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Read More

Catching up with Kettle REMIX

November 23, 2018   Pentaho

Dear Kettle and Neo4j friends,

Since I joined the Neo4j team in April I haven’t given you any updates despite the fact that a lot of activity has been taking place in both the Neo4j and Kettle realms.

First and foremost, you can grab the cool Neo4j plugins from neo4j.kettle.be (the plugin in the marketplace is always out of date since it takes weeks to update the metadata).

Then based on valuable feedback from community members we’ve updated the DataSet plugin (including unit testing) to include relative paths for filenames (for easier git support), to avoid modifying transformation metadata and to set custom variables or parameters.

Kettle unit testing ready for prime time

I’ve also created a plugin to debug transformations and jobs a bit easier.  You can do things like set specific logging levels on steps (or only for a few rows) and work with zoom levels.

Clicking right on a step you can choose “Logging…” and set logging specifics.

Then, back on the subject of Neo4j, I’ve created a plugin to log the execution results of transformations and jobs (and a bit of their metadata) to Neo4j.

Graph of a transformation executing a bunch of steps. Metadata on the left, logging nodes on the right.

Those working with Azure might enjoy the Event Hubs plugins for a bit of data streaming action in Kettle.

The Kettle Needful Things plugin aims to fix bugs and solve silly problems in Kettle.  For now it sets the correct local metastore on Carte servers AND… features a new launcher script called Maitre. Maitre supports transformations and jobs, local, remote and clustered execution.

The Kettle Environment plugin aims to take a stab at life-cycle management by allowing you to define a list of Environments:

The Environments dialog shown at the start of Spoon

In each Environment you can set all sorts of metadata but also the location of the Kettle and MetaStore home folders.

Finally, because downloading, patching, installing and configuring all this is a lot of work, I’ve created an automated process which does this for you on a daily bases (for testing) and so you can download Kettle Community Edition version 8.1.0.0 patched to 8.1.0.4 with all the extra plugins above in its 1GB glory at : remix.kettle.be

To get it on your machine simply run:

wget remix.kettle.be -O remix.zip

You can also give these plugins (Except for Needful-things and Environment) a try live on my sandbox WebSpoon server.  You can easily run your own WebSpoon from the also daily updated docker container.

If you have suggestions, bugs, rants, please feel free to leave them here or in the respective github projects.  Any feedback is as always more than welcome.  In fact, thanks you all for the feedback given so far.  It’s making all the difference.  If you feel the need to contribute more opinions on the subjects of Kettle feel free to send me a mail (mattcasters at gmail dot com) to join our kettle-community Slack channel.

Enjoy!

Matt

Let’s block ads! (Why?)

Matt Casters on Data Integration

Read More

Pentaho Community Meeting – PCM18! Bologna, Italy, November 23-25!

August 2, 2018   Pentaho

headerbanner PCM18 Pentaho Community Meeting   PCM18! Bologna, Italy, November 23 25!

PCM 18!!

If you’ve been in one, no more words are needed, just go ahead and register! If you don’t know what I’m talking about, just go ahead and register as well!

It’s the best example of what Pentaho – how part of Hitachi Vantara – is all about. A very passionate group of people that are absolutely world class at what they do and still know how to spend a good time!

csm Gruppenbild Pentaho Community Meeting 2017 2d07006266 Pentaho Community Meeting   PCM18! Bologna, Italy, November 23 25!
PCM17 group photo

Now shamelessly copy-pasting the content from it-novum:

Pentaho Community Meeting 2018

Pentaho Community Meeting 2018 will take place in Bologna from November 23-25. It will be organized by Italia Pentaho User Group and by it-novum, the host of PCM17. As always, it will be a 3-days event full of presentations, networking and fun and we invite Pentaho users of every kind to participate!

For PCM18 we will meet in the beautiful city of Bologna. The guys of Italia User Group will take care of the venue and the program. With Virgilio Pierini as group representative we not only have a Pentaho enthusiast but also a native of Bologna guiding us to the beautiful corners of the hometown of Europe’s oldest university!

What is Pentaho Community Meeting?

Pentaho Community Meeting is an informal gathering for Pentaho users from around the world. We meet to discuss the latest and greatest in Pentaho products and exciting geek stuff (techie track) as well as best practices of Pentaho implementations and successful projects (business track). Read this summary of Pentaho Community Meeting 2017 to learn more.

PCM18 is open to everyone who does something with Pentaho (development, extensions, implementation) or plans to do data integration, analytics or big data with Pentaho. Several Pentaho folks – architects, designers, product managers – will share their latest developments with us.

The event is community-oriented and open-minded. There’s room for networking and exchanging ideas and experiences. Participants are free to break off into groups and work together.

Call for Papers

For sure, this is intended to be a community event – for the community and by the community. To register your proposal for the agenda, please use the contact form to send a brief description including your name and title in English until September 30th.

Agenda

The agenda will be updated continuously, so stay tuned for updates! All updates will be posted on twitter, too.

Friday, November 23 | Hackathon

We start the three-day PCM with a hackathon, snacks and drinks. After a 2-hour hackathon, a highly esteemed jury will award the most intelligent/awkward/funny hacks.

Saturday, November 24 | Conference Day

Still a lot to be determined! We’re still receiving papers

  • Welcome speech | Stefan Müller and the org team
  • The future of Pentaho in Hitachi Vantara | Pedro Alves, Hitachi Vantara
  • What’s new in PDI 9.0 | Jens Bleuel, Hitachi Vantara
  • Useful Kettle plugins | Matt Casters, Neo4j (and founder of Kettle)
  • IoT and AI: Why innovation is a societal imperative | Wael Elrifai, VP for Solution Engineering – Big Data, IOT & AI, Hitachi Vantara
  • Pentaho at CERN | Gabriele Thiede, CERN
  • Pentaho User Group Italia
  • SSBI (Self Service BI ) – Pentaho Plugin Update | Pranav Lakhani, SPEC INDIA
  • Scaling Pentaho Server with Kubernetes | Diethard Steiner
  • Capitalizing on Lambda & Kappa Architectures for IoT with Pentaho | Issam Hizaji, Lead Sales Engineer, Data Analytics & IoT | Emerging & Southern

After the lunch, everybody splits up to join the business or the techie track.

Sunday, November 25 | Social Event

Brunch, sightseeing and… let´s see!

—-

Anyway, believe me, you want to go! GO REGISTER HERE!

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Read More

Pentaho 8.1 is available

June 12, 2018   Pentaho

Pentaho 8.1 is available

The team has once again over delivered on a dot release! Below are what I think are the many highlights of Pentaho 8.1 as well as a long list of additional updates.
If you don’t have time to read to the end of my very long blog, just save some time and download it now. Go get your Enterprise Edition or trial version from the usual places


For CE, you can find it on the community home!

Cloud

One of the biggest themes of the release: Increased support for Cloud. A lot of vendors are fighting for becoming the best providers, and what we do is try to make sure Pentaho users watch all that comfortably sitting on their chairs, having a glass of wine, and really not caring about the outcome. Like in a lot of areas, we want to be agnostic – which is not saying that we’ll leverage the best of each – and really focus on logic and execution.
It’s hard to do this as a one time effort, so we’ve been adding support as needed (and by “as needed” I really mean based on the prioritization given by the market and our customers). A big focus of this release was Google and AWS:
 Pentaho 8.1 is available

Google Storage (EE)

Google Cloud Storage is a RESTful unified storage for storing and accessing data on Google’s infrastructure. PDI support for import and export Data To/From Cloud Storage is now done through a new VFS driver (gs://). You may even use it on the several steps that support it as well as browse it’s contents.
These are the roles required on Google Storage for this to work:
●     Storage Admin
●     Storage Object Admin
●     Storage Object Creator
●     Storage Object Viewer
In terms of authentication, you’ll need the following environment variable defined:
GOOGLE_APPLICATION_CREDENTIALS=”/opt/Pentaho81BigQuery.json“
From this point on, just treat it as a normal VFS source.

 Pentaho 8.1 is available

 Google BigQuery – JDBC Support  (EE/CE)

BigQuery is Google’s serverless, highly scalable, low cost enterprise data warehouse. Fancy name for a database, and that’s how we treat it.
In order to connect to it first we need the appropriate drivers. Steps here are pretty simple:
1.      Download free driver:  https://cloud.google.com/bigquery/partners/simba-drivers/
2.      Copy google*.* files from Simba driver to /pentaho/design-tools/data-integration/libs folder
Host Name will default to https://www.googleapis.com/bigquery/v2 but your mileage may vary.
Unlike the previous item, authentication doesn’t use the previously defined environment variable as does Google VFS. Authentication here is done at the JDBC driver level, though a driver option, OAuthPvtKeyPath, set in the Database Connection Option and the you need to point to the Google Storage certificate through the P12 key format.
The following Google BigQuery roles are required:
1.      BigQuery Data Viewer
2.      BigQuery User
 Pentaho 8.1 is available

Google BigQuery – Bulk Loader  (EE)

While you can use a regular table output to insert data into BigQuery that’s going to be slow as hell (who said hell was slow? This expression makes no sense at all!). So we’ve added a step for that: Google BigQuery Loader.
This step leverages google’s loading abilities, and is processed out on Google, not on PDI. So the data, that has to be either in Avro, JSON or CSV has to be previously copied to Google Storage. From that point on is pretty straightforward. Authentication is done via the GOOGLE_APPLICATION_CREDENTIALS environment variable point to the Google JSON file.
 Pentaho 8.1 is available
Google Drive  (EE/CE)
While Google Storage will probably be seen more frequently in production scenarios, we also added support for Goggle Drive, a file storage and synchronization service, allows users to store files on their servers, synchronize files across devices, and share files.
This is also done through a VFS driver, but given it’s a per user authentication a few steps need to be fulfilled to leverage this support:
●     Copy your Google client_secret.json file into (The Google Drive option will not appear as a Location until you copy the client_secret.json file into the credentials directory and restart)
o   Spoon: data-integration/plugins/pentaho-googledrive-vfs/credentials directory, and restart spoon.
o   Pentaho Server:  pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-googledrive-vfs/credentials directory and restart the server
●     Select Google Drive as your Location. You are prompted to login to your Google account.
●     Once you have logged in, the Google Drive permission screen displays.
●     Click Allow to access your Google Drive Resources.
●     A new file called StoredCredential will be added to the same place where you had the client_secret.json file. This file will need to be added to the Pentaho Server credential location and that authentication will be used

Analytics over BigQuery  (EE/CE, depending on the tool used)

This JDBC connectivity to Google BigQuery, as defined previously for Spoon, can also be used throughout all the other Business Analytics browser and client tools – Analyzer, CTools, PIR, PRD, modeling tools, etc. Some care has to be taken here, though, as BigQuery’s pricing is related to 2 factors:
●     Data stored
●     Data queried
While the first one is relatively straightforward, the second one is harder to control, as you’re charged according to total data processed in columns selected. For instance, a ‘select *’ query should be avoided if only specific columns are needed. To be absolutely clear, this has nothing to do with Pentaho, these are Google BigQuery pricing rules.
So ultimately, and a bit like we need to do on all databases / data warehouses, we need to be smart and work around the constraints (usually speed and volume, on this case price as well) to leverage best what these technologies have to offer. Some examples are given here:
●     By default, there is BigQuery caching and cached queries are free. For instance, if you run a report in Analyzer, clear the Mondrian cache, and then reload the report, you will not be charged (thanks to the BigQuery caching)
●     Analyzer: Turn off auto refresh, i.e, this way you design your report layout first, including calculations and filtering, without querying the database automatically after each change
●     Analyzer: Drag in filters before levels to reduce data queried (i.e. filter on state = California BEFORE dragging city, year, sales, etc. onto canvas)
●     Pre-aggregate data in BigQuery tables so they are smaller in size where possible (to avoid queries across all raw data)
●     GBQ administrators can set query volume limits by user, project, etc. (quotas)

AWS S3 Security Improvements (IAM) (EE/CE)

PDI is now able to get IAM security keys from the following places (in this order):
1.      Environment Variables
2.      Machine’s home directory
3.      EC2 instance profile
This added flexibility helps accommodate different AWS security scenarios, such as integration with S3 data via federated SSO from a local workstation, by providing secure PDI read/write access to S3 without making user provide hardcoded credentials.
The IAM user secret key and access key can be stored in one place so they can be leveraged by PDI without repeated hardcoding in Spoon. These are the environment variables that point to them:
●     AWS_ACCESS_KEY_ID
●     AWS_SECRET_ACCESS_KEY

 Pentaho 8.1 is available

Big Data / Adaptive Execution Layer (AEL) Improvements

 Pentaho 8.1 is available

Bigger and Better (EE/CE)

AEL provides spectacular scale out capabilities (or is it scale up? I can’t cope with these terminologies…) by seamlessly allowing a very big transformation to leverage a clustered processing engine.
Currently we have support for Spark through the AEL layer, and throughout the latest releases we’ve been improving it in 3 distinct areas:
●     Performance and resource optimizations
o   Added Spark Context Reuse that, under certain circumstances can speed up startup performance on the range to 5x faster, proving specially useful under development conditions
o   Spark History Server integration, providing a centralized administration, auditing and performance reviews of the transformations executed in Spark
o   Ability to passing down to the cluster customized spark properties, allowing a finer-grained control of the execution process
●     Increased support for native steps (eg, leveraging the spark specific group by instead of the PDI engine one)
●     Adding support for more cloud vendors – and we just did that for EMR 5.9 and MapR 5.2
This is the current support matrix for Cloud Vendors:

 Pentaho 8.1 is available

Sub Transformation support (EE/CE)

This one is big, as it was the result of a big and important refactor on the kettle engine. AEL Now supports executing sub transformations through the Transformation Executor step, a long-standing request since the times of good-old PMR (Pentaho Map Reduce)
 Pentaho 8.1 is available

Big Data formats: Added support for Orc (EE/CE)

Not directly related to AEL, but most of the use cases where we want the AEL execution we’ll need to input data in a big data specific format. In previous releases we added support for Parquet and Avro, and we now added support for ORC (Optimized Record Columnar), a format favored by Hortonworks.
Like the others, Orc will be handled natively when transformations are executed in AEL
 Pentaho 8.1 is available 

Worker Nodes (EE)

 Pentaho 8.1 is available

Jumping from scale-out to scale-up (or the opposite, like I mentioned, I never know), we continue to do lots of improvements on the Worker Nodes project. This is an extremely strategic project for us as we integrate with the larger Hitachi Vantara portfolio.
Worker nodes allow you to execute Pentaho work items, such as PDI jobs and transformations, with parallel processing and dynamic scalability with load balancing in a clustered environment. It operates easily and securely across an elastic architecture, which uses additional machine resources as they are required for processing, operating on premise or in the cloud.
It uses the Hitachi Vantara Foundry project, that leverages popular technologies under the hood such as Docker (Container Platform), Chronos (Scheduler) and Mesos/Marathon (Container Orchestration).
For 8.1 there are several other improvements:
●     Improvements tn Monitoring, with accurate propagation of Work Items status for monitoring
●     Performance improvements by optimizing the startup times for executing the work items
●     Customizations are now externalized from docker build process
●     Job clean up functionality

 Pentaho 8.1 is available

Streaming

 Pentaho 8.1 is available

In Pentaho 8.0 we introduced a new paradigm to handle streaming datasources. The fact that it’s a permanently running transformation required a different approach: The new streaming steps define the windowing mode and point to a sub transformation that will then be executed on a micro batch approach.
That works not only for ETL within the kettle engine but also in AEL, enabling spark transformations to feed from Kafka sources.

New Streaming Datasources: MQTT, and JMS (Active MQ / IBM MQ) (EE/CE)

Leveraging on the new streaming approach, there are 2 new steps available – well, one new and one (two, actually) refreshed.
The new one is MQTT – Message Queuing Telemetry Transport – an ISO standard publish-subscribe-based messaging protocol that works on top of the TCP/IP protocol. It is designed for connections with remote locations where a “small code footprint” is required or the network bandwidth is limited.  Alternative IoT centric protocols include AMQP, STOMP, XMPP, DDS, OPC UA, WAMP

 Pentaho 8.1 is available

There are 2 new steps – MQTT Input and MQTT Output, that connect with the broker for consuming and publishing back the results.
Other than this new, IoT centered streaming source, there are 2 new steps, JMS Input and JMS Output. These steps replace the old JMS Consumer/Producer and the IBM Websphere MQ steps, supporting, in the new mode the following message queue platforms:
●     ActiveMQ
●     IBM MQ
Safe Stop (EE/CE)
This new paradigm to handle streaming sources introduced a new challenge that we never had to face. Usually, when we triggered jobs and transformations, they had a well defined start and end; Our stop functionality was used when we wanted to basically kill a running process because something was not going well.
However, on these streaming use cases, a transformation may never finish. So stopping a transformation the way we’ve always done – by stopping all steps at the same time – could have unwanted results.
So we implemented a different approach – We added a new option to safe stop a transformation implemented within Spoon, Carte and the Abort step, that instead of killing all the step threads, stops the input steps and lets the other steps gracefully finish the processing, so no records currently being processed are lost.

 Pentaho 8.1 is available

This is especially useful in real-time scenarios (for example reading from a message bus). It’s one of those things that when we look back seems pretty dumb that it wasn’t there from the start. It actually makes a lot of sense, so we went ahead and made this the default behavior.

Streaming results (EE/CE)

When we launched streaming in Pentaho 8.0 we focused on the processing piece. We could launch the sub transformation but we could not get results back. Now we have the ability to define which step on the sub-transformation will send back the results to follow the rest of the flow.

 Pentaho 8.1 is available

Why is this important? Because of what comes next…
Streaming Dataservices (EE/CE)
There’s a new option new option to run data service in streaming mode. This will allow the consumers (on this case CTools Dashboards) to get streaming data from this dataservice.

 Pentaho 8.1 is available

Once defined, we can test these options within the test dataservices page and see the results as they come.

 Pentaho 8.1 is available

This screen exposes the functionality as it would be called from a client. It’s important to know that the windows that we define here are not the same as the ones we defined for the micro batching service. The window properties are the following:
●     Window Size – The number of rows that a window will have (row based), or the time frame that we want to capture new rows to a window (time based).
●     Every – Number of rows (row based), or milliseconds (time based) that should elapse before creating a new window.
●     Limit – Maximum number of milliseconds (row based) or rows (time based) which will be used to wait for a new window to be generated.

CTools and Streaming Visualizations (EE/CE)

We took a holistic approach to this feature. We want to make sure we can have a real time / streaming dashboard leveraging what was set up before. And this is where the CTools come in. There’s a new datasource in CDE available to connect to streaming dataservices:

 Pentaho 8.1 is available

Then the configuration of the component will select the kind of query we want – Time or number of records base, window size, frequency and limit. This gives us a good control for a lot of use cases.

 Pentaho 8.1 is available

This will allow us to then connect to a component the usual way. While this will probably be more relevant for components like tables and charts, ultimately all of them will work.
It is possible to achieve a level of multi-tenancy by passing a user name parameter from the PUC session (via CDE) to the transformation as a data services push-down parameter. This will enable restriction of the data viewed on a user by user basis
One important note is that the CTools streaming visualizations do not yet operate on a ‘push’ paradigm – this is on the current roadmap. In 8.1, the visualizations poll the streaming data service on a constant interval which has a lower refresh limit of 1 second. But then again… if you’re doing a dashboard of this types and need a refresh of 1 second, you’re definitely doing something wrong…

Time Series Visualizations (EE/CE)

One of the biggest use cases for streaming, from a visualization perspective, is time series. We improved the support for CCC for timeseries line charts, so now data trends over time will be shown without needing workarounds.
This applies not only to CTools but also to Analyzer

 Pentaho 8.1 is available

Data Exploration Tool Updates (EE)

We’re keeping on our path of improving our Data Exploration Tool. It’s no secret that we want to make it feature complete so that it can become the standard data analysis tool for the entire portfolio.
This time we worked on adding filters to the Stream view.
 Pentaho 8.1 is available  Pentaho 8.1 is available 
We’ll keep improving this. Next on the queue, hopefully, will be filters on the model view and date filters!

Additional Updates

As usual, there were several additional updates that did not make it to my highlights above. So for the sake of your time and not creating a 100 page blog – here are even more updates in Pentaho 8.1.
Additional updates:
●     Salesforce connector API update (API version 41)
●     Splunk connection updated to version 7
●     Mongo version updated to 3.6.3 driver (supporting 3.4 and 3.6)
●     Cassandra version updated to support version 3.1 and Datastax 5.1
●     PDI repository browser performance updates, including lazy loading
●     Improvements on the Text and Hadoop file outputs, including limit and control file handling
●     Improved logging by removing auto-refresh from the kettle logging servlet
●     Admin can empty trash folder of other users on PUC
●     Clear button in PDI step search in spoon
●     Override JDBC driver class and URL for a connection
●     Suppressed the Pentaho ‘session expired’ pop-up on SSO scenarios, redirecting to the proper login page
●     Included the possibility to schedule generation of reports with a timestamp to avoid overwriting content
In summary (and wearing my marketing hat) with Pentaho 8.1 you can:
●      Deploy in hybrid and multi-cloud environments with comprehensive support for Google Cloud Platform, Microsoft Azure and AWS for both data integration and analytics
●      Connect, process and visualize streaming data, from MQTT, JMS, and IBM MQ message queues and gain insights from time series visualizations
●      Get better platform performance and increase user productivity with improved logging, additional lineage information, and faster repository access

Download it

Go get your Enterprise Edition or trial version from the usual places


For CE, you can find it on the community home!

Pedro

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Read More

Farewell Pentaho

February 20, 2018   Pentaho
blank Farewell Pentaho

Dear Kettle friends,

12 years ago I joined a wonderful team of people at Pentaho who thought they could make a real change in the world of business analytics. At that point I recently open sourced my own data integration tool (then still called ‘ETL’) called Kettle and so I joined in the role of Chief Architect of Data Integration. The title sounded great and the job included everything from writing articles (and a book), massive amounts of coding, testing, software releases, giving support, doing training, workshops, … In other words, life was simply doing everything I possibly and impossibly could to make our software succeed when deployed by our users. With Kettle now being one of the most popular data integration tools on the planet I think it’s safe to say that this goal has been reached and that it’s time for me to move on.

I don’t just want to announce my exit from Pentaho/Hitachi Vantara. I would also like to thank all the people involved in making our success happen. First and foremost I want to express my gratitude to the founders (Richard, Doug, James, Marc, …) for even including a crazy Belgian like myself on the team but I also want to extend my warmest thanks to everyone who I got to become friends with at Pentaho for the always positive and constructive attitude. Without exaggeration I can say it’s been a lot of fun.

I would also explicitly like to thank the whole community of users of Kettle (now called Pentaho Data Integration). Without your invaluable support in the form of new plugins, bug reports, documentation, forum posts, talks, … we could never have pulled off what we did in the past 12 years! I hope we will continue to meet at one of the many popular community events.

Finally I want to thank everyone at Hitachi and Hitachi Vantara for being such a positive and welcoming group of people. I know that Kettle is used all over Hitachi and I’m quite confident this piece of software will not let you down any time soon.

Now I’m going to go skiing for a week and when I get back it’s time to hunt for a new job. I can’t wait to see what impossible problems need solving out there…

Cheers,
Matt

Let’s block ads! (Why?)

Matt Casters on Data Integration

Read More

Pentaho 8 is now available!

November 30, 2017   Pentaho

17 152 8.0 launch community v1 Pentaho 8 is now available!

I recently wrote about everything you needed to know about Pentaho 8. And now is available! Go get your Enterprise Edition or trial version from the usual places

For CE, you can find it on the new community home!

Enjoy!

-pedro

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Read More

A new collaboration space

November 9, 2017   Pentaho

newForums A new collaboration space

With the move to Hitachi Vantara we’re not letting the community go away – exactly on the contrary. And one of the first things is trying to give the community a new home, in here: http://community.pentaho.com

We’re trying to gather people from the forums, user groups, whatever, and give a better and more modern collaboration space. This space will continue open, also because the content is extremely value, so the ultimate decision is yours.

Your mission, should you choose/decide to accept it, is to register and try this new home. Counting on your help to make it a better space

See you in http://community.pentaho.com

Cheers!

-pedro

Let’s block ads! (Why?)

Pedro Alves on Business Intelligence

Read More
« Older posts
  • Recent Posts

    • Rickey Smiley To Host 22nd Annual Super Bowl Gospel Celebration On BET
    • Kili Technology unveils data annotation platform to improve AI, raises $7 million
    • P3 Jobs: Time to Come Home?
    • NOW, THIS IS WHAT I CALL AVANTE-GARDE!
    • Why the open banking movement is gaining momentum (VB Live)
  • Categories

  • Archives

    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited