Useful, Wanted and Needful Kettle Plugins

Matt Casters, matt.casters@neo4j.com

mattcasters

Kettle Project
Founder

Honorary Soup Nazi
Angry Belgian

Neo4j Chief Solutions Architect

Including cheesy

Graphics!!!!

11th PCM!

  • Opinions on Kettle @ Hitachi Vantara
  • Plugins:
    • Useful
    • Wanted
    • Needful
  • Links
  • Recap
  • Q&A

Agenda

Since March,

When I left Hitachi Vantara

Welcome to PCM!

My opinions on Kettle at
Hitachi Vantara

3

Trigger Warning!

Correct
Biased Opinion!

Kettle @ Hitachi Vantara

  • Kettle / PDI → still the best kept secret
  • Developers lack DI/BD/IoT/ML domain knowledge
  • Deteriorating ${functionality}
  • Lack of innovation
  • Lack of architectural vision
  • Too much unfinished business (AEL, w. nodes, ...)
  • Focus on marketing Dog & Pony (IoT, ML, AI, ...)
  • Unstable x.x.0.0 : that don’t impress me much
  • New non-sizing dialogs are bad

Kettle @ Hitachi Vantara

  • Negative stance on WebSpoon
  • Working against community wrt git, file repo, ...
  • Not listening to what Kettle users want / need
  • Marketplace too slow and cumbersome to update
  • Marketplace containing too many stale plugins
  • JIRA cases
    • Unclear states, unclear processes, slow or meaningless responses, get “Won’t fix” for “unclear reasons”.

Kettle @ Hitachi Vantara

  • Huge codebase
    • over 1M lines of code in pentaho-kettle
    • Large enough so Hitachi couldn’t ruin it all
  • Lots of plugin entry points to fix issues
  • I looked at the 8.2 codebase and all is well!
  • Great community of people
  • Willingness to move forward
  • Still a kick-ass ETL tool

Useful

7

Azure Event Hubs

  • Writer
  • Listener
  • Streaming data processing

Load text from file

  • Apache Tika
  • Original from Matthew Burgess
  • Extract text itself
  • More file formats
  • Extra metadata
  • Updated to latest Tika version

MongoDB steps

  • Original from Ivy / Harris Ward
  • Updated to latest MongoDB API
  • MongoDB Insert
  • MongoDB Lookup
  • MongoDB Map/Reduce

MongoDB Changes

  • Stream data from MongoDB ChangeStream
  • New feature in MongoDB 3.6
  • → docs.mongodb.com/manual/changeStreams

Wanted

12

Kettle Carte

  • Goal:
    • Improve Carte services with extra plugins
  • Ideas:
    • Re-implement removed functionality
    • Add basic scheduler with Spoon UI
    • Improved status web UI
    • Add MetaStore elements CRUD
    • ...

Environment

  • Environment lifecycle management
    • GUI and Batch
    • KETTLE_HOME
    • PENTAHO_METASTORE_HOME
    • ...
  • In progress
  • Needs SpoonGit integration
  • Add LCM functionality:
    • compare, promote, package, ...

Kettle Beam

  • Run Kettle transformations on Apache Beam API
  • Starting to look at it, dependencies, learning API
  • Project for the next few months
  • Help and requirements welcome

Needful

16

Needful things

  • Fixing some important broken Kettle things
  • Metastore fixes, Pan/Kitchen/Carte
  • Job entries:
    • Repeat and End Repeat
  • Steps:
    • Execute on Slave, Get Slave Status

Needful things: Maitre

  • Pan/Kitchen replacement
  • Modern arguments handling (picocli library)
  • Handles transformations and jobs
  • Environment support
  • Execution:
    • Local
    • Remote
    • clustered
  • Proper local Metastore support
  • Proper “Run Configuration” support

No repository support!1!İ!

WebSpoon / docker / REMIX

  • Dockerfile for WebSpoon + all plugins
    • Daily updated on Docker hub
    • Configured to run alongside Neo4j in docker
    • docker-compose script
    • Added samples/ + Neo4j plugin samples
  • Remix
    • Daily build
    • patched 8.1.0.0 to 8.1.0.4 + latest plugins
    • remix.kettle.be (WARNING: 1GB download)

MY PLUGINS WORK WITH WEBSPOON!

MY PLUGINS WORK WITH WEBSPOON!

THANK YOU HOTA-SAN!!

Data sets : unit testing

  • Major updated version 2.1
  • Fixed a lot of bugs
  • No longer changing transformation metadata
  • Relative path configurations (a.k.a. git support)
  • Setting parameters and variables
  • Working on
    • File-based data sets
    • Lookup data sets

Kettle Debug

  • Print duration at end of trans / job
  • Steps:
    • Set specific log levels on individual steps.
    • Only on certain rows
    • Only under certain conditions
  • Job Entries
    • Set specific log levels on individual entries.
    • Log result after execution, variables, rows, files
  • Remember and work with zoom levels

Neo4j Plugins

  • Work with Neo4j graphs
  • CRUD through GUI or Cypher code
  • High performance
  • Written / supported by Neo4j & community
  • Examples on GitHub:
    • neo4j-examples/kettle-plugin-examples

Founder:

Neo4j Logging

  • Log job and transformation executions to Neo4j
  • Stores full execution lineage and logging
  • Fast error resolution
    • Path to parent
    • Path to error
  • Delta window

(:Graphs)-[:ARE]->(:Everywhere)

CSV files in

input/

:Queue

Get details

Slave servers

:SlaveServer

Get status

:Queue

:SlaveServer

File queued?

Capacity available?

Process file on least busy Slave server

Links

26

Link to this presentation

Links

Links

Recap

30

Recap

  • The Kettle community is making Kettle

Stable / Usable / git / container - ready

  • Try the new plugins
  • Join the effort, give feedback
  • (:Graphs)-[:ARE]->(:Everywhere)

Enjoy Kettle!!

Useful, Wanted and Needful Kettle Plugins : Q&A

32

20181124 Pentaho Community Meetup - Bologna - Google Slides