1 of 73

Knime Analytics

Advanced Training Part-3

itsmecevi.github.io

2 of 73

Workflow Control�Loops, Switches, Try-Catch

2

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

3 of 73

Workflow Control Structures

  • Loops
    • Iterate over a workflow snippet with variable inputs.
  • Switches
    • Direct the path of a workflow by selectively executing one or more workflow branches.
  • Try-Catch
    • Handle workflow branches that may fail in execution and you don‘t know before execution

3

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

4 of 73

The Loop Block

  • A loop block is defined by appropriate loop start and loop end nodes.
  • Loop body = Nodes in between and side branches.

4

Loop start node

Loop end node

Loop body

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

5 of 73

New Node: Group Loop Start

5

  • Similar to GroupBy except without aggregation tab.
  • Each iteration of the loop passes the next group of rows.
  • You implement the aggregation task. It can be anything from a complex calculation to updating a database.

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

6 of 73

New Node: Create File Name

6

  • Inputs: Directory, base file name, file Extension
  • Output: Flow variable -> use as input for e.g. writer node

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

7 of 73

Example: Writing aggregated files

  • Group Loop Start 🡪 Variable Loop End
  • Group data by specific column values
  • Iterate over all groups of data
  • Create an appropriate file name
  • Write grouped data to tables with new file name

7

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

8 of 73

Workflow Control Exercise, Activity I

Start with exercise: Workflow Control, Activity I

  • Read the file: CurrentDetailData.table
  • Group over all of the values in the Products column
  • For each group of data, write a new KNIME table to disk. Give it an appropriate filename.

8

(Hint: Group Loop Start creates a flow variable naming the current group)

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

9 of 73

New Node: List Files

9

  • List all files in a directory
  • Restrict to:
    • Top level directory (i.e. not recursive),
    • Specific file extensions
    • Matching name patterns (regex or wildcard)
  • Provides file references as a table of URLs and absolute paths

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

10 of 73

New Node: Table Row to Variable Loop Start

10

  • Similar to Table Row to Variable node
  • Each iteration of the loop converts the next row of the input table into flow variables
  • Inject variables into other nodes (often file readers) to re-execute subflows with a progression of settings

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

11 of 73

Example: Reading Many Files

  • List all files in a directory
  • Convert each file to a flow variable (1 per iteration)
  • In each iteration, read the file and collect the results

11

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

12 of 73

Workflow Control Exercise, Activity II

Start with exercise: Workflow Control, Activity II

  • Use List files to find the file names of the files created in exercise 2
  • Iterate over that list of files, and read them into a single KNIME Table

12

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

13 of 73

Switches

  • A switch allows you to selectively activate branches of a workflow
  • Inactive branches are marked with a red x on their output ports. Inactive nodes propagate down stream.

13

Active

Inactive

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

14 of 73

New Node: Single Selection

14

  • Quickform: Select single value from list of strings
  • Returns selection as string type flow variable
  • Choose between different layout options (dropdown, radio buttons...)

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

15 of 73

New Node: Rule Engine/Rule Engine Variable

15

  • Define custom logic to using simple rules.
  • Rules like: <Antecedent> => <Consequence>
    • (1=1 => “true”)
  • May be used in flow variables or tables
  • Easiest way to encode logic for switches

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

16 of 73

New Node: If Switch

16

  • Control which branches of your workflow are active programmatically
  • Controlled with a flow variable, setting the value to the literal strings: “top”, “bottom”, “both”
  • May be used in flow variables or tables (different nodes)

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

17 of 73

New Nodes: Case Switch Data

  • Similar to If-Switch: Takes data from single input port and passes it to the active output port
  • Nodes connected to inactive branches are not executed
  • Configure via node dialog, or pass port index as flow variable
    • 0, 1, 2 for top, middle, and bottom port
  • Case switches also available for flow variable and model ports

17

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

18 of 73

The Difference between Loops and Switches

Loops

    • The Loop Start is connected to the Loop End node, they form a pair.
    • A loop iterates over a workflow part.

Switches

    • A Switch Start can be used without a corresponding Switch End. They can also be combined.

18

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

19 of 73

Workflow Control Exercise, Activity III

  • Extend the workflow below with a switch to select the type of visualization.
  • Use a Single Input Quickform to let a user choose the values "scatter" or "bar"
  • Use a Rule Engine Variable node to convert the selection into the port index
  • Use a CASE Switch Data (Start) to create either a scatter plot or a bar plot depending on the input.
  • Combine the two paths with a CASE Switch Data (End)

19

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

20 of 73

Try-Catch

  • A way to catch errors in workflows
  • Useful when it is hard to know if a node will execute (for example, when connecting to a web service)
  • KNIME tries to execute the nodes, but if it fails will fall back to an alternate branch

20

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

21 of 73

Streaming

  • Standard execution: Node by node. Node processes all data, finishes, then passes data to next node, etc.
  • Streaming: Nodes executed concurrently, each nodes passes data to the next as soon as it is available, i.e. before node is fully executed
    • Faster execution, esp. for reading/preprocessing data
  • Create wrapped metanode -> Configure -> Job Manager Selection -> Simple Streaming
    • Not available for all nodes (show in node repository)
    • Can only execute entire metanode, not individual nodes
    • Intermediate results not available since nothing is cached

21

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

22 of 73

Streaming

22

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

23 of 73

Advanced Data Mining�Random Forest, Tree Ensembles, Parameter Optimization, Cross Validation

23

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

24 of 73

Overview

  • Advanced analytics nodes
    • Random Forest / Tree Ensembles
    • Gradient Boosted Trees
  • Parameter optimization
  • Cross validation
  • H2O and Keras integration in KNIME

24

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

25 of 73

KNIME’s Tree Ensemble Models

  • The general idea is to take advantage of the “wisdom of the crowd”
  • Ensemble models: Combining predictions from a large number of weak predictors, e.g. decision trees
  • Leads to a more accurate and robust model
  • This is called ”bagging”

25

Typically: for classification the individual models vote and the majority wins; for regression, the individual predictions are averaged

X

1

5

2

2

9

6

7

4

2

7

6

8

9

3

1

7

6

3

9

5

7

y

P1

P2

Pn

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

26 of 73

How Does Bagging Work?

  • Pick a different random subset of the training data for each model in the ensemble (bag)

26

1

5

2

2

9

6

7

4

5

7

2

8

9

3

1

7

6

3

9

5

7

Build tree

Build tree

Build tree

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

27 of 73

An Extra Benefit of Bagging: Out of Bag Estimation

  • Allows testing the model using the training data: when validating, each model should only vote on data points that were not used to train it

27

X2

1

5

2

2

9

6

7

4

2

7

6

8

9

3

1

7

6

3

9

5

7

y2OOB

P1

P2

Pn

X1

1

5

2

2

9

6

7

4

2

7

6

8

9

3

1

7

6

3

9

5

7

P1

P2

Pn

y1OOB

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

28 of 73

Random Forest

  •  

28

1

5

2

2

9

6

7

Build tree

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

29 of 73

New Nodes: Random Forest Learner

  • The output model describes a random forest and is applied in the corresponding predictor node using a simple majority vote
  • The statistics table on the attributes tells how often each attribute…
    • … is used in the first three splits
    • … was a possible candidate in the first three splits

29

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

30 of 73

New Nodes: Random Forest

  • Ensemble learning method for classification and regression tasks
  • It consists of a chosen number of decision trees
  • Each of the decision tree models is learned on a different set of rows (records) and a different set of columns (describing attributes)

30

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

31 of 73

Tree Ensembles

  • Random Forest variant
  • More options to set
  • Trees may be trained using subsets of rows and/or columns and this approach may lead to greater accuracy

31

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

32 of 73

Tree Ensembles

  • Optimization of a tree ensemble is complex due to a surplus of configuration options
    • Number of models
    • Number of columns
    • Number of rows
    • Tree depth
    • ...

32

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

33 of 73

New Nodes: Tree Ensemble Learner/Predictor

  • Choose which columns to include
  • Configure a prototype tree (depth, split criteria etc.)
  • Setup ensemble parameters (model count, row/column subsampling)

33

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

34 of 73

New Node: Gradient Boosted Trees Learner

  • Another algorithm for creating ensembles of decision trees
  • Starts with a tree built on a subset of the data
  • Builds additional trees to fit the residual errors
  • Typically uses fairly shallow trees
  • Can introduce randomness in choice of data subsets (“stochastic gradient boosting”) and in variable choice (Advanced Options)

34

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

35 of 73

Advanced Data Mining Exercise, Activity I

Start with exercise: Advanced Data Mining, Activity I

  • Read the data file CurrentDetailData.table
  • Partition the data 50/50 using stratified sampling on the Products column
  • Create a Tree Ensemble model to predict the “Products” column
    • Use a tree depth of 5, 50 models, and 75% of rows and columns for each iteration.

35

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

36 of 73

Parameter Optimization

  • Some modeling approaches are very sensitive to their configuration.
  • Calculating optimum settings is not always possible.
  • Parameter Optimization loops may help find a good configuration

36

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

37 of 73

New Node: Parameter Optimization Loop Start

  • Define some parameters to optimize
  • Set upper/lower bounds and step sizes (and flag integers)
  • Choose an optimization method
    • Brute force for maximum accuracy but slower computation
    • Hill climbing for better faster runtimes but may get stuck in local optimum settings

37

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

38 of 73

New Node: Parameter Optimization Loop End

  • Collects some value to optimize in a flow variable.
  • Value may be maximized (accuracy) or minimized (error)

38

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

39 of 73

Advanced Data Mining Exercise, Activity II

Start with exercise: Advanced Data Mining, Activity II

  • Add a parameter optimization loop to your Tree Ensemble Model
  • Use Hill climbing to determine the optimum number of models (min=10, max=200, step=10, int = yes)
  • Maximize the accuracy in the Loop End node.
  • What were the optimal settings?

(Hint: don’t forget to use the flow variable in your learner)

39

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

40 of 73

Cross Validation

  • Used to evaluate model stability
  • Re-execute the modeling process many times using different data partitions
  • Collect aggregated statistics on model accuracy

40

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

41 of 73

Example: Cross Validation

  • X-Partitioner 🡪 X-Aggregator
  • X-Partitioner replaces Partition
  • X-Aggregator replaces Scorer
  • Can be used with any learner/predictor

41

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

42 of 73

Advanced Data Mining Exercise, Activity III

Start with exercise: Advanced Data Mining, Activity III

  • Create a 10-fold cross validation for your Tree Ensemble Learner.
  • Calculate the mean error for the cross validation.
  • Does the model seem stable?

42

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

43 of 73

H2O Integration

  • KNIME integrates the H2O machine learning library
  • H2O: Open source, focus on scalability and performance
  • Supports many different models
    • Generalized Linear Model
    • Gradient Boosting Machine
    • Random Forest
    • k-Means, PCA, Naive Bayes, etc. and more to come!
  • Includes support for MOJO model objects for deployment

43

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

44 of 73

H2O Integration - Example

44

Starting point: create local H2O context

Add data from KNIME to H2O

Model training and prediction

Data import

Scoring

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

45 of 73

Deep Learning Integration

  • Keras integration:
    • Many different layer nodes.
    • Define your network, train and apply a network without a single line of code.
  • DL Python integration
  • TensorFlow integration

45

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

46 of 73

Sentiment Analysis Using Keras

46

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

47 of 73

Databases�

47

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

48 of 73

Database Extension

  • Visually assemble complex SQL statements (no SQL coding needed)
  • Connect to all JDBC-compliant databases
  • Harness the power of your database within KNIME

48

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

49 of 73

Database Port Types

49

Database JDBC Connection Port (red)

  • Connection information

Database Connection Port (brown)

  • Connection information
  • SQL statement

Database Connection Ports can be connected to

Database JDBC Connection Ports

but not vice versa

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

50 of 73

Database Table Selector

  • Takes connection information and constructs a query
  • Explore DB metadata
  • Outputs a SQL query

50

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

51 of 73

Database Connection Table Reader

  • Executes incoming SQL Query on Database
  • Reads results into a KNIME data table

51

Database Connection Port

KNIME Data Table

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

52 of 73

Database Connectors

  • Dedicated nodes to connect to specific Databases
    • Necessary JDBC driver included
    • Easy to use
    • Import DB specific behavior/capability
  • Hive and Impala connector part of �the KNIME Big Data Connectors extension
  • General Database Connector
    • Can connect to any JDBC source
    • Register new JDBC driver via�File -> Preferences -> KNIME -> Databases

52

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

53 of 73

Dedicated Database Connectors

  • MySQL, MS SQL Server, Postgres, SQLite, Amazon Redshift, etc.
  • Propagate connection information to other �DB nodes

53

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

54 of 73

«General» Database Connector Node

54

Database type defines SQL dialect

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

55 of 73

Register JDBC Driver

55

Open KNIME and go to File -> Preferences

Increase connection timeout for long running database operations

Register single jar file JDBC drivers

Register new JDBC driver with companion files

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

56 of 73

In-Database Processing

  • Database Manipulation node generates a SQL query on top of the input SQL query (brown square port)

56

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

57 of 73

Query Nodes

  • Filter rows and columns
  • Join tables/queries
  • Extract samples
  • Bin numeric columns
  • Sort your data
  • Write your own query
  • Aggregate your data

57

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

58 of 73

58

Data Aggregation

Aggregated on “Group” by method:

sum(“Value”)

RowID

Group

Value

r1

m

2

r2

f

3

r3

m

1

r4

f

5

r5

f

7

r6

m

5

RowID

Group

Sum(Value)

r1+r3+r6

m

8

r2+r4+r5

f

15

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

59 of 73

Database GroupBy

  • Aggregate to summarize data

59

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

60 of 73

Database GroupBy

60

Returns number of rows per group

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

61 of 73

Database GroupBy – DB Specific Aggregation Methods

61

PostgreSQL: 25 aggregation functions

SQLite: 7 aggregation functions

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

62 of 73

Database Joiner

  • Combines columns from 2 different tables
  • Top port contains “Left” data table
  • Bottom port contains the “Right” data table

62

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

63 of 73

Database Row Filter

  • Filters rows that do not match the filter criteria
  • Use the IS NULL or IS NOT NULL operator to filter missing values

63

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

64 of 73

Database Sorter

  • Sorts the input data by one or multiple columns

64

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

65 of 73

Database Query

  • Executes arbitrary SQL queries
  • #table# is replaced with input query

65

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

66 of 73

Database Connection Port View

66

Copy SQL statement

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

67 of 73

Export Data

  • Writing data back into database
  • Exporting data into KNIME
  • SQL operations are executed on the database!

67

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

68 of 73

Database Writing Nodes

  • Create table as select
  • Insert/append data
  • Update values in table
  • Delete rows from table

68

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

69 of 73

Database Writer

  • Writes data from a KNIME data table �directly into a database table

69

Append to or drop existing table

Increase batch size for better performance

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

70 of 73

Database Update

  • Updates all database records that �match the update criteria

70

Columns to update

Columns that identify the records to update

Increase batch size for better performance

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

71 of 73

Database Delete

  • Deletes all database records that match the values �of the selected columns

71

Increase batch size for better performance

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

72 of 73

Utility

  • Drop table
    • missing table handling
    • cascade option
  • Execute any SQL statement e.g. DDL
  • Manipulate existing queries

72

Execute queries separated by ; and new line

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.

73 of 73

73

Database Exercise

  • Connect to the H2 database, using the H2 Connector node.
  • Write the Fully Joined Data into the Database as a new table called "adult"
  • Use the Database Table Selector to select the adult table.
  • Group the data by products and count the number of occurrences.
  • Filter the adult data table to only contain products which occur more often than 1000 times.
  • Read the filtered table into a KNIME data table. (Triangle ports)

These slides are a derivative of KNIME Course Material of KNIME AG used under CC BY-NC-SA 4.0.