1 of 54

The use of AI in biological research

Anup Kumar

Bioinformatics group and Freiburg Galaxy team

Department of Computer Science

Faculty of Engineering

University of Freiburg

March 26, 2026

2 of 54

Agenda

  1. Who we are?
  2. Introduction to machine learning
  3. Machine learning for bioinformatics and plant biology
  4. European Galaxy server and Galaxy training network
  5. Hands-on:
  6. Predict chronological age from DNA methylation datasets using machine learning tools in Galaxy

2

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

3 of 54

Bioinformatics group, Freiburg

3

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

4 of 54

Galaxy Project

4

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

5 of 54

Introduction to machine learning

5

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

6 of 54

What is machine learning?

6

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

7 of 54

Machine learning: General idea

7

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

8 of 54

Machine learning: General idea

8

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

9 of 54

Supervised machine learning: Classification and regression

9

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Classification: targets as categories

Regression: targets as real numbers

10 of 54

Decision tree

10

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Decision tree

Random forest

11 of 54

Support vector machines (SVM)

11

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

12 of 54

Deep learning architectures

12

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

13 of 54

Machine learning: Convolutional neural network (CNN)

13

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

14 of 54

Rise of machine learning based approaches in bioinformatics, …

14

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

15 of 54

Machine learning: Varied approaches, datasets, …

15

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

AI in bioinformatics leverages machine learning, deep learning, and reinforcement learning to analyze multimodal inputs—such as genomic sequences, medical images, and chemical structures—in order to solve core biological problems ranging from protein structure prediction and gene regulation to single-cell analysis and the de novo design of molecules��Reinforcement learning in bioinformatics is focused on decision-making processes, where algorithms iteratively learn to make a series of decisions in a dynamic environment to achieve specific goals. This approach is especially beneficial in drug discovery and genomics, where models adapt and optimize outcomes based on feedback from biological simulations or experimental data.

16 of 54

Machine learning: Varied approaches, datasets, …

16

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Overview of key biological research domains where machine learning (ML) is actively applied. Each panel indicates representative tasks and commonly used ML approaches, including both unsupervised learning (UL) and supervised learning (SL). In genomics and proteomics, ML helps evaluate gene expression patterns, identify SNPs, and model protein function or metabolic networks. In systems biology, models support network modeling and cell interaction prediction while in agriculture, ML enables crop yield prediction and pest management. In medicine and disease modeling, models like logistic regression and random forest are used for disease prediction and personalized treatment strategies, while PCA and t-SNE assist in patient stratification. In ecology and environmental biology, classification tasks such as species distribution modeling often leverage random forests and SVMs, while PCA and clustering methods help explore change across gradients.

17 of 54

Multiple domains and ML models in Bioinformatics

17

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

18 of 54

Machine learning with plant datasets: Features, labels, …

18

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Construction of deep learning-based disease detection model in plants

19 of 54

Machine learning with plant datasets: Data augmentation

19

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Construction of deep learning-based disease detection model in plants

20 of 54

Machine learning with plant datasets: Classification models

20

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Step 1: The crop classification model was constructed using diseased and healthy leaf images of 1911 bell pepper, 1448 potato, and 3150 tomato. Species of crops were recognized by submodel and assigned to one of three categories, bell pepper, potato, or tomato.

Step 2: After accurate crop recognition, detection models were used to determine disease occurrence for individual crops by detecting the presence or absence of disease symptoms or patterns of symptoms in step 2

Step 3: In step 3, a disease classification model was created for potato and tomato

Construction of deep learning-based disease detection model in plants

21 of 54

Machine learning with plant datasets: CNN architecture

21

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Construction of deep learning-based disease detection model in plants

22 of 54

Machine learning with plant datasets: Segmentation using CNN

22

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Classifying three different legume species: white bean, red bean and soybean

23 of 54

Machine learning with plant datasets: features

23

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Classifying three different legume species: white bean, red bean and soybean: Heatmaps indicating the input image region to which the correct-class output probability of the network is more sensitive to partial occlusion. Red colored regions correspond to a decrease of the output probability while green indicates an increase. Each panel shows 9 examples from each class.

24 of 54

Advancing plant biology through deep learning-powered natural language processing

24

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

ML approaches in plant biology, to accelerate knowledge and genetic improvement

25 of 54

Machine learning: Plant disease diagnosis framework

25

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

An ML pipeline is used as input for detecting plant diseases. The beginning of the approach is data collection from agricultural fields; in this step, pictures are taken, and then pre-processing of the data, for example, rotation, resizing, and rescaling it. Several methods of data augmentation are used in order to enlarge the dataset and create to variability, and improve the model. The training and testing datasets are the same pre-processed data collected in earlier steps to be used to develop and test the model. After testing the model on the testing data, the trained model is able to detect crop diseases, thus allowing for improved management and control of the diseases.

In general, the order of operations includes, data collection, pre-processing, model building, and evaluation before using the DL model for plant leaf disease detection. The initial dataset originated from field-collected crop leaf photographs of potatoes, pepper, and tomato crops. In the data collection and storage step, the photos have been collected and stored for future use. In the pre-processing step, the photographs have been manipulated to increase the quality of the data and also the consistency of the data. In the pre-processing stage, the photographs are resized, rescaled, and rotated in order to have normalized photos for training the model. The dataset has also be artificially expanded in order to produce variability in the dataset to avoid overfitting the model. The dataset is split into training and testing after pre-processing.

The training dataset builds and optimizes the DL model, whereas the testing dataset evaluates performance impartially. The CNN or equivalent architecture is trained to categorize leaf pictures as healthy or unhealthy using discriminative characteristics. To verify the trained model, performance parameters including accuracy, precision, recall, and F1-score are used. This performance study guarantees the proposed system’s dependability and resilience.

26 of 54

Machine learning in plant science and plant breeding

26

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Overview of biochemical and cellular measurements��A variety of “omics” (genomics, transcriptomics, proteomics, metabolomics) data can be measured. Machine learning is used to analyze these data at various levels and with various goals (bottom).

27 of 54

Machine learning in plant science and plant breeding

27

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Overview of plant phenotyping systems��Plants can be observed at different levels (development, growth, production) using different types of sensors and sensor systems. Machine learning plays an important role in processing the sensor data to measure traits at the various levels (red box).

28 of 54

Machine Learning for Plant Stress Modeling: A Perspective towards Hormesis Management

28

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

The hormetic behavior of plant stress responses: At low doses, an overcompensation of the damage caused by the stressor increases plant fitness, whereas, at high doses, the stressors disrupt the homeostasis of the organism. The controlled exposure of crops to low doses of stressors is therefore called hormesis management, and it is a promising method to increase crop productivity and quality. Nevertheless, hormesis management has severe limitations derived from the complexity of plant physiological responses to stress. Many technological advances assist plant stress science in overcoming such limitations, which results in extensive datasets originating from the multiple layers of the plant defensive response. For that reason, artificial intelligence tools, particularly Machine Learning (ML) and Deep Learning (DL), have become crucial for processing and interpreting data to accurately model plant stress responses such as genomic variation, gene and protein expression, and metabolite biosynthesis.

29 of 54

Machine Learning for Plant Stress Modeling: A Perspective towards Hormesis Management

29

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Hormesis characterization through Deep Learning. Plant science uses highly sensitive techniques for detecting variations in gene expression, phenotype, and metabolism caused by environmental interactions. Deep learning, particularly through the implementation of Convolutional Neural Networks (CNN), decision trees, and Support Vector Machine (SVM) algorithms, allows big data processing and interpretation for modeling non-linear biological processes, such as hormesis.

Process of ML implementation for improving hormesis management. Analyzing plant stress responses generates many data, and ML integrates data to model complex systems. Considering the hormetic behavior of plant responses, ML could be used to model dose-response and predict eustress doses, simplifying controlled elicitation in agriculture.

30 of 54

Machine learning: biological foundation models (FMs)

30

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

31 of 54

Machine learning: landscape of bioinformatics FMs.

31

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

32 of 54

Deep learning applications advance plant genomics research

32

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

LLM applied in genomics studies

Deep learning tools in genomics studies

33 of 54

A guide to machine learning for biologists

33

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

The overall procedure for training a machine learning method is shown along the top. A decision tree to assist researchers in selecting a model is given below. This flowchart is intended to be used as a visual guide linking the concepts outlined in this Review. However, a simple overview such as this cannot cover every case. For example, the number of data points required for machine learning to become applicable depends on the number of features available for each data point, with more features requiring more data points, and also depends on the model being used. There are also deep learning models that work on unlabelled data.

34 of 54

European Galaxy Server

34

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

35 of 54

European Galaxy server

35

26 March 2026

https://usegalaxy.eu/

Anup Kumar | The Use of AI in Biological Research |

36 of 54

European Galaxy server: Tools

36

26 March 2026

Tools

Anup Kumar | The Use of AI in Biological Research |

37 of 54

European Galaxy server: Tool user interface

37

26 March 2026

Tool definition / UI

Anup Kumar | The Use of AI in Biological Research |

38 of 54

European Galaxy server: Dataset uploader

38

26 March 2026

Datasets uploader

Anup Kumar | The Use of AI in Biological Research |

39 of 54

European Galaxy server: History

39

26 March 2026

History

Anup Kumar | The Use of AI in Biological Research |

40 of 54

European Galaxy server: Workflows

40

26 March 2026

Workflows

Anup Kumar | The Use of AI in Biological Research |

41 of 54

European Galaxy server: Workflows

41

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

42 of 54

National research data infrastructures (NFDI) and DataPLANT

42

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

43 of 54

Galaxy Training Network

43

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

44 of 54

Galaxy Training Network

An amazing teaching platform

🎓 Focus on the science, not the technical details of tools

🖥️ No installation required, only requirement is a browser

📚 Huge library of free, high quality tutorials

📊 Visualizations of results and workflows

👩‍💻 Enable remote teaching & follow their progress with TIaaS

44

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

45 of 54

Galaxy Training Network: Stats

45

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

46 of 54

Galaxy Training Network: Regression

46

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

47 of 54

Galaxy Training Network Hands-on tutorial: Regression

47

26 March 2026

  1. https://usegalaxy.eu/

Anup Kumar | The Use of AI in Biological Research |

48 of 54

End of talk

48

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

49 of 54

Extra slides

49

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

50 of 54

Application of machine learning and genomics for orphan crop improvement

50

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Orphan crops are important sources of nutrition in developing regions and many are tolerant to biotic and abiotic stressors; however, modern crop improvement technologies have not been widely applied to orphan crops due to the lack of resources available. There are orphan crop representatives across major crop types and the conservation of genes between these related species can be used in crop improvement. Machine learning (ML) has emerged as a promising tool for crop improvement. Transferring knowledge from major crops to orphan crops and using machine learning to improve accuracy and efficiency can be used to improve orphan crops.���ML models are trained using data from major crops. These trained ML models can then be used to predict traits in orphan crops, which have limited available data. The trait predictions are used to choose breeding candidates to improve orphan crop varieties.

51 of 54

Multimodal deep learning methods for genomic-enabled prediction in plant breeding

51

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Example of a model with intermediate data fusion. Note that after the fusion module where representations for all modalities are combined, we have a final sub-model to produce the definite output. The Sub-output means a marginal representation (marginal latent factor), that is, representations of data features specific to each modality.

Early data fusion with raw data for all modalities; b) early data fusion with some modalities preprocessed separately; and c) early data fusion with all modalities preprocessed jointly.

52 of 54

Deep learning applications advance plant genomics research

52

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Overview of deep learning (DL) workflow for genomics study (A) Common DL workflow. The raw data and its corresponding labels are divided into training, validation, and test sets. They are fed into different neural network architectures. The trained models are evaluated using metrics and then applied to new data for prediction. (B) A typical pipeline for sequence-based feature learning. Sequencing data are collected from plant tissues and labeled based on experimental evidence. The sequences are encoded into a matrix using one-hot encoding and processed by a DL model to predict functional labels such as promoters, enhancers, or methylation sites. (C) DL models initially developed and trained on animal genomic datasets undergo subsequent fine-tuning using substantially smaller plant-specific genomic datasets. This approach enables the effective adaptation of pre-trained models for plant genomic studies while significantly reducing computational resource requirements and data dependency.

53 of 54

Machine learning with plant datasets: CNN

53

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

Classifying three different legume species: white bean, red bean and soybean

54 of 54

Deep learning applications advance plant genomics research

54

26 March 2026

Anup Kumar | The Use of AI in Biological Research |

LLM applied in genomics studies

(A) Pre-training of plant DNA large language models. The training process includes the selection of the representative plant reference genome and training data generation, plant-specific tokenizer training, and the construction of five models based on different architectures.

(B) Graphical representation of genetic and epigenetic features for downstream tasks that fine-tuned all models.

(C) Number of parameters and loading time of five trained models and three public models.

(D) Memory used for loading models and maximum batch sizes that different models can be assigned (measured on Nvidia RTX4090 GPU).

(E) Summary of best models’ performances on different downstream tasks. Predictions of core promoter and sequence conservation are fixed-length binary classification tasks, and predictions of histone modifications and lncRNAs are variable-length binary classification tasks; all these tasks use the F1 score as a metric. The prediction of open chromatin is a variable-length multi-class classification task, which uses area under the curve (AUC) score as metric. Predictions of promoter strength are regression tasks, which use R square as a metric. Words under the scores represent the model’s tokenizer, and the best scores are marked as red.

(F) Performance of Plant DNAMamba and PlncRNA–Hdeep in predicting lncRNAs.

(G) Performance of Plant DNAMamba and convolutional neural network (CNN) models in predicting promoter strength in tobacco leaves and maize protoplasts.

H) Performance of Plant DNAMamba and PlantDeepSEA in predicting chromatin accessibilities in different plant species.

((I) Command-line tools, local build, and online version of plant DNA LLMs (PDLLMs) are provided for prediction