ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
Monday 21.10. at 10.30–11.30. Room: Tekla Hultin (F3003). Chair: Lili Aunimo
2
#PresenterAffiliation (presenter)TitleAuthorsAbstract
3
38Aleksei TiulpinUniversity of OuluImage-Level Regression For Uncertainty-Aware Retinal Image SegmentationTrung Dang, Huy Hoang Nguyen and Aleksei Tiulpin1. Outer context: The retina offers valuable diagnostic
insights into various clinical conditions non-invasively.
Quantitative assessment of retinal vasculature is crucial
for diagnosing retinal diseases and identifying systemic
conditions like hypertension, diabetes, and cardiovascular
diseases. Numerous studies aim to automate retinal blood
vessel segmentation (BVS) using Deep Learning (DL)
approaches, typically through semantic segmentation.

2. Research problem / question: Training DL models for BVS
typically requires annotation masks. We aim to address the
uncertainty in the annotation process, particularly in
regions close to retinal vessel boundaries. In addition, we
question the necessity of utilizing high-resolution (HR)
retinal images.

3. Key Motivation: Conventionally, the BVS problem is
formulated as pixel-wise classification, which heavily
relies on annotated masks. However, these masks often
contain high levels of uncertainty, especially in the area
surrounding retinal vessels.

4. Prior attempts at addressing it: Prior attempts at
addressing this include label smoothing techniques. In
addition, some studies incorporate multi-annotations.
5. Why are prior attempts not enough? Label smoothing-based
approaches cannot address the intra-class uncertainty of
pixels around objects of interest. Collecting
multi-annotations data is costly.

6. Method: To address the uncertainty of annotations masks,
we formulate retinal vessel segmentation as image-level
regression (Figure 1). Firstly, we introduce the
Segmentation Annotation Uncertainty-Aware (SAUNA)
transform, converting binary masks into soft labels that
capture the uncertainty around retinal vessels.
Additionally, we adapt the Jaccard metric loss (JML) [1] to
operate in any hypercube, enabling effective training for
image-level regression.

7. Results: We applied our method to the UNet++ [3] and
Swin-UNet [5] architectures, and then compared them to a
diverse array of 15 HR and low-resolution (LR)-based
baselines. On the FIVES dataset (Figure 2), our method with
UNet++ was the only LR-based approach that substantially
outperformed the best HR-based reference, MAGF-Net [8],
with a difference of 1.15% IoU while being over 149 times
more efficient. With the same DL architecture, applying our
image-level regression method resulted in significant
improvements compared to the pixel-wise classification
approach. Our method generalized better compared to other
LR-based baselines on the other 4 external datasets: DRIVE,
STARE, HRF, and CHASEDB1 (Table 1).

8. Conclusion: This study introduces a regression-based
method for retinal vessel segmentation. We employed the
newly developed SAUNA transform to produce soft labels,
addressing the uncertainty inherent in the annotation
process. Through comprehensive experimental assessment, we
established that our approach surpasses existing methods.
Our findings suggest a reconsideration of the necessity for
HR retinal images in retinal image segmentation.
4
61Mikko KurimoAalto UniversityUnlocking the Potential of Radio and Television Archives:
Combination of Strengths in Advancing Speech Recognition
Mikko Kurimo, Tamás Grósz, Yaroslav Getman, Tommi Lehtonen and Mervi Leino-NiemeläWe will present results of combining the latest research in
automatic speech recognition (ASR) with novel European high
performance computing (HPC) and large quantities of raw
audiovisual data contained in national radio and television
archives. The aim of the work was two-fold, firstly, to
advance ASR by building models on large public data
collections and secondly, to harness the large audio-visual
media archives for large-scale qualitative and quantitative
media research by generating an automatic indexation based
on all spoken content that is decoded by ASR. For most
languages spoken in the world, reaching these goals
requires creative solutions, because the required resources
do not meet. Only the largest global companies can have
access both to the latest ASR development, huge computing
resources and huge audio collections, but their commercial
interests do not treat all languages equally.
In Europe, most languages are spoken in small countries
which, however, have advanced radio and television archives
containing millions of hours of broadcasted media content.
The latest publicly funded HPC initiatives have also opened
researchers an access to unprecedented computational
resources. By utilizing the computing and archives it is
now possible for researchers to develop and publish large
pre-trained speech models for many languages without
depending on the commercial interests of the large global
companies. The large speech models can be pre-trained in a
self-supervised fashion which can benefit also from raw
untranscribed and uncategorized audio collections. When
openly published, these models make it then easy and quick
to develop speech technology applications, such as accurate
recognizers for ASR and speech, speaker and audio
characteristics for these languages by fine-tuning the
models using a feasible amount of transcribed target data.
In a case study for Finnish, we developed a large
monolingual pre-trained speech model and a framework for
media researchers to decode the audiovisual archive content
using the best ASR with large computing resources available
for them in Finnish IT Center for Science (CSC).
5
95Sumita SharmaUniversity of OuluChild-centered AI: Imagining fair, inclusive, and diverse
future classrooms with children
Sumita SharmaChildren interact with AI applications in a myriad of ways
including social media and recommendation systems,
generative AI (such as ChatGPT, Dalle2), and indirectly
through various algorithmic decision-making for public and
private services (e.g., social services, banking). While
there are several initiatives focusing on AI literacy for
children, however, children are rarely introduced to the
limitations and ethical implications of the design and use
of such AI systems. To address this gap, through the
Research Council of Finland funded PAIZ project
(https://interact.oulu.fi/paiz), I conducted hands-on
workshop on critical AI literacy with young children (10-12
years) to explore the role of AI in children’s everyday
lives, and how to design ethical, inclusive, and fair
AI-futures, that is, envisioning child-centered AI
applications and futures. In the workshops, participants
generated AI art and text, discussed who owns the content
and where and how they can be used. They explored image
recognition (Teachable Machines) and what is means for an
AI to “see” (see Figure 1), and contemplated the ethical
implications of self-driving cars. Children then oriented
to the future to imagine fair, inclusive, diverse
applications for future classrooms (Figure 1).

Workshops are conducted with children in Finland (45
participants), India (45), Japan (102), & USA (27).
Preliminary data analysis shows how children understood
critical concepts such as fairness in AI and tech access
and use, and their visions of fair, inclusive future
techno-social societies. Participants critiqued the
ethical use of AI art, highlighting that “we can use
AI-generated art for inspiration but not submit it to art
competitions as our own work” (Oulu) and that “Dalle-2 can
draw things quickly, creatively, making our work easier,
but it is only fair if everyone access to it. If an art
competition if is for people, it is a competition of the
human mind, not machines” (New Delhi). Participants in the
US exclaimed that, AI generated artwork “doesn’t feel as
authentic…” and that “…Maybe artists who were making art
before might feel useless” (US) (see figure 2 for artwork
examples).

The project contributes to work on Children and AI by
UNICEF and the STN Generation AI project, by showcasing
multicultural perspectives towards fairness and inclusion
and young children’s ethical sense-making abilities. It
adds to the Child-Computer Interaction research focusing on
designing Child-Centered AI through its future imaginings
of future classrooms, building on children’s everyday
experiences.
6
111Erjon SkenderiUniversity of HelsinkiGroup Conversational AI: Introducing EffervesceErjon Skenderi, Salla-Maria Laaksonen, Kaisa Lindholm, Mia Leppälä and Jukka HuhtamäkiCommunication tools enhanced with Large Language Models
(LLM) are important for facilitating effective group
conversations in digital workspaces, and it's crucial to
develop these models to facilitate many-to-many
conversations as well.
Recent conversational AI applications are designed for
one-to-one interactions in the form of chat [1,2]. Our
study investigates the challenges and opportunities of
fine-tuning and deploying an AI-driven group conversational
bot, Effervesce, within a multi-member Slack environment.
Exploring the applications of large language models for
group conversation is crucial due to the increased use of
digital communication tools in organizations. Dynamic group
conversational bots can be more helpful than conventional
one-to-one, and help increase human interaction [3].

Recent conversational AI applications designed for
one-to-one interactions in the form of chat are not trained
to infer group conversations by default. In our initial
experimentation with the Effervesce Slack bot, we employed
various open-source LLMs, which showed limited capabilities
in handling complex, multi-actor conversations. To tackle
this issue, we employed the open-source Mistral 7B model
[4], fine-tuned using the QLoRA framework [5], and a
dataset of 1.6k Slack messages extracted from a group
conversation.
According to our preliminary results, the fine-tuned model
results in an improved understanding of conversation
structure and engagement in group discussions. We evaluate
the performance of the fine-tuned Mistral model
quantitatively on another similar group-discussion dataset,
showing that the fine-tuned version performs better than
the original. Additionally, the evaluation of Effervesce
through five workshops involving 50 individual participants
showed positive impacts on organizational communication
dynamics. We also received feedback for further
improvements.
The fine-tuned LLM-powered Effervesce bot shows positive
results in facilitating multi-actor conversations within
Slack, showing the potential to enhance group communication
dynamics in organizations. Additionally, our work analyses
the advantages and limitations of current LLMs when applied
in a group communication setting. Future work will focus on
addressing suggestions from user feedback to further
improve the bot’s conversational abilities and extend its
functionalities.
7
Monday 21.10. at 13.00–14.00. Room: Tekla Hultin (F3003). Chair: Tapio Pahikkala
8
#PresenterAffiliation (presenter)TitleAuthorsAbstract
9
8Aidan ScannellAalto UniversitySample-Efficient Reinforcement Learning with Implicitly
Quantized Representations
Aidan Scannell, Kalle Kujanpaa, Yi Zhao, Arno Solin and Joni PajarinenLearning representations for reinforcement learning (RL)
has shown much promise for continuous control. We propose
an efficient representation learning method using only a
self-supervised latent-state consistency loss. Our approach
employs an encoder and a dynamics model to map observations
to latent states and predict future states, respectively.
We achieve high performance and prevent representation
collapse by normalizing and bounding the latent
representation such that rank in the representation is
empirically preserved. Our method is straightforward,
compatible with any model-free RL algorithm, and
demonstrates state-of-the-art performance in continuous
control locomotion benchmarks from DeepMind Control Suite
and manipulation benchmarks from MetaWorld.
10
27Abdullah TokmakAalto UniversityPACSBO: Probably approximately correct safe Bayesian
optimization
Abdullah Tokmak, Thomas B. Schön and Dominik BaumannIn recent years, reinforcement learning (RL) has achieved
remarkable success in controlling high-dimensional systems
without requiring a dynamics model. However, most of the
impressive results have been obtained in simulation
environments. Applying RL algorithms to real-world robotic
systems is challenging: (i) When interacting with
real-world environments, it is crucial to guarantee safety,
which popular RL algorithms fail to provide; (ii) Each
sample corresponds to a potentially expensive experiment,
and hence sample efficiency is essential, whereas RL is
inherently sample-inefficient.
Combining Gaussian process (GP) regression with Bayesian
optimization (BO) provides a sample-efficient alternative
to RL. Based on GP regression and BO, several algorithms
have been proposed that can, in addition, provide
probabilistic safety guarantees [1]. These safe learning
algorithms aim at optimizing an unknown reward function
while satisfying constraints. In exchange for the safety
guarantees, they require smoothness assumptions.
Particularly, they assume that reward and constraint
functions have a known upper bound in a reproducing kernel
Hilbert space (RKHS). The RKHS norm is a norm in a
potentially infinite dimensional space, and assuming
knowledge of that norm in unknown environments is highly
unrealistic. Notably, a too loose upper bound on the RKHS
norm leads to conservative algorithms, whereas an
underestimation might lead to constraint violations.
An alternative to assuming a known RKHS norm upper bound is
estimating it from data. References [2,3] discuss ways to
under-estimate the unknown RKHS norm, which is overly
optimistic and hence can cause constraint violations.
We draw inspiration from these ideas and propose an
approach to estimating an upper bound on the RKHS norm from
data, eliminating the need for guessing the correct RKHS
norm. Moreover, we investigate the theoretical properties
of the RKHS norm over-estimation by proving that the
estimate is probably approximately correct. Furthermore, we
treat the RKHS norm as a local object and thus improve
exploration, which in total yields PACSBO. PACSBO
successfully estimates the RKHS norm, outperforms [1] in a
toy example, and control a hardware system.

Unlike prior works, we drop the assumption of knowing a
tight upper bound on the RKHS norm of reward and constraint
functions. Instead, we estimate the upper bound from data
and theoretically investigate the estimation. Besides, we
treat the RKHS norm as a local object in the area in which
we are interested. This local treatment reduces
conservatism compared to assuming one global upper bound on
the entire parameter space. We successfully evaluate PACSBO
in numerical and hardware experiments.

[1] Y. Sui, A. Gotovos, J. Burdick, and A. Krause. “Safe
Exploration for
Optimization with Gaussian Processes”. In: International
Conference on
Machine Learning. 2015, pp. 997–1005.
[2] P. Scharnhorst, E. T. Maddalena, Y. Jiang, and C. N.
Jones. “Robust
Uncertainty Bounds in Reproducing Kernel Hilbert Spaces: A
Convex Op-
timization Approach”. In: IEEE Transactions on Automatic
Control 68.5
(2023), pp. 2848–2861.
[3] K. Hashimoto, A. Saoud, M. Kishida, T. Ushio, and D.
V. Dimarogonas.
“Learning-based symbolic abstractions for nonlinear control
systems”. In:
Automatica 146 (2022). extended version on
arxiv:1612.05327v3, p. 110646.
11
70Sahar SalimpourUniversity of TurkuSim-to-Real Transfer for Autonomous Lidar-based Navigation with Reinforcement Learning: from NVIDIA Isaac Sim to Gazebo and Real ROS 2 RobotsSahar Salimpour, Jorge Peña Queralta, Jukka Heikkonen and Tomi WesterlundUnprecedented agility and dexterous manipulation have been
demonstrated with controllers based on deep reinforcement
learning (RL), with a significant impact on legged and
humanoid robots. Modern tooling and simulation platforms,
such as NVIDIA Isaac Sim, have been powering such advances.
This paper focuses on demonstrating the applications of
Isaac in local planning and obstacle avoidance as one of
the most fundamental ways in which a mobile robot interacts
with its environment. The literature includes extensive
reproducible research in areas where the RL policy state
space is largely sensed through proprioception. We argue in
this paper that approaches to interaction with the
environment in end-to-end learning with exteroception are
less standardized and reproducible. At the same time, the
article aims to provide a base tutorial for end-to-end
local navigation policies and how a custom robot can be
trained in such a simulation environment. We benchmark
end-to-end policies with the state-of-the-art Nav2 package
in ROS2. We also cover the sim-to-real transfer process by
demonstrating the zero-shot transferability of policies
trained in the Isaac simulator to real-world robots. This
is further evidenced by the tests with different simulated
robots, showcasing the generalization of the learned
policy. Finally, the benchmarks demonstrate comparable
performance to Nav2, opening the door to quick deployment
of state-of-the-art end-to-end local planners for custom
robot platforms, but importantly furthering the
possibilities by expanding the state and action spaces, or
task definitions for more complex missions. Overall, with
this paper, we introduce the most important steps, and
aspects to consider, in deploying RL policies for local
path planning and obstacle avoidance with Isaac Sim
training, Gazebo testing, and ROS2 for real-time inference
in real robots.
12
77Denys IablonskyiUniversity of HelsinkiTowards Industry 4.0: Physics-informed AI for Ultrasound
Structural Health Monitoring
Denys Iablonskyi, Burla Korkmaz, Moontasir Soumik, Shayan Gharib, Julius Korsimaa, Petteri Salminen, Martin Weber, Edward Hæggström, Ari Salmi and Arto KlamiIndustrial structures such as pipelines can become fouled
for several reasons when unwanted substances accumulate on
the inner surfaces. Detection and characterization of the
fouled area are thus of critical importance for sustainable
industrial operations, localized cleaning, and predictive
maintenance. Ultrasonic waves are sensitive to such
deposits or defects and are typically generated and
recorded using a network of ultrasonic transducers. Typical
tomographic methods rely on solving full-wave equations or
iterative methods that are time-consuming. In this work, we
present a method for accurate reconstruction of the fouling
maps using pre-trained neural networks that opens a way to
real-time structural health monitoring. The synthetic
training dataset is obtained using a physics-informed
forward model that incorporates the fouling effect on the
propagating ultrasonic signals. Thus the AI models can be
trained offline and applied to experimental signals on the
fly to generate defect maps. The variational autoencoder
variant of the network also provides the error estimate of
the reconstruction quality. The proposed method is verified
experimentally on the pipe structure using a
sensor-efficient configuration with only four transducers
benefiting from the high-order helical trajectories that
guided waves can propagate along. Moreover, the method can
be easily extended to more complex structures e.g., storage
tanks or pipeline connections, as it only requires the
guided wave trajectories and corresponding signal
attenuation information. Modern non-invasive cleaning
methods employ high-power ultrasound and operate in a
brute-force manner according to a schedule. Identifying
fouled areas and guiding the cleaning process locally will
evidently lead to energy efficiency, production increase,
and sustainability, enabling Industry 4.0.
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100