The group organizes monthly meetings with a speaker or reading group. Membership is open for researchers from all faculties, but the group's focus is on method development for data science challenges in the health science domain.

Agenda

Upcoming event

11-11-2025 (13:00-14:00)| Gorleaus room BM 2.26: seminar with Hielke Muizelaar (PhD Candidate at the Translational Data Science and AI lab, LUMC/LIACS)

Title: Quantifying the Predictive Power of Social Determinants of Health in Cardiometabolic Disease Progression Using XGBoost

Abstract: TBA

25-11-2025 (13:00-14:00): seminar by Dr. Tanja AJ Houweling – Assistant Professor, Department of Public Health, Erasmus MC

Title: Big data: big promises for precision public health?

Abstract: In this era of AI, the availability of big data holds big promises for improved risk screening and tailored referral, and for tailoring policies to needs in the population. A danger is that models are built for purposes or in ways that are not acceptable or meaningful to citizens or care providers. Another risk is that such models distract attention from population level prevention. With the childcare benefits scandal fresh in our memories, these dangers are, arguably, even more pronounced when it comes to using big data relating to young children and families in vulnerable circumstances. Presenting results of the “Making big data meaningful for a promising start” project, Houweling will discuss how models can be built for purposes and in ways that are acceptable for and meaningful to stakeholders, and how big data can be used to improve prevention in early childhood and promote health equity.

Past events

14-10-2025: seminar with Henk van der Pol (PhD candidate at LUMC, Department of Medical Oncology)

Title: A consortium to Merge European Prospective Trials in Precision Cancer Medicine

Abstract: In this talk, I will present the PRIME-ROSE project, a European-funded initiative that merges multiple national prospective trials in Precision Cancer Medicine (PCM). These so-called umbrella basket trials allow patients to receive targeted treatments, approved by the EMA and FDA, based on their biomarker. In such trials, cohorts are defined by treatment, biomarker, and tumour type. Because there are many available targeted treatments and tumour types, numerous cohorts are formed, often with only a few patients. By merging the PRIME-ROSE trials, cohorts that could not be completed within a single country can now be aggregated across trials. The first part of the talk will outline the progress made in this project and plans for future data aggregation in PCM, including potential options for Federated Analysis. The clinical data will be harmonised using the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). Finally, since these are single-arm treatments, we aim to evaluate them using External Control Arms (ECAs). I will cover current literature and recently approved studies which used ECAs and our plans to create our ECAs.

13-05-2025: seminar with Ilaria Prosepe (PhD candidate at LUMC, Department of Biomedical Data Sciences)

Title: Estimating conditional survival benefit from observational data: an application to liver transplantation

Abstract: When treatment resources are limited, allocating based on survival benefit—the difference in expected survival with versus without treatment—can be a valid allocation strategy. However, estimating survival benefit from observational data presents two challenges: time-dependent confounding and multiple time scales. We propose a method that combines the use of cross-sections with inverse probability of treatment weighting (IPTW) to address these challenges and enable dynamic estimation of conditional survival benefit for all eligible patients. We apply the method to estimate the survival benefit of liver transplantation using data from the Eurotransplant region.

10-04-2025: seminar with Prof. Jildau Bouwman (TNO and LACDR)

Title: Personalized Digital Health

Abstract: The future of healthcare is digital, patient-centric and also focused on prevention. Digital solutions can ensure better patient outcomes, empower individuals to take control of their own health, and unburden the healthcare system. For this, it is key to be able to reuse data. Doing so, we can learn from data and thereby, improve public and personal health. Within C4yourself, we have shown that this individual data can be unlocked via the Personal Health environment, but to understand health in all its complexity, data from the whole system is needed and needs to be combined with multiple sources. Within Heracles, we combined data from several medical resources to create a system to learn from the journeys of other patients. From this project, we learned that the medical data currently stored in the Netherlands is not of the quality needed for a learning healthcare system. Collecting data and making it reusable costs time and needs standardization. The next step for fitting advice is bringing together knowledge and data using models towards fitting and explainable advice. For instance, data of other patients can help in predicting what can benefit you, but then the data should be representable for you. We have developed a tool to visualize the effect of the representation on the fit of the model. The last step of giving personal advice is making it easy to adhere to, also here AI can be helpful by giving the advice in a way that fits to the preferences of the individual. For personal advice, LLMs can help in the fitting communication. We have developed a RAG based system for dietary advice.

11-03-2025: seminar with dr. Leon Mei  (Manager Sequencing Analysis Support Core, Molecular Epidemiology, LUMC)

Title: LncRNA-BERT: An RNA Language Model for Classifying Coding and Long Non-Coding RNA

Abstract: Understanding (novel) RNA transcripts generated in next generation sequencing experiments requires accurate classification, given the increasing evidence that long non-coding RNAs (lncRNAs) play crucial regulatory roles. Recent developments in Large Language Models present opportunities for classifying RNA coding potential with sequence-based algorithms that can overcome the limitations of classical approaches that assess coding potential based on a set of predefined features. We present lncRNA-BERT, an RNA language model pre-trained and fine-tuned on human RNAs collected from the GENCODE, RefSeq, and NONCODE databases to classify lncRNAs. LncRNA-BERT matches and outperforms state-of-the-art classifiers on three test datasets, including the cross-species RNAChallenge benchmark. The pre-trained lncRNA-BERT model distinguishes coding from long non-coding RNA without supervised learning which confirms that coding potential is a sequenceintrinsic characteristic. LncRNA-BERT has been shown to benefit from pre-training on human data from GENCODE, RefSeq, and NONCODE, improving upon configurations pre-trained on the commonly used RNAcentral dataset. In addition, we propose a novel Convolutional Sequence Encoding method that is shown to be more effective and efficient than K-mer Tokenization and Byte Pair Encoding for training with long RNA sequences that are otherwise above the common context window size. lncRNA-BERT is available at https://www.biorxiv.org/content/10.1101/2025.01.09.632168v2.abstract

18-02-2025: seminar with Yilin Jiang (PhD candidate at MI & Princess Máxima Center for Pediatric Oncology Utrecht)

Title: A general approach to fitting multistate cure models based on an extended-long-format data structure

Abstract: A multistate cure model is a statistical framework used to analyze and represent the transitions that individuals undergo between different states over time, taking into account the possibility of being cured by initial treatment. This model is particularly useful in pediatric oncology where a proportion of the patient population achieves cure through treatment and therefore they will never experience some events. Our study develops a generalized algorithm based on the extended long data format, an extension of long data format where a transition can be split up to two rows each with a weight assigned reflecting the posterior probability of its cure status. The multistate cure model is fit on top of the current framework of multistate model and mixture cure model. The proposed algorithm makes use of the Expectation-Maximization (EM) algorithm and weighted likelihood representation such that it is easy to implement with standard package. As an example, the proposed algorithm is applied on data from the European Society for Blood and Marrow Transplantation (EBMT).

16-01-2025: seminar with Simone Smits (MSc. Engineering and Policy Analysis & MSc. Computer Science Bioinformatics at TU Delft & Universiteit Leiden)

Title: Type 2 Diabetes in Context

Abstract: Type 2 Diabetes (T2D) is a significant public health issue with multifactorial influences. This study investigates what the correlations and polarities are of an individual’s social network, personal lifestyle, socioeconomic status, and living environment with the prevalence of T2D among Dutch adults. Using a random forest and a logistic regression model as binary classifiers, this research predicts diabetes medication use based on various variables derived from national data. The study sample consists of over 290,000 individuals aged 40 and older who participated in the Dutch health monitor survey in 2016. Both models demonstrate similar performance in terms of average precision on unseen data. For social networks, the findings reveal that T2D is more prevalent among individuals whose social networks have a high prevalence of T2D and lower education. For socioeconomic status, individuals with lower socioeconomic status are more likely to have T2D. In terms of lifestyle, BMI and exercise engagement are very predictive for the prevalence of T2D. However, for the living environment, no clear association was found between T2D prevalence and access to healthy food. This research provides quantitative evidence for the importance of identifying and understanding social networks where T2D is either very prevalent or almost absent. The findings imply that low SES family, work and neighbor networks are correlated with unhealthy behaviors and, consequently, T2D. Regarding the two different models used, the random forest model turns out to be more useful for exploratory research, with little domain knowledge in advance, while the logistic regression model has slightly better performance.

12-11-2024, 13.00: seminar with Armel Lefebvre (Postdoc at LIACS & LUMC)

Title: Empowering Translational Health Data Science Capabilities in Population Health Management

Abstract: During this seminar, Armel Lefebvre will present the outcomes of a survey conducted among the research community of a Population Health Management department at the LUMC. The goal is to investigate how (translational) data science applications can be supported in a complex ecosystem of data sources and regulations of secondary healthcare data use. The envisioned solution is the creation of a data competence center as a multidisciplinary unit mixing research and professional support staff to provide data science technology, training, and resources to (early-career) researchers to address current challenges that are considerably impacting data quality and reproducibility in PHM research. In addition, we will have a more interactive part of the seminar where we will discuss how to better integrate the use of the PHM data infrastructure (named ELAN) with health data science research done at LIACS.

02-10-2024, 13.00: seminar with Marta Cipriani (PhD candidate at MI & Sapienza University of Rome)

Title: A multiple imputation approach to distinguish curative from life-prolonging effects in the presence of missing covariates

Abstract: Medical advancements have increased cancer survival rates and the possibility of finding a cure. Hence, it is crucial to evaluate the impact of treatments in terms of both curing the disease and prolonging survival. We may use a Cox proportional hazards (PH) cure model to achieve this. However, a significant challenge in applying such a model is the potential presence of partially observed covariates in the data. We aim to refine the methods for imputing partially observed covariates based on multiple imputation and fully conditional specification (FCS) approaches. To be more specific, we consider a more general case, where different covariate vectors are used to model the cure probability and the survival of patients who are not cured. We also propose an approximation of the exact conditional distribution using a regression approach, which helps draw imputed values at a lower computational cost. To assess its effectiveness, we compare the proposed approach with a complete case analysis and an analysis without any missing covariates. We discuss the application of these techniques to a real-world dataset from the BO06 clinical trial on osteosarcoma.

19-09-2024, 13.00: seminar with Mirjam van Reisen (Professor of FAIR Data Science at LUMC)

Title: Curation of federated patient data: a proposed landscape for the African Health Data Space

Abstract: This chapter analyzes new aspects of data curation in a federated, machine-actionable, and semantic format. It presents the results of use cases developed by the virus outbreak data network research group in relation to a distributed data infrastructure for federated learning and privacy-preserving analytics. This infrastructure has been deployed across 9 African countries, connecting clinical and research data from 88 health facilities as well as data from population groups without access to health clinics. This chapter sets out the curational aspects, choices, tools, techniques, and future prospects of the federated data platform. The findings are relevant to decisions about what parameters are needed for an African Health Data Space and beyond.

11-06-2024, 13.15: seminar with Niccolò Bianchi (Erasmus+ MSc student at LIACS)

Title: Negative sample integration in an updated semi-automated drug repurposing pipeline

Abstract: Predicting associations between genes involved in a given disease and drugs against it, exclusively based on positive links, is not effective enough: all missing links would be considered as lack of associations. Integrating negative links in the prediction model allows for more nuanced association networks displaying both known lack of associations and unknown associations. In this talk, we will discuss the implementation of embedding a model to obtain negative links–and then integrate them in an automated drug-repurposing pipeline that has been updated to new data standards.

14-05-2024, 13.15: seminar with Victor van der Horst (MSc student at MI)

Title: Missing data in illness-death model: imputation methods and comparisons

Abstract: Missing data arises in many data sets. Common imputation methods, such as the Multivariate Imputations by Chained Equations (MICE) and the multiple imputation by Substantive Model Compatible Fully Conditional Specification (SMC-FCS), have been worked out and analyzed for various settings, including competing risk analysis and traditional survival processes. In this talk, we will discuss the implementation for multi-state models, specifically the illness-death model, and compare different methods.

09-04-2024, 13.15: seminar with Saber Salehkaleybar (Assistant Professor at LIACS)

Title: A Cross-Moment Approach for Causal Effect Estimation

Abstract: In this talk, I mainly focus on the problem of estimating the causal effect of a treatment on an outcome with latent confounders when we have access to a single proxy variable. Several methods (such as difference-indifference (DiD) estimator or negative outcome control) have been proposed in this setting in the literature. However, these approaches require either restrictive assumptions on the data generating model or having access to at least two proxy variables. We propose a method to estimate the causal effect using cross moments between the treatment, the outcome, and the proxy variable. In particular, we show that the causal effect can be identified with simple arithmetic operations on the cross moments if the latent confounder is non-Gaussian. In this setting, DiD estimator provides an unbiased estimate only in the special case where the latent confounder has exactly the same direct causal effects on the outcomes in the pre-treatment and post-treatment phases. This translates to the common trend assumption in DiD, which we effectively relax. Additionally, we provide an impossibility result that shows the causal effect cannot be identified if the observational distribution over the treatment, the outcome, and the proxy is jointly Gaussian.

13-02-2024, 13.15: seminar with Jim Achterberg (PhD student at the LUMC)

Title: On the evaluation of synthetic longitudinal EHRs

Abstract: Longitudinal Electronic Health Records are widely used to provide evidence for medical research and development of medical applications. However, since the information contained within them is sensitive, getting access is often an arduous process. Synthetic EHRs can be a solution, given they are similar (enough) to real EHRs and adequately preserve privacy. In this paper we provide a discussion on existing methods and recommendations going forward when evaluating the quality of synthetic longitudinal EHRs. Here, we support the discussion by applying discussed metrics on synthetic EHRs generated from the MIMIC-IV dataset. Firstly, we provide a method to visualize real and synthetic samples to assess their similarity, through tSNE using Dynamic Time Warping with Gower’s distance for sample distance computations. Secondly, we discuss the widespread use of classifiers to discriminate synthetic and real data as a Goodness-of-Fit metric. We show it has some benefits, like explicitly testing equivalence of the synthetic and real latent distribution, but that this method is very restrictive - it implies the latent distribution can be represented as a univariate binary feature. Lastly, we assess utility of synthetic EHRs in clinical tasks like mortality prediction and diagnoses prediction, and assess privacy preserving capabilities through a targeted Attribute Inference Attack on sensitive personal identifiers.

13-02-2024, 13.15: seminar with Tijn Jacobs (PhD student at Amsterdam UMC)

Title: Long-term and short-term effects of treatment for osteosarcoma: a cure model reanalysis of the EURAMOS-I trial

Abstract: Osteosarcoma, a malignant bone tumor prevalent in children and young adults, has seen improved prognoses due to advanced treatments, with a substantial number of patients never experiencing a recurrence. Cure models can disentangle the short-term and long-term effects of treatment by allowing for a fraction of the patients to be cured, that is, insusceptible to the event of interest. The EURAMOS-I randomised controlled trial data analysis is revised by employing a mixture cure model. We study the effect of experimental treatment conditional on histological response to neoadjuvant chemotherapy. A total of 1275 patients who completed the treatment according to protocol are considered eligible for the analysis. For the poor responders the experimental treatment was not associated to higher chances of cure, but showed a minor protective effect on life if uncured. The experimental treatment had no effect on cure or survival of the uncured in the group of good responders.

09-01-2024, 13.15: seminar with Anton Schreuder (Postdoc at LIACS)

Title: The magnitude and population burden of socioeconomic inequality in adverse birth outcomes: a 2016-2019 Dutch population register data study

Abstract:   Parental socioeconomic position affects offspring health from as early as conception. Evidence on the magnitude and population burden of inequality in adverse birth outcomes by maternal education using nationwide registration data remains scant. Our objective was to describe the magnitude and population burden of inequality by maternal education for adverse birth outcomes in the Netherlands between 2016-2019. Outcomes of interest were stillbirth, neonatal mortality, preterm birth, small-for-gestational age, low Apgar score at 5 minutes, neonatal intensive care unit (NICU) admission, and severe congenital anomalies. Inequalities were expressed as rate ratio (RR) between the lowest and highest education group. We used logistic regression to estimate the population attributable fraction (PAF) and the absolute number of adverse cases avoided in the scenario that the entire population had the same birth outcomes as the highest education group. Our study included 639,007 births. One in six births had an adverse outcome. Each step up the educational ladder was associated with better outcomes. Nearly 15% of adverse outcomes would be avoided if these inequalities were addressed. Inequalities in stillbirth and neonatal mortality rates were large (RR = 2.94 [95% confidence interval = 2.33-3.55] and 2.25 [1.71-2.79], respectively), and mortality would reduce by a third if the entire population had the mortality rates of the highest education group (PAFstillbirth = 35.0% [24.4-45.6%]; PAFneonatal mortality = 27.1% [14.4-39.7%]). Inequalities were smaller for preterm birth, SGA, low Apgar score, NICU admission, and severe congenital anomalies (RR range = 1.32-1.77; PAF range = 13.8-17.7%), but most of these outcomes affected a much larger proportion of infants. The large middle education group with moderately elevated risk contributed most to the population burden of inequality. Socioeconomic inequalities in adverse birth outcomes are substantial and pervade the entire Dutch society. Birth outcomes would substantially improve if these inequalities were addressed. Population health gains would be largest if addressed through population-wide preventive approaches, rather than by only targeting those at highest risk.  

15-06-2023, 13.15 : seminar with Samar Samir Khalil (PhD candidate at LIACS)

Title: Multilingual Federated Learning for Mental Health Disorder Detection

Abstract:   Mental illnesses are now more prevalent in a way that threatens society's productivity. Depression, suicide, and similar mental health issues are often reflected in the language used by patients who suffer from such conditions. Early detection of mental health disorders saves lives and enhances the well-being of affected people. An increasing number of studies in Natural Language Processing (NLP) have been devoted to the recognition and detection of early symptoms of related disorders. Meanwhile, the non-availability of data due to patients' data privacy hinders this research direction. Federated learning (FL) is a new paradigm that provides a collaborative machine learning environment where it trains and updates a centralized model on decentralized data so the data never leaves the client side. Mental health can benefit from the privacy-preserving trait offered by FL to reconcile the need for big deep-learning datasets and the high sensitivity of data ownership. While FL has been employed in the medical domain by promoting the adoption of machine learning models in clinical settings, however, little research has investigated how multilingual text impacts FL algorithms. In this research, we investigate the potential of Federated Learning combined with Natural Language Processing, applied in the psychology field by simulating a multi-lingual data setting. We train and compare results from a collaborative learning model versus a data-centralized model.

06-04-2023, 16.00: seminar with Georgy Gomon  (PhD candidate at LUMC)

Title: Ideas about the introduction of hierarchical and survival models within the DRUP (Drug Rediscovery Protocol) study, a large Dutch basket and umbrella study for oncological targeted therapy

Abstract:   The Drug Rediscovery Protocol (DRUP trial) is a Dutch oncology trial that has been accruing patients since September 2016. The trial tests the efficacy of commercially available targeted anti-cancer drugs in patients with advanced cancer. These patients have no remaining standard treatment options left but have a potentially actionable mutation for which the corresponding targeted therapy has not (yet) been approved for the patient’s specific tumor type. The trial has an infinite number of possible parallel cohorts, defined by the patient's tumor type, targetable mutation and the targeted therapy being used. Since its inception in 2016 over 1300 patients have initiated treatment in one of ≈250 open cohorts, with 33% of patients experiencing clinical benefit, a response rate of 13% and a complete response of 2% (of all included patients), even though these are patients that were initially told that there are no treatment options left. The focus of this presentation is to explore the implementation of statistical methods in the DRUP study. Among others, we shall examine the pooling of information across different cohorts to better analyze the effectiveness of targeted therapy. To achieve this, we will explore the use of hierarchical models such as mixed effect models and Bayesian hierarchical models. Also, we will address the absence of a control group by examining the application of survival models with historical data.

26-01-2023, 13.15: seminar with Marta Spreafico (Postdoc at MI)

Title: Causal effects of chemotherapy regimen intensity on survival outcome in osteosarcoma patients through     Marginal Structural Cox Models

Abstract:  In cancer trials, longitudinal chemotherapy data are problematic to analyse due to the presence of negative feedback between exposure to cytotoxic drugs and consequent toxic side effects. Toxicities act as time-dependent confounders for the effect of chemotherapy intensity exposure on survival, determining the toxicity-treatment-adjustment bias if not properly considered. Novel methodologies are hence needed to control for exposure-affected (time-varying) confounders in longitudinal chemotherapy data. Marginal Structural Cox Models (Cox MSMs) in combination with Inverse Probability of Treatment Weighting (IPTW) are a proper tool to provide unbiased estimates of the causal effects of therapy modifications on survival outcomes. In this work, using novel definitions of Received Dose Intensity (RDI) and Multiple Overall Toxicity, suitable IPTW-based techniques and Cox MSMs are designed to mimic a randomized trial where chemotherapy exposure is no longer confounded by toxicities. In this pseudo-population, a crude analysis suffices to estimate the causal effect of modifications in joint-exposure, seen in terms of both histological response and reduced RDI compared to protocol. Data from the control arms of MRC BO03 and BO06 clinical trials on chemotherapy in osteosarcoma were analysed. During the seminar, the process of building proper causal models based on joint-exposure using two alternative strategies will be discussed, together with a tutorial-like explanations about the difficulties encountered.

24-11-2022, 13.15: seminar with Hine van Os (LUMC)

Title:  Developing clinical prediction models using primary care electronic health record data – the impact of data preparation choices on model performance

Abstract: The objective of this study is to quantify prediction model performance in relation to data preparation choices when using electronic health records (EHR). Cox proportional hazards models were developed predicting first-ever main adverse cardiovascular events using Dutch primary care EHR data. The reference model was based on a one-year run-in period, cardiovascular events were defined based on both EHR diagnosis and medication codes, and missing values were multiply imputed. We compared data preparation choices regarding i) length of the run-in period (two- or three-year run-in); ii) outcome definition (EHR diagnosis codes or medication codes only); and iii) methods addressing missing values (mean imputation or complete case analysis) by making variations on the derivation set and testing their impact in a validation set

11-10-2022, 12.00-13.00: seminar with Salvatore Battaglia (MI & University of Palermo Italy)

Title: Vertical models in the presence of random effects

Abstract: The vertical model is an alternative to competing risks model, in particular when the proportionality assumption is relaxed or in presence of missing cause of failure. The novelty is to accomodate for a random component, one for each part of the model, in order to take into account the unobserved heterogeneity in presence of clusters. The data used come from the EMUR database, a database including Emergency Department's (ED) accesses coming from 63 sicilian EDs. The vertical model will be able to compute the risk of being discharge or hospitalized once admitted inside the ED including a random centre effect. According to the value of correlation coefficient among the couple of random effects (Ui, Vi) two types of vertical model are computed: the separated one when no correlation is supposed, the joint vertical model in the other case. Due to large computational times recorded in the separated vertical model under a frequentist approach, we decided to use R-INLA package to perform a joint vertical model analysis under a Bayesian framework.

29-09-2022, 13.15 : seminar with Ahnjili Zhuparris (CHDR/LUMC)

Title: Unipolar depression classification and estimation from smartphone, wearable, and electronic Patient Reported Outcomes (ePro) data

Abstract: Drug development for mood disorders is expected to benefit from reliable behavioral biomarkers that quantify drug effects outside the clinic. The incorporation of smartphones and wearable devices in clinical trials provide a unique opportunity to monitor depression remotely. Our study investigated the application of the CHDR Monitoring Remotely Platform (MORE™), to monitor and identify potential digital biomarkers –related to physical activity, social activity, geolocation-data, and self-assessments – that could differentiate between and estimate the depression severity of unipolar depressed patients and healthy controls.

30-06-2022, 13.15: seminar with Salvatore  Battaglia (MI & University of Palermo Italy)

Title:  Overcrowding in Emergency Departments: a frailty competing risk model to analyze discharge and hospitalization in 11 Sicilian EDs 

Abstract: The Emergency Department (ED) overcrowding has become more and more prevalent throughout the nation in recent years. Severe overcrowding was usually associated with lack of  ED personnel and beds in the hospital wards. Both issues negatively affect patient care outcomes, such as  increased length of stay (LoS) at EDs, number of patients who left  ED without consulting a medical doctor and other factors. The COVID-19 outbreak had a huge impact on the accesses to ED, as consequence it is difficult to provide suitable ED overcrowding flows analyses due to the enormous heterogeneity in terms of accesses organization between different EDs. The aim of this talk paper is to provide proper information to the EDs’ heads, coming from data collected just before the outbreak.  This analysis may be used  to develop processes able to deal with high ED volume.

21-04-2022, 13.15: seminar with Gerard van Oortmerssen (guest LIACS & patient platform sarcomas)

Title: Health data science applied to a patient Facebook group

Abstract: Since many years patients have discovered social media as a place for peer contact and a source of information. They share their patient journey, give advice to new members of the group and support each other emotionally. The accumulated discussions are a potentially valuable source of information for patients, carers, medical specialists and researchers. Data Science techniques such as Natural Language Processing and Machine Learning can be used to extract information from these spontaneous and unstructured discussions. Since 2015 a series of projects have been carried out applying these techniques to discussions in a Facebook Group of patients with a rare type of cancer. Results will be presented which clearly show the added value of patient generated data.

24-02-2022, 13.15: webinar with Dirk Hoevenaars (VU)

Title: Applying wearable use in mHealth for wheelchair users

Abstract: Physical impairments and becoming wheelchair dependent causes additional daily external physical or societal barriers resulting in even lower physical activity levels compared to the general population tend to show poorer diet quality choices compared to the general population, often defined by increased intake of fat and sugar and limited intake of fruit, whole grains and dairy. In order to achieve and maintain a healthy lifestyle, many different technological developments are available in the field of health promotion to support this, which is expected to rapidly continue to develop in the next decades. The development of mobile health (mHealth) and wearable technology allows incorporation of advanced electronic and computer technologies to support the transition towards and maintaining a healthy lifestyle on an individual and group level. Wearable technology allows individuals to monitor and track multiple lifestyle related variables, such as their physical activity and its intensity, sleep quality, (resting) heart rate and in more recent developments even blood pressure. However, often mHealth and wearables are not specifically designed or tested for individuals with chronic disabilities. So, on top of their daily additional barriers and challenges, less opportunities and support is available in the form of mHealth and wearable technology for this specific population to support a healthy lifestyle. During the seminar the accuracy of hear rate measurement with a Fitbit with the so-called Photoplethysmography (PPG) in individuals with spinal cord injury will be discussed, together with the results of a 12-week mHealth intervention study with wheelchair users in combination with the Fitbit.

27-01-2022, 13.15: webinar Daniel Gomon (PhD candidate at MI)

Title: Continuous time control charts: detecting changes in the quality of care

Abstract: Rapidly detecting problems in the quality of care is of utmost importance for the well-being of patients. Without proper inspection schemes, such problems can go undetected for years. Cumulative sum (CUSUM) charts have proven to be useful for quality control, yet available methodology for survival outcomes is limited. The few available continuous time inspection charts usually require the researcher to specify an expected increase in the failure rate in advance, thereby requiring prior knowledge about the problem at hand.  Misspecifying parameters can lead to false positive alerts and large detection delays. To solve this problem, we take a more general approach to derive the new CGR-CUSUM chart. We find an expression for the approximate average run length (average time to detection) and illustrate the possible gain in detection speed by using the CGR-CUSUM over the funnel plot, Bernoulli CUSUM and Biswas & Kalbfleisch (2008) CUSUM on a real-life data set from the Dutch Arthroplasty Register as well as in simulation studies. Besides the inspection of medical procedures, the CGR-CUSUM can also be used for other real time inspection schemes such as industrial production lines and quality control of services.

16-12-2021, 13.15: webinar with Marta Spreafico (PhD candidate at MI)

Title: Modelling longitudinal Latent Overall Toxicity (LOTox) profiles in osteosarcoma patients

 Abstract: Due to the presence of multiple types of adverse events with different extents of toxicity burden, studying the toxicity evolution during chemotherapy is a challenging problem in cancer research. Statistical methods able to deal with the complexity of chemotherapy data considering both the longitudinal and the categorical aspects of toxicity levels progression are necessary, still not well developed. We will discuss a Latent Markov (LM) procedure to identify and reconstruct the longitudinal Latent Overall Toxicity (LOTox) profiles over time for each patient. The latent variables determining the progression of the observed toxicity levels can be thought of as the outcomes of an underlying latent process which may reflect patients’ quality-of-life. Data from MRC BO06/EORTC 80931 randomised controlled trial for osteosarcoma patients are analysed. This approach represents a novelty for osteosarcoma treatment, providing new insights for childhood cancer.

25-11 2021, 13.15: webinar with Emil Rijcken (TUe)

Title: A Comparative Study of Fuzzy Topic Models and LDA for Interpretability in Text Classification

Abstract: In many domains that employ machine learning models, both high performing and interpretable models are needed. A typical machine learning task is text classification, where models are hardly interpretable. Topic models, used as topic embeddings, carry the potential to better understand the decisions made by text classification algorithms. With this goal in mind, we propose two new fuzzy topic models; FLSA-W and FLSA-V. Both models are derived from the topic model Fuzzy Latent Semantic Analysis (FLSA). After training each model ten times, we use the mean coherence score to compare the different models with the benchmark models Latent Dirichlet Allocation (LDA) and FLSA. Our proposed models generally lead to higher coherence scores and lower standard deviations than the benchmark models. These proposed models are specifically useful as topic embeddings in text classification, since the coherence scores do not drop for a high number of topics, as opposed to the decay that occurs with LDA and FLSA.

28-10-2021, 13.15: webinar with Anne Dirkson (PhD candidate at LIACS)

Title: Real-world evidence from online patient forums can complement current medical perspectives: The example of gastrointestinal stromal tumor patients

Disease-specific internet discussion forums have the potential to provide real-time, uncensored and unsolicited information on both adverse drug effects (ADEs) and the advice patients give each other on how to cope with them. The automatic extraction of ADEs could complement current post-market monitoring of drugs, which suffers from severe under-reporting. To this end, we have developed a text mining pipeline to automatically extract and aggregate side effects from messages on online discussion forums. We show that our automated approach has the potential to reveal side effects that were not found in the original clinical trial, as well as long-term side effects and the side effects that matter most to the patients on a patient forum for Gastrointestinal Stromal Tumor (GIST) patients. The automatic extraction of self-reported coping strategies could empower patients by providing them with aggregate insights, as well as facilitating medical research into why certain strategies work. In fact, some strategies may work to the detriment of the medication efficacy. We will share some work in progress on this novel clinical NLP task that includes many interesting challenges such as fuzzy entities, a large and long-tailed label space, and cross-document relations.

14-10-2021, 13.15: webinar with Vera Arntzen (PhD candidate at MI)

Title: Estimation of incubation and latency time distributions of Covid-19

The distribution of incubation time (from infection to symptom onset) and latency time (from infection to start of infectiousness) are key quantities in the analysis of infectious diseases. Despite their importance in decisions on contact tracing and quarantine policies, estimation methods for incubation time suffer from limitations, while estimates of the latency time for Covid-19 are currently lacking. This talk will answer the question: “Where is the quarantine period based on?” and how this can be improved. The methodology will be applied to data from the pandemic in Vietnam.

29-07-2021, 13.15: webinar with Dr. Justin Dauwels (TU Delft)

Title: Machine Learning Methods to Predict Symptoms of Schizophrenia and Depression Patients from Behavioral Cues

Can automated analysis of audio-visual signals predict the severity of negative, cognitive, and general psychiatric symptoms of schizophrenia and depression and differentiate patients from healthy controls? In this observational study, we extracted a comprehensive set of audio-visual behavioral cues from interview recordings of 103 schizophrenia and 50 depression patients, and 75 healthy participants. We developed machine learning models that, by leveraging these audio-visual behavioral cues, are able to detect overall and specific expression-related negative, cognitive, and general psychiatric symptoms at a balanced accuracy (BAC) of at least 75%, and to distinguish schizophrenia and depression patients from healthy subjects (BAC > 82%). These results suggest that machine learning models leveraging audio-visual characteristics can help diagnose, assess, and monitor schizophrenia and depression patients with negative, cognitive, and general psychiatric symptoms.

24-06-2021, 16.00: webinar Pablo Mosteiro (Utrecht University)

Title: Towards improving psychiatric treatment with natural language processing

Abstract: In this talk, I will use the problem of assessing violence risk in an inpatient psychiatric institution to outline the challenges associated with using natural language processing to improve psychiatric treatment. I will then mention some of the strategies currently being used to tackle those challenges, and the work currently being done to implement those strategies.

Speaker’s webpage: https://www.uu.nl/staff/PJMosteiroRomero

15-04-2021, 16.00: webinar Saskia Koldijk (UMC Utrecht)

Title: Study with Empatica wearables: Detecting physiological arousal in children using a wearable

Abstract: Aggression is one of the main causes of psychiatric admission, and manifests itself in different disorders. Coping with aggression is of importance for children themselves, as well as for staff. We aim to deploy wearables in clinical practice to support emotion regulation. In our research we asked 25 children from the psychiatry ward to wear an Empatica wearable for 5 days. Observations of behavior, especially aggressive incidents were made. Currently we are analyzing the relation between measured physiology over time and observed aggression. We consider to use multilevel modeling, but are also interested in discussing alternative analysis approaches with the SIG members.