1 of 8

2 of 8

FUTURE PLANS (take one) : GenAI

Jonas Almeida, PhD

Trans-Divisional Research Program

Data Science �and Engineering

Research Group

Research Areas

1. Data Science and Engineering for cancer research

Data Science is an interdisciplinary field that uses scientific methods, processes, and computational systems to extract or extrapolate knowledge and insights from data.

At DCEG we also contribute with software engineering of infrastructure initiatives such as Federated Digital Pathology, Polygenic risk Scoring and Connect.

2. FAIR RoIs in Digital Pathology

The complexity of AI applications to digital pathology is a significant challenge to the identification of morphologies and regions of Interest (RoI) in whole slide images.

3. Privacy preserving polygenic risk scoring

Modern Web computing (in-browser) offers an opportunity to have the code travel to where the individual’s data is.

4. Federated Artificial Intelligence

Generative embedded spaces (see future plans) create metric spaces where observational multimodal data can be projected onto high dimensional coordinate spaces.

5. Exploratory data analysis: interactive spatial-temporal clustering.

In exploratory multivariate statistics interacting is understanding.

6. Leveraging language models of biomedical reports. From occupation codes (SOCcer) to Pathology reports.

Key Approaches and/or Resources

Artificial Intelligence, Cloud Computing, distributed Web computing – STRIDES, TCGA, critical resources.

Collaboration Network

BY RESEARCH AREA

1 – Data Science and engineering infrastructure for epidemiology.

2 – FAIR architectures combining Cloud and Web computing.

3 – Artificial intelligence, distributed learning, from risk calculation to digital pathology.

Study Team

Jonas Almeida, PhD - Senior Investigator
Daniel Russ, PhD - Staff Scientist
Jeya Balaji, PhD - postdoctoral fellow
Lee Mason, PhD - postdoctoral fellow
Praphulla Bhawsar, MS, GPP student
Lorena Sandoval, MS, GPP student

(1)

(2)

(3)

(4)

(5)

(6)

~ 8 peer reviewed papers + ~ 830 citations / year

Within the Branch/Lab
Peter Kraft, Ph.D.	Area 1,3
Jeya B. Balasubramanian, Ph.D. (Research Fellow – 30%)	Area 2,3
Daniel Russ, Ph.D. (Staff Scientist – 75%)	Area 1,2,3
Mia Gaudet, Ph.D. (Senior Scientist for the Connect Cohort Study)	Area 1

Outside the Branch/Lab (DCEG/NCI/NIH)
Meredith Shields, Ph.D., DCEG/IIB	Area 1,2
Stephen Chanock, M.D., DCEG/OD	Area 2,3
Tongwu Zhang, Ph.D., DCEG/BB	Area 2
Maria Teresa Landi, M.D., Ph.D., DCEG/ITEB	Area 2,3
Charles Matthews, Ph.D., DCEG/MEB	Area 1,2
Melissa Friesen Ph.D., DCEG/OEEB	Area 2,3
Sue Pan NCI/CBIIT (Digital Services and Solutions Branch Chief)	Area 1
Kailing Chen (software engineer) NCI/CBIIT	Area 2

Outside the NIH
Montserrat Garcia-Closas, M.D., Ph.D., Institute for Cancer Research, UK	Area 1,2,3
Joel Saltz, M.D., Ph.D., Stony Brook University, SUNY, Long Island, NY	Area 2,3
Ana Tereza Vasconcelos, Ph.D., National Lab Scientific Computing of Brazil	Area 2,3
Arlindo Oliveira, Ph.D., University of Lisbon, Portugal	Area 3
Raja Mazumter, Ph.D., George Washington University, DC	Area 1
Wolfgang Maass, Ph.D., German Research Center for Artificial Intelligence (DFKI) Saarland University	Area 3

3 of 8

What are embeddings�https://g.co/gemini/share/61da50b57a16�Created with 2.5 Pro July 8, 2025 at 09:48 AM Published July 8, 2025 at 09:55 AM

In the realm of machine learning, embeddings are numerical representations of complex data, such as text, images, or sounds. They translate high-dimensional data, which is data with a large number of features, into a more compact and meaningful lower-dimensional space. This process allows machine learning models to understand and process intricate information more efficiently.

At its core, an embedding is a vector—a list of numbers—that captures the essential characteristics and semantic relationships of the original data. For instance, in the context of natural language processing, words with similar meanings will have embeddings that are close to each other in this vector space.

4 of 8

Generative AI, dynamic multi-modal systems�(intro)�

- chatting with pathology reports: episphere.github.io/gemini/chat�(present)�

- time series: episphere.github.io/forecast

(past)�

- Non generative Pathology models: mathbiol.github.io/tcgatil

- Generative Pathology models: (txt) bit.ly/tcgareps , (img) bit.ly/wsiTiles

- episphere.github.io/ese,

- biological sequences: usm.github.io/fold �

- NLP (Jeya, Daniel)

5 of 8

Overflow

…

6 of 8

“What new mathematical or ML/AI modeling approaches will impact our ability to predict responses to TME-targeting therapies?”

Jonas S Almeida, PhD, Senior Investigator; Director, Data Science

Lab: Data Science and Engineering Research Group (TDRP/DSERG)

Artificial Intelligence, �from a) Deep Learning to b) Generative Language Models, underlying modern�a) Multivariate Discriminant Analysis and b) Exploratory Multimodal Latent Spaces.

b) bit.ly/tcgareps

a) bit.ly/tcgatil

c) bit.ly/wsiTiles

7 of 8

encoder

transformer

decoder

…

embeddings

…

8 of 8

AlphaGenome