FUTURE PLANS (take one) : GenAI
Jonas Almeida, PhD
Trans-Divisional Research Program
Data Science �and Engineering
Research Group
Research Areas
1. Data Science and Engineering for cancer research
Data Science is an interdisciplinary field that uses scientific methods, processes, and computational systems to extract or extrapolate knowledge and insights from data.
At DCEG we also contribute with software engineering of infrastructure initiatives such as Federated Digital Pathology, Polygenic risk Scoring and Connect.
2. FAIR RoIs in Digital Pathology
The complexity of AI applications to digital pathology is a significant challenge to the identification of morphologies and regions of Interest (RoI) in whole slide images.
3. Privacy preserving polygenic risk scoring
Modern Web computing (in-browser) offers an opportunity to have the code travel to where the individual’s data is.
4. Federated Artificial Intelligence
Generative embedded spaces (see future plans) create metric spaces where observational multimodal data can be projected onto high dimensional coordinate spaces.
5. Exploratory data analysis: interactive spatial-temporal clustering.
In exploratory multivariate statistics interacting is understanding.
6. Leveraging language models of biomedical reports. From occupation codes (SOCcer) to Pathology reports.
Key Approaches and/or Resources
Artificial Intelligence, Cloud Computing, distributed Web computing – STRIDES, TCGA, critical resources.
Collaboration Network
BY RESEARCH AREA
1 – Data Science and engineering infrastructure for epidemiology.
2 – FAIR architectures combining Cloud and Web computing.
3 – Artificial intelligence, distributed learning, from risk calculation to digital pathology.
Study Team
(1)
(2)
(3)
(4)
(5)
(6)
~ 8 peer reviewed papers + ~ 830 citations / year
Within the Branch/Lab
|
|
Peter Kraft, Ph.D. | Area 1,3 |
Jeya B. Balasubramanian, Ph.D. (Research Fellow – 30%) | Area 2,3 |
Daniel Russ, Ph.D. (Staff Scientist – 75%) | Area 1,2,3 |
Mia Gaudet, Ph.D. (Senior Scientist for the Connect Cohort Study) | Area 1 |
|
|
Outside the Branch/Lab (DCEG/NCI/NIH)
|
|
Meredith Shields, Ph.D., DCEG/IIB | Area 1,2 |
Stephen Chanock, M.D., DCEG/OD | Area 2,3 |
Tongwu Zhang, Ph.D., DCEG/BB | Area 2 |
Maria Teresa Landi, M.D., Ph.D., DCEG/ITEB | Area 2,3 |
Charles Matthews, Ph.D., DCEG/MEB | Area 1,2 |
Melissa Friesen Ph.D., DCEG/OEEB | Area 2,3 |
Sue Pan NCI/CBIIT (Digital Services and Solutions Branch Chief) | Area 1 |
Kailing Chen (software engineer) NCI/CBIIT | Area 2 |
|
|
Outside the NIH
|
|
Montserrat Garcia-Closas, M.D., Ph.D., Institute for Cancer Research, UK | Area 1,2,3 |
Joel Saltz, M.D., Ph.D., Stony Brook University, SUNY, Long Island, NY | Area 2,3 |
Ana Tereza Vasconcelos, Ph.D., National Lab Scientific Computing of Brazil | Area 2,3 |
Arlindo Oliveira, Ph.D., University of Lisbon, Portugal | Area 3 |
Raja Mazumter, Ph.D., George Washington University, DC | Area 1 |
Wolfgang Maass, Ph.D., German Research Center for Artificial Intelligence (DFKI) Saarland University | Area 3 |
What are embeddings�https://g.co/gemini/share/61da50b57a16�Created with 2.5 Pro July 8, 2025 at 09:48 AM Published July 8, 2025 at 09:55 AM
In the realm of machine learning, embeddings are numerical representations of complex data, such as text, images, or sounds. They translate high-dimensional data, which is data with a large number of features, into a more compact and meaningful lower-dimensional space. This process allows machine learning models to understand and process intricate information more efficiently.
At its core, an embedding is a vector—a list of numbers—that captures the essential characteristics and semantic relationships of the original data. For instance, in the context of natural language processing, words with similar meanings will have embeddings that are close to each other in this vector space.
Generative AI, dynamic multi-modal systems�(intro)�
- chatting with pathology reports: episphere.github.io/gemini/chat�(present)�
- time series: episphere.github.io/forecast
(past)�
- Non generative Pathology models: mathbiol.github.io/tcgatil
- Generative Pathology models: (txt) bit.ly/tcgareps , (img) bit.ly/wsiTiles
- biological sequences: usm.github.io/fold �
- NLP (Jeya, Daniel)
Overflow
…
“What new mathematical or ML/AI modeling approaches will impact our ability to predict responses to TME-targeting therapies?”
Jonas S Almeida, PhD, Senior Investigator; Director, Data Science
Lab: Data Science and Engineering Research Group (TDRP/DSERG)
Artificial Intelligence, �from a) Deep Learning to b) Generative Language Models, underlying modern�a) Multivariate Discriminant Analysis and b) Exploratory Multimodal Latent Spaces.
encoder
transformer
decoder
…
embeddings
…
AlphaGenome