| A | B | C | D | E | |
|---|---|---|---|---|---|
1 | Given (First) Name | Family (Last) Name | Institution | POSTER Presentation title | POSTER Presentation abstract |
2 | Tal | Ayalon | Drexel University | FAIR and AI-Ready Data and Reproducible Research (FARR): A Pathway to Four-Year STEM Degrees Through Data Foundations, AI, and Materials Science (FARR-STEM) | The Drexel-FARR-STEM program will help prepare student for advancing in any STEM degree field in a four-year college. The program covers 1) data foundations, 2) the fundamentals of data management and FAIR principles, and 3) AI basics. The emphasis of FARR-STEM will be on materials science in connection with the NSF-HDR Institute for Data Driven Dynamical Design. The program will take place virtually and onsite at Drexel University in Philadelphia, Pennsylvania. Participants will join from the following four Philadelphia area community colleges: Community College of Philadelphia (CCP), Bucks County Community College (Bucks CCC), Delaware County Community College (DCCC), and Montgomery County Community College (MCCC). The program will include hands-on independent and group activities. All participants will gain an understanding of concepts in materials science, research data management, and the principles of FAIR and FARR for AI-ready, reproducible research. |
3 | Nrupen | Bhavsar | Duke University | FAIR Principles for Multiscale Infectious and Immune Mediated Diseases (IID) AI Models | Understanding the dynamics of infectious and immune mediated diseases (IID) requires complex, multiscale models. These models are often not reused or integrated because of logistical or methodological challenges, limiting the usability of the models. Developing infrastructure to promote this is a central goal of the NIAID funded (U54AI191253) Center for Multiscale Immune Systems Modeling (MISM). MISM includes a Model and Data Sharing Core charged with 1) developing infrastructure to facilitate access to and sharing of multimodal models and data, and 2) innovating and advancing ways of achieving FAIR (Findable, Accessible, Interoperable, Reusable ) principles for sharing of Infectious and Immune-mediated Disease (IID) models, and artificial intelligence (AI) models. The MDSC model sharing effort will be a complement to the existing NIAID Data Discovery Portal. Understanding how to apply FAIR principles for AI models is critical for this purpose. While there is a robust literature and practical adoption pathway for data that follows FAIR principles, comparable guidance for FAIR AI models is still emerging. Herein, we will review the state of FAIR models for IID, including AI models and their intersection with reproducibility, openness, and domain-specific standards. The goal is to identify gaps and best practices that can be applied to the development of infrastructure that can support IID multiscale modeling. |
4 | Elizabeth | Campolongo | Imageomics Institute and AI & Biodiversity Change Global Center, The Ohio State University | Making Collaborative, Distributed, and Interdisciplinary AI Research more FAIR: A Reusable Guide and Template | Both the Imageomics Institute and the AI & Biodiversity Change (ABC) Global Center aim to inspire AI innovation through domain science advancements. They bring together researchers from computer science, biology, ecology, and related fields for interdisciplinary collaborations, where the domain science knowledge informs an AI solution to aid in answering their question. Though there are many common standards within the AI/ML community, they are not all applied evenly, nor are they necessarily familiar to non-computer scientists. Building a common lexicon and workflow is key to this collaboration. The Collaborative Distributed Science Guide started as an Imageomics internal guide in a GitHub Wiki, developed to help bridge the gaps between our multidisciplinary team, with a focus on providing guidance and best practices for collaborative and interdisciplinary work following FAIR principles. Recognizing that the topics and suggestions are broadly applicable to anyone working in similar or adjacent fields, and further inspired by the response from this community, we moved the vast majority of the content to this publicly available, template website. In this way, we provide an open-source, general resource, ready for re-use and personalization. In this presentation, we will walk through the different elements of the guide, from specifying workflows and expected repository contents, to the FAIR guide with checklists and templates to ensure data, models, and code adhere to the FAIR principles and are reproducible. We will ground this in our project lifecycle and helpful tools that we use to facilitate it—both of our own creation and those created by others that we have appreciated. Our hope is that the broader community will use or contribute to this template guide to help provide a greater resource for scientists working computer or data science, biology, ecology, or related interdisciplinary fields. |
5 | Mostafa | Cham | University of Maryland Baltimore County | AI-Ready and Reproducible Subglacial Bed Mapping from Sparse Radar: DeepTopoNet and Physics-Guided Residual Workflows | Accurate maps of subglacial bed topography are foundational for ice-sheet modeling and sea-level projections, yet direct ice-penetrating radar measurements are sparse, unevenly distributed, and vary in confidence across space and campaigns. This poster presents two complementary, AI-ready workflows that convert heterogeneous cryosphere observations and community priors into reproducible machine-learning products designed for geoscience data providers and downstream modelers. First, DeepTopoNet learns Greenland bed topography by integrating sparse radar-derived ice thickness observations with a widely used gridded prior (e.g., BedMachine-style products) through a training objective that balances fidelity to observations and consistency with prior structure. This formulation addresses a common “AI readiness” gap in geoscience: supervision is incomplete and heterogeneous, yet repositories often provide strong priors that should be used responsibly rather than treated as ground truth everywhere. Second, a physics-guided residual learning framework predicts thickness residuals over a prior and reconstructs bed elevation from the observed surface. Training couples a masked robust fit to radar picks (modulated by a confidence map) with lightweight physical and regularization terms—multi-scale mass conservation, flow-aligned total variation, Laplacian/high-pass damping, and non-negativity—plus a ramped prior-consistency term where radar constraints are weak. To support community reproducibility, we adopt leakage-safe, block-wise spatial holdouts with safety buffers and report metrics on held-out cores to avoid receptive-field leakage, alongside stable inference practices (e.g., EMA weights and test-time augmentation). Together, these methods highlight practical, FAIR-aligned signals and artifacts for AI-ready geoscience: standardized inputs/targets, confidence-aware supervision, leakage-safe evaluation protocols, and uncertainty/validation outputs suitable for repository dissemination and benchmarking. We conclude with recommendations for packaging datasets, splits, and evaluation scripts to enable transparent comparison and long-term maintainability for cryosphere AI products. |
6 | Tom | Cram | NSF National Center for Atmospheric Research | Analysis Ready Data Access at the NSF NCAR Geoscience Data Exchange | Traditional data workflows often force researchers to download massive files before they can even begin their work, leading to significant delays and storage issues. By shifting toward a data streaming model, users can now access and analyze specific slices of information directly over the network without ever needing to download the full dataset. The NSF NCAR Geoscience Data Exchange (GDEX) makes this possible by providing Analysis Ready, Cloud Optimized (ARCO) data. Using tools like Zarr, Kerchunk, and intake-esm catalogs, researchers can treat remote data as if it were on their own local machine. This modern approach removes the "busy work" of data management, allowing scientists to focus entirely on their analysis and reach results much faster. |
7 | Joel | Cutcher-Gershenfeld | Brandeis University | Research Data Work | Research data work is an emerging profession. Historically researchers have always been responsible for their own research data work. The rise of digital technologies has led to a growing body of professionals with expertise in computing and data work. Most research organizations -- universities, government labs, and independent research centers are in various stages of formalizing these career paths and valuing this work, which under pins all of the UN Sustainable Development Goals. The work is important and too often under appreciated. |
8 | Ellie | Davidson | UMBC | Mapping Deep Ocean Temperature and Salinity Variability using Deep Argo Data and Machine Learning | The deep ocean below 2000 meters is a critical yet poorly observed component of the Earth's climate system, storing a significant portion of the anthropogenic heat uptake and contributing to global and regional sea level rise.The emerging Deep Argo array provides direct measurements of temperature and salinity full-depth profiles, but its point measurements are too sparse for traditional mapping techniques to reconstruct coherent, high-resolution fields. This project proposes a novel machine learning (ML) framework to overcome this limitation. We train a deep learning model to predict deep ocean temperature and salinity profiles only from upper ocean (0-2000m) profiles collected from Deep Argo. A fully trained and validated model can then be used to predict T/S variability from densely mapped upper-ocean (0-2000m) fields collected from traditional Argo floats between 2004 and Present. |
9 | Achala | Denagamage | University of Maryland, Baltimore County & Institute for Harnessing Data and Model Revolution in the Polar Regions | Implementing an Open Science Workflow: A Case Study of NSF HDR Institute for Harnessing Data and Model Revolution in the Polar Regions (iHARP) | Open science is the practice of ensuring that research artifacts, including data, code, and publications, are accessible, transparent, and reproducible amongst research communities. Some key strategies to actively implement open science include the utilization of open science repositories, open access journals, adopting open source code, pre-registration, and institutional activism. With the rising campaign for open science, the successful engineering of the process leaves much to be understood. This is particularly complex for interdisciplinary research communities, involving diverse disciplines, professional/career levels, institutions, and research methodologies, to mention a few. In this study, we describe a multi-pronged operation in open science at the NSF HDR Institute for Harnessing Data and Model Revolution in the Polar Regions (iHARP), an interdisciplinary research institute where researchers and practitioners in polar science and data science converge to advance domain knowledge through integrated physics-informed, data-driven discoveries. First, we leverage a comprehensive assessment of open science repositories, where we identify suitable platforms accessible to a niche interdisciplinary polar data science research community. Therefore, we select five platforms, specifically: GitHub for hosting code and algorithms; ScholarWorks@UMBC to serve as institutional open access or link to publications; ArXiv for providing access to pre-prints that have been accepted but pending publication; Zenodo to publish datasets and support integration to GitHub; and Ghub as a dashboard-style showcase that serves as a central discovery point to other repositories, thus increasing accessibility of research artifacts among a niche community of polar science and related disciplines. Furthermore, unlike repositories like GitHub and Ghub, Zenodo also facilitates generating Digital Object Identifiers (DOIs), which provide a reliable and permanent source to a given research artifact, thus enabling findability and tracking. Second, we implement a multi-platform workflow that maps specific research artifacts to the appropriate repositories while maintaining connectivity across platforms through clear, public-facing linkages. Here, we determine the eligibility of a research artifact to be added to an open science repository. Our primary criterion encompasses that an artifact has to be published, such as a journal paper, conference proceedings, or similar. Our objective for conducting an eligibility check is to preserve research copyright. Following that, we categorize eligible publications according to research teams, which basically comprise the authors. We then identify an open science liaison for each research team. We screen each publication to determine the specific artifacts involved, code/algorithms, and dataset(s), among other preliminary outcomes, and to verify research artifacts for accuracy, completeness, and compliance. This also ensures that access to restricted artifacts such as proprietary datasets is not violated. We set up repositories with respect to each publication and the associated artifacts. For this, we make sure to utilize the five platforms, respectively, and where applicable. More so, we establish an institutional collection on each platform, which is essentially a focal point and gateway to all iHARP contributions. Finally, we conduct expert evaluation to ensure that each repository meets the open science requirements and best practices. Third, to amplify the purpose of open science at iHARP, we host open science workshops where we provide tutorials, demos, and hands-on guidance on utilizing the selected open science platforms. Additionally, we facilitate office hours, which entail one-on-one sessions with open science liaisons to provide tailored support. Overall, our framework is designed to promote research outreach to polar data science communities in alignment with FAIR (Findable, Accessible, Interoperable, and Reusable) principles. Nevertheless, the proposed workflow poses limitations, particularly the utilization of dedicated institutional platforms like ScholarWorks@UMBC for publications, which excludes some research publications in iHARP’s multi-institutional ecosystem. To address this, our future directions aim to broadly engage all researchers and expand to alternative institutional or cross-institutional open access repositories to ensure equitable access to all research artifacts. |
10 | Arnell | Garrett | University of Maryland, Baltimore County NOAA CESSRST Fellow (Cohort IV) National Oceanic and Atmospheric Administration | Decoupling Environmental Data Collection from Analysis: AI-Enabled Smart Glasses as a Tool for Increasing Efficiency, Data Quality, Reproducibility and Research Productivity in Climate Science | Recent advances in wearable camera technologies have expanded the study of autobiographical memory by enabling continuous, first-person capture of lived experiences. Prior research demonstrates that such naturalistic stimuli enhance episodic recall, particularly when individuals review images of their own experiences, with notable benefits observed in both clinical and nonclinical populations. Neuroimaging studies further suggest that wearable-generated stimuli engage core memory networks while extending understanding of how autobiographical information is encoded, stored, and retrieved. Collectively, this body of work highlights the value of wearable technologies in capturing rich, ecologically valid data that can be revisited for analysis. Building on this foundation, wearable, AI-enabled smart glasses (e.g., Meta smart glasses) present a novel opportunity to extend these capabilities beyond memory research into environmental and climate science. In extreme environments such as the Arctic, researchers face substantial risks, including extreme cold, unstable ice conditions, wildlife threats (e.g., polar bears), pathogen exposure from thawing permafrost, and significant logistical and psychological challenges associated with isolation. These conditions often require bulky protective gear and limit the time researchers can safely spend collecting data. Wearable smart glasses provide a hands-free method of capturing high-resolution, context-rich environmental data without interrupting fieldwork or requiring removal of protective equipment. Critically, this technology enables the decoupling of data collection from analysis, allowing researchers to capture data efficiently in hazardous environments and conduct detailed analysis later in safer, controlled, or warmer settings. From a policy and operational perspective, this decoupling has significant implications. Reducing the amount of time spent in high-risk environments can lower exposure to physical and environmental hazards while also decreasing operational costs associated with extended field deployments. Additionally, the ability to capture comprehensive datasets in a single pass may reduce the number of personnel required in the field, addressing both safety concerns and budget constraints. This approach also has the potential to increase research productivity by enabling faster data collection, supporting larger volumes of data, and enhancing the richness of research outputs through multimodal documentation (e.g., images, audio, and contextual metadata). To explore the feasibility of this approach, we conducted a pilot study with ten middle school students at a botanical garden to assess whether wearable capture technology enhances the speed and accuracy of environmental pattern recognition. Participants were asked to identify and describe patterns of greenery under two conditions: (1) direct observation with traditional recall and (2) observation supplemented by first-person image capture using smart glasses. Results indicated no statistically significant difference in the speed or accuracy of pattern detection between the two conditions. The findings suggest functional equivalence between in situ observation and later image-based analysis, supporting the feasibility of deferring analysis without compromising performance. Taken together, these findings position wearable AI technologies as a promising methodological innovation for climate research, offering safer, more efficient, and scalable approaches to conducting field-based studies while maintaining data quality and integrity. |
11 | Daniel | Howard | NSF NCAR | Providing Pre-Trained Inference Models for Accessible AI4NWP | Advances in artificial intelligence (AI) are rapidly transforming numerical weather prediction (NWP), offering new methods for generating accurate, efficient forecasts. However, the ability to compare, evaluate, and iterate on these AI-based models is often hindered by limited accessibility and reproducibility of pre-trained inference systems and datasets. This work presents ongoing work to provide pre-trained AI inference models for NWP through a FAIR (Findable, Accessible, Interoperable, and Reproducible) science gateway framework, prioritizing ease of use and user needs. A comparison of current platforms, including Anemoi and Earth2Studio, will be presented in order to identify standardized metadata and reproducibility requirements. This work advances these community frameworks towards enabling researchers to efficiently deploy and benchmark AI-driven weather models across diverse environments and datasets. The resulting infrastructure not only supports transparent evaluation and accelerated experimentation in AI-based forecasting but also serves as a scalable template for FAIR-enabled scientific modeling across domains such as subseasonal to seasonal weather forecasting, hydrology, and Earth system science. |
12 | Jenna | Kline | The Ohio State University | FAIR² Drones: An AI-Ready Standard for Cross-Domain Wildlife Drone Datasets | Animal ecology data collection using drones requires a substantial investment of time, expertise, and financial resources. Yet most existing datasets serve only a single research community, limiting interdisciplinary reuse. We propose a unified drone dataset standard, FAIR² Drones, that bridges ecology, robotics, and computer vision by building on existing FAIR and AI-ready data frameworks while adding essential platform metadata and annotation specifications. Our standard enables datasets to simultaneously support ecological analysis, robotics algorithm development, and computer vision model training and benchmarking. We provide open-source validation tools, reference implementations, and multimodal extensions linking drone imagery with complementary sensors, such as camera traps, GPS, and acoustics. By standardizing metadata across disciplines, this framework maximizes the scientific return on investment for costly field deployments and accelerates cross-domain collaboration in environmental monitoring. |
13 | Chhaya | Kulkarni | Towson University | Evaluating AI Readiness of Multi-Source Geoscience Data Workflows | Modern AI-based applications in Earth science often use data workflows that combine information from multiple geoscience sources. For such workflows to support automated analysis and reuse, data must be harmonized, well-documented, uncertainty-aware, and reproducible. While these requirements are often discussed in the context of FAIR data principles and emerging applications such as Digital Twins, the extent to which commonly used geoscience data workflows meet these expectations has not been systematically examined. In this work, we evaluate the AI readiness of existing multi-source geoscience data preprocessing and integration workflows. We introduce a lightweight, data-centric evaluation framework that examines workflow readiness along four dimensions: data harmonization, metadata completeness, uncertainty representation, and reproducibility. The framework is applied to representative open Earth-system workflows that integrate reanalysis-style products with satellite-derived geophysical data. We further discuss how potential limitations or ambiguities in current workflows may affect downstream reuse in automated AI pipelines, including Digital Twin applications and emerging geospatial foundation models, which place strong demands on consistent and well-documented inputs. Our analysis has practical implications for data repositories and resource providers who are seeking to support reproducible, AI-ready geoscience workflows that align with FAIR principles. |
14 | Hilmar | Lapp | Neuromatch | Imageomics: FAIR ML Products for Biological Knowledge Discovery | A broad goal of the Imageomics Institute is to inspire ML innovation while increasing biological knowledge extraction from images. In furtherance of this goal, we create large and diverse datasets, processing and data exploration tools, and models—big and small—to aid in biological discovery. In this poster we outline many of these open-source tools (on which the poster authors have worked to various degrees) to engage with the broader research community. |
15 | Monica | Morrison | NSF National Center for Atmospheric Research | Toward AI-Ready Data: A Fitness-for-Purpose Framework for Responsible Use and Reproducibility | Data in the geosciences is big, often characterized in terms of its volume and velocity—data sets are large, and generated and processed at high speeds. It has variety across the sources that generate it, formats it is presented in, and the intended use purposes, all of which can make it difficult for users to effectively navigate. Moreover, the validity of data—the degree that quality and reliability can be guaranteed—can be a significant barrier for the responsible reuse and repurposing of data, with a misunderstanding of the validity leading to potential misuse. Despite these challenges, big geoscience data is of immense value to the scientific community and the public, which is evident in the proliferation of open-source data platforms and repositories for data producers and consumers. But, with this open source data come certain risks— risks which are rooted in the difficulty users can experience gaining literacy about the fitness-for-purpose of data, which is a function of data's representational, relational, contextual, and error characteristics. Misuse is a serious concern when thinking about the increase in machine learning and artificial intelligence in Earth system modeling and prediction where there is a strong push to provide instruments for actionable applications. A lack of understanding of the fitness-for-purpose of data in these contexts— information about what and how it represent phenomena of interest, and the uncertainties and potential for error—is a critical hazard, as the use of data that is unfit for training and evaluating models can lead to downstream harms that impact society. This talk will present a philosophical view of data that details the attributes important for determining data fitness-for-purpose that are most salient when considering how to manage the risks of unintentional misuse that might result in downstream harm. This view informs the second component of the presentation: the outline of a framework for identifying and tackling the epistemic and ethical issues related to data fitness-for-purpose literacy and data readiness for AI, i.e., guidelines for responsible data use, and initial considerations for integrating responsible data use protocols into AI/ML pipelines in the geosciences. |
16 | Josephine | Namayanja | University of Maryland, Baltimore County | Evaluating Reproducibility of Benchmark Algorithms for Supraglacial Lake Detection | Due to rising summer temperatures, the Greenland Ice Sheet has experienced accelerated surface melting, resulting in the formation of supraglacial lakes, which are water bodies that develop on the ice sheet surface during the melt season. Supraglacial lakes play an integral role in ice sheet stability and future sea level rise, and understanding their formation, persistence, and drainage of these lakes is critical to predicting the future behavior of the Greenland Ice Sheet, especially during a warmer climate. However, consistently monitoring the existence of these lakes remains challenging because they are optically complex. Hence, this limits the use of automated processes employing machine learning and image analysis to track the behavior of supraglacial lakes. On the other hand, to properly capture contextual features and visual cues from satellite imagery, humans can reliably identify these lakes. This generates annotated data that can be used in supervised learning to train models to automatically detect surface lakes on the Greenland ice sheet from satellite imagery. This entails identifying lakes as tagged polygons in a single image and enabling scientists to easily track their behavior across repeated summer melt seasons. With increasing research interest in both global and local climatic impacts, particularly in sea level rise, there is a demand for suitable benchmark models to evaluate the detection of such complex ice sheet phenomena. Furthermore, it is also essential to ensure the reliability and validity of these benchmark models, beyond accessibility to effective reproducibility. In this study, our objective is to assess the usability and quality of annotated data for training benchmark models to detect supraglacial lakes on the Greenland Ice Sheet, and to engineer reproducibility in these models. To achieve this, we evaluate a set of top-performing algorithms submitted to the GIS Cup 2023 competition, each designed to detect supraglacial lakes from high-resolution satellite imagery. Our findings indicate that the selected models represent diverse methodological families—including convolutional neural networks, transformer-based architectures, and instance segmentation techniques. More so, the utilization of diverse processing environments - single computer, high-performance computing, and cloud-based (AWS) play a role in the efficiency of reproducibility. Overall, reproducibility, especially in transdisciplinary research, remains integral to providing a solid foundation to enable building new scientific knowledge. |
17 | Rhoda | Nankabirwa | University of Maryland Baltimore County | Using LLMs with Knowledge Graphs to Harness Study Findings For Future Polar Science Research | Polar science publications contain a large amount of actionable information that is rich in knowledge (datasets, methods, and findings) that could potentially guide future research direction and improve productivity. However, the lack of structured access to this knowledge creates barriers to manually identifying, extracting, and validating relevant past and ongoing research against published articles. Our Research Validation System transforms unstructured open-access literature into a structured, queryable, and evidence-grounded platform for literature review and claim verification. Our system ingests scientific papers, performs section-aware hierarchical chunking, and extracts key entities. Extracted artifacts are stored in a knowledge graph that preserves relationships among papers, authors, concepts, and evidence-bearing text segments, while retrieved text chunks are embedded into a vector index to support semantic search. To improve controllability, we implement the workflow as a modular multi-agent pipeline, separating planning, retrieval, knowledge-graph enrichment, and validation into explicit stages. This design enables robust fallbacks (e.g., abstract-only vs full-text retrieval), supports chunk-level provenance tracking, and facilitates targeted evaluation of each module. Given a research question or claim, the system retrieves candidate evidence via vector search, enriches it with graph-based context expansion (e.g., linked entities, citations, and related work), and produces a structured validation response that includes confidence, supporting citations, and identified gaps where evidence is insufficient. Thereby providing a qualitative evaluation of research findings against prior literature. Our research validation system is scalable to other domains with open-access science. |
18 | Michael | Pimenta | Permafrost Data Group / University of Connecticut | Reproducible Generalized Arctic Geospatial ML at Scale | Reliable AI workflows in the geosciences require reproducibility, transparent data handling, and adherence to FAIR principles, yet large-scale remote sensing projects often involve complex preprocessing steps, inconsistent metadata, and opaque model pipelines. Here, I present a fully documented, end-to-end workflow for training, evaluating, and scaling deep learning models to map Tundra Capillary Networks (TCNs) across Alaska using sub-meter Maxar imagery. The pipeline integrates standardized data preprocessing, multi-architecture benchmarking (CNN and ViT models), automated experiment tracking, and strict versioning of training configurations, annotations, and model checkpoints. All segmentation outputs are post-processed using a consistent, open graph-theoretic framework, enabling reproducible network-level metrics across >1 million km² of imagery (e.g., node counts, edge lengths, component sizes). The workflow emphasizes AI Readiness by enforcing structured metadata, portable configuration files, and fully containerized execution suitable for HPC environments. This project demonstrates how FAIR-aligned practices can be embedded into geospatial ML research without sacrificing scale or flexibility, and offers a template for reproducible, transparent, and openly shareable Earth-science AI pipelines. |
19 | Kio | Polson | Drexel University | We've Gotten Too Meta: Developing a conceptual framework to facilitate progress on complex AI and data analsysis projects | Artificial Intelligence (AI) presents new opportunities and challenges for data sharing within the context of FAIR practices. The adoption of FAIR has allowed researchers and industry specialists to automate different facets of data management with humans managing the databases and the warehousing of data. Meanwhile, the advances in AI have motivated new efforts to automate data warehousing in order to focus on managing artificially intelligent analysis systems. This added automation has only added complexity to already complex data management systems requiring additional research into FARR data principles which adds to the FAIR principles, AI Readiness and Reproducibility. In order to handle this added complexity, data infrastructure staff should have a shared understanding, framework, and language around the complexity of the system and the underlying collections of data. We are developing such a framework as part of an NSF MRI at Drexel University to support this shared understanding. This presentation will introduce the framework and describe how it is being used to assess the problem space we are facing at Drexel. We will also share a few additional example uses of the framework for clarity. The framework consists of a scale of data and metadata complexity, differentiation between prioritizing centralized data vs decentralized data, identification of stakeholders, and conceptualizing what is being “collected” according to each stakeholder. The framework can be further extended and serve as a foundation for discussing complex AI and data analysis systems. |
20 | Alex | Quistberg | Drexel University | AI and Community Engagement for Climate, Safety and Health in Colombia | Climate change poses serious risks to health, yet in Latin America there have been few studies examining these risks, mitigation and adaptation, in part due to gaps in data linking climate and health. Informal settlements are particularly vulnerable to climate change due to high-risk locations and precarious construction. We used a mixed methods approach to assess climate change and health effects in informal settlements in Bogota and Barranquilla, Colombia using focus groups, Citizen Science, epidemiological methods, and artificial intelligence models. We engaged with community leaders and members in informal settlements in both cities to assess their understanding of climate change impacts on their health and communities via Citizen Science and focus groups relying on established relationships to recruit participants. We also engaged with local stakeholders from government agencies, in combined community, stakeholder, academic meeting. We collected longitudinal (2000-2023), secondary data on neighborhood-level health (e.g., mortality, homicides, road traffic collisions, infectious diseases, hospitalizations), the built environment (e.g., road infrastructure, urban form), the social environment (e.g., household water and sewage connections, education completion), green space, climate (e.g., air and land temperature, precipitation, air quality). We worked with local government agencies and officials to obtain data that were not publicly available when possible. We collected city definitions and maps of historic and current informal settlements, as well as street-level imagery data and satellite data to train artificial intelligence and computer vision models to detect informal and precarious dwellings. Street-level imagery was collected in collaboration with local community members who accompanied the street-camera vehicle through their neighborhoods while it captured 360-degree images of each street that were selected by the study team. Data were harmonized, cleaned and linked temporally and geographically at the finest level possible. For data sharing, we aggregated individual-level data to ensure confidentiality and privacy. To adhere to FAIR data principles, we also prepared metadata documentation, using guidelines like Model Cards for Models, Datasheets for Data, and others. These datasets will be used to assess differences in health indicators, built environment, transportation and climate across different administrative units (e.g. neighborhoods) with a particular emphasis on informal vs. formal ones. |
21 | Enrique | Rojas Villalba | MIT | MadVI: A Virtual Assistant for Finding and Using Madrigal Ionospheric Data | Geoscience repositories hold rich data, but many users still struggle with a basic workflow: find the right dataset, understand the metadata, and produce a first analysis that is correct and reproducible. We are developing MadVI (Madrigal Virtual Assistant) as part of an open-science effort at MIT Haystack Observatory to improve “AI readiness” and reproducibility for the Madrigal database, a widely used repository for ionospheric and upper-atmosphere observations. MadVI is a retrieval-augmented assistant that connects user questions to grounded, citable context from repository documentation, metadata, and curated examples, and then helps users generate reproducible analysis and visualization workflows (e.g., notebooks and scripts) that follow clear data provenance. To support community AI reproducibility, we are designing a task-based benchmark for common repository actions (dataset selection, data access, basic plotting, and interpretation of caveats) and evaluating MadVI using simple metrics such as task completion rate, time-to-completion, and output correctness checks. |
22 | Jonathan | Starfeldt | University of Maryland, College Park | Urban Heat MiniCubes: An AI-Ready Dataset for Urban Heat Research and Applications | Heat is amplified in urban areas due to impermeable surfaces and complex human-modified environments. While urban heat is well understood, the extent to which variations occur on hyper-local scales (i.e., street-level) is less understood. With over eighty percent of the United States’ population living in urban areas, it is critically important to advance our understanding and observational capabilities of the environments where most humans live. Recent advances in artificial intelligence present new opportunities to leverage the extensive and disparate amounts of data available to model urban heat. However, a considerable problem hindering research advances in urban heat is the current lack of datasets that are ready to implement in AI frameworks. Such AI applications involve complex reformatting and processing of remote sensing data, including reprojection and spatiotemporal alignment. “AI-ready datasets” focus on making data FAIR (findable, accessible, interoperable, and reusable) to reduce the amount of time spent on data preprocessing and accelerate research progress. We document the creation of a publicly available urban heat AI-ready dataset from remote sensing images, named “Urban Heat MiniCubes.” The dataset provides 90 km x 90 km grids over 48 different cities in the Western Hemisphere for the years 2022 and 2023. The dataset contains two distinct file types. The first contains high-resolution but infrequent observations with (i) Landsat 8 land surface temperatures, surface reflectances, and cloud mask, and (ii) Sentinel-1 SAR backscatter values and incidence angle. The second contains coarse but frequent observations with (i) GOES-16, 17, or 18 longwave infrared brightness temperatures and (ii) a microwave land surface temperature product calculated from diurnal temperature cycles. An autoencoder neural network is used for quality assessment of the quantitative bands of the dataset. A spatial analysis of variables in each city is provided as a general summary of the dataset's contents. Potential use cases and limitations of the dataset are also discussed. |
23 | Bo | Xiao | Michigan Tech University | An Open Dataset in Construction and Infrastructure Management | ACID (Advanced Construction Image Dataset Suite) is a large-scale open dataset platform designed to advance computer vision applications in construction and infrastructure management. The dataset contains diverse, high-resolution images captured from real construction sites, annotated with detailed object detection, instance segmentation, and image captioning. ACID supports research in automated safety monitoring, productivity analysis, equipment detection, and site understanding. By providing standardized, high-quality training data, the platform enables development and benchmarking of deep learning algorithms for challenging construction environments. The dataset has been adopted by over 700 research groups all over the world at www.acidb.net. This poster presents dataset structure, annotation methodology, current applications, and future expansion plans. |
24 | Yan | Xie | University of Oklahoma | Bias correction in data preparation and its implication for AI models | Recent years have seen a surge in data-driven machine learning weather prediction (MLWP) models. However, training datasets for these MLWP models are often treated as ground-truths despite documented biases relative to quality-controlled in-situ observations. Ignoring these biases can have negative effects on the model performance and trustworthiness. This study aims to (1) measure and mitigate data biases using in-situ observations, and (2) assess the impacts of bias correction on MLWP model performance. Building upon prior work from the NSF AI2ES project on bias identification and categorization, we develop a framework for data bias measurement and correction, and show how it works with two cases focused on extreme heat events. Machine learning models will be trained using both raw and bias-corrected datasets utilizing NSF NCAR’s CREDIT platform. This study highlights effective bias correction strategies to enhance AI model trustworthiness in weather prediction. |
25 | |||||
26 | |||||
27 | |||||
28 | |||||
29 | |||||
30 | |||||
31 | |||||
32 | |||||
33 | |||||
34 | |||||
35 | |||||
36 | |||||
37 | |||||
38 | |||||
39 | |||||
40 | |||||
41 | |||||
42 | |||||
43 | |||||
44 | |||||
45 | |||||
46 | |||||
47 | |||||
48 | |||||
49 | |||||
50 | |||||
51 | |||||
52 | |||||
53 | |||||
54 | |||||
55 | |||||
56 | |||||
57 | |||||
58 | |||||
59 | |||||
60 | |||||
61 | |||||
62 | |||||
63 | |||||
64 | |||||
65 | |||||
66 | |||||
67 | |||||
68 | |||||
69 | |||||
70 | |||||
71 | |||||
72 | |||||
73 | |||||
74 | |||||
75 | |||||
76 | |||||
77 | |||||
78 | |||||
79 | |||||
80 | |||||
81 | |||||
82 | |||||
83 | |||||
84 | |||||
85 | |||||
86 | |||||
87 | |||||
88 | |||||
89 | |||||
90 | |||||
91 | |||||
92 | |||||
93 | |||||
94 | |||||
95 | |||||
96 | |||||
97 | |||||
98 | |||||
99 | |||||
100 |