ABCDEFGHIJ
1
IDTitleAbstractEvent nameSpeaker namesAuthor namesSubjectsTagPackageAwards
2
146033Presidential Panel on the Future of AI ResearchOver the past few years, Artificial Intelligence has bounded into the mainstream of society. Remarkable technical achievements in the use of Deep Learning and Large Language Models have given rise to expectations and hype regarding the possibility of achieving artificial general intelligence, as well as general concerns over the potential deleterious consequences of emerging AI technologies and how to ensure their responsible use. In this panel we engage four Past AAAI Presidents to discuss their views on questions relating to the current state and future of AI research, including such topics as important emerging application areas, current technical challenges, the eventual prospects for achieving artificial general intelligence, and potential AI risks and solutions.AAAI 2026 Main ConferenceStephen Smith, Raj Reddy, Eric Horvitz, Manuela Veloso, Bart Selmanpanelfree
3
146031The Essence of Intelligence is Appropriate Action (not thinking, reasoning, learning or language) and other things every student of AI should knowAn agent acts in its world to achieve its objectives. Intelligence allows the agent to make decisions and act. In natural domains, sensing is limited, so acting is gambling. It’s a myth that passive learning and more data are all we need. An agent cannot learn from observations alone. It needs a real body to carry out experiments in its world, testing hypotheses, to determine causation, refining its model of the world’s dynamics. The agent is acting as a scientist: refining its model through experiments and acting appropriately to achieve its objectives. Its objectives depend on its preferences and values, and those of other agents its actions impact. Determining which values to use, and how preferences can be acquired fairly, is a major non-technical challenge. We address three primary questions: What should an agent believe? What should an agent do, given its beliefs, preferences, and abilities? What should the preferences of an agent be? Integrating these issues motivates the design of our latest AI textbook, Artificial Intelligence: Foundations of Computational Agents, (3rd Ed. 2023).

Alan Mackworth is a Professor Emeritus of Computer Science at the University of British Columbia. He works on artificial intelligence with applications in constraint satisfaction, cognitive robotics, assistive technology, hybrid systems and constraint-based agents. He invented the world’s first soccer-playing robots. He has co-authored two books: Computational Intelligence: A Logical Approach (1998) and Artificial Intelligence: Foundations of Computational Agents (2023, 3rd Ed.). Alan co-founded the UBC Cognitive Systems Program, the Centre for AI, Decision-making and Action (CAIDA) and the AI network of BC (AInBC). He has served as President of AAAI, IJCAI and CAIAC. He is a Fellow of AAAI, CAIAC, AGE-WELL, CIFAR and the Royal Society of Canada.

David Poole is a Professor Emeritus of Computer Science at the University of British Columbia. He is known for his work on combining logic and probability, probabilistic inference, relational probabilistic models, statistical relational AI and semantic science. He is a co-author of two AI textbooks (Cambridge University Press, 3rd edition 2023, and Oxford University Press, 1998), and co-author of ” Statistical Relational Artificial Intelligence: Logic, Probability, and Computation”. He is a former chair of the Association for Uncertainty in Artificial Intelligence, the winner of the Canadian AI Association (CAIAC) 2013 Lifetime Achievement Award, and is a Fellow of the Association for the Advancement Artificial Intelligence (AAAI) and CAIAC.
AAAI 2026 Main ConferenceDavid PooleAAAI 2026 Invited Talksinvited talkfreePatrick Henry Winston Outstanding Educator Award
4
146029Small Data: A New Paradigm for the Next Generation of AIAAAI 2026 Main ConferenceDerek Haoyang LiAAAI 2026 Invited Talksinvited talkfree
5
146028AI for Reskilling, Upskilling, and Workforce DevelopmentAs AI becomes increasingly powerful and ubiquitous, it is disrupting skills and displacing workers. NSF’s National AI Institute for Adult Learning and Online Education (AI-ALOE) posits that AI can be part of the solution to the growing problem if we can use AI for reskilling, upskilling, and workforce development at scale. The long-term vision of AI-ALOE is to develop and use AI technologies to enhance the proficiency of online education for all adult learners, using in-person education as a benchmark. The day-to-day mission of AI-ALOE is to conduct responsible research into AI that is grounded in theories of human cognition and learning and derived from the scientific process of learning engineering. I will describe ongoing research at AI-ALOE.AAAI 2026 Main ConferenceAshok GoelAAAI 2026 Invited Talksinvited talkfreeRobert S. Engelmore Memorial Lecture Award
6
146026Professor Edward Feigenbaum: a Tribute to and Lecture by a Pioneer of AI on his 90th BirthdayAAAI 2026 Main ConferenceRaj Reddy, Eric Horvitz, Bart Selman, Edward Feigenbaum, Yolanda Gil, Peter FriedlandAAAI 2026 Invited Talksinvited talkfree
7
146024Navigating the AI Horizon: Promises, Perils, and the Power of CollaborationWe stand at the dawn of the AI era, a technological revolution poised to be the most consequential of our generation, presenting both unprecedented opportunities and profound challenges. But this promise is shadowed by significant challenges. To build a future we want, we must move beyond the hype and the headlines to confront the most pressing open problems—technical, sociotechnical, and multidisciplinary. This talk will review the rapid progress, dissect challenges ahead, and argue that our greatest task isn’t simply building smarter machines, but fostering the human wisdom to guide them towards a future that is not only intelligent but also equitable, safe, and profoundly human.AAAI 2026 Main ConferenceEce KamarAAAI 2026 Invited Talksinvited talkfree
8
146023AI and Program Reviewing PanelAAAI 2026 Main ConferenceOdest Chadwicke Jenkins, Kevin Leyton-Brown, Joydeep Biswas, Matthew E. Taylorpanelfree
9
146019Towards Embodied Agents that See, Simulate, and ReasonLarge language models have revolutionized textual reasoning, yet their ability to act meaningfully in multimodal, real-world environments remains limited. They struggle to ground their decisions in visual context, adapt to changing goals, and plan actions over time—shortcomings that stem from a lack of structured, goal-driven reasoning and insufficient representations of the physical world.

In this talk, I present a unified framework for building embodied agents that can see, simulate, and reason. I begin by introducing methods for learning world simulators from data, arguing that visual reasoning—like textual reasoning—benefits from step-by-step processing. Inverting a physics simulator becomes key: agents must infer structured 3D neural representations of objects, parts, motions, and scenes directly from raw video. I describe methods for extracting such representations using generative priors, injecting them into vision-language models (VLMs), and scaling up their supervision via generative 3D simulation and fast, modular physics engines. These simulators enable agents to anchor their predictions in grounded physical reality, reducing hallucinations and improving control.

Complementing this simulation capability, I explore techniques that enable agents to reason over time and adapt their behavior. By integrating structured memory systems, agents learn to retain and retrieve relevant experiences to inform long-horizon plans. Language-based reflective feedback allows them to refine their strategies beyond what sparse rewards offer, forming abstractions that generalize across tasks. When trained to ground their reasoning directly in the visual environment, agents gain the ability to set subgoals, explore effectively, and verify their own hypotheses.

Together, these advances point toward autonomous systems that simulate before they act, reflect after they fail, and maintain an ongoing awareness of goals, constraints, and context. I will illustrate these capabilities across web automation, robotics, and interactive assistance, showing how agents that see, simulate, and reason offer a promising path toward general-purpose embodied intelligence.
AAAI 2026 Main ConferenceKaterina FragkiadakiAAAI 2026 Invited Talksinvited talkfree
10
146018From Workflows to Water Coolers: AI That Can Navigate Human NatureAAAI 2026 Main ConferenceYolanda GilAAAI 2026 Invited Talksinvited talkfree
11
146016Fundamental physics and science communicationPhysicists aim to explain the Universe in terms of a compact, interpretable set of principles. Deducing those principles from experiments poses many challenging and problems which are ripe for application of AI and present opportunities to develop new AI techniques. I will describe how AI has changed the way particle physicists work and speculate about the role of AI in the future of fundamental physics. Finally, I will describe my experience in science communication, as an author, podcaster and television producer.AAAI 2026 Main ConferenceDaniel WhitesonAAAI 2026 Invited Talksinvited talkfree
12
146014Quest of AI towards Specializable Generalist: From Reasoning to Scientific DiscoveryThe pursuit of high-efficiency Artificial General Intelligence (AGI) requires more than brute-force scaling of model size and data. While scaling remains a key driver of capability, equally important are scalable architectural and principles—designs that continue to work, improve, and remain controllable as we vary model scale, domains, and modalities. Central to our approach is the concept of the “Specialized Generalist” – a pathway that achieves deep expertise across multiple domains without sacrificing broad generalization capabilities. In this talk, we introduce the “Specialized Generalist” paradigm and our implementation of it, SAGE (Synergistic Architecture for Generalized Expertise), a three-layer architecture designed to balance specialization and generalization in a systematic way. We will describe how SAGE’s Base Model, Synergy Fusion, and Exploration-Evolution layers interact in practice, focusing on concrete mechanisms for coordinating domain-specific expertise with broad general reasoning. We will share empirical results and recent advances in large reasoning models, embodied AI, and scientific applications to further illustrate the approach. A central motivation is to support “AGI for Science” by building a stable plateau of capabilities that can reliably assist with complex scientific workflows rather than isolated demos. Finally, we will outline the safety and governance questions that arise when deploying Specialized Generalist systems in high-impact settings, and discuss what we have learned so far about monitoring, alignment, and operational safeguards.AAAI 2026 Main ConferenceBowen ZhouAAAI 2026 Invited Talksinvited talkfree
13
146013From How to learn to What to learn in Multiagent Systems and RoboticsThere has been a lot of exciting recent progress on new and powerful machine learning algorithms and architectures: how to learn. But for autonomous agents acting in the dynamic, uncertain world, it is at least as important to be able to identify which concepts and subproblems to focus on: what to learn.
This talk presents methods for identifying what to learn within the framework of reinforcement learning, focusing especially on applications in multiagent systems and robotics.
AAAI 2026 Main ConferencePeter StoneAAAI 2026 Invited Talksinvited talkfree
14
146002Exact Combinatorial Multi-Class Graph Cuts for Semi-Supervised LearningSemi-supervised learning (SSL) on graphs is critical in applications where labeled data are scarce and costly, yet existing graph-based methods often degrade under extreme label sparsity or class imbalance, yielding trivial or unstable solutions. We introduce \textbf{CombCut}, the first exact combinatorial optimization framework for multi-class graph-based semi-supervised learning that operates directly on binary one-hot assignments, without any convex relaxation or heuristic volume constraints. By employing a minorization–maximization (MM) scheme, CombCut transforms each step into a structured linear assignment problem solved efficiently via network-flow algorithms. Total unimodularity guarantees integral iterates, and our theoretical analysis establishes both monotonic ascent of the true discrete objective and convergence of every limit point to a Karush–Kuhn–Tucker (KKT) stationary solution of the original combinatorial problem. Our approach requires no hyperparameter tuning and scales near-linearly in the number of vertices. Empirical evaluation on MNIST, Fashion-MNIST, and CIFAR-10 with as few as 1–5 labels per class shows that CombCut excels in worst-case labeling scenarios, significantly outperforming state-of-the-art graph-SSL baselines and yielding more stable and accurate label propagation under severe supervision constraints.AAAI 2026 Main ConferenceArtificial Intelligenceposterfree
15
145957Engaging with Bias in Computer Vision: A Group Assignment for Remote LearningAAAI 2026 Main ConferenceArtificial Intelligencetechnical paperfree
16
145955Discover Combinatorial Structures using Deep Cross-Entropy MethodAAAI 2026 Main ConferenceArtificial Intelligencetechnical paperfree
17
145953Dimensionality Reduction Adventures with Animal FacesAAAI 2026 Main ConferenceArtificial Intelligencetechnical paperfree
18
145584Solving Connections: Thinking Like WynaAAAI 2026 Main ConferenceArtificial Intelligencetechnical paperfree
19
145582ArguBot Arena: Prompt Engineering a Debate on Responsible AIAAAI 2026 Main ConferenceArtificial Intelligencetechnical paperfree
20
143560Towards Generalist Robot Learning from Internet Video: A SurveyScaling deep learning to massive and diverse internet data has driven remarkable breakthroughs in domains such as video generation and natural language processing.
Robot learning, however, has thus far failed to replicate this success and remains constrained by a scarcity of available data.
Learning from Videos (LfV) methods aim to address this data bottleneck by augmenting traditional robot data with large-scale internet video.
This video data provides foundational information regarding physical dynamics, behaviours, and tasks, and can be highly informative for general-purpose robots.

This survey systematically examines the emerging field of LfV.
We first outline essential concepts, including detailing fundamental LfV challenges such as distribution shift and missing action labels in video data.
Next, we comprehensively review current methods for extracting knowledge from large-scale internet video, overcoming LfV challenges, and improving robot learning through video-informed training.
The survey concludes with a critical discussion of future opportunities.
Here, we emphasize the need for scalable foundation model approaches that can leverage the full range of available internet video and enhance the learning of robot policies and dynamics models.
Overall, the survey aims to inform and catalyse future LfV research, driving progress towards general-purpose robots.
AAAI 2026 Main ConferenceRobert McCarthyArtificial Intelligencetechnical paperfree
21
143559Towards an Ontology-Driven Approach to Document BiasMachine learning (ML)-powered systems are capable of reproducing and often amplifying undesired biases embedded in society, emphasizing the importance of operating under practices that enable the study and understanding of the intrinsic characteristics of ML pipelines. This supports the emergence of documentation frameworks with the idea that “any remedy for bias starts with awareness of its existence.” However, a resource that can formally describe ML pipelines in terms of detected biases is still missing. To address this gap, we present the Doc-BiasO ontology, a resource that sets out to create an integrated vocabulary of biases defined in the Trustworthy AI literature and their measures, as well as to incorporate relevant domain terminology and relationships between them. Overseeing ontology engineering best practices, we reuse existing vocabularies on machine learning and AI to foster knowledge sharing and interoperability between the actors concerned with its research, development, regulation, and others. In addition, we demonstrate the potential of Doc-BiasO with an experiment on an existing benchmark and as part of a neuro-symbolic system. Overall, our main objective is to contribute towards clarifying existing terminology on bias research as it rapidly expands to all areas of AI and to improve the interpretation of bias in data and downstream impact through its documentation.AAAI 2026 Main ConferenceMayra RussoArtificial Intelligencetechnical paperfree
22
143558The Search for Stability: Learning Dynamics of Strategic Publishers with Initial DocumentsWe study a game-theoretic information retrieval model in which strategic publishers aim to maximize their chances of being ranked first by the search engine while maintaining the integrity of their original documents. We show that the commonly used Probability Ranking Principle (PRP) ranking scheme results in an unstable environment where games often fail to reach pure Nash equilibrium. We propose two families of ranking functions that do not adhere to the PRP. We provide both theoretical and empirical evidence that these methods lead to a stable search ecosystem, by providing positive results on the learning dynamics convergence. We also define the publishers’ and users’ welfare, demonstrate a possible publisher-user trade-off, and provide means for a search system designer to control it. Finally, we show how instability harms long-term users’ welfare.AAAI 2026 Main ConferenceMoshe Tennenholtz, Omer Madmon, Itamar Reinman, Idan PipanoArtificial Intelligencetechnical paperfree
23
143557Symbolic Task Inference in Deep Reinforcement LearningThis paper proposes DeepSynth, a method for effective training of deep reinforcement learning agents when the reward is sparse or non-Markovian, but at the same time progress towards the reward requires achieving an unknown sequence of high-level objectives. Our method employs a novel algorithm for synthesis of compact finite state automata to uncover this sequential structure automatically. We synthesise a human-interpretable automaton from trace data collected by exploring the environment. The state space of the environment is then enriched with the synthesised automaton, so that the generation of a control policy by deep reinforcement learning is guided by the discovered structure encoded in the automaton. The proposed approach is able to cope with both high-dimensional, low-level features and unknown sparse or non-Markovian rewards. We have evaluated DeepSynth’s performance in a set of experiments that includes the Atari game Montezuma’s Revenge, known to be challenging. Compared to approaches that rely solely on deep reinforcement learning, we obtain a reduction of two orders of magnitude in the iterations required for policy synthesis, and a significant improvement in scalability.AAAI 2026 Main ConferenceHosein HasanbeigArtificial Intelligencetechnical paperfree
24
143555Scalable Synthesis of Formally Verified Neural Value Function for Hamilton-Jacobi Reachability AnalysisHamilton-Jacobi (HJ) reachability analysis provides a formal method for guaranteeing safety in constrained control problems. It synthesizes a value function to represent a long-term safe set called feasible region. Early synthesis methods based on state space discretization cannot scale to high-dimensional problems, while recent methods that use neural networks to approximate value functions result in unverifiable feasible regions. To achieve both scalability and verifiability, we propose a framework for synthesizing verified neural value functions for HJ reachability analysis. Our framework consists of three stages: pre-training, adversarial training, and verification-guided training. We design three techniques to address three challenges to improve scalability respectively: boundary-guided backtracking (BGB) to improve counterexample search efficiency, entering state regularization (ESR) to enlarge feasible region, and activation pattern alignment (APA) to accelerate neural network verification. We also provide a neural safety certificate synthesis and verification benchmark called Cersyve-9, which includes nine commonly used safe control tasks and supplements existing neural network verification benchmarks. Our framework successfully synthesizes verified neural value functions on all tasks, and our proposed three techniques exhibit superior scalability and efficiency compared with existing methods.AAAI 2026 Main ConferenceYujie YangArtificial Intelligencetechnical paperfree
25
143554On Generating Monolithic and Model Reconciling Explanations in Probabilistic ScenariosExplanation generation frameworks aim to make AI systems’ decisions transparent and understandable to human users. However, generating explanations in uncertain environments characterized by incomplete information and probabilistic models remains a significant challenge. In this paper, we propose a novel framework for generating probabilistic monolithic explanations and model reconciling explanations. Monolithic explanations provide self-contained reasons for an explanandum without considering the agent receiving the explanation, while model reconciling explanations account for the knowledge of the agent receiving the explanation. For monolithic explanations, our approach integrates uncertainty by utilizing probabilistic logic to increase the probability of the explanandum. For model reconciling explanations, we propose a framework that extends the logic-based variant of the model reconciliation problem to account for probabilistic human models, where the goal is to find explanations that increase the probability of the explanandum while minimizing conflicts between the explanation and the probabilistic human model. We introduce explanatory gain and explanatory power as quantitative metrics to assess the quality of these explanations. Further, we present algorithms that exploit the duality between minimal correction sets and minimal unsatisfiable sets to efficiently compute both types of explanations in probabilistic contexts. Extensive experimental evaluations on various benchmarks demonstrate the effectiveness and scalability of our approach in generating explanations under uncertainty.AAAI 2026 Main ConferenceStylianos VasileiouArtificial Intelligencetechnical paperfree
26
143553Interpreting capsule networks for image classification by routing path visualizationArtificial neural networks are popular for computer vision as they often give state-of-the-art
performance, but are difficult to interpret because of their complexity. This black box modeling is especially troubling when the application concerns human well-being such as in medical image analysis or autonomous driving. In this work, we propose a technique called routing path visualization for capsule networks, which reveals how much of each region in an image is routed to each capsule. In turn, this technique can be used to interpret the entity that a given capsule detects, and speculate how the network makes a prediction. We demonstrate our new visualization technique on several real world datasets. Experimental results suggest that routing path visualization can precisely localize the predicted class from an image, even though the capsule networks are trained using just images and their respective class labels, without additional information defining the location of the class in the image.
AAAI 2026 Main ConferenceAmanjot BhullarArtificial Intelligencetechnical paperfree
27
143551Choosing abstraction levels for model-based software debugging: A theoretical and empirical analysis for spreadsheet programsModel-based diagnosis is a generally applicable, principled approach to the systematic debugging of a wide range of system types such as circuits, knowledge bases, physical devices, or software. Based on a formal description of the system, it enables precise and deterministic reasoning about potential faults responsible for observed misbehavior. In software, such a formal system description can often even be extracted from the buggy program fully automatically. As logical reasoning is central to diagnosis, the performance of model-based debuggers is largely influenced by reasoning efficiency, which in turn depends on the complexity and expressivity of the system description. Since highly detailed models capturing exact semantics often exceed the capabilities of current reasoning tools, researchers have proposed more abstract representations.

In this work, we thoroughly analyze system modeling techniques with a focus on fault localization in spreadsheets—one of the most widely used end-user programming paradigms. Specifically, we present three constraint model types characterizing spreadsheets at different abstraction levels, show how to extract them automatically from faulty spreadsheets, and provide theoretical and empirical investigations of the impact of abstraction on both diagnostic output and computational performance. Our main conclusions are that (i) for the model types, there is a trade-off between the conciseness of generated fault candidates and computation time, (ii) the exact model is often impractical, and (iii) a new model based on qualitative reasoning yields the same solutions as the exact one in up to more than half the cases while being orders of magnitude faster.

Due to their ability to restrict the solution space in a sound way, the explored model-based techniques, rather than being used as standalone approaches, are expected to realize their full potential in combination with iterative sequential diagnosis or indeterministic but more performant statistical debugging methods.
AAAI 2026 Main ConferencePatrick RodlerArtificial Intelligencetechnical paperfree
28
143549A simple proof-theoretic characterization of stable models: Reduction to difference logic and experimentsStable models of logic programs have been studied and characterized in relation with other formalisms by many researchers. As already argued in previous papers, such characterizations are interesting for diverse reasons, including theoretical investigations and the possibility of leading to new algorithms for computing stable models of logic programs. At the theoretical level, complexity and expressiveness comparisons have brought about fundamental insights. Beyond that, practical implementations of the developed reductions enable the use of existing solvers for other logical formalisms to compute stable models. In this paper, we first provide a simple characterization of stable models that can be viewed as a proof-theoretic counterpart of the standard model-theoretic definition. We further show how it can be naturally encoded in difference logic. Such an encoding, compared to the existing reductions to classical logics, does not require Boolean variables. Then, we implement our novel translation to a Satisfiability Modulo Theories (SMT) formula. We finally compare our approach, employing the SMT solver yices, to the translation-based ASP solver lp2diff and to clingo on domains from the “Basic Decision” track of the 2017 Answer Set Programming competition. The results show that our approach is competitive to and often better than lp2diff, and that it can also be faster than clingo on non-tight domains.AAAI 2026 Main ConferenceMarco MarateaArtificial Intelligencetechnical paperfree
29
143547A Fortiori Case-Based Reasoning: From Theory to DataThe widespread application of uninterpretable machine learning systems for sensitive purposes has spurred research into elucidating the decision-making process of these systems. These efforts have their background in many different disciplines, one of which is the field of AI & law. In particular, recent works have observed that machine learning training data can be interpreted as legal cases. Under this interpretation, the formalism developed to study case law, called the theory of precedential constraint, can be used to analyze the way in which machine learning systems draw on training data—or should draw on them—to make decisions. In the present work, we advance the theory underlying these explanation methods, by relating it to order theory and logic. This allows us to write a software implementation of the theory that can be used to compute with the definitions and give automatic proofs of the properties of the model. We use this implementation to evaluate the model on a series of datasets. Through this analysis, we characterize the types of datasets that are more, or less, suitable to be described by the theory.AAAI 2026 Main ConferenceWijnand van WoerkomArtificial Intelligencetechnical paperfree
30
143546The Cost and Complexity of Minimizing Envy in House
Allocations
We study almost envy-freeness in house allocation, where
$m$ houses are to be allocated among $n$ agents so that
every agent receives exactly one house. An envy-free
allocation need not exist, and therefore we may have to
settle for relaxations. We study different aggregate
measures of envy as markers of fairness. In particular, we
define the amount of envy experienced by an agent $a$
w.r.t. an allocation to be the number of agents that agent
$a$ envies under that allocation. ~\cite{KMS2021} studied
the problem of minimizing the number of envious agents and
showed that it is NP-Complete to find allocations that
minimize the number of envious agents, even for binary
utilities, and this quantity is hard to approximate for
general utilities. In this paper, we explore envy
minimization in house allocation from a broader perspective
and prove algorithmic results not only for minimizing the
number of envious agents but for two other measures of envy
as well---minimizing the amount of maximum envy experienced
by any agent and minimizing the amount of total envy
experienced by all the agents put together.
We prove a host of algorithmic and hardness results. We
also suggest practical approaches for these problems via
integer linear program (ILP) formulations and report the
findings of our experimental evaluation of ILPs. Finally,
we study the price of fairness, which quantifies the loss
of welfare we must suffer due to the fairness requirements,
and present tight bounds as well as algorithms that
simultaneously optimize both welfare and fairness.
AAAI 2026 Main ConferenceJayakrishnan MadathilArtificial Intelligencetechnical paperfree
31
143545TactGen: Tactile Sensory Data Generation via Zero-Shot
Sim-to-Real Transfer
Recent advances in machine learning have driven a
step-change in robot perception with modalities such as
vision, where large amounts of training data are readily
available or cheap to collect. However, in tactile
perception, the relatively high cost of data collection
still largely impedes the adoption of such data-driven
learning solutions. In this article, we introduce TactGen,
a novel, cross-modal framework to tackle this challenge. In
particular, using a two-step data generation pipeline, we
leverage easily accessible vision data to synthesise
artificial tactile data for downstream classifier training.
Specifically, we use readily collected video data of
objects of interest to efficiently learn neural radiance
field (NeRF) representations. The NeRF models are then used
to render red–green–blue-depth (RGBD) images from any
desired vantage points. In the second stage, the RGBD
images are translated into corresponding tactile images
typically generated by camera-based tactile sensors using a
conditional generative adversarial network (cGAN). The cGAN
model is itself trained with a large set of visuo-tactile
images collected in simulation, and can be transferred into
the real world without fine-tuning or additional data
collection. We extensively validate this approach in the
context of tactile object classification, showing that it
effectively reduces data collection time by a factor of 20
while achieving similar performance to training a
classifier on manually collected real data.
AAAI 2026 Main ConferenceShaohong ZhongArtificial Intelligencetechnical paperfree
32
143544MAGIC-VFM - Meta-Learning Adaptation for Ground Interaction
Control With Visual Foundation Models
Control of off-road vehicles is challenging due to the
complex dynamic interactions with the terrain. Accurate
modeling of these interactions is important to optimize
driving performance, but the relevant physical phenomena,
such as slip, are too complex to model from first
principles. Therefore, we present an offline meta-learning
algorithm to construct a rapidly-tunable model of residual
dynamics and disturbances. Our model processes terrain
images into features using a visual foundation model (VFM),
then maps these features and the vehicle state to an
estimate of the current actuation matrix using a deep
neural network (DNN). We then combine this model with
composite adaptive control to modify the last layer of the
DNN in real time, accounting for the remaining terrain
interactions not captured during offline training. We
provide mathematical guarantees of stability and robustness
for our controller, and demonstrate the effectiveness of
our method through simulations and hardware experiments
with a tracked vehicle and a car-like robot. We evaluate
our method outdoors on different slopes with varying
slippage and actuator degradation disturbances, and compare
against an adaptive controller that does not use the VFM
terrain features. We show significant improvement over the
baseline in both hardware experimentation and simulation.
AAAI 2026 Main ConferenceElena-Sorina LupuArtificial Intelligencetechnical paperfree
33
143539Towards automated self-supervised learning for truly
unsupervised graph anomaly detection
Self-supervised learning (SSL) is an emerging paradigm that
exploits supervisory signals generated from the data
itself, and many recent studies have leveraged SSL to
conduct graph anomaly detection. However, we empirically
found that three important factors can substantially impact
detection performance across datasets: (1) the specific SSL
strategy employed; (2) the tuning of the strategy’s
hyperparameters; and (3) the allocation of combination
weights when using multiple strategies. Most SSL-based
graph anomaly detection methods circumvent these issues by
arbitrarily or selectively (i.e., guided by label
information) choosing SSL strategies, hyperparameter
settings, and combination weights. While an arbitrary
choice may lead to subpar performance, using label
information in an unsupervised setting is label information
leakage and leads to severe overestimation of a method’s
performance. Leakage has been criticized as “one of the top
ten data mining mistakes", yet many recent studies on
SSL-based graph anomaly detection have been using label
information to select hyperparameters. To mitigate this
issue, we propose to use an internal evaluation strategy
(with theoretical analysis) to select hyperparameters in
SSL for unsupervised anomaly detection. We perform
extensive experiments using 10 recent SSL-based graph
anomaly detection algorithms on various benchmark datasets,
demonstrating both the prior issues with hyperparameter
selection and the effectiveness of our proposed strategy.
AAAI 2026 Main ConferenceZhong LiArtificial Intelligencetechnical paperfree
34
143538IEEE Transactions on Robotics (T-RO) Publication: "SICNav:
Safe and Interactive Crowd Navigation Using Model
Predictive Control and Bilevel Optimization"
Abstract of T-RO publication: "Robots need to predict and
react to human motions to navigate through a crowd without
collisions. Many existing methods decouple prediction from
planning, which does not account for the interaction
between robot and human motions and can lead to the robot
getting stuck. We propose SICNav, a Model Predictive
Control (MPC) method that jointly solves for robot motion
and predicted crowd motion in closed-loop. We model each
human in the crowd to be following an Optimal Reciprocal
Collision Avoidance (ORCA) scheme and embed that model as a
constraint in the robot's local planner, resulting in a
bilevel nonlinear MPC optimization problem. We use a
KKT-reformulation to cast the bilevel problem as a single
level and use a nonlinear solver to optimize. Our MPC
method can influence pedestrian motion while explicitly
satisfying safety constraints in a single-robot multi-human
environment. We analyze the performance of SICNav in two
simulation environments and indoor experiments with a real
robot to demonstrate safe robot motion that can influence
the surrounding humans. We also validate the trajectory
forecasting performance of ORCA on a human trajectory
dataset."
AAAI 2026 Main ConferenceSepehr SamaviArtificial Intelligencetechnical paperfree
35
143535HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular
Scene Reconstruction
We present HI-SLAM2, a geometry-aware Gaussian SLAM system
that achieves fast and accurate monocular scene
reconstruction using only RGB input. Existing Neural SLAM
or 3DGS-based SLAM methods often trade off between
rendering quality and geometry accuracy, our research
demonstrates that both can be achieved simultaneously with
RGB input alone. The key idea of our approach is to enhance
the ability for geometry estimation by combining
easy-to-obtain monocular priors with learning-based dense
SLAM, and then using 3D Gaussian splatting as our core map
representation to efficiently model the scene. Upon loop
closure, our method ensures on-the-fly global consistency
through efficient pose graph bundle adjustment and instant
map updates by explicitly deforming the 3D Gaussian units
based on anchored keyframe updates. Furthermore, we
introduce a grid-based scale alignment strategy to maintain
improved scale consistency in prior depths for finer depth
details. Through extensive experiments on Replica, ScanNet,
Waymo Open, ETH3D SLAM and ScanNet++ datasets, we
demonstrate significant improvements over existing Neural
SLAM methods and even surpass RGB-D-based methods in both
reconstruction and rendering quality. The project page and
source code are available at https://hi-slam2.github.io/.
AAAI 2026 Main ConferenceWei ZhangArtificial Intelligencetechnical paperfree
36
143534PRIMP: PRobabilistically-Informed Motion Primitives for
Efficient Affordance Learning From Demonstration
This article proposes a Learning-from-Demonstration (LfD)
method using probability densities on the workspaces of
robot manipulators. The method, named
PRobabilistically-Informed Motion Primitives (PRIMP),
learns the probability distribution of the end effector
trajectories in the 6-D workspace that includes both
positions and orientations. It is able to adapt to new
situations such as novel via points with uncertainty and a
change of viewing frame. The method itself is
robot-agnostic, in that the learned distribution can be
transferred to another robot with the adaptation to its
workspace density. Workspace-STOMP, a new version of the
existing STOMP motion planner, is also introduced, which
can be used as a postprocess to improve the performance of
PRIMP and any other reachability-based LfD method. The
combination of PRIMP and Workspace-STOMP can further help
the robot avoid novel obstacles that are not present during
the demonstration process. The proposed methods are
evaluated with several sets of benchmark experiments. PRIMP
runs more than five times faster than existing
state-of-the-art methods while generalizing trajectories
more than twice as close to both the demonstrations and
novel desired poses. They are then combined with our lab's
robot imagination method that learns object affordances,
illustrating the applicability to learn tool use through
physical experiments.
AAAI 2026 Main ConferenceSipu RuanArtificial Intelligencetechnical paperfree
37
143533Motion Planning Diffusion: Learning and Adapting Robot
Motion Planning With Diffusion Models
The performance of optimization-based robot motion planning
algorithms is highly dependent on the initial solutions,
commonly obtained by running a sampling-based planner to
obtain a collision-free path. However, these methods can be
slow in high-dimensional and complex scenes and produce
nonsmooth solutions. Given previously solved path-planning
problems, it is highly desirable to learn their
distribution and use it as a prior for new similar
problems. Several works propose utilizing this prior to
bootstrap the motion planning problem, either by sampling
initial solutions from it, or using its distribution in a
maximum-a-posterior formulation for trajectory
optimization. In this work, we introduce motion planning
diffusion (MPD), an algorithm that learns trajectory
distribution priors with diffusion models. These generative
models have shown increasing success in encoding multimodal
data and have desirable properties for gradient-based
motion planning, such as cost guidance. Given a motion
planning problem, we construct a cost function and sample
from the posterior distribution using the learned prior
combined with the cost function gradients during the
denoising process. Instead of learning the prior on all
trajectory waypoints, we propose learning a lower
dimensional representation of a trajectory using linear
motion primitives, particularly B-spline curves. This
parametrization guarantees that the generated trajectory is
smooth, can be interpolated at higher frequencies, and
needs fewer parameters than a dense waypoint
representation. We demonstrate the results of our method
ranging from simple 2-D to more complex tasks using a 7-DOF
robot arm manipulator. In addition to learning from
simulated data, we also use human demonstrations on a
real-world pick-and-place task. The experiment results show
that diffusion models are strong priors for encoding
multimodal trajectory distributions for optimization-based
motion planning.
AAAI 2026 Main ConferenceJoao CarvalhoArtificial Intelligencetechnical paperfree
38
143532Generative Graphical Inverse KinematicsQuickly and reliably finding accurate inverse kinematics
(IK) solutions is a challenging problem for many robot
manipulators. Existing numerical solvers are widely
applicable but typically only produce a single solution and
rely on local search techniques to minimize nonconvex
objectives. Recent learning-based approaches that
approximate the entire feasible set of solutions have shown
promise in generating multiple fast and accurate IK results
in parallel. However, existing learning-based techniques
have a significant drawback: each robot of interest
requires a specialized model that must be trained from
scratch. To address this key shortcoming, we propose a
novel distance-geometric robot representation coupled with
a graph structure that allows us to leverage the
generalizability of graph neural networks (GNNs). Our
approach, which we call generative graphical IK (GGIK), is
the first learned IK solver that is able to efficiently
yield a large number of diverse solutions in parallel while
also displaying the ability to generalize---a single
learned model can be used to produce IK solutions for a
variety of different robots. When compared to several other
learned IK methods, GGIK provides more accurate solutions
with the same amount of training data. GGIK can also
generalize reasonably well to robot manipulators unseen
during training. In addition, GGIK is able to learn a
constrained distribution that encodes joint limits and
scales well with the number of robot joints and sampled
solutions. Finally, GGIK can be used to complement local IK
solvers by providing a reliable initialization for the
local optimization process.
AAAI 2026 Main ConferenceOliver LimoyoArtificial Intelligencetechnical paperfree
39
143531Path-Constrained Haptic Motion Guidance via Adaptive
Phase-Based Admittance Control
Robots have surpassed humans in terms of strength and
precision, yet humans retain an unparalleled ability for
decision-making in the face of unpredictable disturbances.
This article aims to combine the strengths of both entities
within a singular task: human motion guidance under strict
geometric constraints, particularly adhering to
predetermined paths. To tackle this challenge, a modular
haptic guidance law is proposed that takes the
human-applied wrench as an input. Using an auxiliary
variable called phase, the generated desired motion is
guaranteed to consistently adhere to the constraint path.
It is demonstrated how the guidance policy can be
generalized into physically interpretable terms, adjustable
either prior to initiating the task or dynamically while
the task is in progress. Additionally, an illustrative
guidance adaptation policy is showcased that takes into
account the human’s manipulability. Leveraging passivity
analysis, potential sources of instability are pinpointed,
and subsequently, overall system stability is ensured by
incorporating an augmented virtual energy tank. Lastly, a
comprehensive set of experiments, including a
20-participant user study, explores various aspects of the
approach in practice, encompassing both technical and
usability considerations.
AAAI 2026 Main ConferenceErfan ShahriariArtificial Intelligencetechnical paperfree
40
143529Multimodal Super-Resolution: Discovering Hidden Physics and
Its Application to Fusion Plasmas
Understanding complex physical systems often requires
integrating data from multiple diagnostics, each with
limited resolution or coverage. We present a machine
learning framework that reconstructs synthetic
high-temporal-resolution data for a target diagnostic using
information from other diagnostics, without direct target
measurements during the inference. This multimodal
super-resolution technique improves diagnostic robustness
and enables monitoring even in case of measurement failures
or degradation. Applied to fusion plasmas, our method
targets edge-localized modes (ELMs), which can damage
plasma-facing materials. By reconstructing super-resolution
Thomson Scattering data from complementary diagnostics, we
uncover fine-scale plasma dynamics and validate the role of
resonant magnetic perturbations (RMPs) in ELM suppression
through magnetic island formation. The approach provides
new observation supporting the plasma profile flattening
due to these islands. Our results demonstrate the
framework’s ability to generate high-fidelity synthetic
diagnostics, offering a powerful tool for ELM control
development in future reactors like ITER. The approach is
broadly transferable to other domains facing sparse,
incomplete, or degraded diagnostic data, opening new
avenues for discovery.
AAAI 2026 Main ConferenceAzarakhsh JalalvandArtificial Intelligencetechnical paperfree
41
143527CODEI: Resource-Efficient Task-Driven Co-Design of
Perception and Decision Making for Mobile Robots Applied to
Autonomous Vehicles
This paper discusses the integration challenges and
strategies for designing mobile robots, by focusing on the
task-driven, optimal selection of hardware and software to
balance safety, efficiency, and minimal usage of resources
such as costs, energy, computational requirements, and
weight. We emphasize the interplay between perception and
motion planning in decision-making by introducing the
concept of occupancy queries to quantify the perception
requirements for sampling-based motion planners. Sensor and
algorithm performance are evaluated using False Negative
Rate (FNR) and False Positive Rate (FPR) across various
factors such as geometric relationships, object properties,
sensor resolution, and environmental conditions. By
integrating perception requirements with perception
performance, an Integer Linear Programming (ILP) approach
is proposed for efficient sensor and algorithm selection
and placement. This forms the basis for a co-design
optimization that includes the robot body, motion planner,
perception pipeline, and computing unit. We refer to this
framework for solving the co-design problem of mobile
robots as CODEI, short for Co-design of Embodied
Intelligence. A case study on developing an Autonomous
Vehicle (AV) for urban scenarios provides actionable
information for designers, and shows that complex tasks
escalate resource demands, with task performance affecting
choices of the autonomy stack. The study demonstrates that
resource prioritization influences sensor
choice: cameras are preferred for cost-effective and
lightweight designs, while lidar sensors are chosen for
better energy and computational efficiency.
AAAI 2026 Main ConferenceDejan MilojevicArtificial Intelligencetechnical paperfree
42
143526Analysing Satellite Imagery Classification under Spatial
Domain Shift across Geographic Regions
Deep learning models are designed based on the i.i.d.
assumption; consequently, they experience a significant
performance drop due to the distribution shifts when
deployed in real environments. Domain Generalisation (DG)
aims to bridge the distribution shift between the source
and target domains by improving the generalisability of the
model to Out-Of-Distribution (OOD) data. This challenge is
prominent in satellite imagery classification due to the
scarcity of data from underrepresented regions such as
Africa and Oceania. In this paper, we address the
limitations of existing datasets in capturing distribution
shifts caused by geospatial differences between geographic
regions by constructing a new, large-scale dataset called
Domain Shift across Geographic Regions (DSGR). This dataset
aims to help researchers better understand the impact of
distribution shifts on satellite imagery classification.
Furthermore, we perform rigorous experiments on DSGR to
investigate and benchmark the robustness of existing DG
techniques under single- and multi-source domain settings
and the role of foundation models in enhancing the DG
techniques. Our evaluations reveal that recent DG
techniques have a comparable, yet weak, performance on
DSGR. However, when combined with a foundation model like
CLIP, ERM (introduced in 1999) achieves highly competitive
results, surpassing even recent state-of-the-art DG
solutions in enhancing the generalisability of deep
learning models across different geographic regions. Our
dataset and code are available at
https://github.com/RWGAI/DSGR.

AAAI 2026 Main ConferenceSara Al-EmadiArtificial Intelligencetechnical paperfree
43
143525Causal Explanations for Sequential Decision MakingStochastic sequential decision-making systems — such as
Markov decision processes and their variants — are
increasingly used in areas such as transportation,
healthcare, and communication. However, the ability to
explain these systems’ outputs to non-technical end users
has not kept pace with their widespread adoption. This
paper addresses that gap by extending prior work and
presenting a unified framework for generating causal
explanations of agent behavior in sequential
decision-making settings, grounded in the structural causal
model (SCM) paradigm. Our framework supports the generation
of multiple, semantically distinct explanations for agent
actions — capabilities that were previously unattainable.
In addition to introducing a novel taxonomy of explanations
for MDPs to guide empirical investigation, we develop both
exact and approximate causal inference methods within the
SCM framework. We analyze their applicability and derive
run-time bounds for each. This leads to the proposed
algorithm, MeanRESP, which operates flexibly across a
spectrum of approximations tailored to external
constraints. We further analyze the sample complexity and
error rates of approximate MeanRESP, and provide a detailed
comparison of its outputs — under varying definitions of
responsibility — with popular Shapley-value-based methods.
Empirically, we performed a series of experiments to
evaluate the practicality and effectiveness of the proposed
system, focusing on real-world computational demands and
the validity and reliability of metrics for comparing
approximate and exact causal methods. Finally, we present
two user studies that reveal user preferences for certain
types of explanations and demonstrate a strong preference
for explanations generated by our framework compared to
those from other state-of-the-art systems.
AAAI 2026 Main ConferenceSamer B. NashedArtificial Intelligencetechnical paperfree
44
143524Feature Hallucination for Self-supervised Action RecognitionUnderstanding human actions in videos requires robust
integration of multimodal cues beyond raw pixels. This work
introduces a deep self-supervised action recognition
framework that jointly predicts action concepts and
auxiliary features from RGB video, then hallucinates
missing modalities at test time to improve recognition
without added runtime cost. Two new domain-specific
descriptors, Object Detection Features (ODF) and Saliency
Detection Features (SDF), are proposed to capture spatial
context and motion saliency, integrating them with other
modalities such as optical flow, skeleton, audio, and
improved dense trajectories. The framework incorporates
aleatoric uncertainty modeling to handle noisy or
unreliable features, along with a robust loss for stable
multimodal fusion. Compatible with popular architectures
including I3D, AssembleNet, Video Transformer Network,
VideoMAE V2, and InternVideo2, the approach achieves
state-of-the-art results on Kinetics-400, Kinetics-600, and
Something-Something V2.
AAAI 2026 Main ConferenceLei WangArtificial Intelligencetechnical paperfree
45
143523Super Level Sets and Exponential Decay: A Synergistic
Approach to Stable Neural Network Training
This paper presents a theoretically grounded optimization
framework for neural network training that integrates an
Exponentially Decaying Learning Rate with Lyapunov-based
stability analysis. We develop a dynamic learning rate
algorithm and prove that it induces connected and stable
descent paths through the loss landscape by maintaining the
connectivity of super-level sets Sλ = {θ ∈ ℝn : ℒ(θ) ≥ λ}.
Under the condition that the Lyapunov function V(θ) = ℒ(θ)
satisfies Δ V(θ) ⋅ Δ ℒ(θ) ≥ 0, we establish that these
super-level sets are not only connected but also
equiconnected across epochs, providing uniform topological
stability. We further derive convergence guarantees using
a second-order Taylor expansion and demonstrate that our
exponentially scheduled learning rate with gradient-based
modulation leads to a monotonic decrease in loss. The
proposed algorithm incorporates this schedule into a
stability-aware update mechanism that adapts step sizes
based on both curvature and energy-level geometry. This
work formalizes the role of topological structure in
convergence dynamics and introduces a provably stable
optimization algorithm for high-dimensional, non-convex
neural networks.
AAAI 2026 Main ConferenceJatin ChaudharyArtificial Intelligencetechnical paperfree
46
143521GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI OutputsThe rapid proliferation of Generative AI (GenAI) into diverse, high-stakes domains necessitates robust and reproducible evaluation methods. However, practitioners often resort to ad-hoc, non-standardized scripts, as common metrics are often unsuitable for specialized, structured outputs (e.g., automated plans, time-series) or holistic comparison across modalities (e.g., text, audio, and image). This fragmentation hinders comparability and slows AI system development. To address this challenge, we present GAICo (Generative AI Comparator): a deployed, open-source Python library that streamlines and standardizes GenAI output comparison. GAICo provides a unified, extensible framework supporting a comprehensive suite of reference-based metrics for unstructured text, specialized structured data formats, and multimedia (images, audio). Its architecture features a high-level API for rapid, end-to-end analysis, from multi-model comparison to visualization and reporting, alongside direct metric access for granular control. We demonstrate GAICo's utility through a detailed case study evaluating and debugging complex, multi-modal AI Travel Assistant pipelines. GAICo empowers AI researchers and developers to efficiently assess system performance, make evaluation reproducible, improve development velocity, and ultimately build more trustworthy AI systems, aligning with the goal of moving faster and safer in AI deployment. **Since its release on PyPI in Jun 2025, the tool has been downloaded over 13K times, across versions, by Aug 2025, demonstrating growing community interest.**AAAI 2026 Main ConferenceKausik Lakkaraju, Biplav Srivastava, Nitin Gupta, Pallav KoppisettiArtificial Intelligencetechnical paperfree
47
143520Deploying Atmospheric and Oceanic AI Models on Chinese Hardware and Framework: Migration Strategies, Performance Optimization and AnalysisWith the growing role of artificial intelligence in climate and weather research, efficient model training and inference are in high demand. Current models like FourCastNet and AI-GOMS depend heavily on GPUs, limiting hardware independence, especially for Chinese domestic hardware and frameworks. To address this issue, we present a framework for migrating large-scale atmospheric and oceanic models from PyTorch to MindSpore and optimizing for Chinese chips, and evaluating their performance against GPUs. The framework focuses on software-hardware adaptation, memory optimization, and parallelism. Furthermore, the model's performance is evaluated across multiple metrics, including training speed, inference speed, model accuracy, and energy efficiency, with comparisons against GPU-based implementations. Experimental results demonstrate that the migration and optimization process preserves the models' original accuracy while significantly reducing system dependencies and improving operational efficiency by leveraging Chinese chips as a viable alternative for scientific computing. This work provides valuable insights and practical guidance for leveraging Chinese domestic chips and frameworks in atmospheric and oceanic AI model development, offering a pathway toward greater technological independence.AAAI 2026 Main ConferenceXiaomeng Huang, Li Jiahao, Jiancheng Pan, Yanfei Xiang, Yuze Sun, Luo Wentao, Quan ZhangArtificial Intelligencetechnical paperfree
48
143519From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise ProductionAgents are rapidly advancing in automating digital work, but enterprises face a harder challenge: moving beyond prototypes to deployed systems that deliver measurable business value. This path is complicated by fragmented frameworks, slow development, and the absence of standardized evaluation practices. Generalist agents have emerged as a promising direction, excelling on academic benchmarks and offering flexibility across task types, applications, and modalities. Yet, evidence of their deployment in production enterprise settings is still limited. This paper reports IBM’s experience developing and deploying the \textbf{Computer Using Generalist Agent (CUGA)}. CUGA adopts a hierarchical planner--executor architecture with strong analytical foundations, achieving state-of-the-art performance on AppWorld and WebArena. Beyond benchmarks, it was deployed in the Business-Process-Outsourcing talent acquisition domain, meeting enterprise requirements for scalability, auditability, safety, and governance. To support evaluation, we introduce \textbf{BPO-TA}, a 26-task benchmark spanning 13 analytics endpoints. In deployment, CUGA matched the accuracy of specialized agents while reducing development time by 91.9\% and cost by 52.3\%. Our contribution is twofold: demonstrating generalist agents operating at enterprise scale, and distilling technical and organizational lessons from this deployment. We outline requirements and next steps for advancing research-grade architectures like CUGA into robust, enterprise-ready systems.AAAI 2026 Main ConferenceSegev Shlomov, Lukasz Strak, Eilam Shapira, Alon Oved, Nir Mashkif, Sami Marreed, Ido Levy, Offer Akrabi, Avi Yaeli, Elizabeth Koumpan, Yinon Goldshtein, Asaf AdiArtificial Intelligencetechnical paperfree
49
143518A Metacognitive Architecture for Correcting LLM Errors in AI AgentsThe ability to self-revise is critical for AI agents. To maintain trust and foster positive perceptions, AI systems must correct their mistakes and adapt to users’ changing needs. We present a metacognitive architecture for self-revision in SAMI, an AI social agent deployed in Georgia Tech’s OMSCS program. Over the past ten semesters, SAMI has facilitated social connections for more than 11,000 students. Real-world deployments revealed frequent requests from students to revise the knowledge database, either to correct errors or to update their information. To address this need, we present a self-revision architecture that integrates Knowledge-Based AI (KBAI) and Generative AI (GenAI). The architecture (1) localizes the task requiring revision by introspecting on its self-model, (2) updates the knowledge database, and (3) communicates the revision process back to the user. We evaluate the framework using feedback cases derived from real student data and observed revision needs. This work introduces a novel metacognitive approach to improving explainability through the integration of KBAI and GenAI, with a clear path toward real-world deployment.AAAI 2026 Main ConferenceAshok Goel, Jisu Kim, Mahimul IslamArtificial Intelligencetechnical paperfree
50
143517Interpretable Machine Learning for In-Home Mild Cognitive Impairment DetectionThis paper introduces a novel system for in-home cognitive health assessment using ambient sensors and a machine learning technology that can robustly detect mild cognitive impairment (MCI) despite its noisy and sparsely limited available data. The learned model can transparently explain the aspects of individuals' daily lives led to the prediction, while reliably predicting MCI, providing more insights to healthcare workers for further clinical interventions. We developed the robust transparent machine learning model, based on fusion adaptive resonance theory (Fusion ART) neural network to learn individuals' daily patterns of activity from continuous sensor data in terms of a suite of digital biomarkers reflecting four key domains: physical activity, daily routines, cognitive engagement, and sleep patterns.
Based on a longitudinal study of over one hundred participants, deployed with non-intrusive sensors in their homes to undergo parallel clinical evaluation across a period of five years, our model successfully identified individuals with MCI, achieving high predictive accuracy regardless the noisy and sparse availability of data. As a transparent neural network, the learned model can also serve as classification rules to distinguish MCI from normal cognition (NC) cases based on the digital biomarkers. These results demonstrate that passively collected, sensor-derived digital biomarkers can be leveraged to indicate cognitive status and potentially providing clinically meaningful insights on the impairment conditions. We also discuss the practical challenges and lessons learned from this real-world deployment to inform future large-scale implementations of such AI-driven health monitoring systems.
AAAI 2026 Main ConferenceBudhitama Subagdja, Ah-Hwee Tan, Shanthoshigaa D, Iris RawtaerArtificial Intelligencetechnical paperfree
51
143516InfrastructureSentinel: Policy Enforced Guardrails for Secure MCP-driven Infrastructure AgentsThe proliferation of Model Context Protocol (MCP) servers in enterprise infrastructure management has revolutionized AI-driven automation while introducing critical multi-layered security vulnerabilities that traditional cybersecurity frameworks cannot adequately address. This paper presents a comprehensive intelligent guardrail system that addresses the unique security challenges of MCP-driven infrastructure management through a novel four-layer defense architecture. Our solution employs a dedicated guardian LLM that interprets natural language policies and applies contextual reasoning to complex infrastructure scenarios, providing dynamic policy enforcement that adapts to user roles, operational timing, and system context. Unlike existing rule-based security systems, our approach implements guardrails at four distinct control points: input message filtering, tool selection validation, execution-time verification, and post-action auditing. The system addresses critical gaps in existing security solutions by providing infrastructure-specific threat modeling, real-time policy adaptation, and comprehensive audit trails with explainable decision-making through confidence scores and detailed reasoning. Our evaluation demonstrates the system's effectiveness in preventing command injection, privilege escalation, and tool poisoning attacks across various enterprise infrastructure scenarios while maintaining operational agility essential for modern data center management.AAAI 2026 Main ConferenceGayathri Saranathan, Tarun Kumar, Martin Foltin, Suparna Bhattacharya, Aalap Tripathy, Scott Hinchley, Donald Bahls, David Brookshire, Larry Kaplan, Robert WisniewskiArtificial Intelligenceposterfree
52
143515AquaSentinel: Next-Generation AI System Integrating Sensor Networks for Urban Underground Water Pipeline Anomaly Detection via Collaborative MoE-LLM Agent ArchitectureUnderground pipeline leaks and infiltrations pose significant threats to water security and environmental safety. Traditional manual inspection methods provide limited coverage and delayed response, often missing critical anomalies. This paper proposes AquaSentinel, a novel physics-informed AI system for real-time anomaly detection in urban underground water pipeline networks. We introduce four key innovations: (1) strategic sparse sensor deployment at high-centrality nodes combined with physics-based state augmentation to achieve network-wide observability from minimal infrastructure; (2) the RTCA (Real-Time Cumulative Anomaly) detection algorithm, which employs dual-threshold monitoring with adaptive statistics to distinguish transient fluctuations from genuine anomalies; (3) a Mixture of Experts (MoE) ensemble of spatiotemporal graph neural networks that provides robust predictions by dynamically weighting model contributions; (4) causal flow-based leak localization that traces anomalies upstream to identify source nodes and affected pipe segments. Our system strategically deploys sensors at critical network junctions and leverages physics-based modeling to propagate measurements to unmonitored nodes, creating virtual sensors that enhance data availability across the entire network. Experimental evaluation using 110 leak scenarios demonstrates that AquaSentinel achieves 100% detection accuracy. This work advances pipeline monitoring by demonstrating that physics-informed sparse sensing can match the performance of dense deployments at a fraction of the cost, providing a practical solution for aging urban infrastructure.AAAI 2026 Main ConferenceWenlu Wang, Hua Zhang, Qiming Guo, Bishal Khatri, Wenbo Sun, Jinwen TangArtificial Intelligencetechnical paperfree
53
143512Centralized training with hybrid execution in multi-agent reinforcement learning via predictive observation imputationWe study hybrid execution in multi-agent reinforcement learning (MARL), a paradigm where agents aim to complete cooperative tasks with arbitrary communication levels at execution time by taking advantage of information-sharing among the agents. Under hybrid execution, the communication level can range from a setting in which no communication is allowed between agents (fully decentralized), to a setting featuring full communication (fully centralized), but the agents do not know beforehand which communication level they will encounter at execution time. We contribute MARO, an approach that makes use of an auto-regressive predictive model, trained in a centralized manner, to estimate missing agents' observations at execution time. We evaluate MARO on standard scenarios and extensions of previous benchmarks tailored to emphasize the impact of partial observability in MARL. Experimental results show that our method consistently outperforms relevant baselines, allowing agents to act with faulty communication while successfully exploiting shared information.AAAI 2026 Main ConferencePedro SantosArtificial Intelligencetechnical paperfree
54
143511Generative AI Against Poaching: Latent Composite Flow Matching for Poaching PredictionPoaching poses significant threats to wildlife and biodiversity. A valuable step in reducing poaching is to forecast poacher behavior, which can inform patrol planning and other conservation interventions. Existing poaching prediction methods based on linear models or decision trees lack the expressivity to capture complex, nonlinear spatiotemporal patterns. Recent advances in generative modeling, particularly flow matching, offer a more flexible alternative. However, training such models on real-world poaching data faces two central obstacles: imperfect detection of poaching events and limited data. To address imperfect detection, we integrate flow matching with an occupancy-based detection model and train the flow in latent space to infer the underlying occupancy state. To mitigate data scarcity, we adopt a composite flow initialized from a linear-model prediction rather than random noise which is the standard in diffusion models, injecting prior knowledge and improving generalization. Evaluations on datasets from two national parks in Uganda show consistent gains in predictive accuracy.AAAI 2026 Main ConferenceMilind Tambe, Lily Xu, Lingkai Kong, Haichuan Wang, Charles Emogor, Vincent Boersch-SupanArtificial Intelligencetechnical paperfree
55
143510Life, Machine Learning, and the Search for Habitability: Predicting Biosignature Fluxes for the Habitable Worlds ObservatoryFuture direct-imaging flagship missions, such as NASA's Habitable Worlds Observatory (HWO), face critical decisions in prioritizing observations due to extremely stringent time and resource constraints. In this paper, we introduce two advanced machine-learning architectures tailored for predicting biosignature gas fluxes from exoplanetary reflected-light spectra: a Bayesian Convolutional Neural Network (BCNN) and our novel model architecture, the Spectral Query Adaptive Transformer (SQuAT). The BCNN robustly quantifies both epistemic and aleatoric uncertainties, offering reliable predictions under diverse observational conditions, whereas SQuAT employs query-driven attention mechanisms to enhance interpretability by explicitly associating spectral features with specific biosignature gases. We demonstrate that both models achieve comparably high predictive accuracy on an augmented dataset spanning a wide range of exoplanetary conditions, while highlighting their distinct advantages in uncertainty quantification and spectral interpretability. These capabilities position our methods as promising tools for accelerating target triage, optimizing observation schedules, and maximizing scientific return for upcoming flagship missions such as HWO.AAAI 2026 Main ConferenceMark Moussa, Amber Young, Brianna Isola, Vasuda Trehan, Nicholas Wogan, Michael Himes, Giada ArneyArtificial Intelligencetechnical paperfree
56
143509Deploying Rapid Damage Assessments from sUAS Imagery for Disaster ResponseThis paper presents the first AI/ML system for automating building damage assessment in uncrewed aerial systems (sUAS) imagery to be deployed operationally during federally declared disasters (Hurricanes Debby and Helene). In response to major disasters, sUAS teams are dispatched to collect imagery of the affected areas to assess damage; however, at recent disasters, teams collectively delivered between
47GB and 369GB of imagery per day, representing more imagery than can reasonably be transmitted or interpreted by subject matter experts in the disaster scene, thus delaying response efforts. To alleviate this data avalanche encountered in practice, computer vision and machine learning techniques are necessary. While prior work has been deployed to automatically assess damage in satellite imagery, there is no current state of practice for sUAS-based damage assessment systems for operational use, as all known work has been confined
to academic settings. This work establishes the state of practice via the development and deployment of models for building damage assessment with sUAS imagery. The development of the models consisted of training on the largest known dataset of post-disaster sUAS aerial imagery, which consists of 21,716 building damage labels, and the operational training of 91 disaster practitioners. The deployment of the system was during the responses to Hurricanes Debby and Helene, where it assessed a combined 415 buildings in approximately 18 minutes. This work contributes detailed documentation of the actual use of AI/ML for damage assessment during a disaster and lessons learned to the benefit of the AI/ML research and user communities.
AAAI 2026 Main ConferenceThomas Manzini, Priyankari Perali, Robin MurphyArtificial Intelligencetechnical paperfree
57
143508Clinician-in-the-Loop Smart Home System to Detect Urinary Tract Infection Flare-Ups via Uncertainty-Aware Decision SupportUrinary tract infection (UTI) flare-ups pose a significant health risk for older adults with chronic conditions. These infections often go unnoticed until they become severe, making early detection through innovative smart home technologies crucial. Traditional machine learning (ML) approaches relying on simple binary classification for UTI detection offer limited utility to nurses and practitioners as they lack insight into prediction uncertainty, hindering informed clinical decision-making. This paper presents a clinician-in-the-loop (CIL) smart home system that leverages ambient sensor data to extract meaningful behavioral markers, train robust predictive ML models, and calibrate them to enable uncertainty-aware decision support. The system incorporates a statistically valid uncertainty quantification method called Conformal-Calibrated Interval (CCI), which quantifies uncertainty and abstains from making predictions ("I don’t know") when the ML model's confidence is low. Evaluated on real-world data from eight smart homes, our method outperforms baseline methods in recall and other classification metrics while maintaining the lowest abstention proportion and interval width. A survey of 42 nurses confirms that our system's outputs are valuable for guiding clinical decision-making, underscoring their practical utility in improving informed decisions and effectively managing UTIs and other condition flare-ups in older adults.AAAI 2026 Main ConferenceJana Doppa, Chibuike Ugwu, Roschelle Fritz, Diane CookArtificial Intelligenceposterfree
58
143507Driving Engagement in Daily Fantasy Sports with a Scalable and Urgency-Aware Ranking EngineIn daily fantasy sports (DFS), match participation is highly time-sensitive.
Users must act within a narrow window before a game begins, making match recommendation a time-critical task to prevent missed engagement and revenue loss. Existing recommender systems, typically designed for static item catalogs, are ill-equipped to handle the hard temporal deadlines inherent in these live events. To address this, we designed and deployed a recommendation engine using the Deep Interest Network (DIN) architecture.
We adapt the DIN architecture by injecting temporality at two levels: first, through real-time urgency features for each candidate match (e.g., time-to-round-lock), and second, via temporal positional encodings that represent the time-gap between each historical interaction and the current recommendation request, allowing the model to dynamically weigh the recency of past actions.
This approach, combined with a listwise neuralNDCG loss function, produces highly relevant and urgency-aware rankings. To support this at industrial scale, we developed a multi-node, multi-GPU training architecture on Ray and PyTorch. Our system, validated on a massive industrial dataset with over 650k users and over 100B interactions, achieves a +9\% lift in nDCG@1 over a heavily optimized LightGBM baseline with handcrafted features. The strong offline performance of this model establishes its viability as a core component for our planned on-device (edge) recommendation system, where online A/B testing will be conducted.
AAAI 2026 Main ConferenceUnmesh PadalkarArtificial Intelligencetechnical paperfree
59
143506Automated Unified Reasoning with Vision-Language Models for Multi-modal Burn AssessmentIn emerging clinical applications such as ultrasound-based burn assessment, the lack of domain-specific data presents a significant challenge for developing robust AI systems. Vision-language models (VLMs) have shown strong performance in general computer vision tasks, yet their application to medical imaging remains limited, particularly due to insufficient reasoning capabilities and the scarcity of high-quality training data. We introduce AURA (Automated Unified Reasoning for Burn Assessment), a multi-modal approach that integrates pre-trained VLMs with symbolic first-order logic (FOL) reasoning to improve diagnostic accuracy and interpretability in this data-limited setting. For this study, we collected real-patient data over a one-year period at a U.S. burn center, performing all experiments in a real clinical setting to ensure practical relevance. The dataset includes both conventional B-Mode ultrasound and Tissue Doppler Imaging (TDI), with TDI introduced here for the first time in burn assessment, underscoring the emerging nature of this work. Beyond burn severity classification, we assess the system’s ability to produce expert-level surgical insight directly from imaging data. On the retrospective dataset, it achieves up to 93% accuracy in surgical classification and 87% in fine-grained burn depth prediction, comparable to expert-informed predictions and substantially exceeding the 70% accuracy of traditional visual inspection by human experts. These results, obtained from a novel multi-modal dataset collected in a real clinical burn center setting, highlight the potential of this approach to improve decision-making in burn care. To further support future deployment, we demonstrate a prototype integration with an Electronic Medical Record (EMR) system that aligns with clinical workflows and supports scalable, real-world implementation.AAAI 2026 Main ConferenceJuan Wachs, Md Masudur Rahman, Mohamed Masry, Gayle GordilloArtificial Intelligencetechnical paperfree
60
143505DEEP: A Discourse Evolution Engine for Predictions About Social MovementsNumerous social movements (SMs) around the world help support the UN's Sustainable Development Goals (SDGs). Understanding how key events shape SMs is key to the achievement of the SDGs. We have developed SMART (Social Media Analysis \& Reasoning Tool) to track social movements related to the SDGs. SMART was designed by a multidisciplinary team of AI researchers, journalists, communications scholars and legal experts. This paper describes SMART's transformer-based multivariate time series Discourse Evolution Engine for Predictions about Social Movements (DEEP) to predict the volume of future articles/posts and the emotions expressed. DEEP outputs probabilistic forecasts with uncertainty estimates, providing critical support for editorial planning and strategic decision-making. We evaluate DEEP with a case study of the \emph{\#MeToo} movement by creating a novel longitudinal dataset (433K Reddit posts and 121K news articles) from September 2024 to June 2025 that will be publicly released for research purposes upon publication of this paper.AAAI 2026 Main ConferenceAaron Shaw, Venkatramanan Subrahmanian, Marco Postiglione, Valerio La Gatta, Jeremy Gilbert, Daniel Linna, Morgan GreenfieldArtificial Intelligencetechnical paperfree
61
143504Lightweight Additive Blend Maps for Texture-Preserving Face Retouching: A Neural Approach to Traditional Photographic TechniquesProfessional photography face retouching requires achieving a balance between texture preservation and quality increase, a problem that conventional automated methods find difficult to effectively handle. We provide a new lightweight neural architecture that converts conventional dodge-and-burn photography methods into predictions for learnable additive blend maps. Instead of rebuilding whole images, our method uses a small U-Net that predicts pixel-level changes, allowing for exact brightness adjustments while maintaining the original skin texture. With a 6MB model that operates effectively on common hardware, the technique produces high-quality results while preserving texture fidelity, which is crucial for professional applications. Experimental validation offers significant computing advantages while demonstrating competitive performance with current methods.AAAI 2026 Main ConferenceAbinash PeguArtificial Intelligencetechnical paperfree
62
143503LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review GenerationThe rapid growth of scientific publications has made it increasingly difficult to keep literature reviews comprehensive and up-to-date. Though prior work has focused on automating retrieval and screening, the writing phase of systematic reviews remains largely under-explored, especially with regard to readability and factual accuracy. To address this, we present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process. LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles. Evaluated on SciReviewGen and a proprietary ScienceDirect dataset, LiRA outperforms current baselines such as AutoSurvey and MASS-Survey in writing and citation quality, while maintaining competitive similarity to human-written reviews. We further evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation. Our findings highlight the potential of agentic LLM workflows, even without domain-specific tuning, to improve the reliability and usability of automated scientific writing.AAAI 2026 Main ConferenceAnders Søgaard, Maarten de Rijke, Seyed Amin Tabatabaei, Xinyi Chen, Gregory Hok Tjoan Go, Khang LyArtificial Intelligencetechnical paperfree
63
143502Integrating Fourier Neural Operators into High-Fidelity Helicopter Flight Simulation for Real-Time Urban Wind PredictionHigh-fidelity helicopter flight simulators are essential for preparing pilots for complex and hazardous environments, yet realistic urban wind dynamics are difficult to reproduce in real time when relying on precomputed computational fluid dynamics (CFD) data. We present the first integration of a Fourier Neural Operator (FNO) into a Level D full flight simulator for real-time, physics-based urban wind field generation. Trained on high-resolution urban flow simulations, the FNO predicts one-minute-averaged 3D wind fields that dynamically adapt to flight state and location, replacing static wind inputs in the simulator pipeline. Turbulence levels are computed from the predictions and injected directly into the simulation loop. Professional pilots evaluated the system in an urban scenario and reported that it reproduced wind effects they would expect, such as turbulence and directional changes when landing behind buildings. They highlighted its value for less experienced pilots to develop wind awareness and for realistic training in critical operations, including offshore platform landings.AAAI 2026 Main ConferenceMaximilian Dauner, Michael Kurz, Gudrun Socher, Alexander KnollArtificial Intelligencetechnical paperfree
64
143501NOVAID: Natural-language Observability Visualization Assistant for ITOps Dashboard Widget GenerationManual creation of IT monitoring dashboard widgets is slow, error-prone, and a barrier for both novice and expert users. We present NOVAID, an interactive chatbot that leverages Large Language Models (LLMs) to generate IT monitoring widgets directly from natural language queries. Unlike general natural language–to-visualization tools, NOVAID addresses IT operations–specific challenges: specialized widget types like SLO charts, dynamic API-driven data retrieval, and complex contextual filters. The system combines a domain-aware semantic parser, fuzzy entity matching, and schema completion to produce standardized widget JSON specifications. An interactive clarification loop ensures accuracy in underspecified queries. On a curated dataset of 271 realistic queries, NOVAID achieves promising accuracy (up to 94.10% in metric extraction) across multiple LLMs. A user study with IT engineers yielded a System Usability Scale score of 74.2 for NOVAID, indicating good usability. By bridging natural language intent with operational dashboards, NOVAID demonstrates clear potential and a path for deployment in enterprise ITOps monitoring platforms.AAAI 2026 Main ConferencePrateeti Mohapatra, Seema Nagar, Arthur de Magalhaes, Pratik Mishra, Caner Gözübüyük, Raya WittichArtificial Intelligencetechnical paperfree
65
143500CausalTrace: A Neurosymbolic Causal Analysis Agent for Smart ManufacturingModern manufacturing environments demand not only accurate predictions but also interpretable insights to process anomalies, root causes, and potential interventions. Existing AI systems often function as isolated black boxes, lacking the seamless integration of prediction, explanation, and causal reasoning required for a unified decision-support solution. This fragmentation limits their trustworthiness and practical utility in high-stakes industrial environments. In this work, we present CausalTrace, a neurosymbolic causal analysis module integrated into the SmartPilot industrial CoPilot. CausalTrace performs data-driven causal analysis enriched by industrial ontologies and knowledge graphs, including advanced functions such as causal discovery, counterfactual reasoning, and root cause analysis (RCA). It supports real-time operator interaction and is designed to complement existing agents by offering transparent, explainable decision support. We conducted a comprehensive evaluation of CausalTrace using multiple causal assessment methods and the C3AN framework (i.e. Custom, Compact, Composite AI with Neurosymbolic Integration), which spans principles of robustness, intelligence, and trustworthiness. In an academic rocket assembly testbed, CausalTrace achieved substantial agreement with domain experts (ROUGE-1: 0.91 in ontology QA) and strong RCA performance (MAP@3: 94%, PR@2: 97%, MRR: 0.92, Jaccard: 0.92). It also attained 4.59/5 in the C3AN evaluation, demonstrating precision and reliability for live deployment.AAAI 2026 Main ConferenceAmit Sheth, Utkarshani Jaimini, Chathurangi Shyalika Jayakody Kankanamalage, Aryaman Sharma, Cory Henson, Fadi Kalach, Ramy HarikArtificial Intelligencetechnical paperfree
66
143499Diversity Meets Relevancy: Multi-Agent Knowledge Probing for Industry 4.0 ApplicationsIndustrial data scientists modeling an asset's condition need to build domain understanding by asking questions about a given asset. Some example asset questions are what failure modes can it experience, under which operating conditions, and how the manufacturer and weather affect.
Traditionally, the main source of domain information comes from Subject Matter Experts (SMEs) and Failure Modes and Effects Analysis (FMEA) documents which are not always available and may not be detailed enough to cover different external factors (e.g., operating mode, manufacturer, weather).
Now that Large Language Models (LLMs) have became a commodity, this gives us a big opportunity to leverage them to bridge this gap.
Inspired by other's work on LLM knowledge probing, we present a Multi-Agent System (MAS) specialized on aiding industrial data scientists guide their modeling decisions. One challenge we address is the generated linguistic diversity and question relevance, which we optimize by using popular information diversity metrics and a grounded relevancy classifier.
We continuously monitor the set of newly generated instruction sets at the end of each round, compare the linguistic diversity against common baselines and show high generated knowledge coverage on the downstream FMEA task.
We also conduct user studies to validate the quality of the questions.
We finally present the real-world implications of providing diverse asset specific information to aid data scientist's modeling decisions through our deployed MAS.
Through the deployed system, we show its generalizability to different assets and extendibility to more downstream tasks like work order scheduling, failure mode sensor analysis and machine learning model recipes generation.
AAAI 2026 Main ConferenceDhaval Patel, Christodoulos Constantinides, Scott Kimbleton, Nishu Garg, Muhammad ParachaArtificial Intelligencetechnical paperfree
67
143498Automated Creation and Enrichment Framework for Improved Invocation of Enterprise APIs as ToolsRecent advancements in Large Language Models (LLMs) has lead to the development of agents capable of complex reasoning and interaction with external tools. In enterprise contexts, the effective use of such tools that are often enabled by application programming interfaces (APIs) is hindered by poor documentation, complex input or output schema, and large number of operations. These challenges make tool selection difficult and reduce the accuracy of payload formation upto 25%. We propose ACE, an automated tool creation and enrichment framework that transforms enterprise APIs into LLM-compatible tools. ACE (i) generates enriched tool specifications with parameter descriptions and examples to improve selection and invocation accuracy, and (ii) incorporates a dynamic shortlisting mechanism that filters relevant tools at runtime, reducing prompt complexity while maintaining scalability. We validate our framework on both proprietary and open-source APIs and demonstrate its integration with agentic frameworks. To the best of our knowledge, ACE is the first end-to-end framework that automates the creation, enrichment, and dynamic selection of enterprise API tools for LLM agents.AAAI 2026 Main ConferenceHimanshu Gupta, Sameep Mehta, Prerna Agarwal, Renuka Sindhgatta, Soujanya Soni, Rohith VallamArtificial Intelligencetechnical paperfree
68
143497Digital Scale: Open-Source On-Device BMI Estimation from Smartphone Camera Images Trained on a Large-Scale Real-World DatasetEstimating Body Mass Index (BMI) from camera images with machine learning models enables rapid weight assessment when traditional methods are unavailable or impractical, such as in telehealth or emergency scenarios. Existing computer vision approaches have been limited to datasets of up to 14,500 images. In this study, we present a deep learning-based BMI estimation method trained on our WayBED dataset, a large proprietary collection of 84,963 smartphone images from 25,353 individuals. We introduce an automatic filtering method that uses posture clustering and person detection to curate the dataset by removing low-quality images, such as those with atypical postures or incomplete views. This process retained 71,322 high-quality images suitable for training. We achieve a Mean Absolute Percentage Error (MAPE) of 7.9% on our hold-out test set (WayBED data) using full-body images, the lowest value in the published literature to the best of our knowledge. Further, we achieve a MAPE of 13% on the completely unseen (during training) VisualBodyToBMI dataset, comparable with state-of-the-art approaches trained on it, demonstrating robust generalization. Lastly, we fine-tune our model on VisualBodyToBMI and achieve a MAPE of 8.56%, the lowest reported value on this dataset so far. We deploy the full pipeline, including image filtering and BMI estimation, on Android devices using the CLAID framework. We release our complete code for model training, filtering, and the CLAID package for mobile deployment as open-source contributions.AAAI 2026 Main ConferenceFrederik Manichand, Robin Deuber, Robert Jakob, Steve Swerling, Jamie Rosen, Elgar Fleisch, Patrick LangerArtificial Intelligenceposterfree
69
143496AdaptJobRec: Enhancing Conversational Career Recommendation Through an LLM-Powered Agentic SystemIn recent years, recommendation systems have evolved from providing a single list of recommendations to offering a comprehensive suite of topic-focused services. To better accomplish this task, conversational recommendation systems (CRS) have progressed from basic retrieval-augmented LLM generation to agentic systems with advanced reasoning and self-correction capabilities. However, agentic systems come with notable response latency—a longstanding challenge for conversational recommendation systems. To balance the trade-off between handling complex queries and minimizing latency, we propose AdaptJobRec, the first conversational job recommendation system that leverages autonomous agent to integrate personalized recommendation algorithm tools. The system employs a user query complexity identification mechanism to minimize response latency. For straightforward queries, the agent directly selects the appropriate tool for rapid responses. For complex queries, the agent uses the memory processing module to filter chat history for relevant content, then passes the results to the intelligent task decomposition planner, and finally executes the tasks using personalized recommendation tools. Evaluation on Walmart’s real-world career recommendation scenarios demonstrates that AdaptJobRec reduces average response latency by up to 53.3\% compared to competitive baselines, while significantly improving recommendation accuracy.AAAI 2026 Main ConferenceXintao Wu, Qixin Wang, Dawei Wang, Kun Chen, Yaowei Hu, Puneet Girdhar, Ruoteng Wang, Aadesh Gupta, Chaitanya Devella, Wenlai Guo, Shangwen Huang, Bachir Aoun, Greg Hayworth, Han LiArtificial Intelligenceposterfree
70
143494Transferable RL for Real-World Navigation Using Semantic Segmentation and Bird’s-Eye View AbstractionReinforcement Learning (RL) has shown significant promise in developing autonomous navigation algorithms for complex environments. However, the direct application of RL policies trained in simulation to real-world scenarios often faces challenges due to the reality gap. This paper proposes a two-stage system incorporating a segmentation strategy and a bird’s-eye-view (BEV) representation to mitigate the domain gap between simulation and reality. In the first stage, the segmentation transforms sensor data into a simplified and interpretable representation of the surrounding area, facilitating transferability across different deployments. In the second stage, the agent navigates through the BEV map, which can be trained using a vectorized simulation environment---a setup that runs multiple parallel instances of the environment to provide a wide range of training scenarios. This vectorization enables rapid exposure to varied environmental conditions, thereby accelerating and diversifying the training of a deep RL agent to achieve optimal navigation behaviors while maintaining high-speed, in-bound trajectories. The segmentation is crucial because it supports generalization of the learned policy across different robotic platforms. The contribution of this paper lies in combining real-time semantic segmentation with a bird’s-eye-view navigation policy, resulting in a transferable and scalable framework for real-world deployment of RL-based navigation agents. Experimental results demonstrate that agents trained with this methodology exhibit robust navigation performance and adaptability in both simulated and real-world environments, validating the efficacy of combining vectorized simulation with real-world segmentation for practical robotic navigation.AAAI 2026 Main ConferenceBenedikt Schlereth-Groh, Sakir Yöndem, Ramin KolagariArtificial Intelligencetechnical paperfree
71
143493Building Domain-Specific Small Language Models via Guided Data GenerationLarge Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pre-training (DAPT), Domain-specific Supervised Fine-Tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter language model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating strong domain-specific reasoning and generalization capabilities.AAAI 2026 Main ConferenceChetan Gupta, AMAN KUMAR, Lasitha Vidyaratne, Ekant Amin, Xian Lee, Ahmed Farahat, Yuta KoreedaArtificial Intelligenceposterfree
72
143492Trauma THOMPSON: A Dataset and Realistic Generative Framework for AI Copilots in Emergency CareWe introduce Trauma THOMPSON, a dataset and suite of benchmarks designed to accelerate the development of AI-powered copilots for real-time decision-making in emergency and resource-limited medical settings. This work proposes a method to address a critical bottleneck for future deployment: models trained on simulation may not work well in the real world. The dataset features 3,717 unscripted, first-person video clips of five emergency procedures, uniquely including "just-in-time" (JIT) interventions that mirror the improvisational nature of field medicine. To obtain realistic patient data without ethical issues and identity concerns that medical data often counter, we also propose TraumaGen, a novel framework for generating photorealistic patient and wound images from manikins while preserving clinical context. We establish benchmarks for action recognition, anticipation, and visual question answering (VQA), evaluating state-of-the-art models to demonstrate the challenges and potential of our dataset. By focusing on realism and improvisation, Trauma THOMPSON provides a crucial resource and a clear path toward developing and validating robust AI assistants for future deployment in real-world emergency care. The dataset is available at https://anonymous.4open.science/r/dataset-58F3.AAAI 2026 Main ConferenceJuan Wachs, Yupeng Zhuo, Eddie Zhang, Xiangchen Yu, Aditya Pachpande, Andrew Kirkpatrick, Jessica MckeeArtificial Intelligenceposterfree
73
143491Automatic Funny Scene Extraction from Long-form Cinematic VideosAutomatically extracting engaging and high-quality humorous scenes from cinematic titles is pivotal for creating captivating video previews and snackable content, boosting user engagement on streaming platforms. Long-form cinematic titles, with their extended duration and complex narratives, challenge scene localization, while humor’s reliance on diverse modalities and its nuanced style add further complexity. This paper introduces an end-to-end system for automatically identifying and ranking humorous scenes from long-form cinematic titles,
featuring shot detection, multimodal scene localization, and humor tagging optimized for cinematic content. Key innovations include a novel scene segmentation approach combining visual and textual cues, improved shot representations via guided triplet mining, and a multimodal humor tagging framework leveraging both audio and text modalities. Our system achieves an 18.3% AP improvement over state-of-the-art scene detection on the OVSD dataset and an F1 score of 0.834 for detecting humor in long text. Extensive evaluations across five cinematic titles demonstrate 87% of clips extracted by our pipeline are intended to be funny, while 98% of scenes are accurately localized. With successful generalization to trailers, these results showcase the pipeline’s potential to enhance content creation workflows, improve user engagement, and streamline snackable content generation for diverse cinematic media formats.
AAAI 2026 Main Conferencesibendu paul, Haotian Jiang, Caren ChenArtificial Intelligenceposterfree
74
143489Octopus: Entropy-Controlled Science Fiction Literature Generation with Persistent Memory-Context BindingLong-form science fiction generation demands rigorous maintenance of narrative coherence across evolving plots, character dynamics, and speculative world-building. We propose Octopus, an entropy-controlled neural framework with persistent memory-context binding that addresses these challenges through two key innovations: 1) dynamic entropy regulation balancing creativity and structural stability via narrative divergence thresholds, and 2) hierarchical memory architecture preserving character states, plot events, and scientific rules over 10K+ token spans. Evaluations across 12 sci-fi subgenres demonstrate Octopus's superiority over GPT-4 and ReAlign baselines, achieving 15.2\% higher coherence scores (SciClarity) and 62\% fewer contextual contradictions in extended narratives. Human evaluations confirm its effectiveness in maintaining speculative logic (4.7/5 vs. 3.1/5 baseline) while preserving creative diversity. The framework resolves the "hard sci-fi paradox" of enforcing scientific rigor without compromising narrative flexibility, establishing new capabilities for AI-assisted cross-media universe development.AAAI 2026 Main ConferenceLuqi Gong, Xu Wang, Jiaju Kang, Puyu Han, Zeyu AiArtificial Intelligencetechnical paperfree
75
143488PeerCoPilot: A Language Model-Powered Assistant for Behavioral Health OrganizationsBehavioral health conditions, which include mental health and substance use disorders, are the leading disease burden in the United States. Peer-run behavioral health organizations (PROs) critically assist individuals facing these conditions by combining mental health services with assistance for needs such as income, employment, and housing. However, limited funds and staffing make it difficult for PROs to address all service user needs. To assist peer providers at PROs with their day-to-day tasks, we introduce PeerCoPilot, a large language model (LLM)-powered assistant that helps peer providers create wellness plans, construct step-by-step goals, and locate organizational resources to support these goals. PeerCoPilot ensures information reliability through a retrieval-augmented generation pipeline backed by a large database of over 1,300 vetted resources. We conducted human evaluations with 15 peer providers and 6 service users and found that over 90\% of users supported using PeerCoPilot. Moreover, we demonstrate that PeerCoPilot provides more reliable and specific information than a baseline LLM. PeerCoPilot is now used by a group of 5-10 peer providers at CSPNJ, a large behavioral health organization serving over 10,000 service users, and we are actively expanding PeerCoPilot's use.AAAI 2026 Main ConferenceFei Fang, Nev Jones, Hong Shen, Naveen Raman, Gao Mo, Megan Chai, Cindy Peng, Shannon Pagdon, Margaret SwarbrickArtificial Intelligencetechnical paperfree
76
143487PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking EstimationEvaluating the quality of e-commerce search systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.AAAI 2026 Main ConferenceAnirban Majumder, Abhishek DivekarArtificial Intelligencetechnical paperfree
77
143486Balancing Accuracy and Efficiency in Multi-Turn Intent Classification for LLM-Powered Dialog Systems in ProductionAccurate multi-turn intent classification is essential for advancing conversational AI systems. Yet, it remains challenging due to the scarcity of comprehensive datasets and the complexity of contextual dependencies across dialogue turns hinder progress. This paper presents two novel approaches leveraging Large Language Models (LLMs) to enhance scalability and reduce latency in production dialogue systems. First, we introduce Symbol Tuning, which simplifies intent labels to reduce task complexity and improve performance in multi-turn dialogues. Second, we propose C-LARA (Consistency-aware, Linguistics Adaptive Retrieval Augmentation), a framework that employs LLMs for data augmentation and pseudo-labeling to generate synthetic multi-turn dialogues. These enriched datasets are used to fine-tune a small, efficient model suitable for deployment. Experiments conducted on multilingual dialogue datasets demonstrate significant improvements in classification accuracy and resource efficiency. Our methods enhance multi-turn intent classification accuracy by 5.09%, reduce annotation costs by 40%, and enable scalable deployment in low-resource multilingual industrial systems, highlighting their practicality and impact.AAAI 2026 Main ConferenceKwan Hui Lim, Bin Fu, Junhua Liu, Tan KeatArtificial Intelligencetechnical paperfree
78
143485LLM-Based Agent for Competitive Landscape Mapping in Drug Asset Due DiligenceIn this paper, we describe and benchmark a competitor-discovery component, an essential part of an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren’t capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multi-modal, unstructured due diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On our benchmark, our competitor-discovery agent achieves 83\% recall, exceeding OpenAI Deep Research (65\%) and Perplexity Labs (60\%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment-fund, analyst turnaround time dropped from 2.5 days to $\sim$3 hours ($\sim$20x) for the competitive analysis.AAAI 2026 Main ConferenceAlisa Vinogradova, Vlad Vinogradov, Dmitrii Radkevich, Katsiaryna Yanchanka, Ilya Yasny, Dmitry Kobyzev, Andrey Doronichev, Ivan IzmailovArtificial Intelligencetechnical paperfree
79
143483A Deployed Investigative AI Search Engine for Combating Human Trafficking at Web ScaleHuman trafficking, affecting over 50 million people globally, is a complex criminal enterprise in which traffickers actively conceal and distribute information across fragmented and often illicit online platforms. Traditional investigative tools are ill-suited for detecting patterns across such obfuscated, heterogeneous data. This paper presents Domain-specific Insight Graphs (DIG), an investigative AI search engine designed to operate at web scale and enable non-technical decision-makers, such as law enforcement and prosecutors, to rapidly uncover actionable leads in human trafficking investigations. DIG employs a novel AI pipeline that ingests large, diverse web corpora (including trafficking-relevant advertisements), cleans and normalizes extracted information, and links entities into a semantic knowledge graph. A domain-optimized search layer allows investigators to traverse these graphs to identify potential victims, perpetrators, and trafficking networks. Unlike commercial alternatives, DIG was released free of charge, open-sourced, and deployed to over 200 U.S. state and local law enforcement agencies through the DARPA Memex program. Deployment results demonstrate measurable impact: in New York, agencies using DIG reported a drop in sex worker arrests and an increase in trafficking-related arrests from <1% to over 60%, disrupting cycles of victim re-victimization. The system has been credited in high-profile prosecutions and received endorsements from District Attorneys. This paper details the problem context, AI approach, deployment process, operational challenges, and lessons learned from maintaining DIG post-federal funding, including navigating intellectual property for open release and sustaining the system via philanthropic support. DIG exemplifies how AI-driven investigative tools can deliver lasting societal benefit through targeted, innovative application in high-stakes domains.AAAI 2026 Main ConferenceMayank KejriwalArtificial Intelligencetechnical paperfree
80
143482Deployed AI Agents for Industrial Asset Management: CodeReAct Framework for Event Analysis and Work Order AutomationMaintenance of mission-critical industrial assets is frequently hindered by fragmented data, inconsistent record-keeping, and limited access to analytical expertise, resulting in reactive rather than predictive practices. We present \textit{CodeReAct}, an AI-powered agentic framework deployed in large-scale facilities to automate event analysis and work order (WO) management.CodeReAct extends the ReAct paradigm by embedding executable Python code within the Thought--Action--Observation (TAO) loop, enabling natural language interaction, grounding heterogeneous alerts and work orders into structured Business Objects (BOs), and dynamically invoking analytic functions for forecasting, anomaly correlation, and maintenance recommendations. This architecture reduces manual data science intervention, improves adaptability, and supports reuse across asset types.

Deployed in a mission-critical data center and productionized in Maximo, CodeReAct manages pumps, chillers, AHUs, compressors, cooling towers, and other mechanical and electrical systems. Evaluation with 36 representative maintenance utterances showed that outer-loop reflection and adaptive temperature improved task completion by up to 20%, while ablation studies confirmed the importance of reasoning in addition to code execution. Business validation revealed seasonal failure patterns, bundling opportunities, and predictive accuracy trends. In production, site engineers reported 25--40% faster diagnostics, fewer unplanned downtime events, and reduced reliance on specialized analysts. Lessons learned highlight the importance of structured BOs for grounding analytics, runtime safeguards to mitigate hallucinations, and adaptive model control for consistent execution. These results demonstrate how deployed agentic AI can deliver measurable business value in predictive and strategic maintenance planning.
AAAI 2026 Main ConferenceNianjun Zhou, Dhaval Patel, Anamitra BhattacharyyaArtificial Intelligenceposterfree
81
143481SARA: Leveraging LLM Agents and Jurisprudential Ontologies for Automated Legal ReasoningDelivering judicial decisions requires interpreting complex legal texts, analyzing evidence, and reasoning over jurisprudence and legal principles. Recent advances in Generative Artificial Intelligence, particularly Large Language Models (LLMs), have shown potential to automate parts of this process, yet practical, measurable benefits in real-world judicial settings remain limited. This paper introduces SARA, an LLM-powered legal reasoning platform deployed in a regional Brazilian court, which demonstrates significant efficiency and quality gains through the integration of LLM agents with a Jurisprudential Knowledge Graph (Jur-KG). SARA automatically extracts and structures key elements from legal documents—including claims, requests, and evidence—and generates reasoning grounded in retrieved jurisprudential precedents. The Jur-KG, modeled through an ontology encompassing concepts such as \textit{LegalRelation}, \textit{LegalGrounds}, and \textit{LegalClaims}, enables semantic matching and retrieval of relevant case law. By representing cases according to the Legal Case Ontology for the Brazilian Judicial System, SARA supports traceable reasoning and addresses competence questions to assess coverage, coherence, and justification of AI-generated outputs. Deployment results indicate measurable improvements in processing time, consistency, and explainability, while ensuring compliance with ethical and legal guidelines established by Brazil’s National Council of Justice. This work demonstrates that combining LLM-based agents with domain-specific knowledge graphs can yield both innovative capabilities and proven impact in judicial decision-making.AAAI 2026 Main ConferenceVasco Furtado, Joao Neto, Vladia Pinheiro, Francisco Bonfim, Sara SIlva, Alicia Neves, Henrique Santos, Jorge Araujo, Rilder Pires, Ricardo CostaArtificial Intelligencetechnical paperfree
82
143480Discovery of Feasible 3D Printing Configurations for Metal Alloys via AI-Driven Adaptive Experimental DesignConfiguring the parameters of additive manufacturing processes for metal alloys is a challenging problem due to complex relationships between input parameters (e.g., laser power, scan speed, and material feed rate) and quality of printed outputs. The standard trial-and-error approach to find feasible parameter configurations is highly inefficient because validating each input configuration is expensive in terms of resources (physical and human labor) and the configuration space is very large. This paper applies the general principle of AI-driven adaptive experimental design for optimization to the more challenging problem of discovering feasible configurations. The key idea is to build a probabilistic surrogate model from past experiments to intelligently select a small batch of input configurations for validation in each iteration. To demonstrate the effectiveness of this methodology, we deploy it for Directed Energy Deposition (DED) process to print GRCop-42, a high-performance copper–chromium–niobium alloy developed by NASA for extreme-temperature aerospace applications. Within weeks, our approach yielded multiple defect-free outputs across a range of laser powers—dramatically reducing time-to-result and resource expenditure compared to four months of manual experimentation by our collaborators with little to no success. By enabling high-quality GRCop-42 fabrication on readily available infrared laser platforms for the first-time, we democratize access to this critical alloy, paving the way for cost-effective, decentralized production of rocket engine chambers, heat exchangers, and other high-heat-flux components.AAAI 2026 Main ConferenceAryan Deshwal, Jana Doppa, Azza Fadhel, Nathaniel Zuckschwerdt, Susmita Bose, Amit BandyopadhyayArtificial Intelligencetechnical paperfree
83
143479Optimizing Preferential Rate in Retail Lending with Causal Inference and Domain AdaptationIn retail lending, offering preferential interest rates is a core marketing instrument for balancing customer acquisition with portfolio profitability. Accurately predicting the effect of interest-rate discounts for each customer is pivotal for optimizing the discount strategy: offering overly generous discounts erodes margins, while insufficient discounts drive price-sensitive customers to defect.
Off‑the‑shelf machine learning uplift models rarely respect the complex operational constraints of financial business, such as tiered rate grids, regulatory guard‑rails, and marketing budget ceilings. We propose an integrated system that fuses causal inference and domain adaptation to produce constraint‑aware, customer‑specific discount recommendations. To further enhance practitioner adoption, a large language model layer translates model outputs into actionable narratives. Developed in Hyundai Capital Services, the system boosted transaction volume by 13\%, demonstrating both technical soundness and material business impact.
AAAI 2026 Main ConferenceWooyoung Kim, Jaehyun Kim, Kee-Eung Kim, Sumin Shin, Jimyung Choi, Yujin Lee, Hyeryeong OhArtificial Intelligencetechnical paperfree
84
143478From Natural Language to Executable ETL Flows: The IBM DataStage AssistantModern ETL (Extract, Transform, Load) tools offer graphical, no-code interfaces for workflow creation but still require users to manually identify transformation functions and configure their properties, which is time-consuming and demands prior expertise. We present the research and engineering foundations of the IBM DataStage Assistant, a deployed capability that generates complete multi-stage ETL flows directly from natural language (NL) descriptions. Our framework infers transformation functions, their properties, and transformer expressions, enabling novices to discover relevant functions and allowing experts to bypass manual configuration. The proposed framework achieves a prediction accuracy of 96.4% for flow predictions, 87.0% for properties, and 83.6% for transformer expressions. We also show a document exploration module that uses retrieval-augmented generation (RAG) over product documentation to answer tool-specific questions in NL. Implemented in IBM DataStage, this approach supports iterative, in-environment workflow design and reduces context switching. In initial studies, it achieves up to 90% time savings for novices and 50% for experts.AAAI 2026 Main ConferenceSameep Mehta, Nitin Gupta, Thomas Gschwind, Shramona Chakraborty, Tristan Tyler, Shreya Sisodia, Ben ClermontArtificial Intelligencetechnical paperfree
85
143477ConstructAI: From Real-Time Safety Insight to Skill Growth in Deployed Construction AI SystemsEnsuring safety in power grid construction remains a critical yet challenging task, as existing monitoring approaches often lack scalability, timeliness, and adaptability to diverse on-site conditions. To address these limitations, we present \textbf{ConstructAI}, a deployed AI-driven safety management system that integrates multi-source image and video acquisition devices with advanced multimodal large model reasoning. The system combines text, image, and video prompts through an efficient workflow powered by LLaMA3 and Meta SAM2 backbones, enhanced with LoRA and adaptor modules for multimodal fusion. Once deployed, ConstructAI continuously processes real-time construction footage to identify violations, assess risk levels, and generate standardized rectification requirements. The deployment has demonstrated measurable benefits across multiple sites, including a $>70\%$ increase in violation rectification rates, reduction of average rectification delays from hours to minutes, and a $45\%$ decline in repeat violations. Beyond technical gains, ConstructAI has delivered significant business impacts, such as reduced safety incidents, improved compliance with national regulations, and higher operational efficiency. By enabling proactive risk management and structured safety feedback loops, our system exemplifies how innovative use of AI can translate into tangible improvements for industrial safety. The lessons learned from deployment highlight the importance of balancing algorithmic advances with practical integration into organizational workflows.AAAI 2026 Main ConferenceWei Wang, Jiang Zheng, Gaowei Zhang, Tiong Kong, Kai Xing, Huan LiArtificial Intelligencetechnical paperfree
86
143476Scalable and Efficient Large-Scale Log Analysis with LLMs: An IT Software Support Case StudyIT environments typically have logging mechanisms to monitor system health and detect issues. However, the huge volume of generated logs makes manual inspection impractical, highlighting the importance of automated log analysis in IT Software Support. In this paper, we propose a log analytics tool that leverages Large Language Models (LLMs) for log data processing and issue diagnosis, enabling the generation of automated insights and summaries. We further present a novel approach for efficiently running LLMs on CPUs to process massive log volumes in minimal time without compromising output quality. We share the insights and lessons learned from deployment of the tool - in production since March 2024 - scaled across 70 software products, processing over 2000 tickets for issue diagnosis, achieving a time savings of 300+ man hours and an estimated $15,444 per month in manpower costs compared to the traditional log analysis practices.AAAI 2026 Main ConferenceDebanjana Kar, Prateeti Mohapatra, Harshit Kumar, Seema Nagar, Pranjal Gupta, Karan BhukarArtificial Intelligencetechnical paperfree
87
143475Optimizing Product Provenance Verification Using Data Valuation MethodsDetermining and verifying product provenance remains a critical challenge in global supply chains, particularly as geopolitical conflicts and shifting borders create new incentives for misrepresentation of commodities, such as hiding the origin of illegally harvested timber or stolen agricultural products. Stable Isotope Ratio Analysis (SIRA), combined with Gaussian process regression-based isoscapes, has emerged as a powerful tool for geographic origin verification. While these models are now actively deployed in operational settings supporting regulators, certification bodies, and companies, they remain constrained by data scarcity and suboptimal dataset selection. In this work, we introduce a novel deployed data valuation framework designed to enhance the selection and utilization of training data for machine learning models applied in SIRA. By quantifying the marginal utility of individual samples using Shapley values, our method guides strategic, cost-effective, and robust sampling campaigns within active monitoring programs. By prioritizing high-informative samples, our approach improves model robustness and predictive accuracy across diverse datasets and geographies. Our framework has been implemented and validated in a live provenance verification system currently used by enforcement agencies, demonstrating tangible, real-world impact. Through extensive experiments and deployment in a live provenance verification system, we show that this system significantly enhances provenance verification, mitigates fraudulent trade practices, and strengthens regulatory enforcement of global supply chains.AAAI 2026 Main ConferenceChang-Tien Lu, Jakub Truszkowski, Naren Ramakrishnan, Ruoxi Jia, Hoang Just, Raquib Bin Yousuf, Shengzhe Xu, Brian Mayer, Victor Deklerck, John Simeone, Jade SaundersArtificial Intelligencetechnical paperfree
88
143474Layout-Aware Document Parsing with Visual-Linguistic Fusion: The DATA-LUX with Academic Content Service ProviderMany organizations are increasingly relying on unstructured documents such as PDFs and scanned forms to sup-port downstream large language model (LLM) services, including search, summarization, and recommendation. However, traditional OCR systems struggle with diverse layouts of documents, leading to frequent errors and high costs of labor. So, this study developed DATALUX - a robust document layout system that transforms unstructured documents into structured, machine-readable data suitable for automation. Built on a transformer-based detector, DATA-LUX incorporates several modules for layout refinement, text-visual fusion, and layer-wise optimization to improve coherence and generalization across diverse layouts. Around January 2025, we successfully deployed DATA-LUX into one of the largest academic content service firms (Nurimedia) in South Korea. This firm faced the challenge of extracting metadata and references from thousands of academic papers submitted in various formats. Also, the existing LLM-based tools provided unreliable results. So, they needed to process them manually, creating bottlenecks in both labor and time. However, DATALUX enabled the automatic structuring of over 100,000 research papers a year, improving extraction accuracy to over 97%, reducing costs by more than USD 185K annually, and accelerating processing speed by 8.7 times. These deployment results suggest that DATALUX enables scalable and efficient document automation in complex and high-volume environments successfully. We thus believe that our DATALUX has a significant impact on both academia and industry practices.AAAI 2026 Main ConferenceJae Hong Park, Min Kim, Yeonkyung Kim, Jae Lee, Ki Kim, Ji KwakArtificial Intelligencetechnical paperfree
89
143473Large Scale Retrieval for the LinkedIn Feed Using Causal Language ModelsIn large-scale recommendation systems like LinkedIn’s, the retrieval stage is critical for narrowing billions of potential candidates to a manageable subset for ranking. LinkedIn's feed now serves suggested content based on the topical interests of members, where 2000 candidates are retrieved from several million candidates with a latency budget of a few milliseconds and inbound QPS of several thousand per second. This paper presents a novel retrieval approach that fine tunes a large causal language model (Meta’s LLaMA 3) as a dual encoder to generate high quality embeddings for both users (members) and content (items), using only textual input. We describe the end to end pipeline, including prompt design for embedding generation, techniques for fine tuning at LinkedIn scale, and infrastructure for low latency, cost effective online serving. We share our findings on how quantizing numerical features in the prompt enables the information getting encoded in the embedding facilitating greater alignment between the retrieval and ranking layer. The system was evaluated using offline metrics and an online A/B test, which showed substantial improvements in member engagement. We observed significant gains among newer members, who often lack strong network connections, indicating that high-quality suggested content aids retention. This work demonstrates how generative language models can be effectively adapted for real time, high throughput retrieval in industrial applications.AAAI 2026 Main ConferenceHamed Firooz, Hejian Sang, Sudarshan Ramanujam, Siddharth Dangi, Birjodh Tiwana, Saurabh Kataria, Antonio Alonso, David Byrne, Sojeong Ha, Manas Somaiya, Sen Zhou, Zhoutao Pei, Andrei Akterskii, Zhanglong Liu, Samira Sriram, Zihan Xiong, Akhilesh Gupta, Angela Shao, Alex Li, Caitlin Kolb, Thomas Kistler, Zach MooreArtificial Intelligencetechnical paperfree
90
143472Who Is a Better Matchmaker? Human vs. Algorithmic Judge Assignment in a High-Stakes Startup CompetitionThere is increasing interest in applying artificial intelligence (AI) to automate and support complex decision-making tasks. However, it remains unclear how algorithms compare to human judgment in contexts requiring semantic understanding and domain expertise. We examine this in the context of the judge assignment problem --- matching submissions to suitably qualified evaluators --- at a prominent U.S. university startup competition. Awarding over $\textdollar$500,000 annually, this is a real-world setting where high-quality judge assignment is critical. We develop and deploy HLSE (Hybrid Lexical–Semantic Similarity Ensemble), an AI-based approach, at the competition and compare algorithmic against human expert assignments by collecting blinded match quality scores from judges for $309$ judge-venture matches. Using a Mann–Whitney U statistic based test, we found no statistically significant difference in assignment quality between the two approaches ($AUC=0.48, p=0.40$). On average, algorithmic matches are rated $3.90$ and manual matches are rated $3.94$ on a $5$-point scale, where $5$ indicates an excellent match. Furthermore, manual assignments that took a full week in past years can be completed in under ten minutes by the algorithm during deployment. These results demonstrate that HLSE achieves human expert level matching quality while offering greater scalability and efficiency, underscoring the potential of AI-driven solutions to robustly support and enhance human decision-making for judge assignment in high-stakes settings.AAAI 2026 Main ConferenceNihar B. Shah, Yang (Sarina) Xi, Orelia Pi, Miaomiao Zhang, Rebecca Xiong, Jacqueline LaneArtificial Intelligencetechnical paperfree
91
143471Reducing Alert Fatigue Through AI Ranking: A Deployed Public Health Data Monitoring SystemPublic health experts need scalable methods to monitor large volumes of health data (e.g., human-reported cases, hospitalizations, deaths). These methods must identify individual data points that may indicate significant events, such as outbreaks, or reveal data quality issues. Identifying, triaging, and analyzing these data points in real-time is critical for preventing downstream errors in forecasting or policy. Traditional alert-based data monitoring systems, used for decades in practice, fail to identify relevant data events for several reasons. For example, these systems may not output real-time results from large data volumes, or they may return tens of thousands of unhelpful alerts.

We introduce a human-in-the-loop AI system for public health data monitoring that uses a ranking-based AI anomaly detection method. This system was developed through a multi-year interdisciplinary collaboration with participatory design from researchers, engineers, and public health data experts. From this process, we identified system goals, such as user control and efficiency and designed a system that balances these goals. This system has since been deployed at a national public health organization and analyzes up to 5 million data points daily. A three-month longitudinal deployment evaluation revealed a significant improvement in system goals, including a 54x increase in data reviewer efficiency and increased engagement compared to traditional alert-based methods.
AAAI 2026 Main ConferenceBryan Wilder, Nolan Gormley, Roni Rosenfeld, Catalina Vajiac, Tina Townes, Ananya Joshi, Richa GadgilArtificial Intelligencetechnical paperfree
92
143469TTF: A Trapezoidal Temporal Fusion Framework for LTV Forecasting in DouyinIn the user growth scenario, Internet companies invest heavily in paid acquisition channels to acquire new users. But sustainable growth depends on acquired users' generating lifetime value (LTV) exceeding customer acquisition cost (CAC). In order to maximize LTV/CAC ratio, it is crucial to predict channel-level LTV in an early stage for further optimization of budget allocation. The LTV forecasting problem is significantly different from traditional time series forecasting problems, and there are three main challenges. Firstly, it is an unaligned multi-time series forecasting problem that each channel has a number of LTV series of different activation dates. Secondly, to predict in the early stage, it faces the imbalanced short-input long-output (SILO) challenge. Moreover, compared with the commonly used time series datasets, the real LTV series are volatile and non-stationary, with more frequent fluctuations and higher variance. In this work, we propose a novel framework called Trapezoidal Temporal Fusion (TTF) to address the above challenges. We introduce a trapezoidal multi-time series module to deal with data unalignment and SILO challenges, and output accurate predictions with a multi-tower structure called MT-FusionNet. The framework has been deployed to the online system for Douyin. Compared to the previously deployed online model, $MAPE_{p}$ decreased by 4.3\%, and $MAPE_{a}$ decreased by 3.2\%, where $MAPE_{p}$ denotes the point-wise MAPE of the LTV curve and $MAPE_{a}$ denotes the MAPE of the aggregated LTV.AAAI 2026 Main ConferenceFan Wu, Zhenzhe Zheng, Chaoli Zhang, Yibing Wan, Zhengxiong Guan, Xiaoyang Li, Lai Xu, Beibei JiaArtificial Intelligencetechnical paperfree
93
143468Save, Revisit, Retain: A Scalable Framework for Enhancing User Retention in Large-Scale Recommender SystemsUser retention is a critical objective for online platforms like Pinterest, as it strengthens user loyalty and drives growth through repeated engagement. A key indicator of retention is revisitation, i.e., when users return to view previously saved content, a behavior often sparked by personalized recommendations and user satisfaction. However, modeling and optimizing revisitation poses significant challenges. One core difficulty is accurate attribution: it is often unclear which specific user actions or content exposures trigger a revisit, since many confounding factors (e.g., content quality, user interface, notifications, or even changing user intent) can influence return behavior. Additionally, the scale and timing of revisitations introduce further complexity; users may revisit content days or even weeks after their initial interaction, requiring the system to maintain and associate extensive historical records across millions of users and sessions. These complexities render existing methods insufficient for robustly capturing and optimizing long-term revisitation.

To address these gaps, we introduce a novel, lightweight, and interpretable framework for modeling revisitation behavior and optimizing long-term user retention in Pinterest’s search-based recommendation context. By defining a surrogate attribution process that links saves to subsequent revisitations, we reduce noise in the causal relationship between user actions and return visits. Our scalable event aggregation pipeline enables large-scale analysis of user revisitation patterns and enhances the ranking system’s ability to surface items with high retention value. Deployed on Pinterest’s Related Pins surface to serve 500+ million users, the framework led to a significant lift of 0.1% in daily active users (DAU) and 0.08% in weekly active users (WAU) without additional computational costs. Our data analysis reveals novel insights, such as the impact of content topics on revisitation rates; for example, users are more likely to revisit aesthetically pleasing topics.
AAAI 2026 Main ConferenceWeijie Jiang, Armando Ordorica, Jaewon Yang, Olafur Gudmundsson, Yucheng Tu, Huizhong DuanArtificial Intelligencetechnical paperfree
94
143467TRUST: Transaction Risk via Unified Sequence and TopologyAbuse detection in e-commerce platforms is critical for preventing operational losses, particularly for transaction types vulnerable to abuse such as Return-to-Origin (RTO) in Cash-on-Delivery (COD) workflows. Detecting such abuse accurate, real-time decisions to intercept malicious orders before placement, imposing stringent sub-second latency requirements on deployed systems. In this work, we present TRUST, a deployed, production-scale abuse detection system based on a unified architecture of heterogeneous Graph Neural Networks (GNNs) and Transformer-based sequence encoders. This design enables joint reasoning over multi-relational entity interactions and temporal behavioural signals, allowing the model to combine complementary information for effective abuse detection when either modality is sparse or absent. TRUST processes millions of transactions daily with an average inference latency of ~25 ms, achieving a ~9.6% absolute precision improvement over a strong XGBoost baseline in live RTO detection. We report systematic ablation studies across both graph and sequence stages, evaluating GNN variants, sampling strategies, sequence lengths, and positional encoding schemes to guide architectural choices. Deployed end-to-end in a high-throughput environment, TRUST demonstrates that GNN–Transformer cascades can deliver state-of-the-art accuracy, scalability, and operational reliability in real-world abuse detection, offering a reproducible blueprint for similar industry-scale applications.AAAI 2026 Main ConferenceDebdoot Mukherjee, Bhavuk Singhal, Anshu Aditya, Shubham Jain, Debashis Mukherjee, Rithvik Y, Akshat Garg, Karan TanwarArtificial Intelligenceposterfree
95
143466RescueLens: LLM-Powered Triage and Action on Volunteer Feedback for Food RescueFood rescue organizations simultaneously tackle food insecurity and waste by working with volunteers to redistribute food from donors who have excess to recipients who need it. Volunteer feedback allows food rescue organizations to identify issues early and ensure volunteer satisfaction. However, food rescue organizations monitor feedback manually, which can be cumbersome and labor-intensive, making it difficult to prioritize which issues are most important. In this work, we investigate how large language models (LLMs) assist food rescue organizers in understanding and taking action based on volunteer experiences. We work with 412 Food Rescue, a large food rescue organization based in Pittsburgh, Pennsylvania, to design RescueLens, an LLM-powered tool that automatically categorizes volunteer feedback, suggests donors and recipients to follow up with, and updates volunteer directions based on feedback. We evaluate the performance of RescueLens on an annotated dataset, and show that it can recover 96% of volunteer issues at 71% precision. Moreover, by ranking donors and recipients according to their rates of volunteer issues, RescueLens allows organizers to focus on 0.5% of donors responsible for more than 30% of volunteer issues. RescueLens is now deployed at 412 Food Rescue and through semi-structured interviews with organizers, we find that RescueLens streamlines the feedback process so organizers better allocate their time.AAAI 2026 Main ConferenceFei Fang, Zhiyu Chen, Zheyuan Ryan Shi, Naveen Raman, Jingwu Tang, Sean Hudson, Ameesh KapoorArtificial Intelligencetechnical paperfree
96
143465EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay WritingPrompt-based essay writing is an effective and common way to assess students' critical thinking skills.
Recent work has evaluated the impressive capabilities of Large Language Models (LLMs) on this task. However, most studies focus primarily on English. Those examining LLMs' performance in Chinese often rely on coarse-grained text quality metrics, overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. We therefore propose EssayBench, a multi-genre benchmark specifically designed for Chinese essay writing, along with a fine-grained, genre-specific scoring framework that hierarchically aggregates scores to better align with human preferences.
The dataset comprises 728 real-world prompts across four major genres (Argumentative, Narrative, Descriptive, and Expository), and includes both Open-Ended and Constrained types.
Our evaluation protocol is validated through a comprehensive human agreement study. The results show that our protocol aligns well with human judgments, achieving a highest Spearman's correlation of 0.816 and outperforming coarse-grained evaluation methods by an average of 8.6\%. Finally, we benchmark 15 large LLMs, analyzing their strengths and limitations across genres and instruction types. We believe EssayBench offers a more reliable framework for evaluating Chinese essay generation and provides valuable insights for improving LLMs in this domain.
AAAI 2026 Main ConferenceBaojun wang, Fei Mi, Lifeng Shang, Fan Gao, Dongyuan Li, Yasheng Wang, Ding XiaArtificial Intelligenceposterfree
97
143464GEM: Generative Entropy-Guided Preference Modeling for Few-Shot Alignment of LLMsAlignment of large language models (LLMs) with human preferences typically relies on supervised reward models or external judges, which in turn require abundant preference data. We propose a generative preference modeling approach for low-resource and domain-specific scenarios, reframing preference learning as an inverse reinforcement learning problem. Instead of training a discriminative reward model, we train the LLM itself to infer and maximize an implicit reward function underlying high-quality reasoning. Specifically, we leverage Chain-of-Thought (CoT) sampling to generate diverse candidate solutions for each query and refine fine-grained preferences from these without additional human labels. We also introduce an entropy-guided token scoring mechanism to rank and weight the sampled CoTs, boosting the importance of high-confidence answers and strategically high-entropy tokens. Building on this, we train the model with our Self-Evaluated Group Advantage (SEGA) algorithm. Compared with other methods, this algorithm efficiently utilizes the fine-grained preference information in group candidate solutions to update the strategy. Our method eliminates dependence on external judges or reward classifiers, instead relying on the generative model’s own judgments. Experiments on general benchmarks and domain-specific tasks—such as mathematical reasoning and medical question answering—demonstrate that our generative preference model achieves significant improvements with limited data.AAAI 2026 Main ConferenceXuejiao Zhao, Yiyang Zhao, Bai HuiyuArtificial Intelligencetechnical paperfree
98
143463Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language ModelsLarge Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.AAAI 2026 Main ConferenceSanjay Chawla, Tianyi Zhou, Johanne MedinaArtificial Intelligenceposterfree
99
143462TWINFUZZ: Dual-Model Fuzzing for Robustness Generalization in Deep LearningDeep learning (DL) models are increasingly deployed in safety-critical applications such as face recognition, autonomous driving, and medical diagnosis. Despite their impressive accuracy, they remain vulnerable to adversarial examples - subtle perturbations that can cause incorrect predictions, i.e., the robustness issues. While adversarial training improves robustness against known attacks, it often fails to generalize to unseen or stronger threats, revealing a critical gap in robustness generalization. In this work, we propose a dual-model fuzzing framework to enhance generalized robustness in DL models. Central to our method is a lightweight metric, the Lagrangian Information Bottleneck (LIB), which guides entropy-based mutation toward semantically meaningful and high-risk regions of the input space. The executor uses a resistant model and a more error-prone vulnerable model; their prediction consistency forms the basis of agreement mining, a label-free oracle for isolating decision-boundary samples. To ensure fuzzing effectiveness, we further introduce a task-driven seed selection strategy (e.g., SSIM for vision) that filters out low-quality inputs. We implement a prototype, TWINFUZZ, and evaluate it on six benchmark datasets and nine DL models. Compared with state-of-the-art testing approaches, TWINFUZZ achieves superior improvements in both training-specific and generalized robustness.AAAI 2026 Main ConferenceXi Xiao, Xiaogang Zhu, Shaohua Wang, Kun Hu, Wentao Mo, Enze Dai, Sheng Wen, Yang XiangArtificial Intelligenceposterfree
100
143461ACID Test: A Benchmark for Cultural Safety and Alignment in LALMsLarge Audio Language Models (LALMs) are transforming AI by directly processing and generating human language from audio. As these models proliferate in real-world applications, evaluating their performance for equitable and safe use across diverse linguistic and cultural contexts becomes paramount. This paper presents the first comprehensive study on cultural preferences and biases in LALMs across multilingual and multicultural settings. We extend existing cultural harm frameworks from text-based models to the audio domain, analysing how linguistic and cultural diversity influence LALM behaviour, sensitivity, and output quality. Our research uncovers unique challenges in interpreting cultural nuances from audio and linguistic variations. We introduce a novel multilingual audio-text dataset (10 languages, including English), \textbf{Audio Cultural Intelligence Dataset (ACID Benchmark) spanning 1315 hours in audio length}, specifically for evaluating LALM cultural biases, marking the first such examination in this emerging area. Our \textbf{comprehensive analysis includes 10 open-source and 2 closed-source models}, demonstrating significant performance disparities across languages and cultural contexts, highlighting the audio modality's influence on bias manifestation. These findings highlight the critical need to evaluate LALMs not only for technical accuracy but also for fair and culturally sensitive performance, urging the development of inclusive datasets and cultural awareness for building safer and more equitable large audio language models. The ACID benchmark will be made publicly available.AAAI 2026 Main ConferenceRicha Singh, Mayank Vatsa, Bikash Dutta, Adit Jain, Rishabh RanjanArtificial Intelligenceposterfree