The intersection of generative artificial intelligence and digital accessibility has entered a transformative phase with the introduction of "agentic" browser features. These technologies, exemplified by Google’s Auto Browse in Gemini for Chrome and Microsoft’s Copilot Actions in Edge, represent a departure from static screen-reading software toward dynamic, reasoning-based assistants.1 While traditional assistive technologies (AT) rely on the explicit semantic structure of the Document Object Model (DOM) and the Accessibility Object Model (AOM), agentic AI seeks to interpret the visual and functional intent of a web application to perform multi-step tasks autonomously.4 However, as observed in enterprise environments such as the Salesforce Experience Cloud and Trailhead Marketplace, the performance of these agents remains inconsistent. The unreliability of these systems is not merely a byproduct of early-stage development but is rooted in fundamental tensions between autonomous interaction, browser security architectures, and the structural opacity of modern web frameworks.7
To understand the operational hurdles facing these technologies, it is necessary to examine the technical foundations of native browser agents. Unlike standard extensions that interact with the web through high-level APIs, Gemini in Chrome and Copilot in Edge are integrated directly into the browser’s runtime environment.4
Google’s Auto Browse, launched as a headline feature for Gemini in Chrome on January 28, 2026, utilizes the Gemini 3 multimodal model to bridge the gap between user intent and browser execution.13 The system functions by launching a browsing session within a protected Chrome profile sandbox, leveraging the Chrome DevTools Protocol (CDP) to interact with web elements.4 This native integration allows the AI to "see" the page through a hybrid lens of raw HTML structure and multimodal visual interpretation.4
Metric | Gemini Auto Browse Technical Specification |
Core Model | Gemini 3 (Multimodal) 13 |
Execution Layer | Native Chrome Runtime / CDP 4 |
Context Integration | Google Workspace (Gmail, Calendar, Drive, Maps) 13 |
Interaction Modes | Multi-step navigation, form filling, price comparison 1 |
Quota Limits | 20-200 tasks per day based on subscription tier 15 |
Security Framework | Permission-based control model / Profile Sandboxing 4 |
The capability of Gemini to select from previously inaccessible comboboxes in the Salesforce Trailhead Career Marketplace illustrates its ability to resolve elements that lack traditional accessibility markup.13 Because Gemini can process images and video content in real-time, it can identify the visual "bounds" of a dropdown menu even if the aria-expanded state or role="combobox" attribute is missing or misconfigured.5 However, this reliance on visual reasoning introduces non-determinism; if the layout shifts or the rendering engine delays the display of the combobox list, the agent may fail to locate the target coordinates, leading to the "hit-and-miss" results reported by users.10
Microsoft’s approach through Copilot Actions in Edge focuses on a chat-driven assistant that "sees" the browser through periodic screenshots.2 This visual grounding allows Copilot to provide summaries, compare products across tabs, and automate certain tasks like unsubscribing from newsletters or booking travel.3
The Edge architecture defines three distinct security tiers that govern how Copilot interacts with the DOM. The "Light" mode provides minimal protection, allowing for broad autonomy, while the "Balanced" mode—the recommended default—requires user approval for unfamiliar sites.2 The "Strict" mode enforces a permission request for every site interaction.2 These tiers are critical because they dictate the agent's ability to "accessify" a page. If a site is not on Microsoft’s curated "allow list," the agent is functionally paralyzed unless the user manually intervenes.2
Security Tier | Description of Permissions and Constraints |
Light | Least secure; minimal protections; acts on most sites without permission 2 |
Balanced | Recommended; trusts popular sites; asks for approval on unfamiliar sites 2 |
Strict | Most secure; always asks for permission before any site interaction 2 |
Restricted Data | Blocked from accessing Autofill, Saved Passwords, and Wallet data 2 |
The failure of Copilot to interact with the Salesforce Experience Builder is a direct consequence of these security boundaries. While the user attempted to use Copilot to bridge an accessibility gap, the browser's security model perceived the action—dragging a component to a canvas—as a high-risk operation involving cross-origin manipulation or potential UI redressing.2
Salesforce Experience Builder is a sophisticated, single-page application (SPA) that relies heavily on the Shadow DOM, iframes, and dynamic rendering.8 These technologies, while beneficial for modularity and style encapsulation, represent an "impenetrable fortress" for many automated agents.20
The Shadow DOM allows developers to encapsulate a component's internal HTML and CSS, preventing it from being modified by global styles or scripts.20 For a native browser agent like Gemini or Copilot, the Shadow DOM creates a two-fold problem. First, traditional selectors like XPath and CSS often stop at the "shadow root," rendering the internal elements (like the drag-handle of a headline component) invisible to the agent's DOM parser.8 Second, even if an agent uses visual perception to identify a button, it cannot easily map those pixels to a specific, interactable DOM node without "piercing" the shadow boundary.21
Modern testing frameworks like Playwright can handle Shadow DOM traversal, but the reasoning engines in Gemini and Copilot often rely on a simplified, flattened representation of the page to save on token costs.8 When a Salesforce page contains nested shadow roots, the agent's internal "map" of the page becomes fragmented.8
Salesforce Structural Feature | Impact on AI Agent Navigation |
Shadow DOM | Encapsulates internal structure; hides elements from standard DOM queries 8 |
Dynamic IDs | Generates temporary identifiers (e.g., 60:220;a) that change per session 8 |
Iframe Nesting | Requires the agent to switch context between different origins 20 |
Delayed Rendering | Elements only appear when scrolled into view or after user interaction 8 |
The user's specific failure with drag-and-drop highlights a "grounding gap" in agentic AI. Unlike a "click" action, which is a discrete event at a single coordinate , drag-and-drop is a continuous state-dependent workflow involving a sequence of events: mousedown, mousemove (potentially involving multiple layout shifts as the ghost image follows), and mouseup.18
For an AI agent to execute a drag-and-drop, it must maintain a real-time, high-fidelity model of the page's geometry. Most vision-based agents, however, operate on a latency-heavy loop where a screenshot is taken, processed by the model (taking 1-2 seconds), and then a command is sent.22 By the time the agent attempts to "drop" the component on the Experience Builder canvas, the underlying JavaScript may have timed out the drag event, or the "drop zone" may have failed to initialize because the agent's simulated movement did not trigger the necessary mouseover events.10
The "security features" cited by Copilot—and the less informative refusal by Gemini—are rooted in the industry-wide concern regarding "Same-Origin Policy (SOP) Collapse" and "Privileged Browser Function" abuse.7
The Same-Origin Policy is the cornerstone of web security, ensuring that a script from one site cannot access data from another site.7 When an AI agent is granted the power to read content in one tab (e.g., an email with a headline) and act on another tab (e.g., the Salesforce canvas), the agent itself becomes the bridge between origins.9
If an agent were allowed to perform drag-and-drop freely, a malicious website could theoretically "trick" the agent into dragging sensitive data (like a session token displayed on a page) into an attacker-controlled form.9 Consequently, Microsoft and Google have implemented "hard blocks" on complex UI interactions in sensitive applications.2 These blocks are particularly aggressive when the agent detects a "Builder" or "Admin" interface, as the potential for catastrophic data loss or system misconfiguration is significantly higher.27
A critical factor in Google's "less informative" response may be related to vulnerabilities like CVE-2026-0628, where attackers found they could tap into the browser environment by hijacking the Gemini browser panel.7 This vulnerability allowed malicious extensions to access the user's camera, microphone, and local files by injecting JavaScript into the privileged Gemini interface.7 To mitigate these risks, browser vendors have restricted the agent's ability to execute arbitrary JavaScript or simulate complex mouse movements in "sensitive" browser windows, which include authenticated sessions in enterprise software like Salesforce.7
The "hit-and-miss" results are a function of the agent's failure at different stages of the execution loop. If we represent the probability of task success () as the product of the probabilities of successful Perception (), Reasoning (), and Action ():
The current state of technology sees high variance in all three variables when applied to inaccessible web apps.5
Agents that rely on DOM scraping fail when the DOM is messy, which is typical of inaccessible sites.10 Agents that rely on vision (screenshots) fail due to "coordinate guessing".18 A vision model might see a "Headline" component and estimate its center at , but if the actual clickable "handle" is at , the simulated click will miss the target.18 Research shows that vision-based agents have an error rate of 10-20% simply in misidentifying similar-looking elements or missing elements obscured by overlays (like Salesforce's floating property panels).22
Even if the agent perceives the element, it may fail to reason through the sequence. Salesforce Experience Builder requires the user to click and hold before moving. Many AI agents are programmed for simple click or type actions and do not have a robust "state machine" for long-duration drag events.24 This leads to "reasoning loops" where the agent tries to click the component repeatedly, gets no response, and eventually gives up.10
The action phase is where security and infrastructure limits are felt. Chrome's "Auto Browse" has a hard execution limit of 60 minutes per task, but many agents time out much earlier—often after 7 seconds of inactivity from an external API or slow Salesforce server.31
Failure Mode | Root Cause | Example in Salesforce Context |
Context Overflow | Page DOM exceeds LLM token limit 23 | The Experience Builder DOM is too large to process in one go |
Reasoning Loop | Agent repeats failed action with same parameters 31 | Trying to drag an "un-draggable" element label instead of the handle |
Coordinate Drift | Probabilistic coordinate guessing misses the target 18 | Clicking 5 pixels to the left of a tiny combobox arrow |
Auth/Auth State | Agent loses session or fails to handle a popup 10 | A Salesforce "session expired" modal appears, blocking the agent |
Given the limitations of native browser agents, other specialized technologies may offer more reliable alternatives for "accessifying" web content for blind and low-vision users.
MultiOn is an agentic layer that sits on top of the browser rather than being "baked in".33 It focuses on "what" the user wants rather than the technical selectors of the page.33 MultiOn uses vision-capable models that are specifically trained to adapt to website layout changes.30
While MultiOn is excellent for research (e.g., "find the cheapest flight"), its latency makes it poorly suited for real-time interface design in the Experience Builder, where the user needs immediate feedback on component placement.30
The "Guide" application is a specialized Windows desktop tool designed specifically for blind and low-vision users.37 Unlike browser-based agents, Guide takes a screenshot of the entire computer and uses AI (Claude) to interpret the screen content.24
Guide's primary innovation is that it executes commands by taking direct control of the Windows mouse and keyboard.24 This is a critical distinction from native agents like Copilot, which try to act "within" the browser's DOM.2 Because Guide simulates a physical human input at the OS level, it is not subject to the same "Same-Origin Policy" restrictions that block Copilot and Gemini.24
Feature | Browser-Native Agent (Gemini/Copilot) | Guide (Accessibility Assistant) |
Interaction Layer | Browser DOM / CDP 2 | OS-level Mouse/Keyboard Simulation 24 |
Perception | Native DOM + Screenshots 2 | Full Desktop Screenshots 24 |
Security Barriers | Strict SOP / Browser Sandboxing 7 | Standard OS Permissions 38 |
Best Use Case | Web research and simple form-filling 11 | Navigating "broken" or inaccessible UIs 16 |
User Feedback | "Hit-and-miss" on complex apps [User Query] | Successful with drag-and-drop/comboboxes 16 |
Guide has been specifically highlighted as a tool that allows blind users to "drag and drop components onto canvases in builder apps" that were not designed with accessibility in mind.16 The tool's ability to tell the user the steps it is taking allows the user to learn the workflow and even "record" successful interaction patterns for future use.24
A different path is taken by technologies like rtrvr.ai, which eschew vision-based models in favor of "DOM-native intelligence".22 By using a Chrome extension to traverse the live, rendered DOM, these tools avoid the OCR hallucinations and coordinate drift that plague vision agents.22 They can natively navigate Shadow DOMs and handle infinite scrolls by understanding the underlying framework logic rather than just the pixels.22
For a user struggling with the Salesforce Experience Builder, a DOM-native tool would likely be more successful at identifying the " Headline" component because it can see the hidden semantic markers that the visual model might miss or misinterpret as decorative text.17
The urgency of solving these unreliability issues is driven by shifting legal requirements. By June 2025, the European Accessibility Act (EAA) will mandate that a wide range of digital products and services, including SaaS platforms like Salesforce, meet strict accessibility standards (WCAG 2.1/2.2).40
The global market for assistive technologies for visually impaired users is projected to reach $6 billion in 2024 and double by 2030.41 However, blind users often find that "stapling an accessibility AI agent on top of a broken UI" is less effective than fixing the underlying semantic issues.41 As organizations like Salesforce face increased legal accountability, the "automated remediation" provided by agentic AI will likely shift from a "helper" role to a "compliance" layer, potentially forcing browser vendors to relax security restrictions for certified accessibility tools.42
The current unreliability of Auto Browse and Copilot Actions in "accessifying" web apps like Salesforce is a multi-layered technical failure. At the architectural level, native agents are hampered by their own security sandboxes, which prioritize data protection over the autonomous flexibility required to manipulate complex interfaces.2 At the perceptual level, the probabilistic nature of visual coordinate guessing is too imprecise for the "high-stakes" interactions of a builder canvas, where a miss of a few pixels renders an action void.18
For users seeking to bridge these gaps today, the most reliable approach is not to wait for native browser improvements but to adopt "Computer Using Agents" (CUAs) that operate at the OS level.24
In conclusion, while Gemini Auto Browse and Copilot Actions represent a major step forward in "smart browsing," they are currently optimized for consumer tasks rather than the rigorous requirements of enterprise accessibility remediation.3 Users with physical or visual disabilities should look toward OS-integrated CUAs like Guide or agentic layers like MultiOn, while advocating for broader adoption of the AOM standard to provide a more stable foundation for the next generation of AI-driven inclusion.5