Ghosts in the Codex Machine

Oct 31, 2025 Tibo & the Codex Team

We have not found a conclusive large issue that would explain a consistent degradation of Codex over time. Instead we believe there is a combination of shifts in behavior over time, some of which were encouraged by new features such as compaction, and concrete smaller issues that we found through our investigation and documented below. We have already landed a series of improvements, more are coming and this week has informed our future investment areas to further improve Codex.

This work would not have been possible without Codex, which helped us throughout redesigning existing features, instrumenting obscure parts of the stack, in running evals and for data analysis. We hope you enjoy the read.

Why did we start this investigation?

Over the last few weeks, we saw increasing public reports of Codex performing less well around the launch of gpt-5-codex on September 15th 2025. Despite no immediate evidence from our own usage or top-line product metrics, we felt that it was something that was not well understood by the team and worth the full-time attention of some of our best engineers.

Initial plan

We shared the following plan last Friday

1) Upgrades to /feedback command in CLI
- Add structured options (bug, good result, bad result, other) with freeform text for detailed feedback
- Allow us to tie feedback to a specific cluster, hardware, etc
- Socialize the existence of /feedback more, we want volume of feedback to be good enough to be able to flag anomalies for any cluster or hardware configuration

2) Reduce surfaces of things that could cause issues
- All employees, not just the Codex team will go through the exact same setup as all of our external traffic until we consider this investigation resolved
- Audit infrastructure optimizations landed and feature flags we use to safely land these to ensure that we leave no stone unturned here

3) Evals and qualitative checks
- We continuously run evals, but we will run an additional battery of evals across our cluster and hardware combinations to see if we can pick up anything

Execution

In the immediate days following the plan, we:

Launched the new /feedback mechanism with version 0.50 of the Codex CLI, released the day immediately after announcing the investigation.
Moved all internal usage within OpenAI to use the same setup as our external users.
Reduced internal complexity. There are a few layers between a user query and the GPU that runs inference, to help us narrow down the scope of investigation we tried to remove as many variables as possible. We started by auditing and removing over 60 feature flags. We are in the process of removing 80 more.

The improved /feedback mechanism in particular proved to be a valuable addition and allowed us to trace specific problematic codex conversations all the way through our systems, including the precise hardware the inference ran on. We developed internal tooling to allow engineers to quickly pull up information related to reports and help with analysis. With better tooling, and more importantly, help from users, we started triaging 100+ issues per day.

We then assembled a squad that had the sole mission to continuously come up with creative hypotheses of what could be wrong and investigate them one by one to either reject the formulated hypothesis or fix the related finding. This squad operated from Monday to Friday without other distractions.

Findings/Fixes

Hardware heterogeneity

We trained predictive models to look at relationships between hourly user retention and a slew of features about requests such as model, CLI build version, operating system, time of day, serving cluster, hardware, and user plan type. We also ran evals against each type of hardware and noticed slight performance issues with some of our older hardware, which found some support in the predictive analysis on retention. We removed that hardware from the fleet.

Separately we also discovered an opportunity in the load balancing strategy that would reduce latency under load, and these improvements are rolling out over the course of next week.

Compaction frequency

One early trend in feedback we received from users centered around compaction. When the context window is nearly exhausted, we prompt the model to summarize the conversation, clear the context, and begin anew with the summary as a first message. This enables sessions to proceed more naturally, without forcing users to restart a new session and lose context from ongoing work.

We noticed that both (a) a higher percent of sessions were using compaction over time (b) our compaction implementation could be improved to yield better results. Evals confirmed that performance degrades with the number of /compact or auto-compactions used within a single session.

We landed improvements to avoid recursive summaries when compaction happens multiple times and also added a warning to help nudge users towards keeping conversations more targeted.

Heads up: Long conversations and multiple compactions can cause the model to be less accurate. Start new a new conversation when possible to keep conversations small and targeted



Apply patch

Codex uses a special tool called apply_patch to make changes to files. The apply_patch tool takes a unified diff as an input. Most of the time the model is very good at generating diffs for changes. In rare cases when an incorrect patch is provided by the model it gets prompted by the harness to try again.

We found some reports of the model resorting to deleting and re-creating files when it fails to apply a patch. This is correct in the limit, but can cause issues when the agent is interrupted or fails to apply the second patch after the deletion. Our approach for fixing this is to improve future models so they don’t exhibit this behavior and to land immediate mitigations in the coming week to limit high-risk sequences of edits.

Timeouts

Another trend in user feedback centered around latency - users reported that the model was taking much longer than expected to complete a task. We monitor latency closely and have not seen any clear regressions across the board; in fact, latency has improved over time.

However, we noticed in a few example feedback sessions that the model was retrying tasks with longer timeouts. We designed GPT-5-Codex to be a very persistent model, but escalating timeouts in this way can be inefficient. We are investing in making models better at handling long running or interactive processes.

Constrained sampling

We use a version of constrained sampling [1] to achieve something similar to structured outputs. We found a subtle bug in the implementation that led the token sequence to become out-of-distribution for the model. Later we confirmed that reported issues where the model would switch languages mid sentence as part of the final answer could be attributed to this bug. Less than 0.25% of overall sessions are understood to have been affected. The fix for this will rollout over the next couple of days.

Example answer from gpt-5-codex in codex cli where it switches to korean:

The only place AddMemberToProject is invoked today is inside the screen module’s data seeder (ActivityDataService.cs). That routine bootstraps demo data for the kiosk/dashboard environment: it creates a customer (“Mobily”), two sample projects, assigns hard-coded users, seeds workspace structures (groups, statuses, priorities, custom fields), and drops in a couple of project tasks and activity rows. In short, it’s purely a development/demo seeding script; no production API depends on the helper. 제거하면, 그 시더에서 membership 넣을 때 바로 ProjectAppService.AddMemberToProject 대신 MediatR command를 직접 호출하도록 고쳐주면 됩니다.



Responses API

Codex uses Responses API to make requests to the model. You can think of Responses API as a proxy that turns REST API requests into a stream of tokens that models use inference and parses the resulting stream of tokens into a format convenient for application development.

Responses API became another target of our investigation. The goal was to determine whether any changes to the API implementation can affect input to the engine or the way output is interpreted. We re-reviewed over 100 PRs to Responses API and libraries it uses. We also compared raw token values that get sent to the engine across multiple versions of Responses API spanning months.

We found a small difference in request encoding - two extra newline characters were being added around the section describing tools available to the model. After additional analysis we concluded that this change does not affect the model performance.

Other investigations

We constructed and ran a number of evaluations directly against the production infrastructure

Evals across all CLI versions starting with 0.40 to latest. We saw the expected improvement from apply_patch improvements in 0.45, which confirmed equal performance across a number of evals but with a ~10% reduction in token usage.
Evals to determine the impact of web search being enabled. We concluded that --search did not contribute to a regression.
Evals for prompt changes across the last 2 months. We concluded that these prompt changes did not contribute to a regression.

We investigated the prevalence of errors setting the working directory and found no changes.

We ran a deep analysis of end-to-end queries latencies across the Codex back-end systems to understand whether certain users were unevenly affected by tail latencies. We found lower than expected authentication cache rates that can result in extra 50ms of latency per request. We are working to resolve this in the coming days.

Evolving setup sophistication

More people are using Codex and they’re using it more often. The distribution of turns and tokens per session has held fairly steady over the last several weeks, however we do see an increase in complexity of setups with more MCP tools used over time. Based on evals, we continue to recommend using minimalist setups and keeping conversations small and targeted to get the best performance out of Codex.

Future work

We have learned a lot during the course of this investigation and we thank our users for continuously engaging with us. We always encourage respectful discourse, but we assure you that we appreciate and take all the feedback seriously, no matter how harsh. Please keep it coming.

We are carrying the momentum of this week into a permanently staffed team that will obsess every day about the realworld performance of Codex. We are actively recruiting and are looking for top-talent to join us. Apply if you have prior experience doing similar work as described in this investigation and want to help us build the future of software.

Open roles