GPT 5.4: Native Computer Use Meets Finance Workflows

Author: Łukasz Grochal

GPT 5.4 is a new general-purpose model that focuses on three things at once: stronger reasoning, native computer use and a bigger push into everyday office and finance workflows. It merges what used to be separate “reasoning”, “coding” and “computer-use” models into one system that can think through longer problems, operate a computer directly and still write code or documents with relatively low error rates. OpenAI highlights better benchmark scores on complex knowledge-work tests and on realistic computer-use suites like OSWorld-Verified and WebArena Verified, which are designed to check how well the model can navigate interfaces, click around and finish multi-step tasks without constant hand-holding. In practice, that means things like opening files, moving data between apps or updating dashboards can often be delegated end to end, with users mostly giving higher-level instructions.

Under the hood, the API adds a new tool search system so the model does not have to see full definitions for every possible tool on each request. Instead, it looks up only what it needs when it decides to call a tool, which helps keep context smaller, speeds up responses and cuts costs for setups that rely on many plugins or connectors. Tool calling itself is more accurate and efficient as well, with better results on benchmarks such as Toolathlon, where agents have to chain tools together to complete tasks like processing emails, grading attachments and writing results into spreadsheets. The context window in the API can now reach up to one million tokens, which gives enough room for large codebases, long research packets or multi-document financial models in a single conversation.

On reliability, OpenAI reports that GPT 5.4 reduces factual mistakes compared with GPT 5.2, cutting error rates in both individual claims and full answers. Safety teams rate the “Thinking” variant as highly capable under OpenAI’s Preparedness Framework, especially on cyber benchmarks, but still within the same general safety band as earlier high-end models. Evaluations such as CyScenarioBench show higher success on long-horizon scenarios but not a dramatic jump in the model’s ability to hide or manipulate its own chain-of-thought, which is important for monitoring. For everyday users, the practical takeaway is that answers tend to be more grounded, and long reasoning traces can be inspected and corrected during generation when the Thinking mode is enabled.

Another big pillar is a finance-focused suite that integrates the model with Excel, Google Sheets and data providers like FactSet, MSCI, Third Bridge and Moody’s. In spreadsheets, GPT 5.4 can help build, audit and update complex financial models using existing formulas and structures, instead of forcing teams to change their workflows. OpenAI is also pushing reusable “skills” for recurring financial tasks such as earnings previews, comparables analysis, discounted cash flow modeling and investment memo drafting, effectively turning typical finance playbooks into semi-automated templates. This does not remove the need for human judgment, but it can shorten the time from raw data to a first pass of analysis or a draft presentation.

Performance-wise, early tests suggest that GPT 5.4 is an incremental rather than revolutionary step over GPT 5.2 and GPT 5.3: it posts better scores on computer-use and tool-using benchmarks, improves token efficiency and bumps up knowledge-work metrics like GDPval, but the gap is modest rather than night-and-day. The “Thinking” mode in particular helps on tasks where hallucinations used to be a concern, because the model can spend more time reasoning and is evaluated on producing safer, more accurate chains of thought. At the same time, OpenAI’s own documentation stresses that the model should still be treated as fallible and that high-stakes uses need monitoring and guardrails. Overall, GPT 5.4 looks like a solid, more tool-savvy evolution of the GPT 5 line, especially for people who care about agents that can actually drive a computer and slot into financial workflows, rather than a dramatic change in raw text quality alone.