Teaching GPT-5 to Use a Computer

/San Francisco/Surya DantuluriPosted by Surya Dantuluri

Archon demo animation

Over the weekend, I won #3 at OpenAI's GPT-5 Hackathon with Archon - a copilot for your computer. It comes with a mini vision model for speed, and GPT-5 for variable reasoning to plan. I took some time to write about how it works, and our approach to building a self-driving computer with inference math, and the tradeoffs we made.

Archon is a small bar that sits at the bottom of your Mac/Windows screen where you can type what you want your computer to do in natural language. It takes screenshots to see what's on screen, uses GPT-5's reasoning to plan, then a custom fine-tuned model executes clicks and keystrokes. In a racing game demo with a single instruction to 'start playing' it recognized the view, used WASD, and navigated the track. Although it didn't win this time due to latency, its instruction-following ability was clearly superior to prior models. The goal is to make computers self-driving. Archon is a lightweight client demonstrating that GPT-5's powerful reasoning combined with tiny fine-tuned models can control any interface through natural language.

Full demo video sped up 2x

GPT-5: Why it worked for us

Archon was built entirely using GPT-5's advanced reasoning capabilities. We leveraged probably every aspect of GPT-5 from initial development to debugging to training. Codex CLI with GPT-5 with High Thinking enabled us to build the entire app, and GPT-5 with Vision enabled us to see and perceive the screen. GPT-5's reasoning ability was crucial for instruction following, and planning. This quite simply wasn't possible with any other model.

What makes GPT-5 particularly suited for computer control is its ability to reason through complex multi-step processes while maintaining context across long interactions. Unlike previous models that might hallucinate or lose track of the current state, GPT-5's chain-of-thought reasoning allows it to break down "start playing this game" into discrete, executable steps while adapting to unexpected UI changes.

We calibrated how much compute to use strategically to trade off accuracy and latency. For complex workflows, high reasoning effort mapped out interaction sequences with error handling. GPT-5-mini with function calling preambles enabled us to show the user what we were thinking while simultaneously calling our grounding model. This adaptive approach keeps the user in mind. Whether they need to go through complex, changing UIs, or just need to get something done, we can trade off reasoning for latency and vice versa.

How it actually works

User Intent
natural language
Planner
GPT-5
prava-fc-small
Fast Click grounding
Executor
click + type
πŸ‘
See
Screenshot
10ms
β†’
πŸ’­
Think
What to click?
0-450ms
β†’
πŸ“
Find
Where exactly?
20ms
β†’
πŸ‘†
Act
Click & type
15ms

Archon uses a hierarchical approach: a large reasoning model (GPT-5/o-series) decides what to do, and prava-fc-small (Prava's Fast Click grounding model) figures out exactly where to click. This split matters because reasoning and grounding are fundamentally different problems with different computational requirements.

The reasoning model sees the screen and your request, then outputs a semantic action: "click the blue Submit button at the bottom." Descriptions enable reasoning to be done in natural language. prava-fc-small takes that description plus the screenshot and outputs exact pixel coordinates: (523, 412). One model for the "what," another for the "where."

prava-fc-small (Prava's Fast Click grounding model) is a vision transformer (ViT) fine-tuned specifically for finding UI elements. It outputs exact (x, y) screen coordinates for clicking.

Why vision tokens are expensive (and how we optimize them)

For GPT-5's computer-using agent, each action involves vision, reasoning, and response. A 1920Γ—1080 screenshot becomes 6 tiles at 170 tokens each, plus reasoning tokens billed as output.

Per step: 3,200–9,400 tokens
100 steps: $3.20–$9.40
With caching: $0.32–$0.94 (90% discount)
Latency: 2–5 seconds per action (100-step task = 3–8 minutes)

Running the same workflow 100 times daily costs $940, over $28,000/month without caching. Each run takes 3–8 minutes, so what would take a human 50 minutes would take 5–13 hours of compute time. And because they're LLMs, they aren't deterministic everytime, compounding the cost and time.

Our approach: split reasoning from grounding. GPT-5 decides "click the blue Submit button," prava-fc-small finds the exact coordinates. We try to prevent re-encoding most patches by caching the patches themselves and reconstruct the difference in images overtime. Sometime this is inefficient in tasks that involve a lot of window switches however. Combined with a 3MB saliency scorer that identifies interactive regions, we achieve 70%+ cache hits and 10–50ms grounding latency.

Tiles
Button
512Γ—512size
262Kpixels
170tokens
Patches
Button
32Γ—32size
1Kpixels
256Γ—more precise
Patches give us the precision to find UI elements exactly where they are, not where they might be.

Instead of throwing away dead space, we also just downsample irrelevant regions, keeping the important UI elements at full resolution.

Saliency heatmap showing attention regions on a screenshot
Illustration of the saliency heatmap.

GRPO

We trained prava-fc-small with GRPO (Group Relative Policy Optimization), where rewards are binary: 1 if the click lands inside the target UI element, 0 otherwise. Patches work well for this because they're small enough that clicking anywhere within a patch-covered element still gets rewarded.

To scale training data, we used trajectory augmentation on human demonstrations. From one recorded workflow, we generate multiple related trajectories by varying timing, UI states, and interaction patterns - effectively "boosting" the grounding model's robustness across different scenarios.

Learning from group performance
Group baseline:0.00
Advantage:reward - 0.00
Attempt --:0.00 -> +0.00
Model learns to prefer actions that beat the group average

While testing, prava-fc-small was really bad at clicking bright red buttons, compared to tiny blue buttons it was clicking. We suspect this is because the bright red buttons are more likely to be at the center of the element, and the tiny blue buttons are more likely to be at the edge of the element. More work is needed to make the model more robust and for us to interpret all its capabilities.

Speed: adaptive compute that feels instant

Test-time compute is getting extremely hyped these days, particularly off of the success of the o-series models. In my experience, I personally get much usage from GPT-5 Pro and previously o3-pro. The reason is because a lot of my day-to-day work revolves around "knowledge work". Good thing for prava-fc-small is that it's a lot of "grounding work" and not a lot of "knowledge work". You can get a lot of mileage out of a 7B model if you instead vary the reasoning and determine how to properly pipeline the tasks.

Observe10ms
Ground20ms
Execute15ms
Verify5ms

On this path, prava-fc-small runs alone (no planner call), hitting ~50 ms per action on a A100. The router only escalates when signals are uncertain: high saliency entropy, too many candidate targets, recent misclicks, or ambiguous copy (e.g., multiple "Submit" buttons). When that trips, we pipeline one step ahead: Step N (simple) executes now while the reasoner prepares a short plan for Step N+1. The router is a simple policy that looks at the signals and decides whether to escalate or not.

Simple UI
~50ms
Simple interface with one obvious button
"Click the submit button"
70%of tasks
95%accurate
Complex UI
~500ms
Complex spreadsheet interface
"Click cell B3, which contains the quarterly revenue data, located in the second column, third row"
30%of tasks
99%accurate
The system automatically chooses speed or precision based on visual complexity

The fundamental tradeoff is simple: consumers want one thing done fast, enterprises want many things done efficiently. Same model, different routing strategy.

For the typical consumer, we think it's better to bias toward the fast path (planner stays cold unless ambiguity is detected). In enterprise, we enable continuous batching for planner calls, short aggregation windows, and aggressive prefix caching; prava-fc-small stays on-GPU so grounding still feels immediate.

After ~1 hour of use we typically see a pretty high patch-cache hit‑rate where similar patches (imagine a screenshot of a button) are cached and reused. Verifying is cheap (single screenshot + state predicate), so we keep iterating quickly without silent drift.

The encompassing effect is that compared to computer-use models today, many steps can finish in < 100 ms end-to-end; a 20‑step flow can land in a few seconds without the "stop‑and‑think" feel.

What's next: streaming control and unifying the stack

Screenshot loop
Streaming input

In the future we hope to run a streaming capture pipeline similar to Gemma 3. Consuming frames at 20-30 fps, emitting actions at 5-10 Hz, and verifying state on each commit. This closes the perception-action loop for drag/hover/scroll and makes motion feel natural. The planner hooks into the same stream, but only for escalations.

We also plan to compile solved steps into micro-policies. If you're running something like a RPA task or similar workflow as before, you can simply run the execution locally (with prava-fc-small running locally) and not have to worry about the planning. Over time, the planner is a background teacher, not a crutch. We also found that recording screens on computers is a great way to get enough data to do RL training which materially boosts the performance of the model for each specific use case(s) in each vertical/profession.

We will distill those plans into the local model so more steps stay on the fast path. The path forward is to adopt an end-to-end approach to the problem. For Tesla that's camera, steering, acceleration. For us it's screen, mouse, keyboard.

Eventually we'll get rid of all the brittle policies and controls and have a model that can think on a second-order to understand how much compute it requires to do a task. Today we want to keep a planner in the loop for rare edge cases and safety; as the executor absorbs those patterns (via streaming, macros, distillation), the system becomes simpler and end-to-end.

Self-driving cars
Tesla FSD driving autonomously
InputCameras
OutputSteering, acceleration
ModelEnd-to-end neural net
Self-driving computers
AI agent operating a computer
InputScreenshots
OutputMouse, keyboard
ModelVision transformer
Same principle, different domain: end-to-end vision models replacing complex rules.

Related Work

He, Y., Jin, J., & Liu, P. (2025). Efficient Agent Training for Computer Use. arXiv preprint arXiv:2505.13909.

Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., Xu, R., Pan, L., Xiong, C., & Li, J. (2025). GTA1: GUI Test-time Scaling Agent. arXiv preprint arXiv:2507.05791.


Can You Infinitely Learn with Online RL?

/San Francisco/Surya DantuluriPosted by Surya Dantuluri

Last week I made Geospot Infinity, a photo-to-GPS model. For each upload, we retrieved 10 candidate coordinates and ranked them for the user. Users then pick the closest. We attempted to learn from every interaction via online RL. In reality over 65% of users clicked the first guess regardless of accuracy, effectively nullifying our ability to learn anything. In fact the online RL policy degraded output estimates by 414km or 17% worse than baseline.

GeoSpot Infinity online RL results screenshot

Metric Results

MetricError (km)vs Baseline
Baseline (GeoCLIP top-1)2,464β€”
Baseline@10 (best-of-10)1,276βˆ’48% βœ“
Our Policy (REINFORCE)2,878+17% βœ—

I froze GeoCLIP which is a ViT-L/14 vision encoder + small image MLP + location encoder that pairs images to 10 candidate coordinates. I put a tiny 3-layer MLP head that re-scores those 10 candidates:

# Input: [z_img(512), z_loc(512), similarity(1), base_prob(1)] β†’ 1026-d
ranker = nn.Sequential(
  nn.Linear(1026, 256), nn.ReLU(),
  nn.Linear(256, 256),  nn.ReLU(),
  nn.Linear(256, 1)
)

Effectively a re-ranker to see if it could learn geoguessing better than the backbone. My intention was to build an online RL policy in <12 hours which this simple architecture satisfied. The policy head samples rankings via Plackett-Luce (a probabilistic model for ranking permutations) and optimizes a REINFORCE objective with KL regularization:

\[ \mathcal{L} = -\mathbb{E}_{\pi \sim \pi_\theta}[(\color{#ff3b30}{R} - \color{#34c759}{b}) \cdot \log \pi_\theta(\pi | s)] + \color{#0071e3}{\beta} \cdot D_{\text{KL}}(\pi_{\text{behavior}} || \pi_\theta) \]

where R = reward from user click position, b = EMA baseline, Ξ² = KL penalty against a Polyak-averaged behavior policy (a slowly-updated copy of the policy that prevents the model from changing too quickly between updates).

Throughout the next few days KL stayed under 0.01 meaning we weren't learning much, and rewards were high because users clicked the first option. The shape of those rewards correlated with position so if there was no "good" prediction similar to the uploaded photo, reward would be negative. In short, the interface optimized itself for preference tuning, not geodesic accuracy.

Interface optimization showing preference tuning over geodesic accuracy

Designing for the right ceiling

On October 15th, Meta published a paper about scaling RL for LLMs, conceptually similar to the Chinchilla scaling laws paper and I recommend reading it. In the abstract it showed LLM RL scaling follows a sigmoid:

\[ R(\color{#0071e3}{C}) = \color{#000}{R_0} + \frac{\color{#ff3b30}{A} - \color{#000}{R_0}}{1 + (\color{#0071e3}{C}_{\text{mid}}/\color{#0071e3}{C})^{\color{#34c759}{B}}} \]

A = ceiling (asymptote) B = efficiency C = compute

To lift the A you want:

  • Bigger models: 8 B β†’ 17 BΓ—16 MoE lifts A from 0.610 β†’ 0.710
  • More context: 14 k β†’ 32 k tokens = 0.610 β†’ 0.645
  • Larger batches: 512 β†’ 2048 = 0.605 β†’ 0.645
  • Better precision: FP32 logits + CISPO loss = 0.52 β†’ 0.61
Precision improvement example: FP32 logits + CISPO

Implicating that RL (in this context we assume online/offline is how you sample from distribution) is a capacity, context, and for us a reward/data framing issue. Structured/dense verifiers like math and coding are axes to push RL ceilings because we can verify them cheaply extremely quickly. The next step is how do we apply this in economically valuable domains? What if we rolled up the US economy and turned it into a huge verifiable engine where the reward is USD and the rollout is you being assigned a job to do? You can't grind the ceiling(A) but you can raise it. And ultimately your distribution also defines the ceiling.

Distribution defines the ceiling

Learning to reward the reward

Yesterday I refactored Geospot Infinity to use DPO since the previous REINFORCE setup was implicitly preference tuning. Each click generates pairs comparing winner yw vs losers yl, optimizing:

\[ \mathcal{L}_\text{DPO} = -\mathbb{E} [ \log \sigma ( \color{#0071e3}{\beta} \cdot ( \color{#ff3b30}{r_\theta}(y_w) - \color{#ff3b30}{r_\theta}(y_l) - \color{#34c759}{r_\text{ref}}(y_w) + \color{#34c759}{r_\text{ref}}(y_l) )) ] \]

where yw = clicked candidate, yl = lower-ranked candidates, rΞΈ = trainable policy logits, rref = frozen reference baseline, and Ξ² = KL constraint strength. The trainable policy improves via gradient descent against the frozen reference. Try it: geospot.sdan.io. I also made the DPO code available here.

If the top 100 on Geoguessr dot com only used my model, the online RL policy would be great forever. Similarly if the top coders used Cursor Tab completion then it would get better, but behaviorally what happens if these users keep accepting everything, what happens? Seems like learning how to reward the reward is the implicit challenge. You want a never ending rotation of... smart users.

You can't optimize past your distribution πŸ’« but you can change it πŸ’«