I Tried AI Desktop Agents. They're Not Ready — But the Opportunity Is Massive.

I recently sat down with Claude's Cowork feature, fully expecting to be blown away. The demo videos looked incredible — fast cursor movements, precise keystrokes, tasks getting done without hand-holding. So I gave it something I'd actually want automated: editing a photo in Lightroom Classic to match my specifications.

It failed. Royally.

It spent an absurd amount of time just figuring out what was on the screen, and despite all that processing, the final edit looked like something a random number generator would produce. Not even close to what I asked for. I sat there watching an AI agent fumble through slider adjustments like someone using Lightroom for the first time, blindfolded.

That experience sent me down a rabbit hole. I started pulling apart why it was so bad, and I came out the other side with a framework for thinking about this problem — and honestly, with more excitement about the space than I had before I was disappointed.

The Screenshot Problem

Here's the fundamental issue: every major AI desktop agent right now — Claude's Cowork, OpenAI's Operator, Google's Project Mariner — works by taking screenshots of your screen and trying to figure out what to do next. Take a screenshot, identify UI elements, click something, take another screenshot, repeat.

For web apps, this is annoying but workable. Websites have the DOM, semantic HTML, ARIA labels — structure that agents can reason about. But native desktop apps? It's pure pixel guessing. The agent has zero access to the underlying code, application state, or even what a button does until it clicks it and checks what happened.

Think about what that means in practice. Context menus don't exist until you right-click. Keyboard shortcuts — the thing that makes power users fast — are completely invisible to an agent that only sees pixels. And the agent has no idea if a save operation actually worked unless there's a visual confirmation on screen. It's like asking someone to use your computer while they can only look at it through a series of Polaroid photos.

Why Lightroom Was the Worst Possible Test (and the Most Honest One)

Lightroom Classic is basically a torture test for screenshot-based agents, and I didn't realize that until after my frustration cooled off. The problems stack up fast.

Creative apps have continuous parameter spaces. A button is either clicked or not — that's easy. But a slider that needs to go from +15 to +37? That's an agonizing loop of move-cursor-screenshot-check-adjust for an agent that can only interact through mouse movements. A human makes fifty micro-adjustments in under a minute, feeling their way through an edit. The agent took twenty minutes to make ten clumsy ones.

Then there's the feedback problem. When I nudge an exposure slider, I immediately see the image get brighter, and I know intuitively if I've gone too far. The agent sees pixels change but has no aesthetic judgment. It's optimizing without knowing what "good" looks like.

And here's the kicker — Lightroom has a Lua scripting API. It supports XMP sidecar files where every edit is defined as structured data. Presets are literally just text files. If an agent expressed "warm tones, lifted shadows, slight vignette" as actual parameter values and wrote them to an XMP file, the edit would happen instantly and precisely. No cursor fumbling, no screenshot loops.

The agent was trying to impersonate a human using a GUI, when it should have been using the perfectly good back door that already existed.

A Hierarchy of Automation Difficulty

After thinking about this for a while, I started seeing a pattern. Not all desktop tasks are created equal, and the difficulty of automating them falls into a pretty clear hierarchy.

Tier 1 — Already solved. Apps with API connectors, CLI tools, and structured data pipelines. File management, calendar, email, Slack. These involve discrete actions, have programmatic access points, and low statefulness. MCP integrations have basically handled this tier.

Tier 2 — Mostly works. Simpler GUI apps with predictable, static layouts. Settings panels, form-filling, basic browser navigation. The UI is simple enough that screenshot-based agents can muddle through.

Tier 3 — This is where the gold is. Complex apps that have scripting backdoors, but agents don't use them yet. Lightroom (Lua + XMP), Photoshop (ExtendScript), Excel (VBA), Blender (Python API), VS Code (extensions). These are "hard" only because nobody has connected the agent to the programmatic layer. That's an integration problem, not an AI problem.

Tier 4 — Genuinely hard. Complex GUI apps with high statefulness, continuous parameters, and little or no scripting API. Drag-and-drop design tools, DAWs like Ableton, CAD software. Vision-dependent, subjective outputs, tight feedback loops humans handle intuitively.

Tier 5 — Currently unreachable. Tasks requiring taste, physical intuition, or real-time reactive judgment. Color grading a film to match a mood, mixing audio where you need to feel the bass, competitive gaming. The kind of thing where an expert says, "I just know when it's right."

Most people in the agent space are trying to solve Tiers 4 and 5 by throwing better vision models at the problem. I think the real opportunity is in Tier 3 — where the problem is already solved in principle, but no one has built the bridge.

The Problem Nobody's Talking About: Time

Here's something I haven't seen discussed anywhere, and it might be one of the sneakiest failure modes.

Humans have an intuitive sense of "the app is still thinking." We recognize a spinning cursor, a slight freeze, the feel of the interface being sluggish. We unconsciously wait before doing the next thing. It's so natural we don't even notice we're doing it.

Screenshot-based agents have almost none of this awareness. The agent takes a screenshot, sees a slider at 40, assumes the action completed, and moves to the next step. But the preview hasn't rendered yet. So the next screenshot captures a half-rendered or stale state. The agent "corrects" based on wrong information, and everything spirals.

There are at least three flavors of this:

Obvious loading — spinners, progress bars, grayed-out UI. Detectable if the agent bothers to look, but completely non-standardized across apps.

Implicit loading — the UI looks normal, but the output hasn't updated. The number says 40, but the image still reflects 30. No visual indicator that anything is pending. Humans catch this because the image "pops" when rendering finishes. Agents don't have that instinct.

Cascading state changes — one edit triggers a chain of recalculations downstream. The UI might look settled for a moment before everything shifts again. In Excel, changing one cell might cascade through hundreds of formulas. In audio production, adjusting one track's volume might trigger a compressor on the master bus to react differently.

The current workaround? Most frameworks just add fixed delays. "Wait two seconds after every action." That's terrible because it's either too long (making everything painfully slow) or too short (causing errors on heavier operations). It's a Band-Aid on a wound that needs stitches.

The Frontier Isn't Fixed

Here's the thought that actually got me excited about this space, despite the frustration.

The difficulty tiers I described aren't permanent. They're collapsing.

Think about it from an app developer's perspective. If your competitor's software works seamlessly with AI agents and yours doesn't, users will migrate. "Agent-friendliness" is becoming a feature differentiator the same way "mobile-friendly" was in 2012. Nobody planned to rebuild their website for phones either — until you had to, or you were irrelevant.

MCP adoption is already exploding. Notion, Slack, Gmail, Stripe — they're all racing to ship agent connectors. That's Tier 3 apps voluntarily dropping themselves into Tier 1. Not out of charity, but because if agents can't interact with your tool, you get cut out of the workflow.

And it'll go deeper than just connectors. Imagine if apps exposed a simple "I'm done processing" signal through an accessibility API or local socket. Trivial to implement on the app side, transformative for agent reliability. Imagine if creative apps published structured schemas of their adjustable parameters — not just "here's a button," but "here's a brightness slider, range -100 to +100, current value +15, affects exposure of selected image."

The apps that win in the next few years won't necessarily be the ones with the best GUI. They'll be the ones with the best agent integration surface. We saw this exact shift play out with APIs — Salesforce was generating over 50% of its revenue through API calls rather than its UI as early as 2012. The same thing is about to happen with agent-accessible interfaces.

Where I'm Headed

I'm a full-stack web developer, not a systems engineer. I'm not going to out-engineer ByteDance or Simular at the OS layer, and I'm honest about that. But after digging into this, I believe the most impactful gap isn't at the OS level anyway. It's in the middleware — the layer that connects agents to the scripting backends that already exist in complex applications.

My plan is to start small and specific: build an open-source framework that maps natural language intent to application scripting APIs, starting with Lightroom. Build the bridge between "make this warmer with lifted shadows" and the actual parameter writes. Bake in smart temporal awareness from day one — not dumb fixed delays, but actual ready-state detection.

Ship that. See if it resonates. Then generalize.

The big players are all racing to build better agents. Very few people are building the infrastructure that makes those agents actually reliable on real desktop software. That's where I want to be — not competing with Anthropic and OpenAI, but building something they'd want to integrate with.

I went into this expecting to be wowed by AI desktop automation. I came out disappointed by the current state, but genuinely excited about what needs to be built next. Sometimes the best startup ideas come from the exact moment you yell at your computer, "Why is this so bad?"

I intend to make it less bad.