
Oct 21, 2025

Computer Use Agents represent a significant advancement in AI automation, allowing artificial intelligence to interact with computers the same way humans do, by seeing screens, clicking buttons, typing text, and navigating interfaces.
Unlike traditional API-based automation that requires specific endpoints and documentation, Computer Use Agents interact with applications through their visual interfaces. The agent observes screen content, makes decisions about what actions to take, and executes those actions through simulated mouse clicks and keyboard input. This universal approach works with any application that has a visual interface—from legacy systems without APIs to modern web applications.
This guide will take you through building practical Computer Use Agents using OpenAI's computer-use-preview model. You'll learn the underlying architecture, implement working examples from weather lookups to form automation, and understand both the potential and current limitations of this beta technology. As usual, you can follow along with the video or read the written version below.
Important Note: This technology is currently in beta. OpenAI explicitly states it should not be used for production applications.
Understanding Computer Use Agents
Computer Use Agents operate on a fundamentally different paradigm than traditional automation. Instead of making structured API calls, these agents visually perceive your screen using OpenAI's vision capabilities, interpret what they see, and execute actions through browser automation tools like Playwright.
The power of this approach lies in its universal applicability. Whether you're dealing with legacy systems, web applications with complex interfaces, or desktop software without programmatic access, Computer Use Agents provide a path forward. The agent doesn't need to understand the underlying code or API structure, it just needs to see the interface and understand how to interact with it.
The CUA Loop: How Everything Works

The Computer Use Agent operates through what's known as the "CUA Loop", a continuous feedback cycle that forms the backbone of the agent's functionality.
The Loop:
Objective Input → You provide a task goal to the agent
Screenshot & Analysis → Agent captures current screen state and interprets it via OpenAI's vision API
Action Decision → Model determines next step (click, type, scroll, navigate)
Execute & Evaluate → Action is performed in the browser and results are checked
Repeat or Exit → Process continues until success or unrecoverable error
The process begins when you provide an objective like "find me tickets for Friday's Detroit Tigers game versus Cleveland." The agent sends this to the computer-use-preview model, which understands both textual instructions and visual interface elements. It immediately takes a screenshot, which serves as its "eyes," and analyzes the image to identify clickable elements, text fields, buttons, and other relevant components.
Based on this analysis, the agent determines the next action (clicking, typing, scrolling, or navigating), executes it through Playwright, then captures another screenshot to observe the results. This creates a continuous feedback loop where the agent evaluates progress and determines next steps. Throughout the loop, it maintains context about previous actions, allowing sophisticated multi-step workflows like logging into websites by finding login buttons, identifying credential fields, entering information, and submitting forms while adapting to each site's specific layout.
Setting Up Your Development Environment
Getting started with Computer Use Agents requires proper setup with several key dependencies. The foundation relies on OpenAI's sample application, which provides the framework for browser interaction and API communication.
Installation Steps
First, ensure you have Python 3.8 or higher installed. Then follow these steps:
The requirements.txt
file includes several critical packages:
OpenAI Python library - for API communication with the computer-use-preview model
Playwright - for browser automation and control
Pillow - for image processing and screenshot handling
The playwright install
command downloads and configures the Chromium browser that Playwright will use. This controlled browser environment ensures consistent behavior and proper screenshot capture—the browser runs in a standardized configuration with specific dimensions (typically 1024x768) to provide predictable layouts for the vision model.
Testing the CLI
Once installed, you can test the system with a simple query:
The agent will launch a browser, navigate to a search engine, enter your weather query, and return the results. While this CLI interface is useful for initial experimentation, most practical applications require programmatic control that can run autonomously without manual intervention.
Moving Beyond the CLI: Programmatic Control
While the command-line interface is useful for testing, practical applications require code you can run autonomously, schedule, or integrate into larger workflows. This involves:
Launching a browser with Playwright
Sending the initial query to computer-use-preview model
Passing the response to the CUA loop
Printing the final response
Closing the browser
Let's build this step by step.
Launching the Browser with Playwright
The foundation is a controlled browser environment. Here's how to set it up:
Several aspects of this configuration are important:
Non-headless mode (headless=False
) allows you to watch the agent work in real-time, invaluable for debugging.
Standardized viewport (1024x768) ensures predictable screen dimensions. The AI model needs consistent sizing to accurately target click coordinates and understand layout. These dimensions must match what you specify in API calls.
Disabled extensions and file system (--disable-extensions
, --disable-file-system
) provide a clean, isolated environment preventing unexpected popups or dialogs that could confuse the agent.
Chromium sandbox (chromium_sandbox=True
) adds an extra security layer by isolating the browser process.
Initial navigation and timeout ensure the page is fully loaded before the agent begins interacting, preventing errors from attempting to interact with elements that haven't rendered yet. The 10-second (10000ms) timeout gives sufficient time for the page to stabilize.
Making the Initial API Call
With the browser ready, we send our first request to establish the agent's objective:
Notice several critical elements:
Model specification - model="computer-use-preview"
is the specialized OpenAI model for computer control, not standard GPT models.
Responses API - Uses client.responses.create()
rather than the standard chat.completions
endpoint.
Tools configuration - The tools
parameter specifies:
type: "computer_use_preview"
- Enables computer control capabilitiesdisplay_width/height: 1024/768
- Must match your browser viewport dimensionsenvironment: "browser"
- Specifies browser-based automation (alternatives: "mac", "windows", "ubuntu")
Input structure - Uses input
array with role: "user"
and content
containing input_text
type with your task description.
Reasoning parameter - "summary": "concise"
controls how much internal thought process the model exposes (alternative: "detailed", I could only get concise to work at time of print).
Truncation - "auto"
allows OpenAI to manage token limits automatically.
Initial response - Won't contain a final answer. Instead, response.output
includes a request for a screenshot of the current browser state, which initiates the agent loop.
The response typically looks like this:
This tells us the agent needs a screenshot to proceed.
Implementing the Complete Agent Loop
The agent loop is where automation happens. This continuous cycle of observation, decision, and action enables the agent to navigate complex workflows autonomously.
This loop implements several critical features:
Computer Call Processing - Each iteration checks whether the response contains computer calls. If not, the task is complete and we extract the final answer.
Action Handling - Different actions (click, type, scroll, key press) require different browser interactions. Each is handled appropriately with timing delays to allow pages to respond.
Screenshot Feedback - After each action, a screenshot is captured and sent back to the model, creating the continuous feedback loop essential for adaptive behavior.
Safety Check Acknowledgment - The agent may pause before certain actions (like form submissions) and present safety checks. These must be acknowledged to proceed.
Message History via Response ID - The previous_response_id
parameter maintains conversation context, helping the agent understand what it has already tried and what the current state represents.
Iteration Safety - While not shown in this simplified version, production code should include a maximum iteration limit to prevent infinite loops.
Example 1: Weather Lookup Agent
Let's put everything together with a complete weather lookup example. This demonstrates basic web navigation, text input, and information extraction.
When you run this agent, you'll observe it:
Open the browser and navigate to Bing
Identify the search input field
Click on the search field
Type "weather in Port Sanilac, Michigan"
Press Enter or click the search button
Wait for results to load
Analyze the page for weather information
Extract temperature, conditions, and forecast
Return the formatted weather data
The entire process typically takes 15-30 seconds depending on page load times and network conditions.
Example 2: Job Search Automation
Job searching typically involves repetitive tasks: visiting multiple job boards, entering the same search criteria, filtering through irrelevant listings, and compiling results.
This example shows how to automate that process using Computer Use, demonstrating multi-step navigation, handling of commercial websites with popups and cookie banners, and intelligent filtering of search results. The agent adapts to different site layouts without hardcoded selectors, making it useful for aggregating opportunities across multiple job boards efficiently. The implementation targets Dice.com and illustrates both the power of Computer Use for complex automation and its limitations with anti-bot protections on commercial sites.
As you can see in the code, this agent demonstrates several advanced capabilities:
Cookie Consent Handling - The agent recognizes and dismisses cookie consent banners automatically.
Form Field Identification - Even though Dice.com's interface differs from Indeed or other job sites, the agent visually identifies the correct input fields.
Filtering Attempts - The agent may try to use filters to refine results (e.g., removing remote jobs when searching for on-site positions).
Result Navigation - The agent clicks through job listings to gather more detailed information.
Challenges with Indeed - When testing with Indeed.com, you may encounter Cloudflare protection that blocks the automated browser. This is one of the current limitations of the technology.
Example 3: Automated Form Filling
Form filling addresses one of the most tedious business workflows: manually entering data into web forms repeatedly. Whether it's lead generation across multiple platforms, bulk data entry tasks, or customer service inquiries, this capability can save hours of repetitive work.
This example demonstrates how to build an agent that can navigate to any form, intelligently identify fields, fill them with provided data, and handle the submission process, including the safety confirmations that Computer Use requires before executing potentially irreversible actions like form submissions.
Handling Safety Checks and Form Submission
Form submission is where safety checks become critical. The agent will pause before submitting and ask for confirmation. You need to handle this programmatically:
This enhanced loop handles two types of confirmations:
Programmatic Safety Checks - Structured confirmations in the API response that must be acknowledged with safety check IDs.
Natural Language Confirmations - The agent asking textually "Should I submit?" which requires detecting the question pattern and responding affirmatively.
The key insight is catching these confirmations before the loop exits and providing automatic approval to continue the workflow.
Complete Code Example
Here's a full, runnable example that combines all the components:
Save this as computer_use_agent.py
and run it after setting your OpenAI API key in a .env
file:
Then execute:
Current Limitations and Caveats
While Computer Use Agents show tremendous promise, understanding their current limitations is essential for setting realistic expectations.
Browser-Based Only (For Now)
The current implementation focuses primarily on browser-based automation. While OpenAI mentions support for full desktop environments (Mac, Windows, Ubuntu), these implementations are less mature.
For now, stick with browser-based automation where the technology is most reliable.
Beta Status and Non-Determinism
OpenAI explicitly states this technology is in beta and should not be used for production applications. A critical limitation is the lack of deterministic behavior:
Running the same task twice may take different paths
Success rates vary between attempts
The agent may make different decisions about which elements to click
Timing issues can cause intermittent failures
Unlike traditional automation that follows scripted logic, Computer Use Agents make probabilistic decisions. This makes them unsuitable for scenarios requiring guaranteed repeatability or critical business processes.
Security Measures Block Access
Many websites employ anti-bot measures that Computer Use Agents cannot bypass:
Cloudflare Protection - Many modern websites use Cloudflare's bot detection, which recognizes automated browsers and blocks access entirely.
CAPTCHAs - Any CAPTCHA challenge requires human intervention, breaking the automation workflow. The agent cannot solve visual puzzles or audio challenges designed to distinguish humans from bots.
Sophisticated Bot Detection - Websites may detect non-human patterns through:
Mouse movement analysis
Typing rhythm patterns
Time between actions
Browser fingerprinting
Network request patterns
When an agent encounters these obstacles, you may need to manually intervene by "taking over" the browser, solving the CAPTCHA, and then letting the agent continue.
Performance Considerations
Computer Use Agents are significantly slower than traditional automation:
Simple tasks: 15-30 seconds (like weather lookups)
Moderate complexity: 5-10 minutes (like job searches)
Complex workflows: 20+ minutes (form filling with multiple steps)
When to Use Computer Use Agents
Given these limitations, when does it make sense to use this technology?
Good Use Cases
Automating legacy systems without APIs - If you're dealing with old software that has no programmatic interface, CUAs provide the only automation path forward.
One-off or occasional tasks - For tasks that run daily or weekly where occasional failures are acceptable, CUAs can save significant manual effort.
Research and data collection - When gathering information from multiple sources where perfect reliability isn't critical, agents can explore websites more thoroughly than manual searches.
Rapid prototyping - CUAs let you test automation workflows quickly before investing in traditional API integrations.
Dynamic or changing interfaces - When websites frequently redesign their interfaces, traditional automation breaks. CUAs adapt to visual changes without code updates.
Poor Use Cases
Production systems requiring high reliability - Any critical business process should use traditional automation with proper error handling.
Time-sensitive operations - Anything requiring fast response times isn't suitable for current CUA performance.
Financial transactions - Never use beta technology for payments, transfers, or sensitive financial operations.
High-volume batch processing - The cost and time per operation make bulk processing impractical.
Compliance-critical operations - If you need audit trails, deterministic behavior, or regulatory compliance, stick with traditional automation.
Conclusion
By following this tutorial, you've learned how to build Computer Use Agents that automate browser tasks through visual understanding rather than APIs, from weather lookups to form submissions. You now understand the CUA loop architecture, proper safety check handling, and key limitations around reliability and cost. Start with the simple examples provided, experiment with your own use cases, and watch for updates as this technology matures.
Additional Resources
The presentation includes several valuable resources for further exploration:
Official Documentation:
Community Resources:
https://github.com/leonvanzyl/openai-responses-api-tutorial-python
https://github.com/godfreynolan/computer-use-preview
app.py
- Find images of a red sports carapp1.py
- Initial query example (checking OpenAI news)app2.py
- Weather lookupapp3.py
- Job searchapp4.py
- Pending safety checks handlingapp5.py
- Complete contact form submission