Oct 21, 2025

Building Computer Use Agents with OpenAI's API

Building Computer Use Agents with OpenAI's API

Complete guide to building OpenAI Computer Use Agents with Python and Playwright. Learn visual automation, implement weather, job board, and form examples.

Computer Use Agents represent a significant advancement in AI automation, allowing artificial intelligence to interact with computers the same way humans do, by seeing screens, clicking buttons, typing text, and navigating interfaces.

Unlike traditional API-based automation that requires specific endpoints and documentation, Computer Use Agents interact with applications through their visual interfaces. The agent observes screen content, makes decisions about what actions to take, and executes those actions through simulated mouse clicks and keyboard input. This universal approach works with any application that has a visual interface—from legacy systems without APIs to modern web applications.

This guide will take you through building practical Computer Use Agents using OpenAI's computer-use-preview model. You'll learn the underlying architecture, implement working examples from weather lookups to form automation, and understand both the potential and current limitations of this beta technology. As usual, you can follow along with the video or read the written version below.

Important Note: This technology is currently in beta. OpenAI explicitly states it should not be used for production applications.

Understanding Computer Use Agents

Computer Use Agents operate on a fundamentally different paradigm than traditional automation. Instead of making structured API calls, these agents visually perceive your screen using OpenAI's vision capabilities, interpret what they see, and execute actions through browser automation tools like Playwright.

The power of this approach lies in its universal applicability. Whether you're dealing with legacy systems, web applications with complex interfaces, or desktop software without programmatic access, Computer Use Agents provide a path forward. The agent doesn't need to understand the underlying code or API structure, it just needs to see the interface and understand how to interact with it.

The CUA Loop: How Everything Works

The Computer Use Agent operates through what's known as the "CUA Loop", a continuous feedback cycle that forms the backbone of the agent's functionality.

The Loop:

  1. Objective Input → You provide a task goal to the agent

  2. Screenshot & Analysis → Agent captures current screen state and interprets it via OpenAI's vision API

  3. Action Decision → Model determines next step (click, type, scroll, navigate)

  4. Execute & Evaluate → Action is performed in the browser and results are checked

  5. Repeat or Exit → Process continues until success or unrecoverable error

The process begins when you provide an objective like "find me tickets for Friday's Detroit Tigers game versus Cleveland." The agent sends this to the computer-use-preview model, which understands both textual instructions and visual interface elements. It immediately takes a screenshot, which serves as its "eyes," and analyzes the image to identify clickable elements, text fields, buttons, and other relevant components.

Based on this analysis, the agent determines the next action (clicking, typing, scrolling, or navigating), executes it through Playwright, then captures another screenshot to observe the results. This creates a continuous feedback loop where the agent evaluates progress and determines next steps. Throughout the loop, it maintains context about previous actions, allowing sophisticated multi-step workflows like logging into websites by finding login buttons, identifying credential fields, entering information, and submitting forms while adapting to each site's specific layout.

Setting Up Your Development Environment

Getting started with Computer Use Agents requires proper setup with several key dependencies. The foundation relies on OpenAI's sample application, which provides the framework for browser interaction and API communication.

Installation Steps

First, ensure you have Python 3.8 or higher installed. Then follow these steps:

# Clone the OpenAI Computer Use Agent sample application
git clone <https://github.com/openai/openai-cua-sample-app>
cd openai-cua-sample-app

# Install Python dependencies
pip install -r requirements.txt

# Install Playwright and its browser
playwright install

# Test the installation with the CLI
python cli.py --computer

The requirements.txt file includes several critical packages:

  • OpenAI Python library - for API communication with the computer-use-preview model

  • Playwright - for browser automation and control

  • Pillow - for image processing and screenshot handling

The playwright install command downloads and configures the Chromium browser that Playwright will use. This controlled browser environment ensures consistent behavior and proper screenshot capture—the browser runs in a standardized configuration with specific dimensions (typically 1024x768) to provide predictable layouts for the vision model.

Testing the CLI

Once installed, you can test the system with a simple query:

python cli.py --computer local-playwright
> what's the weather like in Port Sanilac, MI

The agent will launch a browser, navigate to a search engine, enter your weather query, and return the results. While this CLI interface is useful for initial experimentation, most practical applications require programmatic control that can run autonomously without manual intervention.

Moving Beyond the CLI: Programmatic Control

While the command-line interface is useful for testing, practical applications require code you can run autonomously, schedule, or integrate into larger workflows. This involves:

  1. Launching a browser with Playwright

  2. Sending the initial query to computer-use-preview model

  3. Passing the response to the CUA loop

  4. Printing the final response

  5. Closing the browser

Let's build this step by step.

Launching the Browser with Playwright

The foundation is a controlled browser environment. Here's how to set it up:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=False,
        chromium_sandbox=True,
        env={},
        args=[
            "--disable-extensions",
            "--disable-file-system"
        ]
    )
    page = browser.new_page()
    page.set_viewport_size({"width": 1024, "height": 768})
    page.goto("<https://bing.com>")
    
    page.wait_for_timeout(10000)

Several aspects of this configuration are important:

Non-headless mode (headless=False) allows you to watch the agent work in real-time, invaluable for debugging.

Standardized viewport (1024x768) ensures predictable screen dimensions. The AI model needs consistent sizing to accurately target click coordinates and understand layout. These dimensions must match what you specify in API calls.

Disabled extensions and file system (--disable-extensions, --disable-file-system) provide a clean, isolated environment preventing unexpected popups or dialogs that could confuse the agent.

Chromium sandbox (chromium_sandbox=True) adds an extra security layer by isolating the browser process.

Initial navigation and timeout ensure the page is fully loaded before the agent begins interacting, preventing errors from attempting to interact with elements that haven't rendered yet. The 10-second (10000ms) timeout gives sufficient time for the page to stabilize.

Making the Initial API Call

With the browser ready, we send our first request to establish the agent's objective:

from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()
client = OpenAI()

response = client.responses.create(
    model="computer-use-preview",
    tools=[{
        "type": "computer_use_preview",
        "display_width": 1024,
        "display_height": 768,
        "environment": "browser"  # other possible values: "mac", "windows", "ubuntu"
    }],
    input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": "Check the latest OpenAI news on bing.com."
                }
            ]
        }
    ],
    reasoning={
        "summary": "concise",
    },
    truncation="auto"
)

print(response.output)

Notice several critical elements:

Model specification - model="computer-use-preview" is the specialized OpenAI model for computer control, not standard GPT models.

Responses API - Uses client.responses.create() rather than the standard chat.completions endpoint.

Tools configuration - The tools parameter specifies:

  • type: "computer_use_preview" - Enables computer control capabilities

  • display_width/height: 1024/768 - Must match your browser viewport dimensions

  • environment: "browser" - Specifies browser-based automation (alternatives: "mac", "windows", "ubuntu")

Input structure - Uses input array with role: "user" and content containing input_text type with your task description.

Reasoning parameter - "summary": "concise" controls how much internal thought process the model exposes (alternative: "detailed", I could only get concise to work at time of print).

Truncation - "auto" allows OpenAI to manage token limits automatically.

Initial response - Won't contain a final answer. Instead, response.output includes a request for a screenshot of the current browser state, which initiates the agent loop.

The response typically looks like this:

{
    "ResponseComputerToolCall": {
        "id": "cu_68d1e0746ee881939f997cb435893ba10e7707dd36830812",
        "action": {
            "type": "screenshot"
        },
        "call_id": "call_Gd9Y9G86Om9ofrPuPV8ijBh5",
        "pending_safety_checks": [],
        "status": "completed",
        "type": "computer_call"
    }
}

This tells us the agent needs a screenshot to proceed.

Implementing the Complete Agent Loop

The agent loop is where automation happens. This continuous cycle of observation, decision, and action enables the agent to navigate complex workflows autonomously.

import base64
import time

def get_screenshot(page):
    """Capture and encode screenshot"""
    screenshot_bytes = page.screenshot()
    screenshot_base64 = base64.b64encode(screenshot_bytes).decode("utf-8")
    return screenshot_base64

def handle_model_action(browser, page, action):
    """Execute different types of actions in the browser"""
    action_type = action.get("type")

    if action_type == "click":
        x, y = action.get("x"), action.get("y")
        page.mouse.click(x, y)
        time.sleep(1)  # Allow page to respond

    elif action_type == "type":
        text = action.get("text")
        page.keyboard.type(text, delay=100)  # Simulate natural typing

    elif action_type == "key":
        key = action.get("text")
        page.keyboard.press(key)
        time.sleep(0.5)

    elif action_type == "scroll":
        direction = action.get("direction", "down")
        amount = action.get("amount", 300)
        page.mouse.wheel(0, amount if direction == "down" else -amount)

    elif action_type == "cursor_position":
        x, y = action.get("x"), action.get("y")
        page.mouse.move(x, y)

def computer_use_loop(browser, page, response):
    """Execute the core agent loop until task completion or failure"""

    while True:
        # Extract computer calls from response
        computer_calls = [
            item for item in response.output
            if item.type == "computer_call"
        ]

        # If no computer calls, we're done
        if not computer_calls:
            print("No more computer calls. Output from model:")
            for item in response.output:
                print(item)
            break

        # We expect at most one computer call per response
        computer_call = computer_calls[0]
        last_call_id = computer_call.call_id
        action = computer_call.action

        # Perform the action in Playwright
        page = handle_model_action(browser, page, action)
        time.sleep(1)

        # Send screenshot back to the model
        screenshot_bytes = get_screenshot(page)
        screenshot_base64 = base64.b64encode(screenshot_bytes).decode("utf-8")

        # Handle safety checks if present
        pending_safety_checks = getattr(
            computer_call,
            "pending_safety_checks",
            []
        ) or []

        acknowledged_safety_checks = [
            {"id": sc.id} for sc in pending_safety_checks
        ]

        if acknowledged_safety_checks:
            print("Acknowledging safety checks:", acknowledged_safety_checks)

        # Make next API call with updated state
        response = client.responses.create(
            model="computer-use-preview",
            previous_response_id=response.id,
            tools=tools,
            input=[{
                "call_id": last_call_id,
                "type": "computer_call_output",
                "output": {
                    "type": "input_image",
                    "image_url": f"data:image/png;base64,{screenshot_base64}"
                }
            }],
            truncation="auto"
        )

        print("Response:", response.output)

    return response

This loop implements several critical features:

Computer Call Processing - Each iteration checks whether the response contains computer calls. If not, the task is complete and we extract the final answer.

Action Handling - Different actions (click, type, scroll, key press) require different browser interactions. Each is handled appropriately with timing delays to allow pages to respond.

Screenshot Feedback - After each action, a screenshot is captured and sent back to the model, creating the continuous feedback loop essential for adaptive behavior.

Safety Check Acknowledgment - The agent may pause before certain actions (like form submissions) and present safety checks. These must be acknowledged to proceed.

Message History via Response ID - The previous_response_id parameter maintains conversation context, helping the agent understand what it has already tried and what the current state represents.

Iteration Safety - While not shown in this simplified version, production code should include a maximum iteration limit to prevent infinite loops.

Example 1: Weather Lookup Agent

Let's put everything together with a complete weather lookup example. This demonstrates basic web navigation, text input, and information extraction.

def weather_agent(location="Port Sanilac, Michigan"):
    """Complete example: Weather lookup agent"""

    # Initialize browser
    playwright, browser, page = launch_browser()

    try:
        # Initialize agent with task
        client, initial_response = initialize_agent(
            f"What's the weather like in {location}?"
        )

        # Store tools configuration for loop
        global tools
        tools = [{
            "type": "computer_use_preview",
            "display_width": 1024,
            "display_height": 768,
            "environment": "browser"
        }]

        # Run the agent loop
        final_response = computer_use_loop(browser, page, initial_response)

        # Extract and display result
        for item in final_response.output:
            if hasattr(item, 'content'):
                print(f"\\nWeather Result: {item.content}")

    except Exception as e:
        print(f"Error occurred: {e}")
        raise
    finally:
        browser.close()
        playwright.stop()

# Run the agent
if __name__ == "__main__":
    weather_agent()

When you run this agent, you'll observe it:

  1. Open the browser and navigate to Bing

  2. Identify the search input field

  3. Click on the search field

  4. Type "weather in Port Sanilac, Michigan"

  5. Press Enter or click the search button

  6. Wait for results to load

  7. Analyze the page for weather information

  8. Extract temperature, conditions, and forecast

  9. Return the formatted weather data

The entire process typically takes 15-30 seconds depending on page load times and network conditions.

Example 2: Job Search Automation

Job searching typically involves repetitive tasks: visiting multiple job boards, entering the same search criteria, filtering through irrelevant listings, and compiling results.

This example shows how to automate that process using Computer Use, demonstrating multi-step navigation, handling of commercial websites with popups and cookie banners, and intelligent filtering of search results. The agent adapts to different site layouts without hardcoded selectors, making it useful for aggregating opportunities across multiple job boards efficiently. The implementation targets Dice.com and illustrates both the power of Computer Use for complex automation and its limitations with anti-bot protections on commercial sites.

def job_search_agent(job_title="Android developer", location="Detroit, MI"):
    """Search for jobs on job boards"""

    playwright, browser, page = launch_browser()

    try:
        # Navigate directly to Dice.com
        page.goto("<https://www.dice.com>")
        page.wait_for_timeout(2000)

        # Construct detailed instructions
        instructions = f"""I want you to identify open job positions for an {job_title} in {location}.

Please:
1. Enter '{job_title}' in the job title field
2. Enter '{location}' in the location field
3. Click search
4. Look through the results
5. Provide a summary of what jobs are available"""

        client, initial_response = initialize_agent(instructions)

        # Set up tools
        global tools
        tools = [{
            "type": "computer_use_preview",
            "display_width": 1024,
            "display_height": 768,
            "environment": "browser"
        }]

        # Run agent loop
        final_response = computer_use_loop(browser, page, initial_response)

        # Display results
        for item in final_response.output:
            if hasattr(item, 'content'):
                print(f"\\nJob Search Results:\\n{item.content}")

    except Exception as e:
        print(f"Error during job search: {e}")
        raise
    finally:
        browser.close()
        playwright.stop()

# Run the agent
if __name__ == "__main__":
    job_search_agent()

As you can see in the code, this agent demonstrates several advanced capabilities:

Cookie Consent Handling - The agent recognizes and dismisses cookie consent banners automatically.

Form Field Identification - Even though Dice.com's interface differs from Indeed or other job sites, the agent visually identifies the correct input fields.

Filtering Attempts - The agent may try to use filters to refine results (e.g., removing remote jobs when searching for on-site positions).

Result Navigation - The agent clicks through job listings to gather more detailed information.

Challenges with Indeed - When testing with Indeed.com, you may encounter Cloudflare protection that blocks the automated browser. This is one of the current limitations of the technology.

Example 3: Automated Form Filling

Form filling addresses one of the most tedious business workflows: manually entering data into web forms repeatedly. Whether it's lead generation across multiple platforms, bulk data entry tasks, or customer service inquiries, this capability can save hours of repetitive work.

This example demonstrates how to build an agent that can navigate to any form, intelligently identify fields, fill them with provided data, and handle the submission process, including the safety confirmations that Computer Use requires before executing potentially irreversible actions like form submissions.

def fill_contact_form(url, name, email, message):
    """
    Automated contact form completion agent

    Args:
        url: Target form URL
        name: Name to enter
        email: Email address
        message: Message content
    """

    playwright, browser, page = launch_browser()

    try:
        # Navigate to form
        print(f"Navigating to {url}...")
        page.goto(url)
        page.wait_for_load_state('networkidle')

        # Construct detailed instructions
        instructions = f"""Fill out the contact form on this page with the following information:

Name: {name}
Email: {email}
Message: {message}

After filling out all fields, submit the form.
If you're asked for confirmation before submitting, please proceed with submission."""

        # Initialize agent
        client, initial_response = initialize_agent(instructions)

        # Set up tools
        global tools, client as global_client
        global_client = client
        tools = [{
            "type": "computer_use_preview",
            "display_width": 1024,
            "display_height": 768,
            "environment": "browser"
        }]

        # Run agent loop with form-specific handling
        final_response = computer_use_loop_with_form_submission(
            browser,
            page,
            initial_response
        )

        return final_response

    except Exception as e:
        print(f"Error during form filling: {e}")
        raise
    finally:
        browser.close()
        playwright.stop()

Handling Safety Checks and Form Submission

Form submission is where safety checks become critical. The agent will pause before submitting and ask for confirmation. You need to handle this programmatically:

def _asks_to_submit(text):
    """Check if the agent is asking for permission to submit"""
    if not text:
        return False

    text_lower = text.lower()
    confirmation_phrases = [
        "should i submit",
        "shall i proceed",
        "would you like me to",
        "should i go ahead",
        "ready to submit",
        "go ahead and submit"
    ]

    return any(phrase in text_lower for phrase in confirmation_phrases)

def computer_use_loop_with_form_submission(browser, page, response):
    """Enhanced loop that handles form submission confirmations"""

    while True:
        # Extract computer calls
        computer_calls = [
            item for item in response.output
            if item.type == "computer_call"
        ]

        # Check if we're done OR if model is asking for confirmation
        if not computer_calls:
            # Check the output text for submission questions
            out_text = getattr(response, "output_text", "") or ""

            if _asks_to_submit(out_text):
                # Auto-confirm inside the loop and continue
                print("Model asked for confirmation; auto-confirming: 'Yes, submit the form now.'")
                response = global_client.responses.create(
                    model="computer-use-preview",
                    previous_response_id=response.id,
                    tools=tools,
                    input=[{
                        "role": "user",
                        "content": "Yes, submit the form now."
                    }],
                    truncation="auto"
                )
                # Loop back to process the next computer_call (the actual submit)
                continue
            else:
                # Truly done - print the model's final output and break
                print("No more computer calls. Output from model:")
                for item in response.output:
                    print(item)
                break

        # Process the computer call
        computer_call = computer_calls[0]
        last_call_id = computer_call.call_id
        action = computer_call.action

        # Perform the action
        page = handle_model_action(browser, page, action)
        time.sleep(1)

        # Capture screenshot
        screenshot_bytes = get_screenshot(page)
        screenshot_base64 = base64.b64encode(screenshot_bytes).decode("utf-8")

        # Acknowledge any safety checks on THIS call
        pending_safety_checks = getattr(computer_call, "pending_safety_checks", []) or []
        acknowledged_safety_checks = [{"id": sc.id} for sc in pending_safety_checks]
        if acknowledged_safety_checks:
            print("Acknowledging safety checks:", acknowledged_safety_checks)

        # Make next API call
        response = global_client.responses.create(
            model="computer-use-preview",
            previous_response_id=response.id,
            tools=tools,
            input=[{
                "call_id": last_call_id,
                "type": "computer_call_output",
                "output": {
                    "type": "input_image",
                    "image_url": f"data:image/png;base64,{screenshot_base64}"
                }
            }],
            truncation="auto"
        )

        print("Response:", response.output)

    return response

# Example usage
if __name__ == "__main__":
    fill_contact_form(
        url="<https://www.riis.com/contact>",
        name="Godfrey Nolan",
        email="godfrey@riis.com",
        message="Hello from the OpenAI meetup"
    )

This enhanced loop handles two types of confirmations:

Programmatic Safety Checks - Structured confirmations in the API response that must be acknowledged with safety check IDs.

Natural Language Confirmations - The agent asking textually "Should I submit?" which requires detecting the question pattern and responding affirmatively.

The key insight is catching these confirmations before the loop exits and providing automatic approval to continue the workflow.

Complete Code Example

Here's a full, runnable example that combines all the components:

from playwright.sync_api import sync_playwright
from openai import OpenAI
from dotenv import load_dotenv
import os
import base64
import time

# Global variables for loop
tools = None
global_client = None

def launch_browser():
    """Initialize browser with standardized configuration"""
    playwright = sync_playwright().start()
    browser = playwright.chromium.launch(
        headless=False,
        chromium_sandbox=True,
        env={},
        args=["--disable-extensions", "--disable-file-system"]
    )
    page = browser.new_page()
    page.set_viewport_size({"width": 1024, "height": 768})
    page.goto("<https://www.bing.com>")
    page.wait_for_timeout(10000)
    return playwright, browser, page

def get_screenshot(page):
    """Capture and encode screenshot"""
    screenshot_bytes = page.screenshot()
    return base64.b64encode(screenshot_bytes).decode("utf-8")

def handle_model_action(browser, page, action):
    """Execute actions in the browser"""
    action_type = action.get("type")

    if action_type == "click":
        x, y = action.get("x"), action.get("y")
        page.mouse.click(x, y)
        time.sleep(1)
    elif action_type == "type":
        text = action.get("text")
        page.keyboard.type(text, delay=100)
    elif action_type == "key":
        key = action.get("text")
        page.keyboard.press(key)
        time.sleep(0.5)
    elif action_type == "scroll":
        direction = action.get("direction", "down")
        amount = action.get("amount", 300)
        page.mouse.wheel(0, amount if direction == "down" else -amount)
    elif action_type == "cursor_position":
        x, y = action.get("x"), action.get("y")
        page.mouse.move(x, y)

    return page

def initialize_agent(task_description):
    """Send initial task to OpenAI Computer Use Agent"""
    load_dotenv()
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    response = client.responses.create(
        model="computer-use-preview",
        tools=[{
            "type": "computer_use_preview",
            "display_width": 1024,
            "display_height": 768,
            "environment": "browser"
        }],
        input=[{
            "role": "user",
            "content": [{
                "type": "input_text",
                "text": task_description
            }]
        }],
        reasoning={"summary": "concise"},
        truncation="auto"
    )

    return client, response

def _asks_to_submit(text):
    """Check if agent is asking for submission permission"""
    if not text:
        return False
    text_lower = text.lower()
    return any(phrase in text_lower for phrase in [
        "should i submit", "shall i proceed", "would you like me to",
        "should i go ahead", "ready to submit"
    ])

def computer_use_loop(browser, page, response):
    """Execute the core agent loop"""
    global tools, global_client

    while True:
        computer_calls = [
            item for item in response.output
            if item.type == "computer_call"
        ]

        if not computer_calls:
            out_text = getattr(response, "output_text", "") or ""
            if _asks_to_submit(out_text):
                print("Auto-confirming submission...")
                response = global_client.responses.create(
                    model="computer-use-preview",
                    previous_response_id=response.id,
                    tools=tools,
                    input=[{"role": "user", "content": "Yes, submit the form now."}],
                    truncation="auto"
                )
                continue
            else:
                print("Task completed!")
                for item in response.output:
                    print(item)
                break

        computer_call = computer_calls[0]
        action = computer_call.action

        page = handle_model_action(browser, page, action)
        time.sleep(1)

        screenshot_base64 = get_screenshot(page)

        pending_safety_checks = getattr(computer_call, "pending_safety_checks", []) or []
        acknowledged_safety_checks = [{"id": sc.id} for sc in pending_safety_checks]
        if acknowledged_safety_checks:
            print("Acknowledging safety checks:", acknowledged_safety_checks)

        response = global_client.responses.create(
            model="computer-use-preview",
            previous_response_id=response.id,
            tools=tools,
            input=[{
                "call_id": computer_call.call_id,
                "type": "computer_call_output",
                "output": {
                    "type": "input_image",
                    "image_url": f"data:image/png;base64,{screenshot_base64}"
                }
            }],
            truncation="auto"
        )

    return response

def main():
    """Main execution function"""
    global tools, global_client

    playwright, browser, page = launch_browser()

    try:
        # Choose your task
        task = "What's the weather like in Port Sanilac, Michigan?"
        # task = "Find Android developer jobs in Detroit, Michigan"
        # task = "Fill out the contact form at <https://www.riis.com/contact> with name: Godfrey Nolan, email: godfrey@riis.com, message: Hello from the OpenAI meetup. Then submit the form."

        client, initial_response = initialize_agent(task)
        global_client = client

        tools = [{
            "type": "computer_use_preview",
            "display_width": 1024,
            "display_height": 768,
            "environment": "browser"
        }]

        computer_use_loop(browser, page, initial_response)

    except Exception as e:
        print(f"Error: {e}")
        raise
    finally:
        browser.close()
        playwright.stop()

if __name__ == "__main__":
    main()

Save this as computer_use_agent.py and run it after setting your OpenAI API key in a .env file:

OPENAI_API_KEY

Then execute:

Current Limitations and Caveats

While Computer Use Agents show tremendous promise, understanding their current limitations is essential for setting realistic expectations.

Browser-Based Only (For Now)

The current implementation focuses primarily on browser-based automation. While OpenAI mentions support for full desktop environments (Mac, Windows, Ubuntu), these implementations are less mature.

For now, stick with browser-based automation where the technology is most reliable.

Beta Status and Non-Determinism

OpenAI explicitly states this technology is in beta and should not be used for production applications. A critical limitation is the lack of deterministic behavior:

  • Running the same task twice may take different paths

  • Success rates vary between attempts

  • The agent may make different decisions about which elements to click

  • Timing issues can cause intermittent failures

Unlike traditional automation that follows scripted logic, Computer Use Agents make probabilistic decisions. This makes them unsuitable for scenarios requiring guaranteed repeatability or critical business processes.

Security Measures Block Access

Many websites employ anti-bot measures that Computer Use Agents cannot bypass:

Cloudflare Protection - Many modern websites use Cloudflare's bot detection, which recognizes automated browsers and blocks access entirely.

CAPTCHAs - Any CAPTCHA challenge requires human intervention, breaking the automation workflow. The agent cannot solve visual puzzles or audio challenges designed to distinguish humans from bots.

Sophisticated Bot Detection - Websites may detect non-human patterns through:

  • Mouse movement analysis

  • Typing rhythm patterns

  • Time between actions

  • Browser fingerprinting

  • Network request patterns

When an agent encounters these obstacles, you may need to manually intervene by "taking over" the browser, solving the CAPTCHA, and then letting the agent continue.

Performance Considerations

Computer Use Agents are significantly slower than traditional automation:

  • Simple tasks: 15-30 seconds (like weather lookups)

  • Moderate complexity: 5-10 minutes (like job searches)

  • Complex workflows: 20+ minutes (form filling with multiple steps)

When to Use Computer Use Agents

Given these limitations, when does it make sense to use this technology?

Good Use Cases

Automating legacy systems without APIs - If you're dealing with old software that has no programmatic interface, CUAs provide the only automation path forward.

One-off or occasional tasks - For tasks that run daily or weekly where occasional failures are acceptable, CUAs can save significant manual effort.

Research and data collection - When gathering information from multiple sources where perfect reliability isn't critical, agents can explore websites more thoroughly than manual searches.

Rapid prototyping - CUAs let you test automation workflows quickly before investing in traditional API integrations.

Dynamic or changing interfaces - When websites frequently redesign their interfaces, traditional automation breaks. CUAs adapt to visual changes without code updates.

Poor Use Cases

Production systems requiring high reliability - Any critical business process should use traditional automation with proper error handling.

Time-sensitive operations - Anything requiring fast response times isn't suitable for current CUA performance.

Financial transactions - Never use beta technology for payments, transfers, or sensitive financial operations.

High-volume batch processing - The cost and time per operation make bulk processing impractical.

Compliance-critical operations - If you need audit trails, deterministic behavior, or regulatory compliance, stick with traditional automation.

Conclusion

By following this tutorial, you've learned how to build Computer Use Agents that automate browser tasks through visual understanding rather than APIs, from weather lookups to form submissions. You now understand the CUA loop architecture, proper safety check handling, and key limitations around reliability and cost. Start with the simple examples provided, experiment with your own use cases, and watch for updates as this technology matures.

Additional Resources

The presentation includes several valuable resources for further exploration:

Official Documentation:

Community Resources:

If you had fun with this tutorial, be sure to join the OpenAI Application Explorers Meetup Group to learn more about awesome apps you can build with AI.