Features

Hands on with the new ChatGPT agent mode: Mindblowing with a side of hallucination

In this "how to" on OpenAI's new agent mode, AI training consultant and former Microsoft product manager Shaun Davies takes us through a fundamental shift in how LLMs integrate into workflows: "For marketers and media professionals, this is a tangible preview of the future."

For the past couple of years, we’ve been learning to treat AI like a clever tool—a supercharged search engine or a brainstorming partner. With the release of ChatGPT Agent, OpenAI is asking us to change our thinking. This isn’t a tool you wield, it’s a digital colleague you brief. It takes your instructions and works autonomously in its own little virtual computer, sometimes with brilliant results, and sometimes by hallucinating your face onto a PowerPoint slide. It’s a profound change in how we interact with AI, moving from operator to manager.

And despite moments of absurdity and some privacy concerns, the experience of working with this weird digital colleague left a strong impression. Two days ago, I was in the agent-skeptic camp, convinced it would be some time before these tools could meaningfully impact my workflow. But after putting it through its paces, I’m starting to think my Pro subscription might just be worth the A$300 a month. 

The basics: How agent works

To turn it on, you simply select Agent from the Tools menu. It’s not available on a free plan, and there’s a use limit on other plans. The Pro limit is really high at 500 runs per month. 

Enter a prompt describing what you need and a mini-browser and terminal window appear attached to your chat box. You can watch it scroll, click, and run scripts in real time as it works through its plan — very occasionally asking for confirmation before taking a key step. What allows Agent to perform these tasks is that it’s not just a language model: it’s a language model that has been given its own toolkit and a virtual computer to run it on.

OpenAI calls this a ‘unified agentic system,’ which is a fancy way of saying it combines the web-browsing skills of its earlier ‘Operator’ prototype with the synthesising power of its deep research tools and ChatGPT’s conversational smarts. When you switch to Agent mode, you’re giving ChatGPT access to a browser, a code interpreter, and file management systems. It can then use these tools to carry out your instructions. When it’s working well, it’s quite a thing to watch. You can see it carry out its operations in the virtual machine, or use the ellipsis (…) button in the right-hand corner to toggle to an activity log or take over the browser. You can also access historical recordings of the virtual machine after each step completes. Here it is in action:

 

Through ‘Connectors’, you can grant it permission to access your other applications. Invoking them is as simple as typing the name of the service, like ‘Google Drive’ or ‘Hubspot’, directly into the prompt. This is where it gets powerful. The Agent browses a public website for information, searches its own memory of developer documentation or browses the live web to figure out how to use a specific platform’s API. It then writes and executes the code to get the job done. It’s this ability to plan, use tools, and learn on the fly that separates it from a simple chatbot.

It’s not the most exciting screengrab, but you can see Agent and Hubspot talking to each other (click to expand)

OpenAI is aiming to get deep into your workflows and your operating system here. The ability to access all the files in your Google Drive and emails is really powerful, OpenAI is arguably the most data-hungry company on earth. While I really like this tool, I don’t feel entirely comfortable with the level of access I am giving it – I’ll talk about my concerns below. 

On industry benchmarks like AgentBench and GAIA, which test agents on a range of common computer tasks, Agent achieves state-of-the-art scores. But benchmarks are one thing, and real-world utility is another. Which is why I put it to the test.

Simple start: A BAS statement and an expense claim

My first task for the Agent was simple: calculate the GST for my quarterly Business Activity Statement from a number of invoices. I gave it some general instructions on how to find the files on Google Drive and a few minutes later it returned a perfectly clean table, ready for the tax office. No drama, no fuss — a quiet win on its first outing.

Next, I moved on to something more complex: creating an expense claim from a folder of image receipts. I gave it instructions on where to locate the Google Drive folder where the images and a template could be found. But despite having a connector, it struggled to see the Drive folder, prompting me to log in manually within its browser window. A bit clunky, but we got there – but this would prove a recurrent problem.  Note that it seems at this point the Agent is not able to connect directly to files on your laptop — because your laptop doesn’t have an API.

After I fixed the file issue, Agent started to extract the amounts from the receipts, but struggled when the text was near the bottom of the image, as it couldn’t get the zoom feature to focus there. I watched it laboriously zoom in on each receipt image, sometimes taking twenty attempts to correctly read a single line item at the bottom of a jpg. It was frustrating but I was impressed with its tenacity. Once it got past these hiccups, it did a good job, meticulously creating a table of expenses and double-checking its work. 

There was, however, a very funny mistake. Agent saw a receipt for a “Deluxe Fried Chicken Sandwich”, and interpreted it as a “Lava Fried Chicken Sandwich.” Steve’s Lava Chicken is a meme spawned from A Minecraft Movie due to a rambunctious 34-second Jack Black song that charted on the Billboard Top 100. I guess ChatGPT Agent is heavily brainrotted?

“Steve’s Lava Chicken, it’s as tasty as hell…” Agent mislabeled my fried chicken sandwich as a Minecraft meme

Right at the end, it made another odd decision: it exported the file as a PDF without asking. This was not what I wanted, and it was the first of many “independent decisions” that were out of line with my needs—a theme we’ll return to.

Deals, decks and hallucinated heads

Next, I tested its analytical capabilities. “Summarise my Hubspot pipeline,” I asked. This was really impressive. It quietly read Hubspot’s API documentation, authenticated itself, fetched my list of deals, and then, impressively, cross-referenced them with my Gmail to gather more context. The output was a clean, no-frills breakdown of stages, values, and next actions. It was the work of a competent analyst who just gets the job done.

Here’s part of the report, with sensitive details blacked out. 

The report was useful, accurate and comprehensive, and I’m going to schedule it to happen once a week

The final challenge, creating a bilingual presentation in English and Japanese, was the highlight of the tests. After feeding it the Word documents and image assets, it spun up some Javascript and Python and produced a deck that was competent, pleasant to look at, and will probably save me 90% of the time I would have spent on it. The layouts were mostly intact, and the bilingual headings were accurate. It was magic.

But it took some pain to get there. On two occasions it inexplicably dropped out of Agent mode and reverted to ChatGPT 3.5 Pro, which is a great model but not right for this task. This change made Agent mode inaccessible, which forced me to waste time starting a new chat and copying over context. 

But the biggest problem occurred when Agent struggled to find a folder of pictures and made a unilateral decision to generate its own instead. It proceeded to create not one, but two fake portraits of me as a generic consultant with a full head of hair. 

Aside from this odd decision, the experience quietly blew my mind. To summarise – I gave ChatGPT Agent access to some case studies, a high-level brief (in Japanese), and a deck to use as a template. It then went away and worked for half-an-hour, using Javascript and Python to generate slides, checked its work and adjusted the design, and outputted what is basically a usable deck in two languages. I read Japanese moderately well and the translation is solid. It’s not perfect, but by the standards of even a year ago, producing a full deck this way is staggering. 

Putting on my product manager hat 

While ChatGPT Agent is a leap forward, it’s still very much version one. And if OpenAI is listening (doubtful), I have some ideas for how to make it better. 

I think there needs to be a way to control how often the model asks for help. One option would be a slide to control autonomy. Set it low, and Agent frequently asks for clarification before taking a creative leap, like inventing my face or exporting a file to a format I didn’t ask for. Set it high, and it almost exclusively makes its own decisions. This would solve most of the frustrations of working with a tool that doesn’t know when it’s overstepping.

There are also some basic technical frustrations that need ironing out. The issue of it failing to recognise files in a connected Google Drive is a significant one, as is the infuriating bug where it randomly jumps out of Agent mode and into the less-capable ChatGPT 3.5 model mid-task.

Finally, a word on privacy. When you connect an app, you get a pop-up warning you that signing into websites can “expose your data to malicious sites.” That’s a good call-out – if your account is compromised and you’ve attached all your data sources via Connectors, it would be a treasure trove for bad actors. Make sure you enable two-factor authentication (2FA) and take other precautions if you’re going to use this. You also get this warning when you take over the browser to login to websites.

What’s less obvious is that OpenAI, a data-hungry monster, reserves the right to train its models on the data it can access through your connectors by default. If you value your privacy or have confidential information in your Drive, you need to navigate into Settings > Data Controls and turn off “Improve model for everyone”. I really think this should be a much more prominent choice during onboarding, not a hidden default, but fat chance of that. 

For marketers and media professionals, this is a tangible preview of the future. We are moving from telling our tools what to do, to briefing them on what to achieve. Agent is capable of both staggering feats of competence and amusing screw-ups. It’s undeniably powerful, even if it does try to give you a new face.

ChatGPT Agent is currently available for Pro users, and will be rolling out to Plus and Team users soon, before eventually coming to Enterprise customers.

ADVERTISEMENT

Get the latest media and marketing industry news (and views) direct to your inbox.

Sign up to the free Mumbrella newsletter now.

"*" indicates required fields

 

SUBSCRIBE

Sign up to our free daily update to get the latest in media and marketing.