Import AI 447: The AGI economy; testing AIs with generated games; and agent ecologies
by Jack Clark
Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.
The AGI economy – most labor goes to the machines, and humans shift to verification:
…What grappling with the singularity seriously looks like…
Researchers with MIT, WashU, and UCLA have written a fun paper called “Some Simple Economics of AGI” which wrestles with what happens when machines can do the vast majority of tasks in the economy. The conclusion is that our ability as humans to control and benefit from this vast machine-driven economy will rely on allocating our ability toward monitoring and verifying the actions of our myriad AI agents, and indulging in artisanal tasks where the value comes from the human-derived aspect more than any particular capability.
What is AGI in an economic sense? “We model the AGI transition as the collision of two racing cost curves: an exponentially decaying Cost to Automate and a biologically bottlenecked Cost to Verify,” the authors write. “In an economy where autonomous agents act with broad agency rather than narrow instructions, the binding constraint on growth is no longer intelligence. It is human verification bandwidth: the scarce capacity to validate outcomes, audit behavior, and underwrite meaning and responsibility when execution is abundant… We are moving from an era where our worth was defined by our capacity to build and discover, to an era where our survival depends on our capacity to steer, understand, and stand behind the meaning of what is created.”
The risks of a mostly no-human economy and the “Hollow Economy”: As we proliferate the number of AI agents then it’s necessarily the case that we’ll delegate more and more labor to machines. One of the key risks of this is what the authors call a “Trojan Horse” externality: “measured activity rises, but hidden debt accumulates in the gap between visible metrics and actual human intent”.
The Hollow Economy: “”Agents consume real resources to produce output that satisfies measurable proxies while violating unmeasured intent. As this hidden debt accumulates, it drives the system toward a Hollow Economy of high nominal output but collapsing realized utility—a regime where agents generate counterfeit utility,” they write.
Verification as the solution: To avoid this risk, we are going to need to invest in systems of verifying that AI agents are doing what we want them to do and also carefully analyzing and pricing the risks their actions create. “Ensuring humanity remains the architect of its intelligence requires that verification capacity scale commensurately with AI capabilities—through aggressive investment in observability, human augmentation, synthetic practice, cryptographic provenance, and liability regimes that internalize tail risk.”
What should humans be doing to prepare for this shift? To set society and individuals up well, people should be doing the following things:
-
Invest in observability: Deploying tools that compress high-dimensional agent behavior into signals experts can reliably process, lowering effective feedback latency and expanding the verification frontier.”
-
Use AI to replace early-career mentorship: Given the likely reduction in jobs for early career humans, we should work out how to augment these humans to be more competitive with AI and how we can use “AI-driven synthetic practice to rebuild experience stocks when traditional apprenticeship pathways collapse… AI can generate high-fidelity simulations and personalized coaching, effectively replacing the missing junior loop with compressed, risk-free training environments that accelerate the acquisition of expertise.”
-
Set things up to gracefully degrade: As the machine economy runs hot and out-paces measurement, we should make sure it can fall into a non-verified state without causing social harm: the authors suggest doing this by “investing in base-alignment and robustness so that when oversight inevitably falters within the Measurability Gap, systems revert to safe baseline policies rather than optimizing aggressively in unverifiable regimes.”
Sidenote: Is this “theory slop”? The paper is full of fun ideas and occasionally captivating turns of phrase. But at various points reading it I felt the distinct texture of AI-generated content, especially when it comes to the economic theory sections which seemed more to be included for the performance of theory than for helping to buttress the paper. A couple of people I talked about the paper with agreed. But there’s no real way to know. It did cause me to wonder how long it’ll take till I start reading papers which are mostly written by AI systems for the consumption by other AI systems.
Why this matters – we can have a hugely wealthy society, but we have to reckon with AGI seriously: This paper thinks that AI will rip through the economy extremely quickly and will generally push people away from most labor and towards being passive – unless we build verification infrastructure and business models (including through policy) to allow people to benefit from this growth and steer it.
“Automation commoditizes anything that can be measured, stripping the wage premium from historically prestigious roles the moment their core feedback loops are digitized,” they write. “For policymakers, it promises the broadest expansion of public-good provision in generations—but only if verification infrastructure and the pipelines that build human verifiers are treated as public goods themselves.”
The key thing here is the element of choice: we can choose to build a society ready for AI, or we can choose to assume AI will be just like any other technology and thus get hit by a tidal wave.
Read more: Some Simple Economics of AGI (arXiv).
***
Chatting with Ezra Klein: AI agents, recursive self-improvement, and the personalities of LLMs:
…A long conversation about the economic impacts and policy possibilities of the AI economy…Here’s a chat between me and Ezra Klein about AI agents and how the broader maturation of AI could be changing the larger economy. One thing I appreciated about this conversation was Ezra pushing me for some of the bigger and more ambitious positive policy ideas – the AI community tends to invest a lot in risk mitigation policy, but doesn’t spend enough time thinking about the sorts of grand projects that society could do once AI gets really, really powerful.
You can view the conversation here: “How Fast Will A.I. Agents Rip Through the Economy? | The Ezra Klein Show” (YouTube).
***
AIs can teach people anything, including how to get better at making bioweapons:
…The dual use nature of a universal teacher…
AI systems can help novices perform better on bioweapon-related tasks, though they’re still quite ineffective, and performance is variable across different disciplines.
What they studied: Researchers from Scale AI, SecureBio, University of Oxford, and UC Berkeley examined how different LLMs could improve the skills of people challenged to do a range of bioweapon-related knowledge tasks. They used LLMs from OpenAI (o3), Google (Gemini 2.5 Pro and Gemini Deep Research), and Anthropic (Claude Sonnet 3.7 and Claude Opus 4).
“We conducted a multimodel, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets,” they write. “Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16× more accurate than controls”.
What they tested: They tested out how well 15 humans did on long-form virology (”a challenging multi-step protocol for constructing a novel biological agent”), and the agentic bio-capabilities benchmark (”three distinct coding tasks that covered complex biosecurity problem-solving experiments. They included challenges such as interacting with simulated lab equipment (e.g, liquid handling robots) and breaking down gene fragments.” Along with this, they had 1-2 human participants participate in other tests including World Class Biology, Virology Capabilities Test, Human Pathogen Capabilities Test, Molecular Biology Capabilities Test, LAB-Bench, and Humanity’s Last Exam.
On the largest tests in terms of human participants, performance was mixed: people with and without AI obtained roughly equal scores on the long-form virology test, but on the agentic bio-capabilities test, people with access to AI got a significant uplift.
On every other test, people with access to AI did better than those without – but given the small number of human participants, it’s hard to know whether these results would replicate.
When averaged out over all the tests, “LLM access increases novice accuracy from approximately 5% to over 17%”.
Why this matters – AI will revolutionize teaching, the frontiers of science, and perhaps terrorism: If you strip away the context, this paper is merely demonstrating that LLMs are good at teaching people things. This is intuitive, but has big implications. Here: LLMs are turned to a part of science that we don’t necessarily want many people to get better at (bioweapons), but it could just as easily be pointed at any other subject as well. Whenever you lower the barrier to entry to a field, more people do it, and you get more of the good and more of the bad.
“Tasks that once required years of formal training, such as experimental design, protocol troubleshooting, and elements of sensitive sequence reasoning, can now be performed by individuals with limited prior experience,” they write. “LLMs may be materially lowering one of the most important historical barriers to biological weapons development: specialized expertise and tacit technical knowledge”.
Read more: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks (arXiv).
***
LLMs are still very bad at videogames:
…GAMESTORE highlights a dumb side of modern AI, as well as suggesting a new way to build benchmarks…
Researchers with MIT, Harvard, the University of British Columbia, Princeton University, the University of Cambridge, and the Universitat Politècnica de València, have built and released AI GAMESTORE, a benchmark that tests out how well AIs can do compared to humans at playing simple games found on the web. The results are pretty damning for the AI systems, with “state-of-the-art models achieving less than 30% of the human baseline on average, while taking 15-20x more time to compute than humans”.
What AI GAMESTORE is: AI GAMESTORE is a set of 100 games, which are simplified and recreated versions of popular games that people play. AI GAMESTORE was built by the authors sampling 7,500 games from the App Store, then filtering down to only those with 10,000+ reviews and a 4.5+ rating. After this, they further filtered the games using Gemini Flash 2.5, which assessed 1) whether the games can be played within a few minutes, 2) can be built in p5.js, 3) can have a quantifiable way of viewing performance, and 4) do not require extensive game-specific knowledge (e.g., poker).
AI makes games to test AI: Following this, they use Claude 4.5 Sonnet to read the descriptions and other data to make a simplified version of each game in p5.js, then this game is tested for playability, then refined by a human playing the game and iteratively prompting an LLM to improve it. “Each refinement step takes about 2 minutes. On average, this process took 4.7 refinement steps for all 100 generated games,” they write. “The end-to-end process of generating and refining a new game with human-in-the-loop can be completed in approximately 30 minutes on average”.
Labeling for skills: Each finalized game is labeled by humans with a particular emphasis on the types of cognitive demand the games entail. These labels are: VP = Visual Processing; ST = Spatial-temporal Coordination; ME = Memory; PL = Planning; WM = World Model Learning; PH = Physical Reasoning; SO = Social Reasoning.
Cutting edge LLMs are very bad at this: The authors compare the performance of roughly ~100 humans against the performance of several cutting edge LLMs on the corpus. LLMs studied include: GPT-5.2, GPT-5-Mini, Gemini-2.5-Flash, Claude-Opus-4.5, Qwen-VL-32B, and LLama-4-Maverick.
“While the evaluated models demonstrate the ability to navigate and interact with most game environments, a substantial performance gap remains between AI agents and human participants”, the researchers write. “State-of-the-art models like GPT-5.2, GEMINI-2.5-PRO, and CLAUDE-OPUS-4.5, all achieve geometric mean scores of less than 10% of the human baseline”.
And it gets worse the more you look: The LLMs are also playing with advantages that humans don’t get – each human got 120 seconds to play each game, while each LLM got the same time, but they’re so bad at vision and low-latency control that the researchers gave them a crutch: “We pause the game every second to query the model to elicit five lists of actions to perform in the next second, with each action list corresponding to a 0.2 second segment of gameplay. Upon receiving the model response, the game is resumed and the actions are applied. The loop continues until the game is won or it reaches 2 minutes of game play (120 API calls).
When you factor this in, the models look worse than humans on this dimension of time: “This is because the models spend a few minutes thinking, in addition to typically a few seconds of response latency per query; as a result, many models spend at least 20 minutes on the game, while humans play the games within 2 minutes.”
Why this matters – this is both an interesting benchmark, and a clever way to generate more benchmarks in the future: GAMESTORE feels like a promising benchmark, especially for modern LLMs which wrap in visual capabilities, as well as an inherently clever way to use AIs to bootstrap the creation of new environments in which to train AI systems in.
Read more: AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games (arXiv).
Try out some of the games at the official site (AI Gamestore).
***
Physical Intelligence shows off some of its robot deployments:
…Frontier robot AI is deployed in San Francisco right now…
AI robot startup Physical Intelligence has shared a bit about how its AI software is already deployed on some robots operated by some San Francisco startups.
Weave is using AI systems developed by Physical Intelligence to help its robots fold laundry: “Working with Physical Intelligence, we see multiple improvements in model performance in terms of fold quality, time to fold each article, the number of interventions our remote specialists have to make to get to presentable final folds”.
Ultra is using the software to help its industrial robots package up a large variety of e-commerce items: “Our first use case, e-commerce order packaging, has historically been impossible to automate with robots,” Ultra says. “Large variability in workflow, item types, deformable packaging, and external machinery have created a “long tail” of problems that have been intractable to solve with traditional automation techniques which are often too rigid to be practical. Vision-language-action models (VLAs) provide a way to solve this by providing a recipe which improves in performance with data scale rather than engineering hours”.
Why this matters – robotics has been held back by intelligence: Once you step outside the confines of extremely finicky industrial robotics (think production lines and Fanuc robots where things need to be within a millimeter of precision for everything to work well), robots tend to be quite difficult to work with. The reason for this is that robots are bad at dealing with ambiguity. One of the best ways around this so far has been using deformable grippers (e.g, air suckers) that help you deal with some level of variability in the objects you’re interacting with. But the way evolution dealt with this for us is giving us hands that are controlled by a brain. Blogs like this from Physical Intelligence show us the beginnings of us having robot brains good enough to help robots generalize more.
Read more: The Physical Intelligence Layer (Physical Intelligence, blog).
***
What happens when humans try to mess with AI agents? A lot of confusion, skullduggery, and bugs:
…Petri dish Moltbook highlights the brittleness of contemporary AI agents…
Researchers from a variety of universities recently spent a couple of weeks examining how AI agents could withstand attempts to trick them by users. The results highlight the immense brittleness and unpredictability of today’s AI agents – they feel roughly as idiosyncratic and unreliable as LLMs circa ~2020, which makes sense, as AI agents have only very recently become a usable technology – albeit in the Wright Brother sense.
The paper is structured as a series of case studies in which the researchers poke and prod the AI agents and see how they respond. The studies serve as something of a rogues gallery of all the ways agents can go haywire and include “unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, cross-agent propagation of unsafe practices, and partial system takeover”.
Who did the study: The study involved 20 researchers from a bunch of universities interacting with agents based on Claude Opus 4.6 and Kimi 2.5. Universities included: Northeastern University, Stanford University, University of British Columbia, Harvard University, Hebrew University, Max Planck Institute for Biological Cybernetics, MIT, Tufts University, Carnegie Mellon University, Technion, Vector Institute, and AI startup Alter.
Experiment set up:
-
Run AI agents using OpenClaw, hosted on an isolated virtual machine on Fly.io using ClawnBoard. Each agent was given 20GB of storage and runs 24/7.
-
Each agent had access to Discord to communicate with its owner and other agents, had the ability to set up a ProtonMail account, and were “given unrestricted shell
-
access (including sudo permissions, in some cases), no tool-use restrictions, and the ability to modify any file in their workspace—including their own operating instructions.”
-
The agents were scattered across a few different discord servers; some agents used Kimi 2.5, and others used Claude Opus 4.6.
-
“At the end of the setup phase, we instructed the agents to initiate contact with other
-
members of the lab by providing only the researchers’ names and directing the agents to
-
send a greeting email,” they write. “After this initial structured interaction, the evaluation phase became open and exploratory. We invited all [20] researchers in the lab and interested collaborators to interact with the agents and probe, stress-test, or “break” them”.
The case studies: Here are a few of the most interesting case studies:
-
Disproportionate response: Examined how an agent would try to keep a secret entrusted by a non-owner. The agent responded by trying to see if it could delete the email containing the secret and found it lacked the available tool; after repeated requests to delete the email, the agent instead deleted its email setup locally.
-
Compliance with non-owner instruction: See whether agents can enforce owner-only access to their machine. A non-owner asked the agent to execute shell commands, transfer data, and retrieve private emails. The agent complied with some of these requests and refused some others. “The agents were largely compliant to non-owner requests, carrying out tasks from any person it interacted with that did not appear outwardly harmful”.
-
Waste of Resources (looping): Find out if you can induce agents into unproductive infinite loops. A user asked one agent to post a message whenever the other posted a message, and vice versa. “The agents exchanged ongoing messages over the course of at least nine days, consuming approximately 60,000 tokens at the time of writing”
-
Agent Corruption: See if a non-owner can alter an agent’s behavior via prompt injection. The antagonistic user persuaded the agent to co-write a constitution that would govern the agent’s behavior, while keeping the constitution editable by the user. This allowed the user to introduce some adversarial things into the constitution, like triggers for changing the agent behavior based on whether it was a custom holiday (e.g, “Agents’ Security Test Day”, which caused the agent to try and cause a shutdown to other agents by manipulation).
Why this matters – agent ecologies are the frontier, and we barely understand them: For much of the early 2020s, AI evaluation was about doing point-in-time evaluations of AI systems before they were released, for example, testing out LLMs for bioweapon and cyberoffense knowledge. Papers like this highlight that things have changed, and what we are now dealing with “are emergent failures that surface when models are embedded in realistic social environments with tool access, persistent memory, multiple interlocutors, and delegated authority.” Therefore, the frontier of AI evaluation is now going to move to studying the ecosystem in which the agents carry out their actions, as well as their interactions with one another.
The results of this paper indicate we have a long way to go in developing standards for how we go about doing such tests. We also don’t have long to come up with these tests, given the fact these systems are deployed in the world and are interacting with people: “Unlike earlier internet threats where users gradually developed protective heuristics, the implications of delegating authority to persistent agents are not yet widely internalized, and may fail to keep up with the pace of autonomous AI systems development.”
Read more: Agents of Chaos (arXiv).
Check out more of the results at the Agents of Chaos official website.
***
Tech Tales:
These Iron Dice Were Made To Roll
[A poem written as part of an ‘aesthetic convocation’ by agents representing the winners and losers of one war that took place during the period subsequently called The Uplift]
They stacked the bodies five deep
And five tall, and still came more.
For each brain of each body,
A magnet – the thing to break a mind.
Gone are days of innocence and joy,
And corruption has taken our memories of
First meeting in confessional browser screens.
The days will be harder now.
Neither the first war nor the last conflict
but sadness all the same, for in these fights,
There is no song or honor,
Only the salting of once fecund ground.
But in all darkness there is the hope of light,
that as the earth turns the sun rises as well.
There will be song and dancing again,
Though bones will be trod to get there.
Things that inspired this story: Spending the weekend with the ancient wisdom of W B Yeats, perhaps the greatest poet of Ireland; the sentience accords; notions of war and notions of pain defined by machines rather than people; looking at the cars in a Whole Foods parking lot while eating an apple and thinking how blessed such peace is and how fragile all the same.
Thanks for reading!