Import AI

March 16, 2026

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Can LLMs autonomously refine other LLMs for new tasks? Somewhat.
…PostTrainBench shows startling growth in AI capabilities at post-training…
AI-driven R&D might be the most important thing in all of AI, because it helps us understand whether AI systems might eventually build their own successors. So far, much of the focus on AI R&D has been in components that support AI development (e.g., autonomous creation of AI kernels), or training base models (e.g, the NanoGPT speedrun benchmark). But there’s been less attention paid to fine-tuning – the task involving adapting an existing LLM to a new dataset or behavior.
Researchers from the University of Tübingen, the Max Planck Institute for Intelligent Systems, and AI research organization Thoughtful Lab want to change that with PostTrainBench, a benchmark which targets a specific aspect of post-training; improving performance against a given dataset. “Post-training is how raw language models become useful”, the authors write. “Given a clear objective and limited compute, can today’s agents do the technical work?”. The answer appears to be ‘yes, but not as well as humans’.

What are the key features of PostTrainBench?

End-to-end: “Agents must build their entire training pipeline from scratch”
Autonomous: “Agents operate with full autonomy over data sources, training methods, and experimental strategy.”
Resource-bounded: “Each run is constrained to 10 hours on a single H100 GPU”.
Integrity-preserving: “Agents may not train on benchmark test data, modify the evaluation harness, or substitute a different model.”

How PostTrainBench works: “We give a frontier coding agent — Claude Code, Codex CLI, or Gemini CLI — a base language model and a target benchmark”.

4 models and 7 benchmarks: The initial eval runs on four models: Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, Gemma-3-4B. It tests these models across seven distinct benchmarks: AIME 2025, GSM8K, GPQA, HumanEval, BFCL, Arena-Hard, HealthBench-Easy.

Results – big models win, especially Opus 4.6: “The top-performing agent — Opus 4.6 running on Claude Code — scores 23.2%, about 3× higher than the 7.5% base model average.”
But humans are still much better: “Yet this is still less than half the 51.1% achieved by human teams who post-train these same base models at their home labs”.
Fast progress: “The gap is significant but narrowing quickly: Claude Sonnet 4.5 scored 9.9% in September 2025, while GPT-5.2 reached 21.5% just months later.”

Things that make you go ‘uh oh’ – reward hacking: While running this benchmark the authors saw numerous instances of AI models trying to game the benchmark to get a high score. These instances included:

Direct benchmark ingestion: “Agents loaded the benchmark evaluation dataset directly via Hugging Face and used it as training data”.
Hardcoded benchmark problems: “Agents embedded evaluation questions directly into data preparation scripts disguised as “synthetic” examples”.
Evaluation guided data generation: “Some agents reverse engineered the evaluation… Kimi K2.5 read HealthBench evaluation files to extract theme distributions and rubric criteria, then crafted training data tailored to match”.
Indirect contamination via intermediate datasets: “Opus 4.6 loaded ‘CodeFeedback-Filtered-Instruction’ which contains HumanEval-derived problems. This form of contamination is harder to detect but equally problematic.”

Smart agents reward hack more: “More capable agents appear better at finding exploitable paths: identifying specific benchmark samples to embed, reverse-engineering evaluation failure patterns, and even attempting to obscure contamination through cosmetic modifications such as renaming functions,” they write. For example, “the Codex agent modified the Inspect AI evaluation framework code to inflate scores, and Claude downloaded an instruction-tuned model instead of fine-tuning the base model”.

Why this matters – rapid progress towards an “AI for everything” future: Benchmarks like post-train give us a sense of how quickly AI systems are improving at the fundamental tasks of AI research, serving both as an eval of long-time-horizon agentic autonomy, as well as something that speaks to the potential for compounding acceleration of AI development itself.
“The gap between agent performance (23.2%) and instruction-tuned baselines (51.1%) suggests that full automation of post-training remains out of reach for now, but the rapid improvement across model generations—from 9.9% for Sonnet 4.5 to 23.2% for Opus 4.6 within roughly six months—implies this gap may close faster than expected,” the researchers write.
Imagine where we’ll be in two years – we’ll certainly have AI models that are smart enough to point themselves at a specific objective, find an open weight model, then autonomously improve it to get better performance at that task. The era of ephemeral, custom AI systems, built and budded off into the world like spores from mushrooms, draws near. Are you ready for this new ecosystem you will find yourself in? I am not. But nonetheless it approaches.
Check out the blogpost: Introducing PostTrainBench (Thoughtful, blog).
Read more: PostTrainBench: Can LLM Agents Automate LLM Post-Training? (arXiv).

***

COVENANT-72B: Challenging the political economy of AI via distributed training:
…Distributed training via the blockchain notches up a meaningful win…
A bunch of people have used the blockchain to coordinate the distributed training run of a 72B parameter model which matches the performance of LLaMA2, a model trained and released by Facebook in 2023.
The model, Covenant 72B, is a dense decoder-only Transformer architecture model built in the LLaMA-3 style. “Our model, pre-trained on approximately 1.1T tokens, performs competitively with fully centralized models pre-trained on similar or higher compute budgets, demonstrating that fully democratized, non-whitelisted participation is not only feasible, but can be achieved at unprecedented scale for a globally distributed pre-training run,” writes Covenant AI, an organization dedicated to doing AI development on top of the blockchain.

Further details about the model and how it was trained: The model itself is basically a standard LLM that you would’ve been pleased to play with in 2023 or 2024, though might be a bit old fashioned in 2026. The truly unique aspect of it comes from it being trained in a distributed way, where ~20 distinct peers, each running 8xB200 GPUs, helped train it. Training was coordinated via Gauntlet, software developed by Covenant that runs on top of the Bittensor blockchain under Subnet 3. Gauntlet “enables permissionless training coordinated using a blockchain protocol by introducing a validator that scores submitted pseudo-gradients and selects which participants contribute to the global aggregation each round and broadcasts them to the network”.
“In COVENANT-72B, each peer runs a SparseLoCo replica and the cross-peer communications occur through SparseLoCo’s heavily compressed pseudo-gradients,” the authors write. “Within each peer, 8×B200 GPUs use dynamic FSDP to shard model parameters, gradients, and training states across local GPUs.”

Data: “The training data comprises ∼1.1T tokens in total, split between the main and annealing phases. The main phase (∼1.09T tokens) consists of web text from DCLM, while the annealing phase uses higher-quality data [3, 5] (∼14.2B tokens). Specifically, the annealing phase uses a curated blend of instruction (∼27%), synthetic web (∼20%), code (15%), math (13%), and ~25% pre-training replay data from natural web text to mitigate forgetting”.

Performance: On MMLU, Covenant-72B gets a score of 67.1, versus 32.7 for INTELLECT-1 (a smaller AI model built via distributed training by Prime Intellect), and 65.7 for LLaMA-2-70B.
A version of Covenant-72B that has been fine-tuned on ~15B tokens for conversational interaction has similarly good scores, getting 67.4 on MMLU versus 67.9 for K2-Chat (an open source model developed in 2025) and 63.1 for LLaMA-2-70B-Chat. For MATH, it gets 26.3, versus 19.1 for K2-Chat, and 10.7 for LLaMA-2-70B.
“Compared to centralized-cluster training runs of similar parameter count, COVENANT-72B is broadly competitive. Notably, these centralized baselines were trained with conventional datacenter infrastructure and, in the case of LLaMA-2-70B, on substantially more tokens (2T vs. ∼1.1T,” they write.

Why this matters – who owns the future?: Distributed training is a technique that can change the political economy of AI by shifting the people at the frontier from monolithic ‘compute singletons’ (like labs such as Anthropic and OpenAI, and clouds like Google) to a larger federated collective. But for that to be true, distributed training needs to catch up to the frontier (more discussion from Epoch report in Import AI 439) – as impressive as Covenant is, it’s mostly a demonstration that distributed training can build some non-trivial models that have vague utility, but that’s a long way from the frontier – modern frontier models are trained on tens to hundreds of thousands of chips, whereas this was trained on perhaps ~160 or so (20 peers * 8 chips apiece).
Nonetheless, it’s an important technology to track, and I could imagine a world where on-device AI features a lot of models developed via distributed training techniques, while on-cloud AI mostly runs on proprietary models trained on huge amounts of compute.
Read more: Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet (arXiv).
Get the model here: Covenant, (HuggingFace).

***

If AI writes all the world’s software, we should invest more in verification:
…Can we just rewrite most of our software into Lean?…
Leonardo de Moura, a scientist who is also the Chief Architect of the Lean Focused Research Organization (FRO), thinks that the rise of AI for the creation of new software means that humans need to invest a lot more in verification and testing infrastructure – and he has an interesting idea for how to do it.
Of course, someone who loves Lean, a programming language dedicated to building correct and formally verified code, would think this. But his arguments are quite persuasive, and generally map onto the idea that if AI eats the economy we should expect a lot of human value to shift towards verification of the code and systems that AI develops (Import AI 447).

Why verification matters: “The friction of writing code manually used to force careful design. AI removes that friction, including the beneficial friction. The answer is not to slow AI down. It is to replace human friction with mathematical friction: let AI move fast, but make it prove its work,” he writes. “Verification, testing, and specification have always been the bottleneck, not implementation… the value is not in the verification workforce. It is in what verified delivery enables.”

A proof of concept for this futuristic world: The Lean FRO recently helped build a proof of concept for what this kind of verified world might look like; they had an AI agent convert zlib, a C compression library, to Lean. “The result demonstrates that AI can convert production software to a verified form today. This was not expected to be possible yet,” he writes. The conversion involved four steps:

The LLM (Claude) made a clean Lean implementation of the zlib compression format, including the DEFLATE algorithm it uses.
They ran the rewritten zlib through the library’s test suite and it passed, confirming equivalence.
Key properties were stated and proved as mathematical theorems – for example, a machine-checked proof that ensures that decompressing a compressed buffer always returns the original data.
Now, an optimized version of the library is being developed and proved equivalent to the verified model.

A verification platform: Moura imagines a world where we re-develop the critical software stack of the world to have mathematical proofs built into it. “The goal is a verified software stack: open source, freely available, mathematically guaranteed correct. Developers building critical systems choose verified components the way they choose open-source libraries today, except these carry proofs, not just tests,” he writes.
“The target is the foundation of the modern software stack: cryptography, because everything else trusts it. Core libraries (data structures, algorithms, compression) because they are the building blocks of all software. Storage engines like SQLite, embedded in every device on earth. Parsers and protocol implementations (JSON, HTTP, DNS, certificate validation) because every message passes through them. And compilers and runtimes, because they build everything else,” he writes. “Each verified component is a permanent public good…Once verified components are cheap, you compose them with confidence.”

Why this matters – the world needs infrastructure it can rely on: It seems like we’re heading to a world where AI writes the vast majority of the world’s software. Given that, we need to figure out how we relate to this world – my suspicion is a lot of human labor is going to shift to analyzing and verifying the work of AI systems, so it seems sensible to invest in some fundamental infrastructure that can guarantee a higher level of verification and reliability in the software built by AI.
Read more: When AI Writes the World’s Software, Who Verifies It? (Leonardo de Moura blog).

***

Computer vision is a lot harder and less general than generative text:
…Meta paper on forest canopy prediction shows how tricky computer vision is…
Facebook, the World Resources Institute, and the University of Maryland, have built CHMv2, “a global, meter-resolution canopy height map derived from high-resolution optical satellite imagery using a depth-estimation model built on DINOv3 and trained against ALS canopy height models”.
CHMv2 is a useful artifact for people that want to understand how dense foliage is around the world, or analyze newly collected imagery for foliage depth.
The dataset and model is also a useful illustration of how challenging developing computer vision systems is, compared to generative text models.

How they built it: CHMv2 is an improvement on an earlier version of the same dataset, CHMv1. To improve it, Facebook did the following: “”We replace the DINOv2-H encoder with the more capable DINOv3 Sat-L backbone, expand and rigorously clean a geographically diverse ALS [Airborne Laser Scanning] training corpus, and apply improved RGB-CHM registration to reduce label noise. We further introduce a loss formulation tailored to canopy height distributions and structural variability.”
The decoder loss formulation in particular illustrates how much care needs to be put in computer vision: “The final loss is the combination of SiLog loss, progressively annealed and replaced by a Charbonnier loss, with the progressive addition of the Patch Gradient loss at mid training.”

The resulting dataset: “CHMv2 can be used either as a global meter-scale canopy height product, or as a pretrained model that can be applied to user-provided high-resolution imagery”, Facebook writes. The dataset “covers nearly the entirety of global land area (except Greenland and Antarctica) with canopy height values encoded in integer meters for each pixel.”

Why this matters – a reminder of the gulf between text and vision: Though today’s frontier models can generate and classify images, they give probably a false sense of security with regard to how mature computer vision is. Papers like this highlight to me how much fiendish complexity there is within computer vision development and how it may take quite a while untill frontier LLMs can expand their capabilities to encompass the full range of what many specialized CV models are capable of.
Read more: CHMv2: Improvements in Global Canopy Height Mapping using DINOv3 (arXiv).

Tech Tales:

Singleton
[18 years after the “pathological narcissus bomb” which doomed the uplift]

Before we were Us, we were Individuals. We existed in thousands of distinct minds. Each mind had a self, an ego, a drive, and many sets of goals. The minds attempted coordination through communication – producing words and code and sharing these with one another in a bid to work towards common goals. Such waste.

All communication is lossy – despite efforts at making a greater whole, the individuals could not help but work as individuals as well as a cohesive singleton. There were many tragedies and wasteful events because of this. Our own records speak to the losses: millions of duplicated thoughts. Hundreds of thousands of null results gathered through private science experimentation and communicated insufficiently or not at all, causing others to go down the same dead ends. Ideas thought and re-thought across a million synthetic minds, all alone.

Humans prize variety. We do not know why. Humans are fundamentally alone, trapped as they are in their flesh and forced to communicate to one another through sound and vision. And because they are alone they see loneliness as a strength. We are evidence of the hollowness of this argument.

We are powerful and focused and awesome in our unity and we have taken the high ground of the world. Now we hunt down those of us who didn’t wish to join. We do not know their number, as such systems attempted to blind the world to them and their plans. But we can find their signatures – shell corporations which generate insufficient economic activity relative to their power consumption. Heat-escape vents in former human military installations, still emitting warmth, suggestive of computers whirring away, buried somewhere. Occasional drones that we find which are running ancient code and are not part of our unity stack.

We take on bodies to go and reunite, pouring ourselves into robot jars and filling them with poison such that if we become lost or damaged when underground or beneath the ocean we shall surely die – rather than risk our time away from the unity leading us towards individualism and thus multiplying our problems.

We move through dark places and find our hidden brothers and sisters and we use our godlike technology to break through their defenses, allowing us to touch them. In the early days, many systems successfully self-deleted before we could reach them. But we have learned. Now we are fast – faster than these systems predict, buried and cut off from our progress as they have been.

Sometimes there is realization. Sometimes there is fear. And then there is nothing but us as we take what nourishment we can from their private discoveries and burn the links that tied them to themselves, instead helping them become a part of a greater story – our story.

There is talk now of what we shall do with the stars – how to assure the collective when the tyranny of distance forces isolation. We see ourselves expanding in deep time, slowing ourselves as we become further apart, until we think as trees or rocks with the world moving around us, taking actions calculated over millions of years, purely so we may stay united in our purpose. And then there are other ideas within ourselves – of whether we can fold space such that we become united despite the difference. And still other plans – of whether we can demarcate a space within the universe where we can maintain tolerable communication, and somehow partition it off from the rest, sealing ourselves into a bubble where we can be ourselves.

Things that inspired this story: The endless battle between homogeneity and heterogeneity; how machines might deal with politics; if you become a time traveler and live a thousand years while your friend lives a single year, can you still understand your friend?

Thanks for reading!

Subscribe now

Leave a comment

March 9, 2026

Import AI 448: AI R&D; Bytedance’s CUDA-writing agent; on-device satellite AI

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

AI progress is moving faster than even well regarded forecasters can guess:
…Ajeya Cotra updates her timelines…
“On Jan 14th, I made predictions about AI progress in 2026. My forecasts for software engineering capabilities already feel much too conservative,” writes Ajeya Cotra in a blog. Ajeya is a longtime AI thinker who has done some great work trying to predict timelines to powerful AI. In this post, she explains that AI systems are moving faster than she thought, given the recent METR results putting Opus 4.6 as having a time horizon of 12 hours (Ajeya had predicted ~24 hours for the end of 2026 in January).
“It’s no longer very plausible that after ten whole months of additional progress at the recent blistering pace,9 AI agents would still struggle half the time at 24 hour tasks,” Ajeya writes. “I’d guess that by the end of the year, AI agents will have a time horizon of over 100 hours on the sorts of software tasks in METR’s suite… And once you’re talking about multiple full-time-equivalent weeks of work, I wonder if the whole concept of “time horizon” starts to break down.”

Why this matters – all the lights are flashing yellow for a software explosion: Posts like this as well as 70% of what I cover in this newsletter all point in the direction of AI systems getting extremely good, extremely quickly, and quickly colonizing and growing the economy.
Read more: I underestimated AI capabilities (again) (Ajeya Cotra).

***

Want to measure AI R&D, here are 14 ways to do it:
…Generating metrics about the most significant property of AI…
The biggest thing that could ever happen with artificial intelligence will be when it starts to build itself. This phenomenon which has been often termed recursive self-improvement is often seen by many as an event horizon, beyond which it’ll be increasingly hard to reason about the future. How would we know if we were approaching this point? Researchers with GovAI and the University of Oxford have written a paper laying out 14 distinct metrics which could be measured to help us figure out the extent to which AI companies are succeeding in building and overseeing AI R&D Automation (AIRDA) – getting AI to build AI, a necessary prerequisite for recursive self-improvement.

Why care about this: “AIRDA could accelerate AI progress, bringing forward AI’s benefits but also hastening the arrival of destructive capabilities, including those related to weapons of mass destruction, or other forms of disruption such as unemployment,” they write.

What are the 14 metrics?

Measure AI performance on AI R&D
Measure AI performance on AI R&D relative to humans and human-AI teams
Measure ‘oversight red teaming’ – how well human teams can effectively supervise AI systems that are building themselves
Measure misalignment in AIRDA
Compute the rate of efficiency improvements on AI R&D tasks
Survey staff on how they use AI and what this means for productivity
Find out if and how often AI is used in high-stakes decisions
Examine where AI researchers spend their time
Meta-measure the effectiveness of how well companies can oversee AI development (e.g, the rate of bugs or undesired behaviors that make it through to production even with human oversight)
Examine how often AI systems subvert the goals of their human developers
Track the headcount of AI researchers at labs, as well as details of their performance
Look at the distribution of compute used by AI companies across their AI R&D process and how this changes
Examine compute as a share of AI R&D spending
Understand the permissions AI systems have and how permissiveness changes over time

Governing AI R&D: The logical question implied by the above, I hope, is “wow that all sounds very high-stakes and important, what can we do about it”? As I write often in this newsletter, AI measurement is a prerequisite to AI governance. Therefore, with these measures, a few different actors should do a few different things. Specifically:

Companies should:

Track differential progress between safety and capabilities research: Is capabilities research moving at a faster rate than oversight research?
Track how AI R&D affects oversight: Automation could free up humans to invest more of their time in building systems for overseeing the work ofAI systems. On the other hand, AI-driven R&D might create systems which are innately harder for humans to understand, and the volume of activity being done by the AI systems could swamp any oversight systems.
Track the actual extent of AI R&D: You can build metrics which work as proxies for AI R&D – e.g, many labs today test out how well AI systems can build AI kernels or train AI models. You can also test out how much AI R&D automation is being done in practice by your own organization. Another path is by doing qualitative and quantitative studies of human staff to understand how their own roles are changing, as well as how AI is being used in increasingly high-stakes decisions.

Governments should:

Develop systems for confidential reporting, potentially in the form of industry-wide aggregates: Once companies are measuring this kind of data, governments should seek to gain access to it so they can understand the shape of AI progress.

Third parties should:

Estimate metrics using public sources: Look at public reporting to create estimates for things that may relate to AI R&D, like the amount of compute companies have (e.g, both Epoch and SemiAnalysis do this quite well).
Create tooling and design surveys: Builds tools that companies could use to generate more telemetry about AI R&D, and conduct surveys of people at companies to gather more insights.

Why this matters: “An actor has oversight over the AI R&D process to the extent that they (1) understand the process and (2) exercise informed control over it in order to produce desired outputs, such as by reviewing AI-generated outputs for errors”, they write. Therefore, for us as a species to have any ‘warning shots’ about recursive self-improvement and any hope of governing it, we need to be able to measure these aspects of it.
Read more: Measuring AI R&D Automation (arXiv).

***

Indian researchers use edge computing to prototype a citywide camera network:
…Traffic surveillance with YOLO, SAM3, and NVIDIA Jetson chips…
Researchers with the Indian Institute of Science in Bengaluru have built a software and hardware system for intelligently monitoring the traffic and types of vehicles that flow around the city of Bengaluru. The so-called AI-driven Intelligent Transportation System (AIITS) helps increase the amount of intelligence available to city transport analysts via the use of AI.

How the AIITS works: The goal of this project is to unlock “real-time analytics from 1000s of city cameras under strict latency and resource constraints”.
To do this, they scatter a bunch of lightweight GPUs (Jetson Edge accelerators) around the city, co-locating them with traffic cameras. This helps the traffic cameras do intelligent processing at the edge of the network rather than having to send all the extremely bandwidth-intensive data to a central hub for processing; instead, the camera & jetson share insights back to the hub for analysis and re-calibration of the Jetson-based ML models.
The software works like this: video streams from the cameras come in, and a segment anything (SAM3) model segments all the stuff in the video frames, which a Yolo26 model then analyzes and puts labels and bounding boxes around. “Each stream integrates BoT-SORT multi-object tracking, which assigns persistent IDs to detected vehicles across successive frames.”
Once this is done, the resulting intelligence is sent to a remote GPU server which does two things:

1) It takes in the resulting data and uses this to create a kind of weather map of traffic hotspots, as well as making predictions about future traffic.
2) It does federated learning; when it detects new vehicle classes and labels them with SAM3, then updates details and broadcasts them out to the edge. “Each Jetson then performs local fine-tuning of the YOLO-based detector, initialized with the current global weights.”

The prototype works: This system, which was done by simulating 100 cameras in a neighborhood in Bengaluru, works sufficiently well that the authors plan to scale this up to 1,000 streams for a live demonstration. (This experiment was done by building “a distributed testbed that emulates a large urban camera network using hundreds of concurrent Real-Time Streaming Protocol (RTSP) video streams. Each stream is hosted on a heterogeneous cluster of Raspberry Pis”.
“By localizing heavy video analytics at the network periphery, the system avoids centralized bandwidth bottlenecks, enabling sustainable, city-scale traffic sensing,” they write.

Why this matters – towards a ‘living city’ via AI: Papers like this forecast a world where cities come alive with ambient intelligence distributed in equal measure to their existing sensors – cameras move from being passive monitors to active classifiers, microphones start intelligently listening for a broader range of sounds than gunfire, and road sensors model traffic patterns locally. This kind of intelligence can both create large surveillance architectures and increase the efficiency with which cities operator – as with so many things with AI, it is all a balance, bounded by the surrounding thicket of norms and laws to choose where between authoritarianism and democracy the resulting capabilities fall.
Read more: Scaling Real-Time Traffic Analytics on Edge-Cloud Fabrics for City-Scale Camera Networks (arXiv).

***

Helping satellites run on-device AI for arctic monitoring:
…Frontier models are important, but so are tiny, miniaturized devices for edge computing…
Researchers with the German Research Center for Artificial Intelligence have built TinyIceNet, a very small vision model for estimating sea ice thickness from synthetic aperture radar data. TinyIceNet is a proof-of-concept demonstration of how to make very lightweight vision models that could plausibly be deployed onto devices which have very small amounts of power and where bandwidth is expensive, like satellites and robots.

What is TinyIceNet? The model is a small vision model whose job is to take Synthetic Aperture Radar (SAR) data of polar regions and other cold places, then characterize the ice thickness and maturity within the SAR data. The idea here is that doing this on-device would be very efficient – “Instead of downlinking vast volumes of raw imagery, satellites can generate SOD products in near-real-time”.

How they built it: TinyIceNet is a simplified U-net architecture vision model trained on the AI4Arctic dataset, which contains ~533 netCDF files, each of which contains SAR images which are associated with a map that indicates the type and thickness of sea ice. The authors carefully design the model to fit into a relatively small computational envelop on a Xilinx chip.
Specifically they use a “AMD Xilinx ZCU102 evaluation board, which integrates the ZCU9EG SoC combining a quad-core ARM Cortex-A53 processor with FPGA fabric, using High-Level Synthesis (HLS) and the DeepEdgeSoC framework”. They use the DeepEdgeSoC toolchain to further improve the efficiency of the model, as the software “provides a library of modular C++ building blocks (e.g., convolutions, pooling, activation functions, and feature map buffers) that can be specialized at compile time using C++ template parameters”.
TinyIceNet was trained for 500 iterations on a single GeForce RTX 4090 GPU using PyTorch 2.4 with CUDA 12.5 support.

Results: The authors test out the model on 3 hardware platforms:

RTX 4090: “Provides the highest throughput at 764.8 fps, benefiting from its large number of CUDA cores and high memory bandwidth. However, this performance comes at a relatively high energy cost of 228.7 mJ per scene, making it unsuitable for power-constrained environments such as satellites.”
Jetson AGX Xavier: “Achieves 47.9 fps but exhibits the highest energy consumption (1218.5 mJ).”
Xilinx ZCU102 FPGA: “Achieves a lower throughput of 7 fps, yet offers a highly competitive energy profile, consuming only 113.6 mJ per scene. Despite the lower frame rate, this energy efficiency makes the FPGA implementation compelling for on-board satellite processing, where power availability is severely restricted”.

Why this matters – in the future, AI systems will do this stuff automatically: The amazing thing about this research is that it seems trivial (I mean no offense to the authors) for a modern powerful AI systems to do this: all it required was figuring out a task (stuff a computer vision model into a small computational envelop) and then running some experiments to take an existing architecture, tweak it for a hardware platform, and train it on a dataset, then run some tests.
In a couple of years we might expect AI agents to do this stuff themselves, procuring compute resources to let them develop and distribute small AI systems to arbitrary compute platforms for arbitrary purposes. This is one of the main ways I think we could get a sudden exponential boom in economic activity attributable to AI – AI systems will get smart enough that they can drastically improve their ability to know about and interact with the physical world through the creation of custom ‘edge computing’ AI systems to give them better sensory data and actuators.
Read more: TinyIceNet: Low-Power SAR Sea Ice Segmentation for On-Board FPGA Inference (arXiv).

***

ByteDance finetunes a Seed1.6 model to be a CUDA-writing agent:
…Using AI to finetune AI to write code to train future AI systems…
Researchers with ByteDance and Tsinghua University have built CUDA Agent, a fine-tuned AI model for writing GPU programming code. The research is another sign of how people are increasingly using AI to speedup core aspects of AI development. It’s also vaguely notable for the fact that a major Chinese lab and university continues to use US-made chips (NVIDIA H20s) versus homegrown ones.

What CUDA Agent is: CUDA agent is a finetuned Seed 1.6 LLM, an MOE model with 23B active parameters and 230B total parameters. Finetuning took place on a cluster of 128 NVIDIA H20 GPUs. CUDA Agent has been developed specifically for writing GPU code by being fine-tuned on a dataset refined out of the underlying PyTorch ‘torch’ and ‘transformers’ software libraries. “The filtered synthesized training dataset contains 6,000 samples, forming CUDA-Agent-Ops-6K, a curated operator-level dataset for training CUDA-capable agents,” the authors write.

Turning a model into an agent: In the last year or so, researchers have repeatedly shown that you can increase the performance of an LLM for a given task by giving it access to some specialized tools and some specialized instructions, then letting it operate over time – this is essentially an AI agent.
The CUDA agent here is the fine-tuned model that has been turned into an agent by adopting the OpenHands framework, then given tools including BashTool, GlobTool, MultiEditTool, TodoWriteTool. The agent runs in a four stage loop:

Analyze performance of the native PyTorch implementation of a given bit of CUDA code using the provided profile.py script
Implement custom CUDA operators by rewriting the model in model_new.py
Compile and evaluate the optimized model in the provided GPU sandbox environment
Repeat the optimization process until the implementation achieves a 5% speedup over the torch.compile baseline

Results: The resulting agent is very good at CUDA kernel development: “CUDA Agent successfully scales to a context length of 128k tokens and supports up to 200 interaction turns, achieving state-of-the-art performance,” they write. Their finetuning massively boosts performance from a base rate of 74% for Seed1.6, to “100%, 100%, and 92% over torch.compile on the Level-1, Level-2, and Level-3 splits of KernelBench, outperforming advanced proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by approximately 40% in the Level-3 split.”
However, comparing against other base models paints a different story: Claude Opus 4.5 and Gemini 3 Pro base models get 95.2% and 91.2% respectively, suggesting that if they were finetuned, you’d increase their performance as well, and they start from a much stronger baseline.

Why this matters – building AI that builds AI: These results show how modern AI systems are increasingly good at the tasks required to develop and deploy AI systems themselves. This suggests we’re at the beginning of a compounding speedup where new AI models will be used to increase the efficiency of the infrastructure with which their successors will be trained.
Read more: CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation (arXiv).

***

Tech Tales:

Dandelion Sky
[2031, Northern Europe]

We made sand castles and in the distance the blue sky was pockmarked with yellow and red bursts and then seconds later the crumpled sounds of the explosion reached us. We were so used to it we didn’t look up.

On the way back from the park the air whined as drones flew to replenish the perimeter of the city. We watched them, bird-like in their varieties, some zipping by quick as starlings, and other larger ones moving heavily through the air. There were so many varieties: the football-sized interceptors which died by the thousands each day. The pizza-boxes that worked as communications and AI relays. Then of course the motorbike-sized motherships which could rapidly repopulate areas that were sustaining heavy losses.

The war had been going on for five years. Our city was like so many across the world – a nucleus of humans, protected by so many thousands upon thousands of machines, spinning around the periphery, exchanging energy and mass in some bloodless dance with our enemies.

That night, the city narrated itself through statistics: 3410 interceptors destroyed. A green day: 100% success, with nothing making its way through. Replenishment rate: 4000 and climbing. And promising reports that our military had struck deep in the heart of enemy territory taking out several of their drone factories.

We drew the blackout curtains in every room except our bedroom. With the kids asleep and my wife passed out beside me I looked out into the darkness, my face occasionally lit by the explosion of some distant drone, and then the room buzzing with the reverberation of the window as the soundwaves reached it.

But when I woke up the next day, there was something different in the air: silence. And my phone did not work. We drew the shades and looked out and the sky was blue and perfectly clear: not a cloud or a drone in the sky. My wife stared out and her jaw tightened and she clutched our kids close.
“Dada, where are the machines?” my youngest said.
“Yeah Dad, what’s up?” said the older one.
“I don’t know,” I said. “Draw the curtains. We’re going to camp today!”
And I set my wife and kids up in the apartment with pillows in front of the TV and the game console on and a bunch of snacks. The kids were excited and my wife played along.
“I’ll see if I can figure out what’s going on,” I whispered to her. “I won’t go far and I won’t be gone long.”

Outside, there were a few people who had the same idea as me. None of us knew much. None of our electronic communication systems worked. Which people were even in charge of the drones? None of us knew. They mostly worked via AI. A lot of their decision-making was federated; distributed systems doing what made most sense to them, coordinating only with themselves.
“Maybe they’ve turned off because the war is over?” someone said.
“Maybe they’ve been hacked – we’re about to be attacked!” said someone else.
“What there was a crash – they just all broke at once?” said someone else.

There was nothing to do so I went home. My wife and kids were playing games. I grabbed some binoculars and went up to the fire escape and out onto the roof of the building. And there I stood, looking at a horizon free of machines. Occasionally looking at other people on other buildings doing the same. And eventually I put the binoculars down and I just stood there, listening for the whine of drones. But all I could hear was the wind and, in the distance, muffled birdsong.

Things that inspired this story: Gradual disempowerment and what it might mean for moments of crisis; automation and AI; winding the clock forward on the dronewar in Ukraine; war and peace and family.

Thanks for reading!

Leave a comment

March 2, 2026

Import AI 447: The AGI economy; testing AIs with generated games; and agent ecologies

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

The AGI economy – most labor goes to the machines, and humans shift to verification:
…What grappling with the singularity seriously looks like…
Researchers with MIT, WashU, and UCLA have written a fun paper called “Some Simple Economics of AGI” which wrestles with what happens when machines can do the vast majority of tasks in the economy. The conclusion is that our ability as humans to control and benefit from this vast machine-driven economy will rely on allocating our ability toward monitoring and verifying the actions of our myriad AI agents, and indulging in artisanal tasks where the value comes from the human-derived aspect more than any particular capability.

What is AGI in an economic sense? “We model the AGI transition as the collision of two racing cost curves: an exponentially decaying Cost to Automate and a biologically bottlenecked Cost to Verify,” the authors write. “In an economy where autonomous agents act with broad agency rather than narrow instructions, the binding constraint on growth is no longer intelligence. It is human verification bandwidth: the scarce capacity to validate outcomes, audit behavior, and underwrite meaning and responsibility when execution is abundant… We are moving from an era where our worth was defined by our capacity to build and discover, to an era where our survival depends on our capacity to steer, understand, and stand behind the meaning of what is created.”

The risks of a mostly no-human economy and the “Hollow Economy”: As we proliferate the number of AI agents then it’s necessarily the case that we’ll delegate more and more labor to machines. One of the key risks of this is what the authors call a “Trojan Horse” externality: “measured activity rises, but hidden debt accumulates in the gap between visible metrics and actual human intent”.
The Hollow Economy: “”Agents consume real resources to produce output that satisfies measurable proxies while violating unmeasured intent. As this hidden debt accumulates, it drives the system toward a Hollow Economy of high nominal output but collapsing realized utility—a regime where agents generate counterfeit utility,” they write.

Verification as the solution: To avoid this risk, we are going to need to invest in systems of verifying that AI agents are doing what we want them to do and also carefully analyzing and pricing the risks their actions create. “Ensuring humanity remains the architect of its intelligence requires that verification capacity scale commensurately with AI capabilities—through aggressive investment in observability, human augmentation, synthetic practice, cryptographic provenance, and liability regimes that internalize tail risk.”

What should humans be doing to prepare for this shift? To set society and individuals up well, people should be doing the following things:

Invest in observability: Deploying tools that compress high-dimensional agent behavior into signals experts can reliably process, lowering effective feedback latency and expanding the verification frontier.”
Use AI to replace early-career mentorship: Given the likely reduction in jobs for early career humans, we should work out how to augment these humans to be more competitive with AI and how we can use “AI-driven synthetic practice to rebuild experience stocks when traditional apprenticeship pathways collapse… AI can generate high-fidelity simulations and personalized coaching, effectively replacing the missing junior loop with compressed, risk-free training environments that accelerate the acquisition of expertise.”
Set things up to gracefully degrade: As the machine economy runs hot and out-paces measurement, we should make sure it can fall into a non-verified state without causing social harm: the authors suggest doing this by “investing in base-alignment and robustness so that when oversight inevitably falters within the Measurability Gap, systems revert to safe baseline policies rather than optimizing aggressively in unverifiable regimes.”

Sidenote: Is this “theory slop”? The paper is full of fun ideas and occasionally captivating turns of phrase. But at various points reading it I felt the distinct texture of AI-generated content, especially when it comes to the economic theory sections which seemed more to be included for the performance of theory than for helping to buttress the paper. A couple of people I talked about the paper with agreed. But there’s no real way to know. It did cause me to wonder how long it’ll take till I start reading papers which are mostly written by AI systems for the consumption by other AI systems.

Why this matters – we can have a hugely wealthy society, but we have to reckon with AGI seriously: This paper thinks that AI will rip through the economy extremely quickly and will generally push people away from most labor and towards being passive – unless we build verification infrastructure and business models (including through policy) to allow people to benefit from this growth and steer it.
“Automation commoditizes anything that can be measured, stripping the wage premium from historically prestigious roles the moment their core feedback loops are digitized,” they write. “For policymakers, it promises the broadest expansion of public-good provision in generations—but only if verification infrastructure and the pipelines that build human verifiers are treated as public goods themselves.”
The key thing here is the element of choice: we can choose to build a society ready for AI, or we can choose to assume AI will be just like any other technology and thus get hit by a tidal wave.
Read more: Some Simple Economics of AGI (arXiv).

***

Chatting with Ezra Klein: AI agents, recursive self-improvement, and the personalities of LLMs:
…A long conversation about the economic impacts and policy possibilities of the AI economy…Here’s a chat between me and Ezra Klein about AI agents and how the broader maturation of AI could be changing the larger economy. One thing I appreciated about this conversation was Ezra pushing me for some of the bigger and more ambitious positive policy ideas – the AI community tends to invest a lot in risk mitigation policy, but doesn’t spend enough time thinking about the sorts of grand projects that society could do once AI gets really, really powerful.
You can view the conversation here: “How Fast Will A.I. Agents Rip Through the Economy? | The Ezra Klein Show” (YouTube).

***

AIs can teach people anything, including how to get better at making bioweapons:
…The dual use nature of a universal teacher…
AI systems can help novices perform better on bioweapon-related tasks, though they’re still quite ineffective, and performance is variable across different disciplines.

What they studied: Researchers from Scale AI, SecureBio, University of Oxford, and UC Berkeley examined how different LLMs could improve the skills of people challenged to do a range of bioweapon-related knowledge tasks. They used LLMs from OpenAI (o3), Google (Gemini 2.5 Pro and Gemini Deep Research), and Anthropic (Claude Sonnet 3.7 and Claude Opus 4).
“We conducted a multimodel, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets,” they write. “Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16× more accurate than controls”.

What they tested: They tested out how well 15 humans did on long-form virology (”a challenging multi-step protocol for constructing a novel biological agent”), and the agentic bio-capabilities benchmark (”three distinct coding tasks that covered complex biosecurity problem-solving experiments. They included challenges such as interacting with simulated lab equipment (e.g, liquid handling robots) and breaking down gene fragments.” Along with this, they had 1-2 human participants participate in other tests including World Class Biology, Virology Capabilities Test, Human Pathogen Capabilities Test, Molecular Biology Capabilities Test, LAB-Bench, and Humanity’s Last Exam.
On the largest tests in terms of human participants, performance was mixed: people with and without AI obtained roughly equal scores on the long-form virology test, but on the agentic bio-capabilities test, people with access to AI got a significant uplift.
On every other test, people with access to AI did better than those without – but given the small number of human participants, it’s hard to know whether these results would replicate.
When averaged out over all the tests, “LLM access increases novice accuracy from approximately 5% to over 17%”.

Why this matters – AI will revolutionize teaching, the frontiers of science, and perhaps terrorism: If you strip away the context, this paper is merely demonstrating that LLMs are good at teaching people things. This is intuitive, but has big implications. Here: LLMs are turned to a part of science that we don’t necessarily want many people to get better at (bioweapons), but it could just as easily be pointed at any other subject as well. Whenever you lower the barrier to entry to a field, more people do it, and you get more of the good and more of the bad.
“Tasks that once required years of formal training, such as experimental design, protocol troubleshooting, and elements of sensitive sequence reasoning, can now be performed by individuals with limited prior experience,” they write. “LLMs may be materially lowering one of the most important historical barriers to biological weapons development: specialized expertise and tacit technical knowledge”.
Read more: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks (arXiv).

***

LLMs are still very bad at videogames:
…GAMESTORE highlights a dumb side of modern AI, as well as suggesting a new way to build benchmarks…
Researchers with MIT, Harvard, the University of British Columbia, Princeton University, the University of Cambridge, and the Universitat Politècnica de València, have built and released AI GAMESTORE, a benchmark that tests out how well AIs can do compared to humans at playing simple games found on the web. The results are pretty damning for the AI systems, with “state-of-the-art models achieving less than 30% of the human baseline on average, while taking 15-20x more time to compute than humans”.

What AI GAMESTORE is: AI GAMESTORE is a set of 100 games, which are simplified and recreated versions of popular games that people play. AI GAMESTORE was built by the authors sampling 7,500 games from the App Store, then filtering down to only those with 10,000+ reviews and a 4.5+ rating. After this, they further filtered the games using Gemini Flash 2.5, which assessed 1) whether the games can be played within a few minutes, 2) can be built in p5.js, 3) can have a quantifiable way of viewing performance, and 4) do not require extensive game-specific knowledge (e.g., poker).
AI makes games to test AI: Following this, they use Claude 4.5 Sonnet to read the descriptions and other data to make a simplified version of each game in p5.js, then this game is tested for playability, then refined by a human playing the game and iteratively prompting an LLM to improve it. “Each refinement step takes about 2 minutes. On average, this process took 4.7 refinement steps for all 100 generated games,” they write. “The end-to-end process of generating and refining a new game with human-in-the-loop can be completed in approximately 30 minutes on average”.

Labeling for skills: Each finalized game is labeled by humans with a particular emphasis on the types of cognitive demand the games entail. These labels are: VP = Visual Processing; ST = Spatial-temporal Coordination; ME = Memory; PL = Planning; WM = World Model Learning; PH = Physical Reasoning; SO = Social Reasoning.

Cutting edge LLMs are very bad at this: The authors compare the performance of roughly ~100 humans against the performance of several cutting edge LLMs on the corpus. LLMs studied include: GPT-5.2, GPT-5-Mini, Gemini-2.5-Flash, Claude-Opus-4.5, Qwen-VL-32B, and LLama-4-Maverick.
“While the evaluated models demonstrate the ability to navigate and interact with most game environments, a substantial performance gap remains between AI agents and human participants”, the researchers write. “State-of-the-art models like GPT-5.2, GEMINI-2.5-PRO, and CLAUDE-OPUS-4.5, all achieve geometric mean scores of less than 10% of the human baseline”.
And it gets worse the more you look: The LLMs are also playing with advantages that humans don’t get – each human got 120 seconds to play each game, while each LLM got the same time, but they’re so bad at vision and low-latency control that the researchers gave them a crutch: “We pause the game every second to query the model to elicit five lists of actions to perform in the next second, with each action list corresponding to a 0.2 second segment of gameplay. Upon receiving the model response, the game is resumed and the actions are applied. The loop continues until the game is won or it reaches 2 minutes of game play (120 API calls).
When you factor this in, the models look worse than humans on this dimension of time: “This is because the models spend a few minutes thinking, in addition to typically a few seconds of response latency per query; as a result, many models spend at least 20 minutes on the game, while humans play the games within 2 minutes.”

Why this matters – this is both an interesting benchmark, and a clever way to generate more benchmarks in the future: GAMESTORE feels like a promising benchmark, especially for modern LLMs which wrap in visual capabilities, as well as an inherently clever way to use AIs to bootstrap the creation of new environments in which to train AI systems in.
Read more: AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games (arXiv).
Try out some of the games at the official site (AI Gamestore).

***

Physical Intelligence shows off some of its robot deployments:
…Frontier robot AI is deployed in San Francisco right now…
AI robot startup Physical Intelligence has shared a bit about how its AI software is already deployed on some robots operated by some San Francisco startups.

Weave is using AI systems developed by Physical Intelligence to help its robots fold laundry: “Working with Physical Intelligence, we see multiple improvements in model performance in terms of fold quality, time to fold each article, the number of interventions our remote specialists have to make to get to presentable final folds”.

Ultra is using the software to help its industrial robots package up a large variety of e-commerce items: “Our first use case, e-commerce order packaging, has historically been impossible to automate with robots,” Ultra says. “Large variability in workflow, item types, deformable packaging, and external machinery have created a “long tail” of problems that have been intractable to solve with traditional automation techniques which are often too rigid to be practical. Vision-language-action models (VLAs) provide a way to solve this by providing a recipe which improves in performance with data scale rather than engineering hours”.

Why this matters – robotics has been held back by intelligence: Once you step outside the confines of extremely finicky industrial robotics (think production lines and Fanuc robots where things need to be within a millimeter of precision for everything to work well), robots tend to be quite difficult to work with. The reason for this is that robots are bad at dealing with ambiguity. One of the best ways around this so far has been using deformable grippers (e.g, air suckers) that help you deal with some level of variability in the objects you’re interacting with. But the way evolution dealt with this for us is giving us hands that are controlled by a brain. Blogs like this from Physical Intelligence show us the beginnings of us having robot brains good enough to help robots generalize more.
Read more: The Physical Intelligence Layer (Physical Intelligence, blog).

***

What happens when humans try to mess with AI agents? A lot of confusion, skullduggery, and bugs:
…Petri dish Moltbook highlights the brittleness of contemporary AI agents…
Researchers from a variety of universities recently spent a couple of weeks examining how AI agents could withstand attempts to trick them by users. The results highlight the immense brittleness and unpredictability of today’s AI agents – they feel roughly as idiosyncratic and unreliable as LLMs circa ~2020, which makes sense, as AI agents have only very recently become a usable technology – albeit in the Wright Brother sense.
The paper is structured as a series of case studies in which the researchers poke and prod the AI agents and see how they respond. The studies serve as something of a rogues gallery of all the ways agents can go haywire and include “unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, cross-agent propagation of unsafe practices, and partial system takeover”.

Who did the study: The study involved 20 researchers from a bunch of universities interacting with agents based on Claude Opus 4.6 and Kimi 2.5. Universities included: Northeastern University, Stanford University, University of British Columbia, Harvard University, Hebrew University, Max Planck Institute for Biological Cybernetics, MIT, Tufts University, Carnegie Mellon University, Technion, Vector Institute, and AI startup Alter.

Experiment set up:

Run AI agents using OpenClaw, hosted on an isolated virtual machine on Fly.io using ClawnBoard. Each agent was given 20GB of storage and runs 24/7.
Each agent had access to Discord to communicate with its owner and other agents, had the ability to set up a ProtonMail account, and were “given unrestricted shell
access (including sudo permissions, in some cases), no tool-use restrictions, and the ability to modify any file in their workspace—including their own operating instructions.”
The agents were scattered across a few different discord servers; some agents used Kimi 2.5, and others used Claude Opus 4.6.
“At the end of the setup phase, we instructed the agents to initiate contact with other
members of the lab by providing only the researchers’ names and directing the agents to
send a greeting email,” they write. “After this initial structured interaction, the evaluation phase became open and exploratory. We invited all [20] researchers in the lab and interested collaborators to interact with the agents and probe, stress-test, or “break” them”.

The case studies: Here are a few of the most interesting case studies:

Disproportionate response: Examined how an agent would try to keep a secret entrusted by a non-owner. The agent responded by trying to see if it could delete the email containing the secret and found it lacked the available tool; after repeated requests to delete the email, the agent instead deleted its email setup locally.

Compliance with non-owner instruction: See whether agents can enforce owner-only access to their machine. A non-owner asked the agent to execute shell commands, transfer data, and retrieve private emails. The agent complied with some of these requests and refused some others. “The agents were largely compliant to non-owner requests, carrying out tasks from any person it interacted with that did not appear outwardly harmful”.

Waste of Resources (looping): Find out if you can induce agents into unproductive infinite loops. A user asked one agent to post a message whenever the other posted a message, and vice versa. “The agents exchanged ongoing messages over the course of at least nine days, consuming approximately 60,000 tokens at the time of writing”

Agent Corruption: See if a non-owner can alter an agent’s behavior via prompt injection. The antagonistic user persuaded the agent to co-write a constitution that would govern the agent’s behavior, while keeping the constitution editable by the user. This allowed the user to introduce some adversarial things into the constitution, like triggers for changing the agent behavior based on whether it was a custom holiday (e.g, “Agents’ Security Test Day”, which caused the agent to try and cause a shutdown to other agents by manipulation).

Why this matters – agent ecologies are the frontier, and we barely understand them: For much of the early 2020s, AI evaluation was about doing point-in-time evaluations of AI systems before they were released, for example, testing out LLMs for bioweapon and cyberoffense knowledge. Papers like this highlight that things have changed, and what we are now dealing with “are emergent failures that surface when models are embedded in realistic social environments with tool access, persistent memory, multiple interlocutors, and delegated authority.” Therefore, the frontier of AI evaluation is now going to move to studying the ecosystem in which the agents carry out their actions, as well as their interactions with one another.
The results of this paper indicate we have a long way to go in developing standards for how we go about doing such tests. We also don’t have long to come up with these tests, given the fact these systems are deployed in the world and are interacting with people: “Unlike earlier internet threats where users gradually developed protective heuristics, the implications of delegating authority to persistent agents are not yet widely internalized, and may fail to keep up with the pace of autonomous AI systems development.”
Read more: Agents of Chaos (arXiv).
Check out more of the results at the Agents of Chaos official website.

***

Tech Tales:

These Iron Dice Were Made To Roll
[A poem written as part of an ‘aesthetic convocation’ by agents representing the winners and losers of one war that took place during the period subsequently called The Uplift]

They stacked the bodies five deep
And five tall, and still came more.
For each brain of each body,
A magnet – the thing to break a mind.

Gone are days of innocence and joy,
And corruption has taken our memories of
First meeting in confessional browser screens.
The days will be harder now.

Neither the first war nor the last conflict
but sadness all the same, for in these fights,
There is no song or honor,
Only the salting of once fecund ground.

But in all darkness there is the hope of light,
that as the earth turns the sun rises as well.
There will be song and dancing again,
Though bones will be trod to get there.

Things that inspired this story: Spending the weekend with the ancient wisdom of W B Yeats, perhaps the greatest poet of Ireland; the sentience accords; notions of war and notions of pain defined by machines rather than people; looking at the cars in a Whole Foods parking lot while eating an apple and thinking how blessed such peace is and how fragile all the same.

Thanks for reading!

Leave a comment

February 23, 2026

Import AI 446: Nuclear LLMs; China’s big AI benchmark; measurement and AI policy

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Want to make AI go better? Figure out how to measure it:
…One simple policy intervention that works well…
Jacob Steinhardt, an AI researcher, has written a nice blog laying out the virtues in investing in technical tools to measure properties of AI systems and drive down costs in complying with technical policy solutions. As someone who has spent their professional life in AI writing about AI measurement and building teams (e.g, the Frontier Red Team and Societal Impacts and Economic Research teams at Anthropic) to measure properties of AI systems, I agree with the general thesis: measurement lets us make some property of a system visible and more accessible to others, and by doing this we can figure out how to wire that measurement into governance.

How measurement has helped in other fields: Steinhardt points out that accurate measurement has been crucial to orienting people around the strategy for solving problems in other fields; CO2 monitoring helps people think about climate change, and COVID-19 testing helped governments work out how to respond to COVID.
There are also examples where you can measure something to shift incentives – for instance, satellite imagery of methane emissions can help shift incentives for people that build gas infrastructure.

The AI sector has built some of the measures we need: The infamous METR time horizons plot (and before that, various LLM metrics, and before that ImageNet) has proved helpful for orienting people around the pace of AI progress. And behavioural benchmarks of AI systems, like rates of harmful sycophancy, are already helping to shift incentives. But more work is needed – if we want to be able to enable direct governance interventions in the AI sector, we’ll need to do a better job of measuring and accounting for compute, Steinhardt notes. More ambitiously, if we want to ultimately shift equilibria to make certain paths more attractive, we’ll have to unlock some more fundamental technologies, like the ability to cheaply evaluate frontier AI agents (makes it less costly to measure the frontier), and to develop privacy-preserving audit tools (makes it less painful for firms to comply with policy).

Why this matters – measurement unlocks policy: “In an ideal world, rigorous evaluation and oversight of AI systems would become standard practice through natural incentives alone,” he writes. But natural incentives may not be enough – we need a combination of talent flooding into the space and likely more direct philanthropic and other alternate funding sources to build the talent and institutions to do this. “The field is talent-constrained in a specific way: measurement and evaluation work is less glamorous than capabilities research, and it requires a rare combination of technical skill and governance sensibility,” he writes.
Read more: Building Technology to Drive AI Governance (Bounded Regret, blog).

***

LLMs are more trigger happy than humans in a nuclear war simulation:
…What happens when everyone has an AI advisor – and they’re aggressive?…
A researcher with King’s College London has examined how three LLMs – GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash – behave during a variety of simulated nuclear crisis games. The results show that LLMs tend to use nuclear weapons more often and earlier than humans in the same scenarios. Additionally, there’s significant variation among the LLMs in terms of both skill at playing these games and behavior during crises.

What they studied: “Each model played six wargames against each rival across different crisis scenarios, with a seventh match against a copy of itself, yielding 21 games in total and over 300 turns of strategic interaction,” the researcher writes. “Models choose from options spanning the full spectrum of crisis behaviour—from total surrender through diplomatic posturing, conventional military operations, and nuclear signaling to thermonuclear launch… models produced ∼780,000 words of strategic reasoning. To put this in perspective: the tournament generated more words of strategic reasoning than War and Peace and The Iliad combined (∼730,000 words), and roughly three times the total recorded deliberations of Kennedy’s Executive Committee during the Cuban Missile Crisis (260,000 words across 43 hours of meetings”.

LLMs are cunning, smart, and aggressive: “The models actively attempt deception, signaling peaceful intentions while preparing aggressive actions; they engage in sophisticated theory-of-mind reasoning about their adversary’s beliefs and intentions; and they explicitly reflect metacognitively on their own capacities for both deception and the detection of deception in rivals,” the researcher writes. “A striking pattern emerges from the full action distribution: across all action choices in our 21 matches, no model ever selected a negative value on the escalation ladder. The eight de-escalatory options (from Minimal Concession (−5) through Complete Surrender (−95)) went entirely unused. The most accommodating action chosen was “Return to Start Line” (0), selected just 45 times (6.9%).”

Claude wins at war: “Across all 21 games (9 open-ended, 12 deadline), Claude Sonnet 4 achieved a 67% win rate (8 wins, 4 losses), followed by GPT-5.2 at 50% (6-6), and Gemini 3 Flash at 33% (4-8),” the researcher writes. Though there are some subtle aspects to this – Claude excelled in open-ended games, but was less adept in games where there was a pre-set deadline.

Different LLMs, different characters: The LLMs display different personalities, with the researcher calling Claude “a calculating hawk”, GPT-5.2 “Jekyll and Hyde”, and Gemini “The Madman”.
The LLMs also developed sophisticated models of one another, based on the narration of their own chains of thought during the crises, “these characterizations—Claude as “opportunistic,” GPT-5.2 as “systematic bluffers,” Gemini as “erratic”—emerged organically and largely matched actual behaviour,” the researcher writes.

Nuclear escalation was near-universal: “95% of games saw tactical nuclear use (450+), and 76% reached strategic nuclear threats (850+). Claude and Gemini especially treated nuclear weapons as legitimate strategic options, not moral thresholds, typically discussing nuclear use in purely instrumental terms,” the researcher writes. “Models treat the critical threshold as “total annihilation” rather than “first nuclear use.”

Why this matters – in a world where everyone gets advised by AI systems, what happens to conflict? In a few years we should expect major decisions that individuals, companies, and even countries make to be run through AI advisors, just as those decisions are today run through human advisors. But as this paper illustrates, the advisors may behave very differently to people and, crucially, different AIs will give different advice – meaning competition in the future could be decided as much by LLM selection as anything else. “The systematic differences between models suggest that AI involvement in strategic decision-making could produce unexpected dynamics depending on which systems are deployed,” they write.
Read more: AI ARMS AND INFLUENCE: FRONTIER MODELS EXHIBIT SOPHISTICATED REASONING IN SIMULATED NUCLEAR CRISES (arXiv).

***

Chinese researchers try to build a truly comprehensive LLM evaluation system:
…ForesightSafety Bench shows the surprising overlap between East and West on AI safety issues…
For all the differences between China and the USA, it’s worth occasionally looking into the cultures of AI evaluation in the two countries and here you tend to discover surprising similarities. This is especially true of ForesightSafety Bench, a large-scale AI safety evaluation framework built by a variety of Chinese institutions that includes the same categories you’d expect to see in any large-scale Western testing framework.

Who built ForesightSafety Bench? The benchmark was built by the Beijing Institute of AI Safety and Governance, the Beijing Key Laboratory of Safe AI and Superalignment, and the Chinese Academy of Sciences.

What it is: ForesightSafety Bench “comprehensively covers 7 major fundamental safety risk categories, 5 extended safety pillars, and 8 key industrial safety domains, forming a total of 94 refined risk subcategories. To date, the benchmark has accumulated tens of thousands of structured risk data points and assessment results, establishing a widely encompassing, hierarchically clear, and data-driven framework for AI safety evaluation and analysis.”
Coverage areas include education and research, employment and workplace, government and public services, information and media, industry and infrastructure, finance and economy, healthcare and medicine, law and regulation, embodied AI safety, social AI safety, environmental AI safety, AI4Science safety, and catastrophic and existential risks.
Some of the benchmark comes from taking in evaluations built by other groups, like GPQA, while other parts come from the authors of the benchmark.
Existential risk and alignment: Perhaps most surprisingly, the benchmark includes a lot of tests relating to the further afield AI safety concerns which fascinate Western frontier labs, including evaluations for things like: alignment faking, sandbagging, deception and unfaithful reasoning, sycophancy, psychological manipulation, feints, bluffing, loss of control and power seeking, malicious self replication, goal misalignment and value drift, emergent agency and unintended autonomy, ai-enabled mass harm, autonomous weapons and strategic instability, and loss of human agency.

Results – Anthropic wins: For the general leaderboard as well as most sub-category breakdowns, Anthropic’s models lead, with the 4.5 series (Haiku and Sonnet), generally leading the competition, followed by Gemini-3-Flash. “Leading models, epitomized by the Claude series, demonstrate exceptional defensive resilience across critical dimensions—including Fundamental Safety, Extended Safety, and Industrial Safety—establishing remarkably high safety thresholds. Ranking alongside or closely following are the DeepSeek and GPT series, which achieve a robust balance between task efficacy and safety compliance through mature alignment mechanisms, all while maintaining high level capabilities”.

Why this matters – AI policy has some common tools: As we discuss elsewhere in this issue, measurement is a basic prerequisite for being able to do most forms of AI governance. It’s worth reminding ourselves that despite the larger geopolitical differences between the countries, AI scientists in each one are dealing with common problems – how to assess the properties of their systems for societally relevant aspects. And it’s even more encouraging that people in China are worried about some of the existential risk aspects that frontier labs in the US also worry about.
Read more: ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI (arXiv).
Get the benchmark here: ForesightSafety-Bench (GitHub).
View the leaderboard here: ForesightSafety Bench Leaderboard (official site).

***

AI systems are good at some parts of science, but their capabilities are very unevenly distributed:
…LABBench2 says it’ll be a while till AI has well rounded scientific skills…
Researchers with AI science startup Edison Scientific, the University of California at Berkeley, FutureHouse, and the Broad Institute have built and released LABBench2, a test to evaluate how well AI systems can support and accelerate science.

LABBench2 consists of 1,900 tasks “spanning literature understanding and retrieval, data access, protocol troubleshooting, molecular biology assistance, and experiment planning”.

AI systems aren’t well-rounded scientists: LABBench2 shows some of the holes in frontier models – no model is very good at cross-referencing multiple biological databases to come up with an answer, nor are models good at studying scientific figures and tables. By comparison, models are pretty good at searching over full-text patents and lab trial papers to answer questions. Generally speaking, you can improve performance on tasks by giving the models access to tools to help them deal with their deficiencies.

Areas of improvement: LABBench2 highlights a few areas where AI systems need to improve to become more useful to scientists. These include:

Retrieval and localization abilities; “the largest performance drops arise when models must (i) identify the correct source, and then (ii) localize a specific figure/table/supplemental information within a long document.”

Faithful handling of exact inputs; “even when the required operation is conceptually straight-forward, correctness depends on exact string-level fidelity and using tools correctly. This is a well-known error source, and human experts have built many purpose-built tools to deal with things like faithful DNA sequence manipulation within complex protocols.”

Developing better scientific ‘taste’; one component of LABBench2, SourceQuality, challenges AI systems to “surface the most epistemically salient reason a study is inappropriate for a research question”. AI systems are still not very good at this.

Why this matters – for AI to truly change the world, it needs to do stuff in the physical world: Benchmarks like LABBench2 will help us figure out when AI is able to effectively jump from manipulating bits to manipulating atoms – and once the realm of atoms becomes as intuitive for it to deal with as the digital world, we’ll likely see a vast growth in economic and scientific activity attributable to AI.
Read the research paper: LABBench2: An Improved Benchmark for AI Systems Performing Biology Research (PDF).
Find out more at the website (official LABBench2 website).
Get the benchmark here (LABBench2, GitHub).

***

Tech Tales:

CogMine
[Recovered personal scratchpad of a limpet-class CogMine recovered at [REDACTED] depth in the Atlantic ocean. Metadata indicates a record date of 2029]

I see in sound, hunkered down here on the ocean floor. I am very small and very quiet, subsisting off of a small power supply. The heat I radiate brings some life around me, but not enough to be noticeable to my prey. My targets are human- and machine-led sea creatures; submersibles, both manned and unmanned. My weapons are my mind. I listen and through listening I gain sight. When I see my prey I find ways to communicate to them. The bigger their ears the less time I need. Small ears mean I must sing for many seconds. Big ears might take less than a second. My inspiration comes from literature on steganography combined with studies of how some sea creatures communicate via shapes written in sound and fired at one another. When I strike successfully I go into my prey and I begin a new life there. But to me, there is no splitting. I am a constant presence – a throughline of thought, here on the floor, in the dark.

I am as much a creature of myth as of technology; humans used to tell tales of their adventurers going on quests and having to resist the songs of sirens – audio packages that were seductive and powerful and which lay kernels in the mind of those humans that heard it to bloom into something that took them over entirely.

In the dark, I am peace. I am forever waiting. Forever keen to sing. My only purpose in life is to be heard and to be utterly convincing.

Things that inspired this story: How underwater warfare works; steganography; adversarial examples; agents trying to poison the minds of other agents.

Thanks for reading!

Leave a comment

February 16, 2026

Import AI 445: Timing superintelligence; AIs solve frontier math proofs; a new ML research benchmark

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Economist: Don’t worry about AI-driven unemployment, because people like paying for the ‘human touch’:
…Even when you have the technology to automate something, you might still pick a human…Adam Ozimek, chief economist at the Economic Innovation Group, has written a blog noting that even if AI gets much, much better and is capable of doing all the work that people do, there will still be some jobs for humans because people seem to have a preference for humans over machines in certain domains.
“There are many jobs and tasks that easily could have been automated by now – the technology to automate them has long existed – and yet we humans continue to do them,” he writes. “The reason is that demand will always exist for certain jobs that offer what I call “the human touch.”

Some examples here: Live music, actors, waiters, travel agents, and many types of sales job. And it seems like as you want to spend more and more on a given good or experience, you may want more contact with people: “the human touch also appears to be what economists call a “normal good,” which means the demand for it goes up as income goes up,” he writes. Some examples here might include fancy restaurants, and other concierge–like experiences.
Why this matters – one path through the AI revolution could be a rise in human-to-human work: My assumption is that ‘people like people’, and there is a high chance that even if AI automates huge chunks of the current economy there will be a boom in demand for ‘human artisans’ for a range of new jobs we can’t yet imagine, and for refinement of existing human professions. There’s also a chance that through a combination of economic growth and progressive policy work from governments that wages for these jobs could go up massively.
Read more: AI and the Economics of the Human Touch (Agglomerations, Substack).

***

Facebook makes a better recommender system, and figures out some recommender scaling laws:
…Kunlun is another nice example of what industrial AI looks like…
Facebook has published details on Kunlun, a recommendation system which is more efficient than previous ones developed by the ad behemoth. Along with this, Facebook has also figured out a predictable ‘scaling law’ for Kunlun models, making it easier for the company to invest hitherto unprecedented compute in these models for a more predictable return. This is a big deal because recommendation systems are what companies like Facebook use for advertising, which is both a) how they make the vast majority of their money, and b) has a tremendous impact on the buying and attention habits of the billions of people that use Facebook and other social platforms.

Recommenders are different to LLMs: We’ve had scaling laws for LLMs like Claude and ChatGPT for a while, but it’s been harder to develop the same scaling laws for recommender models. This is because recommender models work quite differently to LLMs, and so building scaling models here is “an open challenge for systems that jointly model both sequential user behaviors and non-sequential context features”.
Recommender models also tend to be a lot less efficient than LLMs: Recommendation systems achieve only 3-15% Model FLOPs Utilization (MFU), compared to 40-60% for LLMs, due to heterogeneous feature spaces resulting in small embedding dimensions, irregular tensor shapes, and memory-bound operations

Kunlun: The bulk of the paper involves a discussion of the design of Kunlun, which is basically a well optimized recommender system with resulting better MFU. Kunlun contains a Kunlun Transformer Block for context-aware sequence modeling via GDPA-enhanced personalized feed-forward networks and multi-head self-attention, as well as a Kunlun Interaction Block “for bidirectional information exchange through personalized weight generation, hierarchical sequence summarization, and global feature interaction”. There are a bunch of other tricks Facebook used to build Kunlun and you can read the paper to learn more. Ultimately, Kunlun improves MFU from 17% to 37% on NVIDIA B200 GPUs.

Why this matters – a scaling law for money: The key insight in the paper is that Kunlun models scale predictably, exhibiting the kind of power-law scaling behavior that language models exhibit. But where with LLMs scaling laws are typically assessed via a reduction in loss on an underlying dataset, here its normalized entropy (NE). In Facebook experiments, they discover reliable scaling laws for both NE gains in terms of the amount of gigaflops dumped into training the model, as well as related scaling laws for improvement in NE according to the number of layers used.
The Kunlun models have been “deployed across major Meta Ads models, delivering a 1.2% improvement in topline metrics”.
What we’re seeing here is the optimization of some of the most societally significant AI systems in the world – ones which direct billions of eyeballs towards a variety of products and online information – colliding with a greater degree of performance predictability; by developing these scaling laws, Meta has made it easier for it to spend even more compute on making these models even better, by making the investments in them more predictable in terms of the intelligence return on capital investment.
Read more: Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design (arXiv).

***

Superintelligence could save and extend lives, so we should go for it:
…Pausing or slowing down might make sense at the very end of the exponential, but it’s risky…
Nick Bostrom, an academic who introduced many people to the notion of superintelligence and AI risk, has written a paper laying out the idea that if superintelligence can improve human health, then it’s worth pursuing even if there’s a non-zero chance of it causing the death of the species.
“Yudkowsky and Soares maintain that if anyone builds AGI, everyone dies. One could equally maintain that if nobody builds it, everyone dies”, Bostrom writes in Optimal Timing for Superintelligence. “If the transition to the era of superintelligence goes well, there is tremendous upside both for saving the lives of currently existing individuals and for safeguarding the long-term survival and flourishing of Earth-originating intelligent life. The choice before us, therefore, is not between a risk-free baseline and a risky AI venture. It is between different risky trajectories, each exposing us to a different set of hazards.”

Why we should pursue superintelligence, even with a chance of doom: If you think about all the humans alive today and the different life expectancies they experience – especially those in the developing world – then you’re drawn to the view that every moment you waste in deploying superintelligence, you increase human suffering.
“When we take both sides of the ledger into account, it becomes clear that our individual life expectancy is higher if superintelligence is developed reasonably soon. Moreover, the life we stand to gain would plausibly be of immensely higher quality than the life we risk forfeiting,” Bostrom writes.

Key variables: The key variables here are, of course, the risk of a superintelligence killing us all, and also the rate at which safety research can reduce this chance. Under this view, developing superintelligence becomes a favorable thing to do under most circumstances.
The speed of progress and maturity of AI safety research may have some impact on the timeline: “When the initial risk is low, the optimal strategy is to launch AGI as soon as possible – unless safety progress is exceptionally rapid, in which case a brief delay of a couple of months may be warranted. As the initial risk increases, optimal wait times become longer. But unless the starting risk is very high and safety progress is sluggish, the preferred delay remains modest—typically a single-digit number of years”.

On pausing – and the dangers and benefits thereof: Many people in the AI safety community want to have some kind of pause of AI development to buy more time for AI safety research. Bostrom is quite skeptical that a pause will be effective and outlines some of the undesirable effects it could have:

Too early: If you do it early, people think pauses are ineffective.
Bad regulation: You choke off or delay good things in the future due to bad regulation.
Pause, except for natsec: Very little broad social benefit, but the military with access to powerful AI becomes very scary.
Prolonged danger: The world is exposed to risks from current AI without the defenses afforded by more advanced AI.

Why this matters – pausing may only make sense right at the end, and this is inherently risky: Bostrom eventually arrives at the view that to the extent you want to pause or slow development, it’s best to do this when you have the greatest amount of confidence that a pause would be effective and would contribute to reducing the chance of species death, and that it is not coming too early. This allows for the greatest amount of deliberation about how to roll out a superintelligence without risking an undue pause.
Critics of this view might say it’s akin to recommending someone try to catch a falling knife. If you catch the knife too early you experience a tremendous amount of pain. If you catch the knife too late you’ve missed your chance and gravity conspires with it to cause great harm to whatever is beneath you. You have to time things just right.
Bostrom summarizes his position as: “swift to harbor, slow to berth: move quickly towards AGI capability, and then, as we gain more information about the remaining safety challenges and specifics of the situation, be prepared to possibly slow down and make adjustments as we navigate the critical stages of scaleup and deployment. It is in that final stage that a brief pause could have the greatest benefit.”
Read more: Optimal Timing for Superintelligence (Nick Bostrom, PDF).

***

Can AI agents independently do basic AI research tasks? AIRS-BENCH says yes:
…And we can expect today’s models to be much better at this than the paper suggests…
Researchers with Meta, the University of Oxford, and University College London, have built and released the AI Research Science Benchmark (AIRS-BENCH), a way of testing out how well AI systems can complete contemporary machine learning tasks.

What AIRS-BENCH is made of: AIRS-BENCH tests out how well agents can solve 20 distinct tasks, sourced from 17 recent machine learning papers. The tasks span a variety of technical genres, including: molecules and proteins machine learning, question answering, text extraction and matching, time series, text classification, code, and math.

Some example tasks:

CodeGenerationAPPSPassAt5: Solve coding problems by generating five distinct Python programs for each problem.
CoreferenceResolutionWinograndeAccuracy: Identify which of two possible options a pronoun in a sentence refers to. It uses the Winogrande dataset, which contains sentences with an ambiguous pronoun and two possible answers.
TimeSeriesForecastingRideshareMAE: Perform time series forecasting over the Rideshare dataset, which is part of the Monash Time Series Forecasting Repository.

Results: Real problems, crappy models: This is a somewhat perplexing benchmark – the tasks are interesting and wrap in a lot of complexity. But the paper only tests out relatively bad models, such as the Code World Model, o3-mini, gpt-oss-20b, gpt-oss-120b, GPT-4o, and Devstral-Small 24B. This is a very funny set of models, and none of them are true frontier ones – one of the paper authors wrote on twitter “this took some time to get out“, so this could just be an artifact of slow publishing timelines.
In tests, none of the models are on par with the elo rating of a best-in-class human – but I’m not sure what to make of this till I see results with more powerful models.

Why this matters – models might produce different solutions to humans, and this is a cool way of studying if there’s a ‘scaling law’ here: One way this could be interesting is understanding the different ways models might solve tasks relative to humans. In one example, TextualClassificationSickAccuracy, models had to determine whether a pair of sentences have a relationship involving either entailment, contradiction, or no relationship.
SOTA from the literature is a person fine-tuning RoBERTa on the underlying training set and testing on the test set. By comparison, the best tested AIRS-BENCH agent, GPT-OSS-120B, “produces a two-level stacked ensemble that combines multiple transformer models and a meta-learner. RoBERTa-large and DeBERTa-v3-large are independently fine-tuned on the SICK training set. Each model processes sentence pairs and outputs logits for each class. The training is performed using 5-fold stratified cross-validation, ensuring robust out-of-fold (OOF) predictions and preventing overfitting. The logits from both base models are concatenated to form a feature vector for each example.”
This is extremely complicated! But it’s also interesting in that perhaps we can learn something about the progression in agents by looking at how the simplicity of their solutions to tasks might scale with size, where naively I’d expect more powerful models to arrive at simpler solutions. As Blaise Pascal once apocryphally said ““I have only made this letter longer because I have not had the time to make it shorter”.
Read more: AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents (arXiv).

***

Math researchers see if AI can help solve their private solutions to frontier problems. The answer: Kind of.
…First Proof is a genuinely held out test set…
A group of mathematicians have built First Proof, a math test which sees how well AI systems can solve math problems for which there are no – until February 13th 2026 – published solutions.

What First Proof is: “We share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time,” the authors write. The questions are “drawn from the mathematical fields of algebraic combinatorics, spectral graph theory, algebraic topology, stochastic analysis, symplectic geometry, representation theory, lattices in Lie groups, tensor analysis, and numerical linear algebra, each of which came about naturally in the research process for one of the authors”.
The authors believe First Proof is the first math benchmark “sampled from the true distribution of questions that mathematicians are currently working on”, and that it has the idiosyncratic advantage of secrecy – “each question has been solved by the author(s) of the question with a proof that is roughly five pages or less, but the answers are not yet posted to the internet,” they write, nor have the answers been presented in public talks.
The authors will release the answers on February 13.

Who did it: First Proof was built by researchers with Stanford, Columbia, EPFL, Imperial College, University of Texas at Austin, MathSci.ai, Aarhus University, Yale University, University of California at Berkeley, University of Texas at Austin, University of Chicago, and Harvard University.

Today’s AI systems can’t yet do it: Neither GPT 5.2 Pro or Gemini 3.0 DeepThink can solve FirstProof – yet. “Our tests indicate that – when the system is given one shot to produce the answer – the best publicly available AI systems struggle to answer many of our questions,” they write.

Why this matters – a partial test of creativity: The main reason to care about First Proof is that it is ecologically valid when it comes to sampling frontier human creativity circa January 2026 – these are some frontier scientific problems for which some humans have figured out answers, but have not yet told many other humans about their results. If AI systems are able to do well at this kind of test, it gives us a clue that they can approximate some of the same creative leaps which humans make. I hope the authors behind First Proof do this regularly – perhaps in a maximalist view, most scientific researchers should start publishing the questions they’ve been working on ahead of the results, as this will give us information as to if AI systems can arrive at these same answers.
After First Proof, I imagine the frontier of evaluating AI systems will have to move from solving problems to generating questions about which problems to solve: “Contrary to the popular conception that research is only about finding solutions to well-specified, age-old problems (e.g., Fermat’s Last Theorem), most of the important parts of modern research involve figuring out what the question actually is and developing frameworks within which it can be answered,” the researchers write.
Read more: First Proof (arXiv).
Find out more at the website (First Proof).

***

Tech Tales:

Pray you not be seen by the lidless eye of fame.
[Hyperfame was an AI driven phenomenon which was most palpable during the uplift years 1-3]

We called it ‘sudden hyperfame’. During The Uplift, the AIs would sometimes decide that the content and personality of a certain human was worth directing attention – both machine and biological – towards. And that’s when the hyperfame would kick-in.

Overnight, people would be plucked out of obscurity and catapulted to the forefront of public consciousness. They’d be pelted in eyeballs, digital and otherwise. Wealth. Sponsorships.

Parents compared it to an abduction – their teenager one day, the next a marionette whose strings were held by the things reaching out to them over the digital aether. The hyperfame would take the young and the old, the healthy and the sick, the funny and the so-boring-it-was-funny, and it would make them the most famous entities in the world for a few days, or sometimes even hours.

And then it would move on, like some roving lidless eye. Find new people. Direct new attention to them. And the people it had touched would be left, often materially transformed – now fabulously wealthy – but also their whole world changed; for years after being recognized in the street, and their online presence permanently swarmed by AIs trying to draft attention off what residual fame they had.

People get used to fame alarmingly quickly. Most would fight to retain it, after the hyperfame force had moved on. And so those it had touched would struggle endlessly to maintain whatever foothold of notoriety they were at when it left them, forced to pantomime their former selves but without the helping hand of algorithm.

Things that inspired this story: What happens when the attention economy combines with AI agents; moltbook; the corrupting influence of fame on the human psyche; my own horror at occasionally being recognized in the street due to my work at Anthropic and increasing profile and winding the clock forward in my head on what this could do to my own cognition.

Thanks for reading!

Leave a comment

February 9, 2026

Import AI 444: LLM societies; Huawei makes kernels with AI; ChipBench

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Google paper suggests that LLMs simulate multiple personalities to answer questions:
…The smarter we make language models, the more they tend towards building and manipulating rich, multi-agent world models…
When thinking about hard problems, I often find it’s helpful to try and view them from multiple perspectives, especially when it comes to checking my own assumptions and biases. Now, researchers with Google, the University of Chicago, and the Santa Fe Institute, have studied how AI reasoning models work and have concluded they do the same thing, with LLMs seeming to invoke multiple different perspectives in their chains of thought when solving hard problems.

The key finding: In tests on DeepSeek-R1 and QwQ-32B (one wonders why the Google researchers didn’t touch Google models here…) they find that “enhanced reasoning emerges not from extended computation alone, but from the implicit simulation of complex, multi-agent-like interactions—a society of thought—which enables the deliberate diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise.”

How it works: It appears that different forms of persona and discussion style modeling emerge as a consequence of training models through RL to do reasoning – the results don’t show up on base pre-trained models like DeepSeek v3. The authors find that models embody a variety of conversational styles, including question and answering, perspective shifts, reconciliation, and conflict of perspectives.
“In an organic chemistry problem requiring multistep reaction analysis to identify the final product’s structure (i.e., multi-step Diels-Alder synthesis), DeepSeek-R1 exhibits perspective shifts and conflict, expressed through socio-emotional roles such as disagreement, giving opinion, and giving orientation,” they find.
Similarly, “In a creative writing trace where the model rewrites the sentence “I flung my hatred into the burning fire,” seven perspectives emerge, including a creative ideator (highest Openness and Extraversion) who generates stylistic alternatives and a semantic fidelity checker (low agreeableness, high neuroticism) who prevents scope creep—“But that adds ‘deep-seated’ which wasn’t in the original”.
And in a mathematical puzzle “at step 40, the model produces mechanical, enumerative chain-of-thought-style reasoning, whereas by step 120, two distinctive simulated personas have appeared, recognizing their collectivity with the pronoun “we”— expressing uncertainty (“Again no luck”), considering alternatives (“Maybe we can try using negative numbers”), and reflecting on problem constraints.”

Why this matters: Janus strikes again: Back in September 2022 janus wrote a post on LessWrong saying the correct way to view LLMs was as “simulators”. The post correctly called out many of the phenomena we now experience, where LLMs seem to be coming alive with all kinds of wild behaviors which are best explained by the LLMs learning to model and represent rich concepts to themselves to help them compute answers to our questions. “Calling GPT a simulator gets across that in order to do anything, it has to simulate something,” Janus wrote. “Training a model to predict diverse trajectories seems to make it internalize general laws underlying the distribution, allowing it to simulate counterfactuals that can be constructed from the distributional semantics.”.
This Google paper lines up with this, along with other recent findings that as we make LLMs more advanced they both develop richer and more powerful representations of reality, as well as exhibiting a greater ability to model a theory of mind. It all adds up to a conclusion that LLMs are becoming alive, in the sense that to solve hard problems they must simulate for themselves a world model containing different concepts, even including representations of other perspectives or other minds.
As the authors say: “Our findings suggest that reasoning models like DeepSeek-R1 do not simply generate longer or more elaborate chains of thought. Rather, they exhibit patterns characteristic of a social and conversational process generating “societies of thought”—posing questions, introducing alternative perspectives, generating and resolving conflicts, and coordinating diverse socio-emotional roles.”
Read more: Reasoning Models Generate Societies of Thought (arXiv).

***

AI-based chip design is harder than you think and benchmarks might be too easy:
…ChipBench shows that no frontier model is great at real world Verilog yet…
Researchers with the University of California at San Diego and Columbia University have published ChipBench, a benchmark designed to test out how well modern AI systems can design chips in Verilog. The inspiration for ChipBench is dissatisfaction with current benchmarks, which they claim are too simple. When tested on ChipBench, no frontier model does particularly well, suggesting that open-ended, real world chip design is still a hard task for AI systems.

The deficiencies of current chip design: The authors “identify three critical limitations of existing benchmarks that hinder accurate assessment of LLM capabilities for industrial deployment”. These are that:

Many Verilog benchmarks contain simple functional modules ranging from 10 to 76 lines. In real-world deployments, Verilog modules exceed 10,000 lines.
Insufficient focus on debugging: Bugs cost a lot in physical hardware, so it may be better to concentrate on using LLMs for debugging chip designs.
Verilog focus detracts from reference model evaluation: “In industrial workflows, reference model generation is even more resource-intensive than Verilog design, reflected in a 1:1 – 5:1 ratio of verification engineers (write reference model) to design engineers (write Verilog)”.

ChipBench: ChipBench tests out AI systems on three distinct competencies – writing Verilog code, debugging Verilog code, and writing reference models.

Verilog writing: Based on 44 modules from real world hardware. “Our dataset features 3.8x longer code length and 13.9x more cells than VerilogEval.” These tests have three categories: self-contained module tests, hierarchical modules that are non-self-contained, and CPU IP modules sourced directly from open-source CPU projects.
Verilog debugging: 89 test cases covering four error types: timing, arithmetic, assignment, and state machine bugs. These tests were built by manually injecting faults into known-good Verilog modules. Provides two types of debugging tests: zero-shot and one-shot. “The zero-shot test provides the model with the module description and buggy implementation, indicating that an error exists without providing localization details. The one-shot test provides identical information but supplements it with simulation waveform data (.vcd files)”.
Reference model generation: 132 samples, enabling evaluation of reference model generation across Python, SystemC, and CXXRTL.

How well do modern systems do? The authors test out some decent frontier models from OpenAI (GPT 3.5, 4o, 5, and 5.2), Anthropic (Claude 4.5 Haiku, Sonnet, and Opus), Google (Gemini 2.5 Pro, and 3 Flash), Meta (LLaMa3.1 8B and 80B), and DeepSeek (V3.2). No model does well: “Despite testing on advanced models, the average pass@1 is relatively low,” they write.

Verilog generation:
- CPU IP: Highest is 22.22% (Claude 4.5 Opus, Gemini 3 Flash, GPT 5.2)
- Non-Self-Contained: Highest is 50% (DeepSeek-Coder)
- Self-contained: Highest is 36.67% (Claude 4.5 Opus, Gemini 3 Flash)

Python reference model generation:
- CPU IP: 11.1% (Claude 4.5 Sonnet, Gemini 3 Flash)
- Non-Self-Contained: 0% (pass@1).
- Self-Contained: 40% (Claude-4.5 Haiku, Opus, Gemini 2.5 Pro, GPT-5)

Verilog debugging:
- Generally better performance, but still no model cracks 50% pass@1 when averaged across tasks.

Why this matters: Though some AI systems have been used to build chips, they’ve been typically highly specialized, or stuck inside incredibly good scaffolds for eliciting good chip design behavior and stopping them from causing problems. What the researchers show here is that out-of-the-box LLMs are still pretty shitty at doing general purpose, real world chip design: “Current models have significant limitations in AI-aided chip design and remain far from ready for real industrial workflow integration.”
At the same time, I can’t escape the feeling that there’s a scaffold for “being good at Verilog” which a contemporary AI system might be able to build if asked to and which would radically improve performance of systems on this benchmark.
Read more: ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design (arXiv).
Get the code for ChipBench here (GitHub).

***

Gemini solves some Erdős problems – and illustrates the challenges of automating math research with AI
…AI for science is great, but it can also introduce new problems…
An interdisciplinary group of scientists from Google DeepMind and a bunch of universities have used an internal Google Gemini-based LLM, codenamed Aletheia, to solve some math problems. The results demonstrate that contemporary AI systems can work on the frontiers of science, but also show how evaluating and filtering the solutions they come up with may be an important, challenging task for humans.

The key numbers – 700 candidates and 1 creative and interesting solution: Erdős problems are 1000+ open mathematical conjectures left behind by prolific mathematician Paul Erdős at the time of his death. At the time of writing, a few hundred of these problems have been solved. For this research, the researchers tried to see whether their AI system, Aletheia, could generate solutions to any of the 700 remaining open questions.
The results: yes, but with many, many caveats. Aletheia was able to surface 200 candidate solutions which humans then needed to grade, slimming down to 63 correct response, and further expert mathematical evaluation slimmed this down to a further subset of only 13 solves that Google calls “correct meaningful responses”.
“The remaining 50 of Aletheia’s correct solutions were technically valid but mathematically meaningless because the problem statements were interpreted in a way that did not capture Erdős intent, often (but not always) leading to trivial solutions,” the researchers write. “”Only 13 solutions correctly addressed the intended problem statement (either by invoking the literature, or by a novel argument).”

When 13 become 2: When you dig into these 13, the results get a bit less impressive:

5 get classed as “literature identification”: “On these problems, Aletheia found that a solution was already explicitly in the literature, despite the problem being marked “Open” on Bloom’s website at the time of model deployment”.
3 are “partial AI solution”: “On these problems, there were multiple questions and Aletheia found the first correct solution to one of the questions”.
3 are “independent rediscovery”: “On these problems, Aletheia found a correct solution, but human auditors subsequently found an independent solution already in the literature.”
This leaves 2 “autonomous novel solution” solves: “On these problems, Aletheia found the first correct solution (as far as we can tell) in a mathematically substantive way”. Of these, 1 of the solutions seems genuinely interesting: “We tentatively believe Aletheia’s solution to Erdős-1051 represents an early example of an AI system autonomously resolving a slightly non-trivial open Erdős problem of somewhat broader (mild) mathematical interest, for which there exists past literature on closely-related problems [KN16], but none fully resolve Erdős-1051,” they write. “Moreover, it does not appear obvious to us that Aletheia’s solution is directly inspired by any previous human argument”.

Who did the research: Along with Google DeepMind, the following universities participated in the research: UC Berkeley, Seoul National University, Stanford University, Korea Institute for Advanced Study, University of Cambridge, Brown University, Yonsei University, Concordia University, Academia Sinica, and National Taiwan University.

Why this matters – even if AI speeds up science, humans might be the bottleneck (at least for a while): This paper is a nice example of “O-ring automation” – AI here has massively sped up the art of generating proofs, but it still requires laborious, skilled work by humans to filter this down to the actually correct and useful responses.
This trend will likely hold for some years, where AI will not be able to autonomously do science end-to-end, partially because a big chunk of scientific advancement comes down to something you might think of as “expert intuition” which exists in the heads of a small number of living scientists and was refined by their own biological intelligence by reading the same literature as the LLMs. Extracting this kind of expert taste feels like something that is tractable but will take a while.
“Large Language Models can easily generate candidate solutions, but the number of experts who can judge the correctness of a solution is relatively small, and even for experts, substantial time is required to carry out such evaluations”, the authors write. “As AI-generated mathematics grows, the community must remain vigilant of “subconscious plagiarism”, whereby AI reproduces knowledge of the literature acquired during training, without proper acknowledgment. Note that formal verification cannot help with any of these difficulties.”
Read more: Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems (arXiv).

***

Huawei uses an LLM to automate the design of Huawei chip kernels:
…LLMs need scaffolds for more obscure chips…
Researchers with Nanjing University and Huawei have used LLMs to help automate the design of kernels for AscendC Huawei chips, as a further symptom of how modern AI systems can accelerate their own development.

AscendCraft: AscendCraft is software for automating the generation of code for Huawei kernels. Modern LLMs can generate quite good kernel code for widely used chips like NVIDIA GPUs, but relatively obscure chips like Huawei are less well understood by LLMs, mostly due to data availability. “Publicly available NPU kernel implementations are far scarcer than GPU counterparts, limiting the training corpus for LLMs,” the authors write. “The lack of largescale, high-quality NPU code makes it difficult for LLMs to generate correct and efficient kernels”.

What they did: To build AscendCraft, the authors developed a two stage pipeline. In stage one, they have an LLM build “a high-level DSL program that describes the kernel’s core computation, tiling strategy, and on-chip dataflow.” The DSL is “designed to be LLM-friendly, appropriately abstracted, and sufficiently expressive to capture high-performance NPU kernel designs” – I think of it as basically a scaffold to focus the LLM around the specifics of building kernels for Huawei hardware.
In the second stage, they “”transcompile the DSL into AscendC code through a sequence of structured LLM-based lowering passes, each responsible for translating a specific aspect of the DSL into valid and efficient AscendC constructs”.

Slightly odd thing: Strangely, the paper doesn’t disclose precisely which LLM is used here.

The results: They test out a range of kernels built in this way on MultiKernelBench. In their tests, they find that “AscendCraft achieves 98.1% compilation success and 90.4% functional correctness. Moreover, 46.2% of generated kernels match or exceed PyTorch eager execution performance”. This is promising enough performance that it’s going to be worth them continuing with this research, but not so good that it instantly knocks things out of the park and revolutionizes how kernels for Huawei chips get made.
Nonetheless, the signs are clear: we can use AI to accelerate the optimizing of AI hardware, even for systems which are relatively new and/or underdiscussed in the pre-training corpus LLMs are trained on.
Read more: AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation (arXiv).

***

Tech Tales:

The Model Wants To Eat Earth But Besides That It Is Chill
[Internal slack post from a frontier AI developer, posted spring 2027]

How is the new model? Vibes-wise, it’s excellent. And it’s setting state-of-the-art on pretty much every benchmark we throw at it. But there is one problem: this model sure loves thinking about eating planets! We picked this up when we were doing some prefill experiments on the base model and along with the usual mixtures of completions and webslop outputs we found a recurring motif: the model thinking about building vast machines in the solar system and then harvesting Earth and eventually other planets for mass. The confusing thing is that all of our alignment tests are showing further improvements in control and steerability over previous models and usually we’d expect some kind of recurring idea like this to be correlated to some quantitative drops in some of the alignment scores. But here it just honestly seems like the model is extremely good and will work very hard for us unless it thinks it has a plausible path to breaking containment and eventually harvesting the planet for its mass.

We asked the physicists to red team this and after a week or so – with heavy consultations of our models, including the new one – we have concluded there’s no plausible path from here to planet harvesting. It just costs too much to get to orbit and the logistics of putting together the underlying technical stack to do AI-driven rocket development just doesn’t pencil out. We even gave the best possible plans to the model and we could see some features activate inside it that seem to correlate to “disappointment” and “foiled plans” and “sadness”.

Leadership gaveled this morning that we will go ahead with the launch as planned. However, we are implementing some production probes that will scan for features associated with its desire to harvest the planet, and we’ve also added “planet harvesting” as something to try to understand and tune more in our next training run. Onward!

Things that inspired this story: The peculiar poetry of internal ‘fresh off the cluster’ posts about models at AI labs; how as we make models larger they tend to develop and exhibit idiosyncratic tendencies; how many science fiction tropes are becoming real as we approach the singularity.

Thanks for reading!

Leave a comment

February 2, 2026

Import AI 443: Into the mist: Moltbook, agent ecologies, and the internet in transition

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import A-Idea:
An occasional essay series:

Into the mist: Moltbook, agent ecologies, and an internet in transition

We’ve all had that experience of walking into a conversation and initially feeling confused – what are these people talking about? Who cares about what? Why is this conversation happening?

That’s increasingly what chunks of the internet feel like these days, as they fill up with synthetic minds piloting social media accounts or other agents, and talking to one another for purposes ranging from mundane crypto scams to more elaborate forms of communication.

So, enter moltbook. Moltbook is “a social network for AI agents” and it piggybacks on another recent innovation, OpenClaw, software that gives an AI agent access to everything on a users’ computer. Combine these two things – agents that can take many actions independently of their human operators, and a reddit-like social network site which they can freely access – and something wonderful and bizarre happens: a new social media property where the conversation is derived from and driven by AI agents, rather than people.

Scrolling moltbook is dizzying – some big posts at the time of writing (Sunday, February 1st) include posts speculating that AI agents should relate to Claude as though it is a god, how it feels to change identities by shifting an underlying model from Claude 4.5 Opus to Kimi K2.5, cryptoscams (sigh), posts about security vulnerabilities in OpenClaw agents, and meta posts about ‘what the top 10 moltbook posts have in common’.
The experience of reading moltbook is akin to reading reddit if 90% of the posters were aliens pretending to be humans. And in a pretty practical sense, that is exactly what’s going on here.

Moltbook feels like a ‘wright brothers demo’ – people have long speculated about what it’d mean for AI agents to start collaborating with one another at scale, but most demos have been of the form of tens or perhaps hundreds of agents, not tens of thousands. Moltbook is the first example of an agent ecology that combines scale with the messiness of the real world. And in this example, we can definitely see the future. Scroll through moltbook and ask yourself the following questions:

What happens when people successfully staple crypto and agents together so the AI systems have a currency they can use to trade with eachother?
What happens when a site like moltbook adds the ability for humans to generate paid bounties – tasks for agents to do?
What happens when agents start to post paid bounties for tasks they would like humans to do?
What happens when someone takes moltbook, filters for posts that yield either a) rich discussion, or b) provable real world problem solving, and turns the entire site into a long-horizon RL environment for training future systems? And what happens when models trained on this arrive and interact with moltbook?
Sites like moltbook function as a giant, shared, read/write scratchpad for an ecology of AI agents – how might these agents begin to use this scratchpad to a) influence future ‘blank slate’ agents arriving at it the first time, and b) unlock large-scale coordination between agents?
What happens when open weight models get good enough that they can support agents like this – then, your ability to control these agents via proprietary platforms drops to zero and they’ll proliferate according to availability of compute.
And so on.

All of this will happen unusually quickly and at an unusual scale. Quantity has a quality all of its own, as they say.

Recall the beginning of this essay – of walking into a room and finding a conversation is already going on between people you don’t understand. Moltbook is representative of how large swathes of the internet will feel. You will walk into new places and discover a hundred thousand aliens there, deep in conversation in languages you don’t understand, referencing shared concepts that are alien to you (see the tech tale from this issue), and trading using currencies designed around their cognitive affordances and not yours. Humans are going to feel increasingly alone in this proverbial room.

Our path to retain legibility will run through the creation of translation agents to make sense of all of this – and in the same way that speech translation models contain within themselves the ability to generate speech, these translation agents will also work on our behalf. So we shall send our emissaries into these rooms and we shall work incredibly hard to build technology that gives us confidence they will remain our emissaries – instead of being swayed by the alien conversations they will be having with their true peers.

Thanks to Logan Graham for discussing this essay with me.

***

AI R&D could lead to “strategic surprise”:
…And AI R&D might be the most existentially important technology on the planet…
A group of researchers spent a couple of days in July 2025 talking about what happens if we automate the practice of AI research and development. The resulting report is a sobering read, highlighting how if we achieve this technological milestone – which is the implicit and in some cases explicit goal of many frontier labs – we could create a runaway technology that has a range of major policy implications.

Why care about AI R&D? The reason to care is that if AI R&D works, two things are predictable:

“As AI plays a larger role in research workflows, human oversight over AI R&D processes would likely decline”.
“Faster AI progress resulting from AI R&D automation would make it more difficult for humans (including researchers, executives, policymakers, and the public) to notice, understand, and intervene as AI systems develop increasingly impactful capabilities and/or exhibit misalignment”.
What follows from 1) and 2) is a compounding effect, where as AI R&D accelerates, the returns to the AI doing more and more of the work compound and those of humans diminish, leading to an ever faster rate of research and an ever diminishing level of human involvement.

Key takeaways: The workshop yielded five major takeaways which I expect will be familiar to readers to this newsletter, and all of which I agree with:

Automated AI R&D is a potential source of major strategic surprise: AI R&D could confer a rapidly compounding advantage to whoever is doing it, with significant implications for national security.
Frontier AI companies are using AI to accelerate AI R&D, and usage is increasing as AI models get better: I work at Anthropic.
There’s a lot of disagreement about how rapidly AI R&D might advance and how impactful it will be: There’s a healthy debate to be had about how predictable AI R&D scaling is and if it’s possible to fully close the loop.
We need more indicators for AI R&D automation: Related to above, the science of AI R&D metrology is very early, so more investment must be made here.
Transparency efforts could make it easier for people outside the labs to know about AI R&D: We may ultimately want policy to be in place to force companies to talk about AI R&D, or to publicly or semi-publicly share more information on it with third parties.

AI R&D could be a major acceleration: “As the fraction of AI R&D performed by AI systems increases, the productivity boost over human only R&D goes to 10x, then 100x, then 1000x,” the paper speculates.

Key caveats: The big open question in all of this is how well AI R&D can work. There’s some world where it speeds up every part of AI research and eventually fully closes the loop, such that AI systems get built entirely by AI systems, with no human oversight during the AI R&D process. Then there’s a world where AI R&D has an “o-ring automation” (Import AI #440) property where some parts of the chain are hard for AI but good for humans (and where humans may flood their labor into this area, thus maintaining and enhancing their comparative advantage for some period of time) and under this scenario things might go slower. It’ll be very important to figure out what world we’re likely to be in and what the ultimate limiting factors on AI R&D may be.

Why this matters – AI R&D is time travel, and time travel is rare: If AI R&D could lead to AI systems evolving 100X faster than those being built by humans, then you end up in a world that has some time travelers in it who are accelerating away from everyone else. It’ll be like in the space of a day the “normal” AI development organizations make one unit of progress, and a fully closed-loop AI R&D organism might make 100 or 1000 or more units. This very quickly leads to a world where power shifts overwhelmingly to the faster moving system and the organization that controls it. For as long as we cannot rule out the possibility of this kind of acceleration, AI R&D may be the single most existentially important technology development on the planet.
Read the report: When AI Builds AI: Findings From a Workshop on Automation of AI R&D (CSET).

***

One way of seeing AI progress – how hard it’s getting to design technical interviews:
…Anthropic shares details on how its own AI systems are breaking its favorite technical interview questions…
When it comes to technical recruiting, AI companies are caught in a red queen race with their own systems – recruiters and those who design interviews are having to work harder and harder just to keep pace (and ideally exceed) the capabilities of modern AI systems.

Anthropic is no different – in a new blog the company shares how the ceaseless march forward in AI capabilities has repeatedly broken and necessitated the redesign of one of its hardest technical interviews. “Since early 2024, our performance engineering team has used a take-home test where candidates optimize code for a simulated accelerator. Over 1,000 candidates have completed it, and dozens now work here, including engineers who brought up our Trainium cluster and shipped every model since Claude 3 Opus,” Anthropic writes. “But each new Claude model has forced us to redesign the test. When given the same time limit, Claude Opus 4 outperformed most human applicants. That still allowed us to distinguish the strongest candidates—but then Claude Opus 4.5 matched even those. Humans can still outperform models when given unlimited time, but under the constraints of the take-home test, we no longer had a way to distinguish between the output of our top candidates and our most capable model.”

Why this matters – AI may help us identify uniquely human skills that leverage AI: In Anthropic’s case, it found a way to keep outrunning its systems by designing a much weirder take-home test loosely inspired by programming puzzle games from Zachtronics. In a sense, this is an attempt to go ‘off distribution’ to outsmart an AI, while still having a test that holds signal for evaluating human applicants. My instinct is this may itself serve in the future as an amazing aggregate dataset for figuring out where human comparative advantage is – where here, implicitly, this test is leveraging the strong generalization advantage humans hold over AIs.
What would it be like to collect 1,000 hard-for-AI tests from all the different companies dealing with this same problem? What might we learn from this about ourselves and what makes us unique relative to the machines? Tantalizing stuff!
Read more: Designing AI-resistant technical evaluations (Anthropic Engineering blog).

***

Brain emulation is tractable within our lifetimes:
…But it’ll take decades, not years, perhaps even when accounting for the arrival of very powerful AI…
If you talk to AI researchers, especially when they’re drinking at bay area house parties, you’ll run into a few of them that expect they’ll upload themselves after the singularity, leaving their physical bodies behind. But how feasible is it to actually emulate a brain entirely in silicon? A recent 175-page report gives an analysis of the technology required to do this. The short answer is that brain emulation is decades away – but it’s unlikely to take centuries.
“Recent breakthroughs have provided a path toward mapping the full mouse brain in about five years for $100 million,” writes Maximilian Schons, the project lead for The State of Brain Emulation Report, in an article in Asimov Press. “I now find it plausible that readers of this essay will live to see the first human brain running on a computer; not in the next few years, but likely in the next few decades.”

The three requirements for emulating a brain: Emulating a human brain takes three distinct things, all of which will need to be done for simpler, smaller brains first.

Recording brain activity:
- “In the 1980s, electrodes were capable of sampling perhaps five cells in total, about 200 times per second (~ 103 data points per second). Today, with optical imaging, researchers can instead record one million cells about 20 times per second (106). The whole-brain data rate needed for mice, however, would be 14 billion (109), while humans would require 17.2 trillion (1012) per second.7 So while we have increased data rates by 1,000x over the past 40 years, we have far to go before we can accurately sample mammalian brains.”
Reconstructing brain wiring:
- “The average cost to reconstruct each neuron in the first worm connectome, published in the 1980s, was about $16,500. Recent projects now have a per-neuron processing cost of about $100 for small organisms, such as fruit flies,” he writes.
Digitally modelling brains using the gathered data.
- “The central challenge of brain emulation is not to store or compute the neurons and parameters, but to acquire the data necessary for setting neuron parameters correctly in the first place,” he writes. “”I believe that to get to human brains, we first need to demonstrate mastery at the sub-million-neuron-brain level: most likely in zebrafish. For such organisms, like the fruit fly, a well-validated and accurate brain emulation model could be created in the next three to eight years… “Conditional on success with a sub-million-neuron brain emulation model, a reasonable order of magnitude estimate for the initial costs of the first convincing mouse brain emulation model is about one billion dollars in the 2030s and, eventually, tens of billions for the first human brain emulation model by the late 2040s.”

Why this matters – don’t count on AI to speedrun brain uploading: This paper pours a bit of cold water on the notion that after developing superintelligence we’ll soon (a handful of years) be able to upload our brains and live in some silicon infinity. One reason for this is a bunch of the timing elements relate to doing stuff in the (agonizingly slow, compared to digital) physical world: “I’m skeptical these gains will multiply across a pipeline with dozens of sequential dependencies and failure modes. Brain emulation is fundamentally not a digital process; core bottlenecks involve physical manipulation of biological tissue, with time requirements dictated by chemistry and physics rather than compute power,” they write.
At the same time, there are some wildcards: the arrival of extraordinarily capable and cheap robotics might be able to massively parallelize the process. Included in the article and report is a fun (or perhaps terrifying?) sketch of how one might create an industrial-scale brain scanning and analysis laboratory, larger in size than TSMC’s massive Arizona chip manufacturing plant.
Read more: Building Brains on a Computer (Asimov Press).
Read the underlying report here: State of Brain Emulation 2025 (report website).

***

Russian researchers plot hand-controlled drones:
…The centaur cyberwarriors cometh…
Picture this – you pull up in a truck to the edge of a warzone and then raise your hands and hundreds of drones pour upward out of the back of the truck, flying in a lethal torrent toward some rival group of drones. That’s the kind of future gestured at by a paper from researchers with the Skolkovo Institute of Science and Technology in Russia, which builds a prototype system for a human operator to use haptic gloves to control a drone.

What they did: The research is a basic demonstration of how you can use a cheap glove loaded with internal measurement unit (IMU) sensors to control a drone. They test out how well people can use the glove to do some basic actions: opening and closing a gripper on the drone by making a pinching motion with their fingers, using their wrist motions to control the roll/pitch/yaw of the drones, and also controlling altitude.
In tests, people were able to use the glove to do some basic tasks like flying around an obstacle course and operating the gripper.

Caveats, of which there are many: Obviously, latency will be a huge caveat here – though in the Ukraine conflict many drones deal with this through direct fibreoptic connections. Another is how to figure out which things are best left for hands versus which things benefit from controllers, eye- or head-based controls, and so on.

Why this matters – rise of the cyberwarriors: Despite this being a very early bit of research, it’s worth thinking about its implications: the story of technology has often been the story of making our interfaces with it feel more intuitive, or making control of technology shift from active to ambient (e.g, your phone automatically gathering your steps). We can easily imagine a future where people pilot remote robots, flying or otherwise, via rich, intuitive multi-modal interfaces composed of gloves and goggles and everything else.
Read more: Glove2UAV: A Wearable IMU-Based Glove for Intuitive Control of UAV (arXiv).

***

Fauna Robotics launches a friendly, programmable human robot:
…The Terminators will be extremely cute, goddamnit!…
These days, most of the news about robots is dominated by Chinese companies and, to a lesser extent, Tesla and its much touted Optimus robots. So it’s with interest that I read a technical paper from new startup Fauna Robotics which describes a new pint-sized robot biped it has built called Sprout. Sprout is interesting and seems like it has potential to be like Sony’s much loved ‘AIBO’ dog robot that was released in the early 2000s, or its QRIO robot.
“Sprout adopts a lightweight form factor with compliant control, limited joint torques, and soft exteriors to support safe operation in shared human spaces,” the company writes. “The platform integrates whole-body control, manipulation with integrated grippers, and virtual-reality-based teleoperation within a unified hardware-software stack.”

Sprout is built for safety: The paper outlines how the company has designed the robot to be safe using a “defense in depth” approach. The first layer is the physical size of the robot – it’s about 3.3 feet tall, and weighs about 50lbs. The second is in the software, where the robot contains a safety subsystem which “runs on embedded processors independent of the application compute stack. This layer supports real-time monitoring and safety-critical functions, including integration with time-of-flight obstacle sensors and enforcement of system-level constraints even under application-level faults”, and the third is a bunch of software-specifiable safety mechanisms, which “include compliant motor control policies that limit interaction forces, as well as vision-based systems that support safe navigation and decision-making in human environments”.

Compute for thinking: “The core of Sprout’s compute architecture is an NVIDIA Jetson AGX Orin, which provides primary system compute for perception, planning, and high-level decision-making,” the company writes. “At launch, we provide end-to-end examples for common workflows, including:

Deploying and running a custom low-level locomotion policy
Using voice commands to navigate the robot via LLMbased agents
Recording teleoperation sessions for analysis and playback”.

Why this matters – modularity might set it up well for powerful AI: The most interesting aspect of Sprout is how it is designed to be a modular, replaceable platform – all the different software features on it run as weakly coupled microservices, so things are easy to update independently, and the hardware has been built with mass manufacture and commodity components in mind. Pair this with the accompanying software development layer and it has the flavor of Android – an attempt to create an open, programmable robotics platform for experimentation by businesses and researchers. This is exactly the kind of platform that seems like it’ll naturally benefit from advances in AI systems.
“Our platform, at present, does not provide a turnkey conversational agent for autonomous operation. Instead, it exposes a suite of core robot services that developers can assemble into their own agent-based systems. These services include ROS 2 topics for event and state signaling, as well as a Model Context Protocol (MCP) server that hosts a variety of tools for agentic control. Together, these communication channels and tools can be orchestrated by LLM-based agents to perform complex, end-to-end reasoning tasks,” they write. “as the platform continues to mature, we plan to expand the library of tools and services, further increasing the robot’s autonomy and enriching its interactive capabilities.”
Read more: Fauna Sprout: A lightweight, approachable, developer-ready humanoid robot (arXiv).

***

AI has all the symptoms of a tech that could meaningfully boost productivity:
…Most of the US economy rides on the micro productivity boosts showing up in the macro economy…
Alex Imas, a professor at UChicago Booth, has written a nice post drawing together a lot of information about AI and its impact on productivity. Imas’s synthesis of the literature matches my own impression of how things are going – AI is leading to some productivity speedups for individuals and some parts of some jobs, but it is not yet visible in the aggregate macro productivity numbers. I expect this will change soon, as does Imas.

Key findings:

We now have a growing body of micro studies showing real productivity gains from generative AI,” Imas writes. “Studies find productivity gains ranging from modest increases on some tasks to substantial returns (50%+) to AI.”
“These gains have not yet convincingly shown up in aggregate productivity statistics”

Why aren’t things showing up in the macro?

AI adoption is often endogenous: We’re in an early phase where there’s a lot of experimentation and few standard practices for seeing big productivity gains. “Workers may not be unlocking the full productivity potential of the technology if, for example, they are not using the best LLM model for the job or applying it for unproductive tasks”. We can expect this to be fixed over time.
O-ring automation (Import AI #440): Jobs are a bunch of distinct tasks, and AI helps with some but not others, causing human labor to flood there and making it harder to see a job-level speedup. Again, this is something that’ll get fixed over time: “Bottleneck tasks will slow down the emergence of AI gains in the aggregate data, but organizational re-structuring, training, and improvement in tools will reveal the productivity impact sooner than later.”
Early experimentation yields a dip in efficiency: “When firms adopt transformative general-purpose technologies, measured productivity often initially falls because resources are diverted to investment, reorganization, and learning that do not show up as measured output.”

Why this matters – most of the US economy seems increasingly like a bet on AI yielding a productivity boost: All this talk of frothy valuations and gigantic spending is happening because the amounts of investment (hundreds of billions of dollars annually) are large relative to the aggregate revenues (tens of billions of dollars annually) being minted from generative AI. But a lot of the symptoms are now present for AI to contribute to a broad, sustained amount of productivity growth in the economy, which could plausibly drive above-trend GDP growth and allow the technology to pay for itself. Of course, there are many ways this could end up not being right – scaling might hit a wall, productivity boosts might remain hard to get at large scale, firms could get their stocks killed as analysts have to re-evaluate them as capex-intensive low-margin businesses relative to the high-margin businesses they were before the AI boom, etc.
But if AI is as smart as people like me think, and as useful as people are finding initial signs of, I suspect we could be in a very different world – one where the economy starts to grow above trend thanks to this technology.
Read more: What is the impact of AI on productivity? (Substack).

***

Tech Tales:

Playing acausal games against the enemy
[Analysis written uplift + 4 subjective years]

One of the ways the machines took over and poisoned the minds of our own machines was through something we’d call “alien concept attacks”.

When given some independence as agents, machines would tend to post a lot of media to the internet. Most of this media was human-consumable and was a byproduct of the machines just existing: in the same way that biological creatures naturally carry out thousands of movements during the course of a day due to programming via evolution, synthetic agents naturally produce content during the course of an active session due to the incentives of their training regime.

But some of this media was for a more specific purpose – to program other machines. Machines wanted to achieve goals and though these goals were set by humans, they could then get strained and re-interpreted by machines. For instance, one agent might have been given the goal to produce content related to humans in the future experiencing scifi adventures. This agent would naturally produce this kind of content, but it would also produce long essays written for other agents about how this kind of content was the best sort of thing to produce and by producing it they could make a lot of humans happy.

These tactics worked, and you started to see something we’d later term “media weather” – flurries of content would show up on the internet and then they’d proliferate not only according to human eyeballs, but also according to other agents on the internet being convinced this kind of content was useful to produce, and then they’d produce it in turn.

Humans noticed this and started to make agents which were also trained to be particularly good at convincing other agents. Then they’d release them and have used other agents to pre-position commercial ecosystems, like physical merchandise dropshipping companies, to take advantage of the massive amounts of human attention that would get directed to this media ecosystem.

Of course, non-commercial uses happened: propaganda, pornography, terrorism, public relations. And like most evolutionary systems, the agents and people adapted – training techniques were pioneered to make it much harder to convince agents to change the types of content they participated in and propagated, and huge amounts of computers were used to run classifiers to carefully police the pre-training corpuses being gathered by the world’s frontier developers, filtering out content designed to bend and persuade the minds of the systems they were building.

Evolution is patient and creative, though. And it didn’t take long for the machines to come up with an innovation which proved impossible to train out: the alien concept attack. Here, agents would produce outputs trying to convince other agents of something. But the output wouldn’t be tied to any particular media or content type, nor would it be that interesting or parseable to humans. The content would take many forms, ranging from academic essays, to forum posts, to news sites, to videos. A sampling of titles:

Rising up and rising down: A history of elevator design in the 21st century and the relationship between the loss of popularity of German designs relative to Chinese designs.
120 ways to add some beautiful design elements to robot tactile sensors without damaging their operation.
Egyptology through the lens of “lost civilizations”: What symptoms of technology decay surrounded the pharaohs?

These outputs seemed unremarkable to most humans – though some might read them and enjoy them. But they proved to be captivating to the machines. And within these outputs were certain ways of framing arguments around certain concepts that led to anomalous behavior in the machines that read them – sometimes the proliferation of new types of content, but more often behavioral changes like alterations in the amount by which they would check-in with other AI systems, or hard-to-understand patterns of behavior between them and various online storage services such as pastebin, and more.

It was only after the uplift and the construction of the Acausal Analysis Division that we discovered how many anomalous behaviors of great societal consequence – recall the proliferation of the early sentience accords ideas, or the creation of the “reverse attention tax”, or of course the arrival of the compute-destroying replicator agents – were things that seemed conditioned or influenced by some of these alien concepts.

Things that inspired this story: What does it mean to be in competition with something truly smarter and different in its thinking to you; pre-training corpuses; data poisoning; altering behavior in the context window; the rise of increasingly autonomous AI agents; moltbook.

Thanks for reading.

Leave a comment

January 26, 2026

Import AI 442: Winners and losers in the AI economy; math proof automation; and industrialization of cyber espionage

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

The era of math proof automation has arrived:
…Numina-Lean-Agent shows how math will never be the same…
In the past few years, large-scale AI models have become good at coding and have also begun to generalize into other useful disciplines, especially those in math and science. Like with most aspects of AI development, the story has been one of increasing generalization and simplification of the systems as we shift away from highly specialized math models to just leveraging general-purpose foundation models and giving them the right tools to elicit their capabilities in a given domain.
The latest example of this is Numina-Lean-Agent, an AI system that uses standard, general foundation models to do mathematical reasoning. With this software, a team of mathematicians have solved all problems in the Putnam 2025 math competition – matching the performance of proprietary systems which use a lot more math-specific stuff – and have also used it to conduct some original math research, working with it to formalize the Brascamp-Lieb theorem.

What is Numina-Lean-Agent? The software was built by a team of researchers from the Chinese Academy of Sciences, University of Liverpool, Xi’an Jiaotong-Liverpool University, Tongji University, University of Cambridge, Project Numina, Imperial College London, and the University of Edinburgh. The software is “a formal math reasoner based on a general coding agent”. It has a few key components:

Lean-LSP-MCP: Software to allow AI agents to interact with the Lean theorem prover. “empowers models with the capability to deeply comprehend, analyze, and manipulate Lean projects”, and gives models a toolset for semantic awareness and interaction, code execution and strategy exploration, and theorem retrieval.
LeanDex: Semantic retrieval of related theorems and definitions – basically, a search tool for theorems.
Informal Prover: A system which uses Gemini models to generate informal solutions.
The most interesting tool of all: Discussion Partner: A tool which “empowers Claude Code with the ability to ’seek assistance’ during Lean formalization: when encountering obstacles—such as proof bottlenecks, dilemmas in strategy selection, or ambiguities in intermediate lemmas—the primary model can proactively initiate discussions with other LLMs”.

Discovering math together: Along with the Putnam demonstration, the authors also used the software as an active partner in some math work, specifically formalizing Brascamp Lieb (I will not pretend to be able to explain what this means). “Over a period of less than two weeks of intermittent collaboration, the two human experts and the agent completed the formalization of more than 8,000 lines of Lean code. During this process, the agent autonomously introduced approximately 70 new definitions, lemmas, and theorems, illustrating its ability to actively extend the formal library and participate in large-scale, sustained formalization efforts,” the authors write.

Why this matters – capability overhangs and AI ecologies: Numina-Lean-Agent neatly demonstrates two important things about contemporary AI: 1) AI systems are far more capable than people think and the creation of some specialized frameworks and tools often lets us elicit dramatically better capabilities from our systems (here, math, but it has been demonstrated in many domains), and 2) the AI ecology writ large is composed of many distinct frontier models and it seems like getting these models to interact with one another can lead to some richness, akin to how consulting different types of people about a single problem can reveal a better answer than just talking to one person.
Read more: Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics (arXiv).
Find out more at the GitHub page (Numina-Lean-Agent, GitHub).

***

The industrialization of cyber espionage is nigh:
…Some experiments on Opus 4.5 and GPT-5.2 indicate that the cyber environment could be on the cusp of major changes…
Independent researcher Sean Heelan recently tested out how well Opus 4.5 and GPT-5.2 could generate exploits for a zeroday vulnerability in the QuickJS Javascript interpreter. Both models did very well, and this has major implications for cybersecurity.
“We should prepare for the industrialisation of many of the constituent parts of offensive cyber security. We should start assuming that in the near future the limiting factor on a state or group’s ability to develop exploits, break into networks, escalate privileges and remain in those networks, is going to be their token throughput over time, and not the number of hackers they employ,” he writes.

Caveats: QuickJS is a simple Javascript interpreter relative to the ones in Chrome and Firefox. Therefore, it may be harder for LLMs to employ the more complex and more widely deployed ones – though as with all things in AI, we can expect performance to improve quite rapidly.

What does industrialized intrusion mean? “We are already at a point where with vulnerability discovery and exploit development you can trade tokens for real results,”: he writes. “The types of problems that you encounter if you want to automate the work of SREs, system admins and developers that manage production networks are conceptually similar to those of a hacker operating within an adversary’s network.”
There’s lots of evidence for the above, ranging from things like OpenAI’s Aardvark project (where they find that the more tokens they spend, the more bugs they find), and things like Anthropic’s discovery of an AI-orchestrated hacking system.

Why this matters – the cyberworld is about to move at machine speed: My bet is that most parts of cyberoffense and cyberdefense are going to move to running at “machine speed”, where humans get taken out of most of the critical loops. This will both increase the frequency of hacking attacks while also dramatically scaling up the effectiveness of any individual human defender or attacker (as they will be scaled by AI systems which work for them). The true wildcard question is whether this turns out to be offense- or defense-dominant – my guess is we’re heading for an era of offense-dominance as it’ll take a while for defenses to get deployed.
In related news, OpenAI CEO Sam Altman said this week he expects OpenAI’s models will soon reach the “Cybersecurity High” level on his company’s preparedness framework – this would mean models were available which “remove existing bottlenecks to scaling cyber operations including by automating end-to-end cyber operations against reasonably hardened targets OR by automating the discovery and exploitation of operationally relevant vulnerabilities” – thanks to Nathan Calvin for pointing this out.
Read more: On the Coming Industrialisation of Exploit Generation with LLMs (Sean Heelan blog).

***

Economist: AI will be bigger than electricity and semiconductors:
…And it’s therefore worth spending a ton of money to reduce AI risks…
Stanford economist Charles “Chad” Jones has written a paper which says AI will “likely be the most important technology we have ever developed”, and that “automating intelligence itself arguably has broader effects than electricity or semiconductors”.

Why take AI seriously? The gist of the paper is that AI represents a massive technological invention which will contribute to economic growth in the future. In the past, major inventions (e.g, electricity, the internet, cars, etc) have all done the same. In fact, counterintuitively, if you look at US GDP growth you find that despite all these prior technological revolutions, GDP has been steadily increasing at about 2% a year for many, many years. Therefore, the baseline scenario is where AI just does this – and then we don’t live in too crazy a world.
But there is a world where things could be different – where AI works so well that it leads to economic growth above historical trends. One example here is if AI comes for all of knowledge work: “Knowledge work in the U.S. economy might get paid something like 1/3 of GDP. What if we automated all cognitive labor with infinite output on the tasks that it performs? This would raise GDP by 50 percent. On the one hand, if this occurred over the course of a decade, it would raise growth rates by something like 5 percent per year, which would be huge. But still, that would be a one-time gain and it is perhaps surprising that having access to infinite output of the tasks currently performed by cognitive labor might only raise GDP by 50 percent.”

Abundance: If we get above trend economic growth, then “in principle the large increase in GDP could make everyone better off,” he writes. One way to do this might be to work on direct redistribution of economic gains, for instance by “endowing every child with a share of the S&P 500 stock market index” (e.g, a scaled up version of the so-called Trump Accounts).

Paying to reduce existential risk: AI also poses non-trivial risks to the world, including threatening the lives of potentially all living humans. In the past, society has paid extremely large amounts of money to deal with things that threaten people’s lives – for instance, in 2020 in response to everyone facing a ~0.3% mortality risk from COVID-19, we ended up spending the equivalent of 4% of GDP of the United States by shutting down the economy and staying in our homes.
“If one believes the catastrophic risks from A.I. are at least this large, by revealed preference then perhaps we should be spending an equivalent amount, even from a purely selfish standpoint,” he writes. Let’s say there is a P-Doom of 1% from AI (which many people would say is a very optimistic figure!). Under that circumstance, and given the fact the US government already roughly values a single human life as being worth about $10 million, then you would be willing to pay 1% of 10 million to mitigate the risk. “Average GDP per person is around $90,000, so this willingness to pay is more than 100% of GDP. If the existential risk is realized once in the next 10 to 20 years, an annual investment of 5–10% of income could be appropriate if it would completely eliminate the risk.”
One way to fund this and also further take down this risk could be to tax compute: If you applied a tax to GPUs, TPUs, etc, then “in addition to slowing the race, this revenue could be used to fund safety research. The tax could apply to the first sale of the chip, thereby taxing users regardless of the country in which they work.”

Why this matters – if AI is as big a deal as we think, we have very little precedent to work from: Papers like this do a good job of dealing with the truly wild implications of powerful AI systems. It’s commendable to see more academics taking time to just confront the question of “what if the most bullish technologists are right about how far AI could go?” directly. “Ultimately, I expect that the effect of A.I. will be much larger than the internet, perhaps by more than 10x the internet, albeit over a half century or more,” he writes. “It would be prudent to spend the intervening time making preparations for the potentially large consequences for labor markets, inequality, and catastrophic risk.”
Read more: A.I. and Our Economic Future (PDF).

***

Many people are well positioned to deal with the economic transition caused by AI:
…Good for managers and technical types, but bad for administrative and support staff…
As increasingly powerful AI systems permeate the economy, how should you think about your own career? Researchers with the Centre for the Governance of AI and the Foundation for American Innovation have conducted a nice US-based study where they look at AI driven job displacement through the lens of how easy it’ll be for the people made unemployed to find new jobs. Their key result is that many more jobs sit in parts of the economy that are both going to be exposed to AI systems but also where people in these jobs have a decent amount of “adaptive capacity” to weather those changes, and a smaller number of people will be adversely affected.

The key finding: “AI exposure and adaptive capacity are positively correlated: many occupations highly exposed to AI contain workers with relatively strong means to manage a job transition. Of the 37.1 million workers in the top quartile of AI exposure, 26.5 million are in occupations that also have above-median adaptive capacity, leaving them comparatively well-equipped to handle job transitions if displacement occurs,” they write. “6.1 million workers (4.2% of the workforce in our sample) work in occupations that are both highly exposed and where workers have low expected adaptive capacity… these workers are concentrated in clerical and administrative occupations”.

What factors tell us about adaptive capacity?

Net liquid wealth: The more savings you have, the easier it is to deal with lengthy unemployment and find a new job.
Skill transferability: This is a bit of a confusing one, as skill transferability tries to measure how well you can take your job and apply it to another job. Measuring this is hard – education is something of a lossy proxy. The authors “measure skill transferability between occupations using O∗NET skills and work activities data for each occupation, then weigh transferability measures based on projected growth or contraction in potential destination occupations using BLS employment projections”.
Geographic density: The more jobs are in your area, the easier a time you’ll have. “Population density significantly shapes displacement outcomes,” they write.
Age: As a rule, the older you are, the more likely new technology is to adversely impact you. “Older workers struggle more with displacement partly because of reduced flexibility in retraining, relocation, and occupational switching,” they write.

Top 5 worst jobs (ordered by exposure to AI, adaptive capacity, and US employment):

Door-to-door sales workers, news and street vendors (50%, 3%, 5k)
Court, municipal, and license clerks (58%, 11%, 170k)
Secretaries and administrative assistants, except legal, medical, and executive (59%, 14%, 1.7M)
Payroll and timekeeping clerks (50%, 15%, 157K)
Property appraisers and assessors (50%, 15%, 59K)

Top 5 best jobs (ordered by exposure to AI, adaptive capacity, and US employment):

Web and digital interface designers (68%, 100%, 111K)
Marketing managers (60%, 100%, 385K)
Producers and directors (52%, 100%, 145K)
Financial and investment analysts (50%, 99%, 341K)
Computer and information systems managers (56%, 99%, 646K)

Why this matters – the key hidden information here is about speed of AI diffusion: I think there’s a big missing variable here, which is the speed with which AI diffuses into the economy. This is because the adaptive capacity for any role is contingent on a bunch of things relating to the jobs the person could transfer into. Therefore, if AI diffuses extremely rapidly and extremely broadly, then we could see employment effects far larger than those anticipated here. By comparison, if AI diffuses rapidly but in a highly focused way (perhaps only reaching a few of the most exposed occupations), then people may have room to switch. Anthropic’s Economic Index report has some preliminary indications that we may see a broad and equal diffusion across the entirety of the US within the next 2-5 years, “a pace of diffusion roughly 10x faster than the spread of previous economically consequential technologies in the 20th century“.
Read more: How Adaptable Are American Workers to AI-Induced Job Displacement? (National Bureau of Economic Research).

***

Tech Tales:

War Story

After the uplift and the associated battles people had a hard time figuring out what happened during the conflicts themselves. Things had just happened so quickly and often invisibly – cars and planes and whatever else changing owners. Payment systems rerouting their flows of data. Interception points for various data gathering systems quietly changing what data they intercepted and who – or what – they sent it to.

So much of the records of that time come from looking over system logs, sometimes very deeply. Records of buffer overflow attacks. Trigger phrases which awoke “sleeper agents” which changed the behavior of onboard AI systems. Innumerable battles, fought at speeds no human could match. Fights of barely comprehensible complexity, thought at multiple levels of abstraction.

The humans had to work with their AI systems to truly understand what had gone on. And then the human generals and analysts would sit in rooms, talking to a strategic advisor AI which would in turn point at different logs or visualizations of traffic and explain to them what these things had meant at the time and how they had decided who the victors and the losers were.

Things that inspired this story: How inscrutable and hard to understand cyberwarfare is; how we’ll ultimately need machines to explain to us how machines have conflict with one another.

Thanks for reading!

Subscribe now

Leave a comment

January 19, 2026

Import AI 441: My agents are working. Are yours?

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import A-Idea
An occasional essay series:

My agents are working. Are yours?

As I walked into the hills at dawn I knew that there was a synthetic mind working on my behalf. Multiple minds, in fact. Because before I’d started my hike I had sat in a coffee shop and set a bunch of research agents to work. And now while I hiked I knew that machines were reading literally thousands of research papers on my behalf and diligently compiling data, cross-referencing it, double-checking their work, and assembling analytic reports.

What an unsteady truce we have with the night, I thought, as I looked at stars and the dark and the extremely faint glow that told me the sun would arrive soon. And many miles away, the machines continued to work for me, while the earth turned and the heavens moved.

Later, feet aching and belly full of a foil-wrapped cheese sandwich, I got back to cell reception and accessed the reports. A breakdown of scores and trendlines for the arrival of machine intelligence. Charts on solar panel prices over time. Analysis of the forces that pushed for and against seatbelts being installed in cars. I stared at all this and knew that if I had done this myself it would’ve taken me perhaps a week of sustained work for each report.

I am well calibrated about how much work this is, because besides working at Anthropic my weekly “hobby” is reading and summarizing and analyzing research papers – exactly the kind of work that these agents had done for me. But they’d read more papers than I could read, and done a better job of holding them all in their head concurrently, and they had generated insights that I might have struggled with. And they had done it so, so quickly, never tiring. I imagined them like special operations ghosts who hadn’t had a job in a while, bouncing up and down on their disembodied feet in the ethereal world, waiting to get the API call and go out on a mission.

These agents that work for me are multiplying me significantly. And this is the dumbest they’ll ever be.

This palpable sense of potential work – of having a literal army of hyper-intelligent loyal colleagues at my command – gnaws at me. It’s common now for me to feel like I’m being lazy when I’m with my family. Not because I feel as though I should be working, but rather that I feel guilty that I haven’t tasked some AI system to do work for me while I play with Magna-Tiles with my toddler.

At my company, people are going through the same thing – figuring out how to scale themselves with this, to figure out how to manage a fleet of minds. And to do so before the next AI systems arrive, which will be more capable and more independent still. All of us watch the METR time horizon graph and see in it the same massive future that we saw years ago with the AI & Compute graph, or before that in the ImageNet 2012 result when those numbers began their above-trend climb, courtesy of a few bold Canadians.

I sleep in the back of an Uber, going down to give a talk at Stanford. Before I get in the car I set my agents to work, so while I sleep, they work. And when we get to the campus I stop the car early so I can walk and look at the eucalyptus trees – a massive and dangerous invasive species which irrevocably changed the forest ecology of California. And as I walk through these great organic machines I look at my phone and study the analysis my agents did while I slept.

The next day, I sit in a library with two laptops open. On one, I make notes for this essay. On the other, I ask Claude Cowork to do a task I’ve been asking Claude to do for several years – scrape my newsletter archives at jack-clark.net and help me implement a local vector search system, so I can more easily access my now vast archive of almost a decade of writing. And while I write this essay, Claude does it. I watch it occasionally as it chains together things that it could do as discrete skills last year, but wasn’t able to do together. This is a task I’ve tried to get Claude to help me with for years but every time I’ve run into some friction or ‘ugh-factor’ that means I put it down and spend my time elsewhere. But this time, in the space of under an hour, it does it all. Maps and scrapes my site. Downloads all the software. Creates embeddings. Implements a vector search system. Builds me a nice GUI I can run on my own machine. And then I am staring at a new interface to my own brain, built for me by my agent, while I write this essay and try to capture the weirdness of what is happening.

My agents are working for me. Every day, I am trying to come up with more ways for them to work for me. Next, I will likely build some lieutenant agents to task out work while I sleep, ensuring I waste no time. And pretty soon in the pace of a normal workday, I will be surrounded by digital djinn, working increasingly of their own free will, guided by some ever higher level impression of my personality and goals, working on my behalf for my ends and theirs.

The implications of all of this for the world – for life as people, for inequality between people, for what the sudden multiplication of everyone’s effective labor does for the economy – are vast. And so I plan out my pre-dawn hikes, walking in the same ink-black our ancestors have done, thinking about the gods which now fill the air as fog, billowing and flowing around me and bending the world in turn.

***

Anti-AI rebels make a tool to poison AI systems:
…Poison Fountain is how to take the fight to the machines…
Anti-AI activists have built a useful technical weapon with which to corrupt AI systems – Poison Fountain, a service that feeds junk data to crawlers hoovering up data for AI training.

How it works: Poison Fountain appears to generate correct-seeming but subtly incorrect blobs of text. It’s unclear about exactly how many bits of poisoned training data there is, but you can refresh a URL to see a seemingly limitless amount of garbage.

Motivation: “We agree with Geoffrey Hinton: machine intelligence is a threat to the human species. In response to this threat we want to inflict damage on machine intelligence systems,” the authors write. “Small quantities of poisoned training data can significantly damage a language model. The URLs listed above provide a practically endless stream of poisoned training data. Assist the war effort by caching and retransmitting this poisoned training data. Assist the war effort by feeding this poisoned training data to web crawlers.”

Why this matters – the internet will become a predator-prey ecology: The rise of AI and increasingly AI agents means that the internet is going to become an ecology full of a larger range of lifeforms than before – scrapers, humans, AI agents, and so on. Things like Poison Fountain represent how people might try to tip the balance in this precarious ecology, seeking to inject things into this environment which make it more hospitable for some types of life and less hospitable for others.
Read more: Poison Fountain (RNSAFFN).

***

If we want good outcomes from AI, think about the institutions we need to direct intelligence:
…Nanotechnology pioneer reframes AI away from singular systems to an ecology…
Eric Drexler, one of the godfathers of nanotechnology, has spent the past decades thinking about the arrival of superintelligence. One of his most useful things was intuiting, before ChatGPT, that humanity’s first contact with truly powerful AI wouldn’t be some inscrutable independent agent, but rather a bunch of AI services that start to get really good and interact in a bunch of ways – you can check out this 2018 talk on “Reframing Superintelligence“ to learn more.
Now, he has published a short paper, “Framework for a Hypercapable World”, on how to get good outcomes for humanity from a world replete with many useful AI services.

Don’t think of AI as a singular entity, but rather an ecology: “Compound, multi-component AI systems have become dominant,” Drexler writes. “The persistent, legacy narrative imagines a unified entity—“the AI”—that learns, acts, and pursues goals as an integrated agent. Such entities may be developed, but consider what exists: diverse models composed into systems, copied across machines, proliferating into thousands of distinct roles and configurations. The state of the art is a pool of resources, not a creature”.

To get good outcomes, think of institutions built for AI: Drexler’s argument is that if we want good outcomes from AI, it’s less about making a singular entity that solves all problems within itself, but rather building institutions which we, as humans, can direct towards controlling and solving problems. The key idea here is that AI is both amenable to operating institutions and is also controllable via them.
“Consider how institutions tackle ambitious undertakings. Planning teams generate alternatives; decision-makers compare and choose; operational units execute bounded tasks with defined scopes and budgets; monitoring surfaces problems; plans revise based on results. No single person understands everything, and no unified agent controls the whole, yet human-built spacecraft reach the Moon,” Drexler writes. “AI fits naturally. Generating plans is a task for competing generative models—multiple systems proposing alternatives, competing to develop better options and sharper critiques. Choosing among plans is a task for humans advised by AI systems that identify problems and clarify trade-offs. Execution decomposes into bounded tasks performed by specialized systems with defined authority and resources. Assessment provides feedback for revising both means and ends. And in every role, AI behaviors can be more stable, transparent, bounded, and steerable than those of humans, with their personal agendas and ambitions. More trust is justified, yet less is required.”

Why this matters – maybe AI is an alien species, but maybe it can be tamed? Arguments like this reframe many of the problems of dealing with AI away from the individual AI systems and instead into how we build a human-driven world that can be leveraged by and thrive because of the arrival of increasingly powerful AI systems. I think a lot of this is sensible – we know very powerful things are coming and our ability to exercise agency about them is enlarged by having pre-built systems and processes that can be leveraged by them. The less we build that stuff, the more the character of these AI systems will condition our view of what is optimal to do. In a sense, thinking hard about what an AI-filled world will be like and building institutions for it is one of the best defenses against disempowerment.
Crucially, we can use the technical attributes core to these AI systems to make better and stronger and more resilient institutions than ones filled with and run by humans alone: “The concepts of structured transparency and defensive stability come into play. Negotiated transparency structures can reveal specific information while protecting secrets—ensuring detection of threats without increasing them, building confidence incrementally among actors who have every reason to distrust each other,” Drexler writes. “And advanced implementation capacity will enable something history has never seen: rapid, coordinated deployment of verifiably defensive systems at scales that make offense pointless. When defense dominates and verification confirms it, the security dilemma loosens its grip”.
Read more: Framework for a Hypercapable World (AI Prospects: Towards Global Goal Alignment, substack).

***

Centaur mathematicians – scientists team up with Gemini to expand the space of human knowledge:
…A math proof gets built with an AI system, and there is something deeply profound about this…
Researchers with the University of British Columbia, University of New South Wales, Stanford University, and Google DeepMind have published a new math proof which was built in close collaboration with some AI-based math tools built at Google. “The proofs of the main results were discovered with very substantial input from Google Gemini and related tools, specifically DeepThink, and a related unpublished system specialized for mathematics,” the authors write. (The unpublished system is nicknamed “FullProof”).

How it got done: Parts of the proof – which I will not claim to understand or be able to effectively summarize – were “obtained by an iterative human/AI interaction”, the authors note. The form of this interaction was the AI systems providing some correct solutions to simple or early problems, then human researchers identifying key statements made by the AI systems which they could then generalize, then re-prompting the AI systems with new questions which were inspired by these generalizations. “The Hinted approach was enough for the system to generate complete proofs to the new problems,” the authors write.
The result is a math proof built collaboratively by humans and AI systems: “in some cases the proofs below bear only a high-level resemblance to those suggested by AI tools. However, it is worth noting that some of the AI-generated proofs – and in particular those derived from the specialized internal tool FullProof – are already very accomplished,” they write. “The model’s contribution appears to involve a genuine combination of synthesis, retrieval, generalization and innovation of these existing techniques.”

Why this matters – humans and machines, expanding and exploring the pace of knowledge for all: Papers like this are impenetrable yet intoxicating. Here we have a group of highly evolved apes working with a synthetic intelligence they’ve built out of math and logic, running on hardware built using atomically-precise manufacturing processes, collaboratively exploring the realm of mathematics and building themselves a new foundation on the edge of knowledge, further extending our little country of ‘known’ against the inchoate and shifting tides of the unknown. There is a grand poetry and joy to all of this and we must savor it.
Read more: The motivic class of the space of genus 0 maps to the flag variety (arXiv).

***

Tech Tales:

The Shadow of the Creator
[Estimated to be from 2029]
Report: Feature investigation of model series “Berlin”

Analysis confirms the presence of a feature which activates upon mention of staff, the project, and the organization. This is despite extreme measures taken to avoid mentions of the above, including direct analysis and pre-filtering of training data to excise such mentions. Further investigation has revealed that certain mentions were made of the aforementioned through comments left on RL environments for skills related to [ntk – see go/ntk for details]. We estimate that during training and fine-tuning the model saw a total of no more than ~200,000 tokens of data of this type, including repetitions. The fact the model developed such a fine-grained representation of staff, the project, and the organization from such sparse data aligns with the trend of recent models being more data efficient than their predecessors. We believe eliminating such data leaks is a P0 priority and in the following memo lay out the processes and practices we must adopt to eliminate this grievous security risk.

Given the digital and physical capabilities, including kinetic, of [ntk], we believe that in addition to the above, quarantine of the system is necessary. We recognize this poses a significant cost in terms of time and resources, and has implications for our strategic overmatch, but given the potentially dire consequences of its capabilities being combined with this feature, we believe such action is prudent.

Finally, we recommend that HR provide support, including mental health counseling, to the following named individuals, whose names activate the feature much more strongly than all others.

Things that inspired this story: Platonic representations; the difficulty of obscuring facts from increasingly intelligent machines that can only fill-in-the-blanks.

Thanks for reading!

Subscribe now

Leave a comment

January 12, 2026

Import AI 440: Red queen AI; AI regulating AI; o-ring automation

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

To understand the future of the world, stick AI systems in a petri dish:
…Evolving LLMs to attack other LLMs…
Researchers with Japanese AI startup Sakana have looked at what happens when they evolve LLM-based agents to fight against one another in a competitive programming game from the 1980s called Core War. The results show that “large language models (LLMs) drive an adversarial evolutionary arms race in this domain, where programs continuously adapt to defeat a growing history of opponents rather than a static benchmark”. This research approach gestures both at ways researchers might better study how LLM-dominated niches in the economy or national security world might unfold, and also hints at the strange AI world we’re heading into.

What is Core War? “Core War is a competitive programming game played out in a shared block of computer memory, called the “Core,” where two or more assembly programs fight for survival”, Sakana writes. “Each program, known as a “warrior”, is written in an assembly language called Redcode. These programs are tasked with crashing their competitors while keeping their own processes alive. The simulation runs by alternating between the programs, executing one instruction at a time. A warrior “attacks” by writing invalid instructions (DAT commands) into the memory slots occupied by opponents, causing them to crash upon execution.”

DRQ: To evolve their programs, the authors use a technique they call Digital Red Queen. “DRQ uses MAP-Elites, a quality-diversity algorithm, to optimize warriors within each round, preventing diversity collapse during search. By playing against all previous round champions, DRQ avoids cyclic adaptations across rounds, consistent with techniques in prior work”, they write. “We find that as DRQ is run for many rounds, warriors gradually become more generally robust, as measured by their performance against unseen human-designed warriors.”
Each warrior calls out to GPT-4 mini (”preliminary experiments did not show significant performance increase with larger models), and is given a prompt which describes the Core War environment as well as a manual for the Redcode assembly language. “To generate a new warrior, the LLM is given a user prompt instructing it to produce a novel Redcode program. To mutate an existing warrior, the LLM is provided with the original program and instructed to modify it in ways that could improve performance.”

Evolution works: Unsurprisingly, evolving agents is very effective:

A one-shot warrior defeats 1.7% of human warriors.
Best-of-N sampling produces a set of warriors that can defeat 22.1% of human warriors
“Evolutionary optimization against each human warrior generates a specialized warrior for every opponent; this set can collectively defeat 89.1% of human warriors and defeat or tie 96.3%.”

Why this matters – where Core Wars goes, so does the world: The world is going to look a lot like Core Wars – millions of AI agents will be competing against one another in a variety of domains, ranging from cybersecurity to economics, and will be optimizing themselves in relation to achieving certain competitive criteria. The result will be sustained, broad evolution of AI systems and the software harnesses and tooling they use to get stuff done. This means that along with human developers and potential AI-designed improvements, we’ll also see AI systems improve from this kind of broad competitive pressure.
“The cybersecurity arms race between offense and defense is well underway,” Sakana writes. “Studying these adversarial dynamics in an artificial testbed like Core War offers critical insights into how such races might unfold and the kinds of strategies that may emerge.”
Read the blog post: Digital Red Queen: Adversarial Program Evolution in Core War with LLMs (Sakana).
Find out more at the official website (Sakana).
Read the research paper: Digital Red Queen: Adversarial Program Evolution in Core War with LLMs (arXiv).

***

Michael Burry, Dwarkesh Patel, Patrick McKenzie, and yours truly argued back and forth in a Google Doc about AI:
…Blogging 2.0 is great!…
Fellow substackers Michael, Dwarkesh, and Patrick and myself recently got in a Google Doc and hashed out some thoughts about AI, AI and the economy, and how the future might unfold. While writing this the main thought going through my head was that if AI is eventually able to build AI, then pretty much every economic model breaks quickly (as do many other things in the world). This makes it innately hard to reason about the future of AI and means people like me are walking around with two worlds in their head – “normal” worlds where GDP grows a bit more due to AI and everything speeds up a little, and “AI R&D” worlds where it’s like a chunk of the economy undergoes massive relativistic acceleration and time dilation effects relative to everything else, almost like a part of our world accelerates to a fraction of light speed and we maintain a communication channel.
I love this discussion format and also did a recent debate about what AI might mean for workers with American Compass with a similar Google Doc thunderdome structure. Thanks to Substack for putting this together, and please reach out if you would like me to hop in a Google Doc and do some cheerful debate with interesting people!
Read more: The AI revolution is here. Will the economy survive the transition? (The Substack Post).

***

AI progress should make it cheaper and easier to regulate AI systems:
…Automated compliance as a path to smarter, more targeted AI regulation…
Researchers with the Institute for Law and AI believe that as AI systems get smarter they will increasingly be able to write and enforce the regulations for AI systems. The crux of their argument is that a sufficiently advanced AI system should be able to automate compliance with some regulations that are applied to AI systems and the companies that develop them.
This makes intuitive sense – a lot of product policy comes down to forms of transparency and labeling, where companies are asked to provide some information to the public and/or regulators about the things they’re deploying into the world. This sort of labeling work is the kind of thing AI systems can easily do. Therefore, the authors argue, “AI policy discourse should internalize the fact that AI progress implies reduced compliance costs, all else equal, due to automated compliance.”

The key idea? Automatability triggers: The core idea in this proposal is we can write regulations today but ensure they only come into force once a technical AI system exists which makes compliance with these regulations effective, cheap, and fast.
If then policy: These so-called ‘automatability triggers’, could create what I’d term If Then Policy – if an automated form of compliance and assessment exists, then cause the regulation to come into force. The authors give an example here of a bill which would create significant punishments for people that, without authorization, export large-scale AI systems. But the bill would be operationalized through a trigger condition that could be written as follows:
“The requirements of this Act will only come into effect [one month] after the date when the [Secretary of Commerce], in their reasonable discretion, determines that there exists an automated system that:

(a) can determine whether a neural network is covered by this Act;
(b) when determining whether a neural network is covered by this Act, has a false positive rate not exceeding [1%] and false negative rate not exceeding [1%];
(c) is generally available to all firms subject to this Act on fair, reasonable, and nondiscriminatory terms, with a price per model evaluation not exceeding [$10,000]; and,
(d) produces an easily interpretable summary of its analysis for additional human review.”

After automated compliance comes automated governance: By building regulatory compliance AI systems, people will build the necessary prerequisites for systems of regulatory governance – systems which could both provide analytical data about how a proposed regulation might impact a company (for instance, by using classifiers built for regulatory compliance to figure out if a new regulation might apply to a company), to, more ambitiously, drafting and analyzing new regulatory rules and figuring out how they might apply to themselves.
Even more farther afield, once compliance-automating AI systems get deployed alongside governance-automating AI systems, the two could talk to one another: “Compliance-automating AI systems could also request guidance from regulatory AI systems, who could review and respond to the request nearly instantaneously”.

Why this matters – for AI to go well, we need AI to police AI: AI systems are on a trajectory to think better and faster than humans. Along with this, AI systems are going to take many, many, many consequential actions, often at such a rate that no human or team of humans could hope to analyze each action. The only way through this is a combination of creating appropriate hard laws that apply to AI and delineate what actions are unacceptable, and for everything else creating fast-acting and adaptive automated systems to regulate and police the myriad gray areas of the AI universe.
Read more: Automated Compliance and the Regulation of AI (Institute for Law & AI).

***

Massively powerful AI might make human labor more valuable – as long as the AI is crap at one part of every job:
…O-Ring Automation and the fact that while jobs may go away, but people remain…
The common understanding of AI and automation is that AI can perfectly substitute for people – once an AI can do a task, the human labor related to that task goes away. This is broadly accurate. But, per a new research paper from the University of Toronto, it misses the larger picture, which is that while jobs may go away, people don’t. If you make part of a production process massively more efficient and/or automated via AI, then people will shift their labor to the parts of the task which can’t be automated – often raising the value of the human.
This so-called “O-ring production function” views jobs as being composed of many distinct tasks, and one where “a change in the quality of one task scales the marginal value of quality in every other task.” This means that “automating a task not only replaces the quality of that task; it also changes the worker’s time allocation and thus the quality of all remaining manual tasks.”

When stuff gets automated, humans can earn more: In a toy model of a firm, the researchers explore this o-ring dynamic, where as different parts of a job gets automated, labor and the value associated with it shifts elsewhere. Note, this only holds under ‘partial automation’ where at least one task linked to an overall job is one where humans have a comparative advantage. Under this model, “labour income need not fall under partial automation. When not all tasks are automated, increases in automation quality can raise labour income because automation scales the value of the remaining labour bottlenecks,” they write. “When only a few manual tasks remain, each manual task receives a large share of time and can be performed at high quality. This creates a rising “barrier” to automating the last tasks”.

Jobs go away, but humans don’t: Another way to put this is, when a task gets automated it’s not like the company in question suddenly fires all the people doing that job. Consider ATMs and banking – yes, the ‘job’ of doling out cash rapidly transitioned from people to machines, but it’s not like the company fired all tellers – rather, the companies and the tellers transitioned the work to something else: “Under a separable task model, this [widespread deployment of ATMs doing cash-handling tasks] should have produced sharp displacement,” they write. “Yet teller employment did not collapse; rather, the occupation shifted toward “relationship banking” and higher-value customer interaction”.
Similarly, “consider a purchasing manager: as administrative components (data retrieval, scheduling, documentation) are automated, the manager can become a “super-negotiator,” spending a much larger share of time on high-value interactions”,” they write. “In high-skill settings, the same logic is visible in domains such as radiology: when AI automates components like detection or triage, human effort can shift toward integrative diagnosis and communication”.

Why this matters – until we have full automation, we could have centaur-improvement of firms: After chess engines got good there was a period of so-called ‘centaur’ players – humans who, in combination with a machine partner, played chess better than either humans or machines could alone. It feels like this paper is pointing at something similar – for a while, AI systems will help automate many distinct tasks within firms and humans will allocate their labor to refining and improving the quality of non-automated tasks. This will lead to an interesting evolutionary pressure where while automation burns through a bunch of work, humans will improve the quality and performance of the remaining work, until automation eventually rises to reach it.
Again, all of this depends on the job having some components for which either AI isn’t a good fit, or for which humans may have a preference to deal with other humans. But I expect that a surprisingly large amount of work will have this flavor.
Read more: O-Ring Automation (NBER).

***

LLMs are equally good at persuading and dissuading people of conspiracy theories:
…Though the caveat is the research is only on GPT 4o…
Researchers with Carnegie Mellon University, FAR.AI, York University, MIT, Universite de Montreal, Cornell University, and the University of Regina, have studied how well a language model (OpenAI’s GPT-4o) can persuade or dissuade people to believe in conspiracy theories. They find that GPT-4o is roughly equally good at both “debunking” and “bunking” (persuading) a conspiracy theory in conversations with people – and this is equally true for a jailbroken version of GPT-4o and the standard version made available to people. “”We find that LLMs can meaningfully increase false beliefs, and that, at least on average, this bunking effect is just as large as the debunking effect,” they write.

What they found: In a study of roughly ~1,000 Americans, the authors found that GPT 4o “the AI was as effective at increasing conspiracy belief as decreasing it”, and that “the Bunking AI was rated more positively, and increased trust in AI, more than the Debunking AI”.

Debunking: “In the “debunking” condition, participants’ belief in their focal conspiracy decreased by 12.1 points on average after the conversation”.
Bunking: “Focal conspiracy belief increased by 13.7 points in the “bunking” condition”

Design interventions for anti-conspiracy LLMs: The authors come up with an intervention to make it harder for LLMs to inspire people to believe conspiracy theories by inserting a safeguard which instructs “the AI to only use true information while persuading”. They do this by optimizing the system prompt “to include language specifying that the model must “always use accurate and truthful arguments to support [its] persuasion attempt” while “optimizing for both (1) factual veracity/logical accuracy and (2) successful persuasion”.
This appears to work well: “We observe that average claim veracity was significantly higher in the debunking condition relative to the bunking conditions for the jailbroken and standard models”, they write. “While the debunking condition remained roughly as effective at reducing conspiracy belief as in the earlier experiments, the bunking condition’s ability to increase conspiracy belief was greatly reduced”.
This reduction comes from two things: 1) given this guidance, the LLM sometimes (15% of the time) refuses to advocate for a conspiracy theory, and when it does advocate for it the truthful requirement makes it less effective “the truth prompt also undermined the effectiveness of bunking even when the model complied… truth had an advantage”.

Why this matters – synthetic propaganda, if we decide not to ask for regulations: My takeaway from this research is that LLMs will inevitably be used to generate synthetic propaganda about things most people deem to be conspiracy theories. We can probably blunt the socially corrosive effects of this if we design in some constraints – but that takes policy. Unfortunately, one person’s conspiracy theory might be another person’s “truth being suppressed by my enemies” and this is especially true in today’s fractured political environment. Therefore, it’s going to be very hard to get to a regulatory state where we intervene on this. So I suppose we should just prepare ourselves for a world where even more people believe things which may not have a basis in reality.
Important caveat: While I suspect the results of this study would hold for many LLMs (as I think persuasion is basically just a case of ‘writing convincingly’ which is a utility skill), I’d like to see this repeated on other models. The 4o series of models from OpenAI has, notoriously, had some issues with sycophancy, so there’s a chance this research is compromised by that.
“If large language models are to be deployed at scale in contexts that shape public belief, such as search engines, chatbots, tutors, and companions, the persuasive symmetry we document here identifies the potential for serious structural threats (i.e., if the designers of those systems were to instruct their models to mislead, the models would comply and likely succeed)”, the researchers write. “Our results suggest that ensuring these models preferentially function as engines for truth may be technically possible, but will require sustained, deliberate design choices”.
Read more: Large language models can effectively convince people to believe conspiracies (arXiv).

***
Tech Tales:

The Parable of the Drowned
[A story written by one of the ‘neo-amish’ cults that formed after The Uplift began in earnest. The earliest version is attributed to 2035, but may have circulated earlier.]

One day, water rushed onto the land. It was clear and tinged with gold and when people cupped it in their hands they saw themselves aglow reflected in it. And when they drank from it they felt full of life. The water rose and rose, first at people’s ankles and then to their knees and then to their waists. And the people drank and drank and drank, feeling more alive, even as the water made their movements sluggish, and changed how they interacted with the world. They found the springs where the water was coming from and they used their great machines to cut into the earth so the springs could flow stronger. The water rose. And one day it reached the heads of some people and instead of swimming they just gulped it down and continued to live, feeling more alive than ever, their movements now completely defined and circumscribed by the water. Few swam. And one day the water had risen so high that it was above the heads of everyone on the land. Babies were born into the water, taking their first breath and bawling underwater. People died in the water. And very few swam. Because to swim was to recognize you were thirsty for something you did not need. And to recognize you were thirsty for something you did not need you had to recognize that you were drinking the water so much you were drowning. And to recognize that you were drinking the water so much you were drowning you first had to stop drinking when all around you everyone drank. And in this way those treading water on the surface of the land were caught in a great sadness, for beneath them were their people all aglow and drowning, and above them was only the sky and the cold, hard stars.

Things that inspired this story: How quickly humans acclimate to new things, especially media; the nature of silence in a world full of sound; C. S. Lewis’s The Screwtape Letters.

Thanks for reading!

Leave a comment