If your AI model is going to sell, it has to be safe

A hand holding a phone in front of a screen with the OpenAI logo and the term GPT-4.
CFOTO/Future Publishing via Getty Images

OpenAI’s GPT-4 shows the competitive advantage of putting in safety work.

On March 14, OpenAI released the successor to ChatGPT: GPT-4. It impressed observers with its markedly improved performance across reasoning, retention, and coding. It also fanned fears around AI safety, around our ability to control these increasingly powerful models. But that debate obscures the fact that, in many ways, GPT-4’s most remarkable gains, compared to similar models in the past, have been around safety.

According to the company’s Technical Report, during GPT-4’s development, OpenAI “spent six months on safety research, risk assessment, and iteration.” OpenAI reported that this work yielded significant results: “GPT-4 is 82% less likely to respond to requests for disallowed content and 40% more likely to produce factual responses than GPT-3.5 on our internal evaluations.” (ChatGPT is a slightly tweaked version of GPT-3.5: if you’ve been using ChatGPT over the last few months, you’ve been interacting with GPT-3.5.)

This demonstrates a broader point: For AI companies, there are significant competitive advantages and profit incentives for emphasizing safety. The key success of ChatGPT over other companies’ large language models (LLMs) — apart from a nice user interface and remarkable word-of-mouth buzz — is precisely its safety. Even as it rapidly grew to over 100 million users, it hasn’t had to be taken down or significantly tweaked to make it less harmful (and less useful).

Tech companies should be investing heavily in safety research and testing for all our sakes, but also for their own commercial self-interest. That way, the AI model works as intended, and these companies can keep their tech online. ChatGPT Plus is making money, and you can’t make money if you’ve had to take your language model down. OpenAI’s reputation has been increased by its tech being safer than its competitors, while other tech companies have had their reputations hit by their tech being unsafe, and even having to take it down. (Disclosure: I am listed in the acknowledgments of the GPT-4 System Card, but I have not shown the draft of this story to anyone at OpenAI, nor have I taken funding from the company.)

The competitive advantage of AI safety

Just ask Mark Zuckerberg. When Meta released its large language model BlenderBot 3 in August 2022, it immediately faced problems of making inappropriate and untrue statements. Meta’s Galactica was only up for three days in November 2022 before it was withdrawn after it was shown confidently ‘hallucinating’ (making up) academic papers that didn’t exist. Most recently, in February 2023, Meta irresponsibly released the full weights of its latest language model, LLaMA. As many experts predicted would happen, it proliferated to 4chan, where it will be used to mass-produce disinformation and hate.

I and my co-authors warned about this five years ago in a 2018 report called “The Malicious Use of Artificial Intelligence,” while the Partnership on AI (Meta was a founding member and remains an active partner) had a great report on responsible publication in 2021. These repeated and failed attempts to “move fast and break things” have probably exacerbated Meta’s trust problems. In surveys from 2021 of AI researchers and the US public on trust in actors to shape the development and use of AI in the public interest, “Facebook [Meta] is ranked the least trustworthy of American tech companies.”

But it’s not just Meta. The original misbehaving machine learning chatbot was Microsoft’s Tay, which was withdrawn 16 hours after it was released in 2016 after making racist and inflammatory statements. Even Bing/Sydney had some very erratic responses, including declaring its love for, and then threatening, a journalist. In response, Microsoft limited the number of messages one could exchange, and Bing/Sydney no longer answers questions about itself.

We now know Microsoft based it on OpenAI’s GPT-4; Microsoft invested $11 billion into OpenAI in return for OpenAI running all their computing on Microsoft’s Azure cloud and becoming their “preferred partner for commercializing new AI technologies.” But it is unclear why the model responded so strangely. It could have been an early, not fully safety-trained version, or it could be due to its connection to search and thus its ability to “read” and respond to an article about itself in real time. (By contrast, GPT-4’s training data only runs up to September 2021, and it does not have access to the web.) It’s notable that even as it was heralding its new AI models, Microsoft recently laid off its AI ethics and society team.

OpenAI took a different path with GPT-4, but it’s not the only AI company that has been putting in the work on safety. Other leading labs have also been making clear their commitments, with Anthropic and DeepMind publishing their safety and alignment strategies. These two labs have also been safe and cautious with the development and deployment of Claude and Sparrow, their respective LLMs.

A playbook for best practices

Tech companies developing LLMs and other forms of cutting-edge, impactful AI should learn from this comparison. They should adopt the best practice as shown by OpenAI: Invest in safety research and testing before releasing.

What does this look like specifically? GPT-4’s System Card describes four steps OpenAI took that could be a model for other companies.

First, prune your dataset for toxic or inappropriate content. Second, train your system with reinforcement learning from human feedback (RLHF) and rule-based reward models (RBRMs). RLHF involves human labelers creating demonstration data for the model to copy and ranking data (“output A is preferred to output B”) for the model to better predict what outputs we want. RLHF produces a model that is sometimes overcautious, refusing to answer or hedging (as some users of ChatGPT will have noticed).

RBRM is an automated classifier that evaluates the model’s output on a set of rules in multiple-choice style, then rewards the model for refusing or answering for the right reasons and in the desired style. So the combination of RLHF and RBRM encourages the model to answer questions helpfully, refuse to answer some harmful questions, and distinguish between the two.

Third, provide structured access to the model through an API. This allows you to filter responses and monitor for poor behavior from the model (or from users). Fourth, invest in moderation, both by humans and by automated moderation and content classifiers. For example, OpenAI used GPT-4 to create rule-based classifiers that flag model outputs that could be harmful.

This all takes time and effort, but it’s worth it. Other approaches can also work, like Anthropic’s rule-following Constitutional AI, which leverages RL from AI feedback (RLAIF) to complement human labelers. As OpenAI acknowledges, their approach is not perfect: the model still hallucinates and can still sometimes be tricked into providing harmful content. Indeed, there’s room to go beyond and improve upon OpenAI’s approach, for example by providing more compensation and career progression opportunities for the human labelers of outputs.

Has OpenAI become less open? If this means less open source, then no. OpenAI adopted a “staged release” strategy for GPT-2 in 2019 and an API in 2020. Given Meta’s 4chan experience, this seems justified. As Ilya Sutskever, OpenAI chief scientist, noted to The Verge: “I fully expect that in a few years it’s going to be completely obvious to everyone that open-sourcing AI is just not wise.”

GPT-4 did have less information than previous releases on “architecture (including model size), hardware, training compute, dataset construction, training method.” This is because OpenAI is concerned about acceleration risk: “the risk of racing dynamics leading to a decline in safety standards, the diffusion of bad norms, and accelerated AI timelines, each of which heighten societal risks associated with AI.”

Providing those technical details would speed up the overall rate of progress in developing and deploying powerful AI systems. However, AI poses many unsolved governance and technical challenges: For example, the US and EU won’t have detailed safety technical standards for high-risk AI systems ready until early 2025.

That’s why I and others believe we shouldn’t be speeding up progress in AI capabilities, but we should be going full speed ahead on safety progress. Any reduced openness should never be an impediment to safety, which is why it’s so useful that the System Card shares details on safety challenges and mitigation techniques. Even though OpenAI seems to be coming around to this view, they’re still at the forefront of pushing forward capabilities, and should provide more information on how and when they envisage themselves and the field slowing down.

AI companies should be investing significantly in safety research and testing. It is the right thing to do and will soon be required by regulation and safety standards in the EU and USA. But also, it is in the self-interest of these AI companies. Put in the work, get the reward.

Haydn Belfield has been academic project manager at the University of Cambridge’s Centre for the Study of Existential Risk (CSER) for the past six years. He is also an associate fellow at the Leverhulme Centre for the Future of Intelligence.

You May Also Like