One of the coolest AI systems I’ve ever seen may also be the one that will kick me out of my job.

Earlier this week, I attended a demo with a research team at OpenAI, the San Francisco nonprofit that’s right up there with top tech companies in conducting impressive new research on the frontiers of AI. The system they showed me was a language-learning model that writes the news, answers reading comprehension problems, and is beginning to show promise at tasks like translation.

In a paper released Thursday, the OpenAI team demonstrates that we can get those results from an “unsupervised” AI — meaning the system learned from reading 8 million internet articles, not from being explicitly trained for the tasks. Their AI advances the state of the art — in some cases, by a lot. The OpenAI team says their system sets a record for performance on so-called Winograd schemas, a tough reading comprehension task; achieves near-human performance on the Children’s Book Test, another check of reading comprehension; and — most thrillingly to me — generates its own text, including highly convincing news articles and Amazon reviews.

Here’s what happens when you give the system a one-sentence prompt and invite it to write the rest of this article:


The AI selects words one at a time and then considers what the next one should be. It takes a few seconds to add sentences. It’s by no means perfect: The prose is pretty rough, there’s the occasional non-sequitur, and the articles get less coherent the longer they get. “The model still does seem to drift off topic eventually, and the output is capped at a few hundred words,” Sam Bowman, who works on natural language processing and computational linguistics at NYU, told me in an email.

And to be clear, while the AI can write news articles that are sometimes convincing enough that I wouldn’t be surprised to see them in the newspaper, it can’t write true news articles; the quotes and statistics are all made up.

Advantage human journalists — for now.

We’ve made huge strides in natural language processing over the past decade. Translation has improved, becoming high-quality enough that you can read news articles in other languages. Google demonstrated last summer that Google Assistant can make phone calls and book appointments while sounding just like a human (though the company promised it won’t use deceptive tactics in practice).

AI systems are seeing similarly impressive gains outside natural language processing. New techniques — and more computing power — have allowed researchers to invent photorealistic images, excel at two-player games like Go, and compete with the pros in strategy video games like Starcraft and DOTA.

But even for those of us who are used to seeing fast progress in this space, the latest release from OpenAI is pretty impressive.

Until now, researchers trying to get world-record results on language tasks would “fine-tune” their models to perform well on the specific task in question — that is, the AI would be trained for each task.

The OpenAI system, called GPT-2, needed no fine-tuning: It turned in a record-setting performance at lots of the core tasks we use to judge language AIs, without ever having seen those tasks before and without being specifically trained to handle them. It also started to demonstrate some talent for reading comprehension, summarization, and translation with no explicit training in those tasks.

GPT-2 is the result of an approach called “unsupervised learning.” Here’s what that means. The predominant approach in industry today is “supervised learning.” That’s where you have large, carefully labeled data sets that contain desired inputs and desired outputs. You teach the AI how to produce the outputs given the inputs.

That can get great results, but it requires building huge data sets and carefully labeling each bit of data. And it’s worth noting that supervised learning isn’t how humans acquire skills and knowledge. We make inferences about the world without the carefully delineated examples from supervised learning.

Many people believe that advances in general AI capabilities will require advances in unsupervised learning — that is, where the AI just gets exposed to lots of data and has to figure out everything else itself. Unsupervised learning is easier to scale since there’s lots more unstructured data than there is structured data, and unsupervised learning may generalize better across tasks.

One task that OpenAI used to test the capabilities of GPT-2 is a famous test in machine learning known as the Winograd schema test. A Winograd schema is a sentence that’s grammatically ambiguous but not ambiguous to humans — because we have the context to interpret it.

For example, take the sentence: “The trophy doesn’t fit in the brown suitcase because it’s too big.”

To a human reader, it’s obvious that this means the trophy is too big, not that the suitcase is too big, because we know how objects fitting into other objects works. AI systems, though, struggle with questions like these.

Before this paper, state-of-the-art AIs that can solve Winograd schemas got them right 63.7 percent of the time, OpenAI says. (Humans almost never get them wrong.) GPT-2 gets these right 70.7 percent of the time. That’s still well short of human-level performance, but it’s a striking gain over what was previously possible.

GPT-2 set records on other language tasks, too. LAMBADA is a task that tests a computer’s ability to use context mentioned earlier in a story in order to complete a sentence. The previous best performance had 56.25 percent accuracy; GPT-2 achieved 63.24 percent accuracy. (Again, humans get these right more than 95 percent of the time, so AI hasn’t replaced us yet — but this is a substantial jump in capabilities.)

One skeptical perspective on text-generation AI systems, Bowman pointed out, is that “models like this can sometimes look deceptively good by just repeating the exact texts that they were trained on.” For example, it’s easy to have coherent paragraphs if you’re plagiarizing whole paragraphs from other sources. But that’s not what’s going on here: “This is set up in a way that it can’t really be doing that.” Since it selects one word at a time, it’s not plagiarizing.

Another skeptical perspective on AI advances like this one is that they don’t reflect “deep” advances in our understanding of computer systems, just shallow improvements that come from being able to use more data and more computing power. Critics argue that almost everything heralded as an AI advance is really just incremental progress from adding more computing power to existing approaches.

The team at OpenAI contested that. GPT-2 uses a newly invented neural network design called the Transformer, invented 18 months ago by researchers at Google Brain. Some of the gains in performance are certainly thanks to more data and more computing power, but they’re also driven by powerful recent innovations in the field — as we’d expect if AI as a field is improving on all fronts.

“It’s more data, more compute, cheaper compute, and architectural improvements — designed by researchers at Google about a year and a half ago,” OpenAI researcher Jeffrey Wu told me. “We just want to try everything and see where the actual results take us.”

The team at OpenAI is making the unusual choice not to release their system publicly for everyone to interact with. That’s too bad — take it from me, it’s incredibly fun to try out — but they have a very good reason.

OpenAI has been active in trying to figure out how to limit the potential for misuse of AI, and they’ve concluded that in some cases, the right solution is limiting what they publish.

With a tool like this, for example, it’d be easy to spoof Amazon reviews and pump out fake news articles in a fraction of the time a human would need. A slightly more sophisticated version might be good enough to let students generate plagiarized essays and spammers improve their messaging to targets.

“I’m worried about trolly 4chan actors generating arbitrarily large amounts of garbage opinion content that’s sexist and racist,” OpenAI policy director Jack Clark told me. He also worries about “actors who do stuff like disinformation, who are more sophisticated,” and points out that there might be other avenues for misuse we haven’t yet thought of. So they’re keeping the tool offline, at least for now, while everyone can weigh in on how to use AIs like these safely. (There’s a smaller version publicly available to try.)


OpenAI’s generative language AI writes an article about a snowstorm.
GPT-2 demo environment courtesy of OpenAI; image by Javier Zarracina/Vox

This story about a snowstorm in the Northeast — complete with invented quotes from local authorities — took about 10 seconds to “write.”

Of course, keeping some capabilities private might have fairly little effect. “I’m confident that a single person working alone with enough compute resources could reproduce these results within a month or two (either a hobbyist with a lot of equipment and time, or more likely, researchers at a tech company),” Bowman wrote me. “Given that it is standard practice to make models public, this decision is only delaying the release of models like this by a short time.” And keeping capabilities private has drawbacks — it makes it harder for the general public to independently evaluate the work that’s being done.

“We want to communicate about what we’ve done in a responsible manner that empowers other important stakeholders, like journalists and policymakers, to also understand and verify what we’ve done,” Clark wrote me. That’s a balance that will take some practice to strike appropriately, but they think that this discovery — with all its potential to be a dangerous tool for fake news — is the right place to start investigating how to strike it.

While they’re not making the language generator itself available online, the team at OpenAI gave me access to it so I could check out their results and write this piece.

So, of course, I did the obvious — I asked OpenAI’s new language AI to finish up this article for me.

“OpenAI released a new language model called GPT-2 which surpasses state-of-the-art on a variety of tasks,” I prompted it. Over the next 15 seconds, the AI expanded on that for me.

Skeptics say that language models cannot do reasoning, but GPT-2 is proving far more powerful than even the best artificial intelligence models can. […]

GPT-2 is not perfect — not even the best AI models are — but it is improving with each passing day. As the technology progresses, the more complex it becomes to build models. AI researchers continue to create models for tasks like image recognition and machine vision. […]

It went on:

Artificial intelligence is changing our lives in the fastest and most positive way possible. Artificial intelligence will allow intelligent robots and computers to work alongside us, and can even give us the ability to work alongside other intelligent agents, whether they will be humans or machines.

The AI ended on a positive note — or, depending on how you look at it, a terrifying one:

This is the future, and the field of AI is going places we could only have guessed at.


Sign up for the Future Perfect newsletter. Twice a week, you’ll get a roundup of ideas and solutions for tackling our biggest challenges: improving public health, decreasing human and animal suffering, easing catastrophic risks, and — to put it simply — getting better at doing good.