At GDC 2024, Google AI senior engineers Jane Friedhoff (UX) and Feiyang Chen (software) presented the results of their Werewolf AI experiment, in which all innocent villagers and cunning murderous wolves were large language models (LLMs).
Friedhoff and Chen trained each LLM chatbot to generate conversations with a unique personality, strategize based on the character, reason about what the other players (AI or human) were hiding, and then vote for the most suspicious person (or werewolf scapegoat).
They then put the Google AI bots to relax, testing their ability to detect lies or how susceptible they were to gaslighting. They also tested how well the LL.M.s performed when specific abilities, such as memory or deductive reasoning, were removed to see how it affected the results.
The Google engineering team spoke candidly about the successes and shortcomings of the experiment. Under ideal circumstances, nine out of ten villagers will come to the correct conclusion; without proper reasoning and memory, the result drops to three out of ten. Bots are too cautious to reveal useful information and too suspicious of any claims, resulting in random attacks on unfortunate targets.
However, even in a fully psychedelic state, these robots tend to be overly suspicious of anyone who makes bold claims early on (such as prophets). They tracked the bots’ expected end-of-turn votes after each line of dialogue and found that after initial skepticism, their opinions rarely changed, no matter what was said.
Although Google’s human testers said it was a lot of fun playing Werewolf against the AI bots, they rated it a 2/5 or 3/5 for inference and found that the best strategy for winning was to stay silent and let some bots take responsibility.
As Friedhoff explains, this is a reasonable strategy for a werewolf, but not necessarily an interesting one or the focus of the game. Players could have more fun by changing the robot’s personality; in one instance, they told the robot to talk like a pirate for the rest of the game, which the robot did — while also becoming suspicious, asking, “Why are you guys doing this?” Do you want to do such a thing?”
Beyond that, the tests showed the limitations of the robot’s reasoning. They’ll give robots personalities—like a paranoid robot that’s suspicious of everyone, or a theater robot that talks like a Shakespearean actor—and other robots react to those personalities without any context. They found the drama bot suspicious of its verbosity and circumlocution, even though this was its default personality.
In real-life Werewolf, the goal is to capture what people say or do different from usual. This is where these LL.M.s fall short.
Friedhoff also provides a hilarious example of a robot’s hallucinations leading villagers astray. When Isaac (the prophet robot) accuses Scott (the werewolf robot) of being suspicious, Scott responds that Isaac accused the innocent “Liam” of being a werewolf and unfairly banished him. Isaac responds defensively and suspicion turns on him – even though Liam doesn’t exist and the scene is made up.
Google’s AI efforts, like Gemini, get smarter over time. Another GDC panel showcased Google’s vision for generative AI video games that automatically respond to player feedback in real time and feature “hundreds of thousands” of LLM-powered NPCs that can remember player interactions and answer them organically The problem.
Still, experiments like this go beyond the bold plans of Google executives and show how far artificial intelligence has to go before it’s ready to replace actual written conversations or real-life players.
It’s truly impressive that Chen and Friedhoff have managed to mimic the complexities of dialogue, memory, and reasoning found in party games like Werewolf! But these LL.M. robots need to go back to school to be consumer-ready.
Friedhoff, meanwhile, says that such LLM experiments are a great way for game developers to “contribute to machine learning research through games,” and their experiments show that players are more interested in: architecture Teach LLM personalities, not play with them.
Ultimately, the idea of a mobile game with text-based characters that respond organically to text is interesting, especially for interactive fiction, which often requires hundreds of thousands of words of dialogue to provide enough choice for players.
If the best Android phones with AI-enabled NPUs could deliver fast LLM responses for organic gaming, this could be a real game-changer for gaming. This experiment in generating werewolves is a good reminder that this future is still a long way off.