New AI model can hallucinate a game of 1993 Doom in real time

New AI model can hallucinate a game of 1993 Doom in real time

Aurich Lawson |

On Tuesday, researchers from Google and Tel Aviv University unveiled GameNGen, a new AI model that can interactively simulate the classic 1993 first-person shooter game. Downfall in real-time using AI image generation techniques adopted from Stable Diffusion, a neural network system that can act as a constrained game engine and may open up new possibilities for real-time video game synthesis in the future.

For example, instead of drawing graphical video images using traditional techniques, future games could potentially use an AI engine to “imagine” or hallucinate graphics in real time as a prediction task.

“The potential here is absurd,” wrote app developer Nick Dobos in response to the news. “Why write complex rules for software by hand when AI can just think every pixel for you?”

GameNGen can supposedly generate new frames from Downfall Gameplay at over 20 frames per second using a single Tensor Processing Unit (TPU), a specialized type of processor similar to a GPU that is optimized for machine learning tasks.

In tests, the researchers say, ten human evaluators were sometimes unable to distinguish between short excerpts (1.6 seconds and 3.2 seconds) of actual Downfall Gameplay footage and output generated by GameNGen, with true game footage identified in 58 and 60 percent of cases, respectively.

An example of GameNGen in action: Doom is simulated interactively using an image synthesis model.

Real-time synthesis of video games using so-called “neural rendering” is not an entirely new idea. Nvidia CEO Jensen Huang predicted in an interview in March – perhaps somewhat boldly – that most video game graphics could be generated in real time by AI within five to ten years.

GameNGen also builds on previous work in the field cited in the GameNGen paper, including World Models in 2018, GameGAN in 2020, and Google's own Genie in March. And a group of university researchers trained an AI model (called “DIAMOND”) earlier this year to simulate classic Atari video games using a diffusion model.

Ongoing research into “world models” or “world simulators,” often associated with AI video synthesis models such as Runway’s Gen-3 Alpha and OpenAI’s Sora, also tends in a similar direction. For example, during the debut of Sora, OpenAI showed demo videos of the AI ​​generator that simulated Minecraft.

Distribution is key

In a pre-published research paper titled “Diffusion Models Are Real-Time Game Engines,” authors Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter explain how GameNGen works. Their system uses a modified version of Stable Diffusion 1.4, a diffusion model for image synthesis released in 2022 that humans use to create AI-generated images.

“The answer to the question 'Can it run?' is DOWNFALL?' is 'yes' for diffusion models,” wrote Tanishq Mathew Abraham, research director at Stability AI, who was not involved in the research project.

A diagram of GameNGen's architecture provided by Google.
Enlarge / A diagram of GameNGen's architecture provided by Google.

The diffusion model is driven by the player’s inputs and predicts the next game state from the previous ones after being trained on extensive footage of Downfall in action.

The development of GameNGen involved a two-phase training process. First, the researchers trained a reinforcement learning agent to Downfallwhere the gameplay sessions were recorded to create an automatically generated training dataset – the aforementioned footage. They then used this data to train the custom Stable Diffusion model.

However, using Stable Diffusion does introduce some graphical glitches, as the researchers note in their abstract: “Stable Diffusion v1.4's pre-trained auto-encoder, which compresses 8×8 pixel patches into 4 latent channels, introduces significant artifacts when predicting game frames, affecting small details and especially the bottom bar HUD.”

An example of GameNGen in action: Doom is simulated interactively using an image synthesis model.

And that's not the only challenge. Keeping the images visually clear and consistent over time (often referred to as “temporal coherence” in the AI ​​video space) can be challenging. The researchers at GameNGen say that “simulating interactive worlds is more than just very fast video generation,” as they write in their paper. “The requirement to account for a stream of input actions that is only available throughout the generation contradicts some assumptions of existing diffusion model architectures,” including repeatedly generating new frames based on previous frames (so-called “autoregression”), which can lead to instability and a rapid decline in the quality of the generated world over time.

Leave a Reply