Google’s DeepMind AI Can Now Produce Full-On Soundtracks For Video

Martina
02 July 2024, Tuesday

Share this article on

Google’s DeepMind AI is rapidly advancing in its capabilities and creativity. Recent updates from Google highlight that the technology model can now generate music to accompany video formats, creating entire soundtracks.

An unlimited number of soundtracks for a wide range of video inputs

Similarly to the production of other creative content, the process of generating video-to-audio (V2A) uses natural language text prompts — in combination with video pixels — to generate rich soundscapes for the video. Google pairs its V2A technology with video generation models like Veo to create shots that include a dramatic score, realistic sound effects, or even dialogue matching the characters and their tone of voice.

Additionally, the technology can generate soundtracks for rather traditional footage, such as silent films, archival material, and more. This opens up a world of possibilities for creative expression.

As Google explains, the V2A model has been designed to be effective and easy to use, offering creators enhanced creative control. Users can either opt for a ‘positive prompt,’ which serves to guide the generated output towards the desired sounds, or a ‘negative prompt,’ which outlines what the sounds should not be like. Moreover, the V2A can generate an unlimited amount of soundtracks for any video input.

How does the V2A actually work?

In its press release, Google revealed that its researchers experimented with ‘autoregressive and diffusion approaches to discover the most scalable AI architecture.’ The diffusion-based approach for audio generation reportedly provided the most realistic results regarding synchronizing audio and video information.

To generate a soundscape, the V2A system starts by ‘encoding video input into a compressed representation.’ The diffusion model then iteratively refines the audio from random noise—a process guided by the visual input and natural language prompts, resulting in synchronized, realistic audio that closely aligns with the prompt. Finally, the audio output is decoded, transformed into an audio waveform, and combined with the video data.

To secure a higher-quality audio output and allow for generating specific sounds, the development team added more information to the training process. This includes AI-generated annotations with transcripts of spoken dialogue and detailed descriptions of sound.

Moreover, by training video, audio, and the additional annotations, the V2A technology learns better to associate audio events with specific visual scenes, while responding to the data provided in the transcripts or annotations.

Future research to fight limitations

As Google claims, its V2A model stands out from its competitors mainly because it can well understand raw pixels and create an audio output without a text prompt (a feature that is optional for users). Still, the company does not deny existing limitations to its technology that are being addressed in further research.

Such limitations are mostly related to the high dependency of the quality of audio output on the quality of video input. Currently, any distortions and artifacts in the provided video that are outside the model’s training process can lead to a substantial drop in audio quality. Additionally, improving lip synchronization for videos that involve speech remains a focus for future research.