• Author(s): Ziyang Chen, Daniel Geng, Andrew Owens

Spectrograms, which are 2D representations of sound, differ significantly from the images found in the visual world. When natural images are played as spectrograms, they produce unnatural sounds. However, this paper demonstrates the possibility of synthesizing spectrograms that simultaneously resemble natural images and sound like natural audio, referred to as “images that sound.”

The approach presented in this study is simple and zero-shot, utilizing pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, noisy latents are denoised using both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models.

Quantitative evaluations and perceptual studies reveal that this method successfully generates spectrograms that align with a desired audio prompt while also taking on the visual appearance of a desired image prompt. This innovative approach bridges the gap between the visual and auditory domains, enabling the creation of spectrograms that maintain the characteristics of natural images and audio.

The implications of this research are significant, as it opens up new possibilities for audio-visual synthesis and manipulation. By generating spectrograms that look like natural images and sound like natural audio, this method can be applied in various fields, such as multimedia content creation, audio-visual synchronization, and data augmentation for machine learning tasks. This paper introduces a novel approach to synthesizing spectrograms that simultaneously resemble natural images and sound like natural audio. The simple and zero-shot method, which leverages pre-trained diffusion models, demonstrates the potential for generating spectrograms that align with desired audio and visual prompts. This research contributes to the advancement of audio-visual synthesis and offers new opportunities for applications in various domains.