Jukebox Diffusion

An AI tool for conditional music generation

Published in

Better Programming

11 min readAug 8, 2023

It hit me with a sickening severity one night. While I stood staring down at a sink full of unwashed dishes, a feeling of nausea overwhelmed me. I had spent the day chasing down a single elusive error in a new credit scoring model. My eyes were bloodshot; my thoughts were static. It was my job. As a Data Scientist for a finance startup touting a global presence and a moral conscience, almost every day was a Hail Mary at trying to feel useful in an apathetic world.

That's why weeks later, when I connected my first patch cable to a Make Noise MATHS module, my world suddenly blossomed into thousands of splintering possibilities. This is modular synthesis I’m talking about, and it redefined my reality. A couple of VCOs and a filter later, I was plugged into an unfamiliar realm of dissonance. It was thrilling, like charting new territory, where churning clouds of sound obscured limits.

This all led to January of this year, when I plunged my life into more obscurity by quitting my job to pursue independent AI music research. Inspired by the rise of generative algorithms for text and images in 2022, I longed for a realignment in my work/life direction.

Armed with ArXiv research papers, some experience with autoencoders, and a hunger for understanding the AI tools that create sound, I began my journey into the captivating realm of generative audio.

Today I’m excited to release a tool representing the past seven months of tinkering. Spanning both giddy moments of discovery and painful groans of anguish, it’s a wild instrument. It’s weird, alien-sounding, and riddled with imperfections. But it is also exciting, unpredictable, and bubbling with vitality. I have had so much fun making noise with this tool. It has an aesthetic charm that I have already fallen deeply for. I hope you find something in it too. I’m happy to share this first prototype. There’s much more to improve upon.

What does Jukebox Diffusion sound like?

All examples heard in this video are unaltered raw outputs from Jukebox Diffusion

What is it?

At its core, Jukebox Diffusion is a hierarchical latent diffusion model. JBDiff uses a Jukebox model's encoder and decoder layers to travel between audio space and multiple differently compressed latent spaces.

At each of the three latent levels, a Denoising U-Net Model is trained to iteratively denoise a normally distributed variable to sample vectors representing compressed audio.

**Architecture Outline:** At each layer, the frozen Jukebox encoder begins by encoding the x and conditioning signals. At training, x is diffused according to a schedule, and the yellow U-Net model is trained to reverse this process generating x̃ at test. Conditioning signal is provided to U-Net at various layers via cross-attention. Every level is trained independently in this way. The top-level Dance Diffusion model is finetuned using the same dataset as all other layers. Layers are then frozen for sampling, and beginning with the bottom layer, random noise is denoised by iterative steps through the layer’s U-Net model. Jukebox decoder then decodes the latent window, and this audio is passed to the level above for upsampling. Once the final audio is obtained through the Dance Diffusion model, all conditioning windows slide over and use the newly generated audio to continue generating the next time step.

The final layer of JBDiff is a [Dance Diffusion] Denoising U-Net model, providing a bump in audio quality and transforming the mono output of Jukebox into final stereo audio.

Why Jukebox?

The first time I heard output from Jukebox, I was absolutely blown away. Released in 2020 by OpenAI, the generative model wove together novel interesting musical phrases, accurately emulating distinct instruments and timbres, and even ventured into otherworldly vocal expressions. But what set it apart was its gift of control — offering the ability to shape the output dramatically with a prompt of genre, artist, or raw audio. Like a guitar or a keyboard or a synth, Jukebox became an instrument.

Example of Jukebox output. Made by prompting the artist Sufjan Stevens and providing lyrics generated by GPT-3. Odd artifacts are the main complaint in raw Jukebox output. Learn more about Jukebox here: https://openai.com/research/jukebox

Despite its groundbreaking capabilities, Jukebox did come with some drawbacks. One common complaint revolved around aesthetically unappealing ‘artifacts’ in the music. Additionally, users expressed frustration with the long wait times for audio generation. Despite these drawbacks, Jukebox, to me, was the first tool to crack the world of AI music wide open.

Because of my past captivation with Jukebox, it naturally came into focus when I first started my exploration of generative audio tools in January. My goal was to quickly evaluate the capabilities of latent diffusion, and to make this task easier; I wanted to focus most of my time on training diffusion models. My hope was to discover pre-trained encoders/decoders with a broad scope, allowing for seamless zero-shot application to my personalized dataset for compressed latent training data.

Upon delving into Jukebox, I was excited to discover that the top-level encoders/decoders were actually excellent at handling the original signal with few unappealing artifacts. This led me to the conclusion that Jukebox’s generative potential was limited primarily by the transformer-based prior and upsamplers in its original implementation. I decided to try and update the old transformer samplers, replacing them denoising diffusion models.

Why diffusion?

Diffusion methods offer some advantages over traditional transformer-based generation methods. One of the most apparent benefits emerges through their innate ability to resample initial audio. Through the process of denoising a partially denoised sample, the algorithm can distill abstract information, like pitch and timing, from the original audio source and reinterpret it in a fresh, novel manner. As a musician, this opens up exciting opportunities across various domains, such as:

Audio Mastering
Remixing Tracks
Building up instrumentation from scratch vocals or humming

Example of the image domain of a diffusion model. This takes an init image, distills meta info such as subject, location, palette, and setting, and then reimagines it differently. *Source*

Some methods have been proposed for conditioning transformer-based Language Models on melodies [MusicGen], but I prefer the results obtained through diffusion methods better.

Why Dance Diffusion?

My confidence in denoising diffusion models for audio received a significant boost with the release of HarmonAI’s [Dance Diffusion] model in 2022. HarmonAI, the audio wing of Stability AI — the brains behind Stable Diffusion — trailblazed the potential of diffusion methods for audio generation. Their U-Net-based model was capable of creating novel and interesting audio snippets, reimagine input audio in new ways, and it sounded good.

Dance Diffusion offered a clear and encouraging path to follow, affirming the viability of denoising diffusion models for creating compelling audio. However, a couple of limitations were apparent such as:

Model was unconditional, meaning control over genre/style output is limited
Limited generation lengths due to the model generating on the level of audio samples

I was determined to introduce conditioning capabilities and a means to generate longer audio samples, envisioning a more versatile generative tool.

Why condition?

Conditioning is really a question of empowerment for musicians — a gateway to control, a realm of knobs to turn, and parameters to fine-tune. Through conditioning, a model transcends the algorithm and transforms into a versatile instrument.

Drawing inspiration from Jukebox, I opted for audio -> audio conditioning signals. While text -> audio conditional models have garnered attention lately, I hold a personal preference for the potential of audio -> audio models.

Like Jukebox, I leverage the previous two context windows of compressed audio as conditioning signals. These windows’ lengths vary based on the Jukebox encoder level. To start, my initial model was trained with windows of size (batch_size, 512, 64).

**Windowed sampling:** Jukebox Diffusion uses three layers of differently compressed audio representations. The diffusion models are trained to sample new windows using the previous two windows as context. The lower level is sampled first, and the output is iteratively passed up to higher levels for further upsampling. Windows are shifted forward when all sampling above has been completed.

512 represents the number of time steps in compressed space, and 64 represents the size of the quantized latent vector at each time step. The compression at each level follows (8x, 32x, 128x), meaning the deepest layer window represents audio of 512 x 128 = 65536 samples or ~1.5s of audio sampled at 44.1 kHz.

My final model was trained on window size 768, but when sampling, I have been able to extend this window up to 32x on a single NVIDIA A10 GPU representing 32 x 768 x 128 = 3145728 samples or ~70s of generated audio in one pass for the most compressed layer.

These context windows were provided to the model using cross-attention at specific levels of the U-Net.

How did the diffusion models do?

My first real excitement in this project came when my bottom layer diffusion model began generating music-like outputs. As the most compressed layer, it started unraveling intricate kick and bass patterns from noise. Hearing the diffusion model capture higher-level musical semantics was extremely encouraging. A testament that latent diffusion was viable for audio generation.

This excitement didn’t last long as I immediately ran into a problem. Despite capturing high-level information, the bottom layer showed weakness during the decoding process — it couldn’t recreate the audio at a perceptually high-quality level. The high compression in the bottom latent space hindered my ability to achieve musically meaningful phrases with the desired level of perceptual quality. This limitation meant the generated output might not be as useful as I had hoped.

The next few months were spent throwing my head against the wall, trying to devise an upsampling method that would preserve the essential information at the bottom layer while effectively eliminating noise and infusing intricate details as the audio signal traveled up each level.

It’s worth noting here that [AudioLM] deals with this high-level/low-level tradeoff by modeling audio based on two differently compressed tokens they call ‘semantic tokens’ and ‘acoustic tokens.’ Modeling in stages from semantic -> acoustic. [MusicGen] deals with this by using Residual Vector Quantization.

Investigating diffusion for upsampling

How did I try to balance this tradeoff? I investigated the following methods for this process in the context of diffusion:

1.

The first thing I tried to do was train the higher layers of JBDiff in the exact same way as the bottom layer. During sampling, I would hand the noisier/higher compressed result up a level and use it as the init input for the next level for upsampling.

This worked, but unfortunately, every time a higher level would take a generation from a lower level, it would divide the sample into 4, meaning one generated window from the bottom layer would be sliced eventually into 16 parts to be upsampled. Keeping these 16 slices coherent using diffusion proved to be difficult, and I would end up with ‘skipping’-like artifacts.

I think this approach could be better investigated, especially using smarter choices for noise as input into each layer, but I hoped for an easier, more direct way to use diffusion as an upsampler.

2.

Next, I tried denoising while providing the lower-level generation as the conditioning signal. I was hoping the conditioning signal of the lower level would provide most of the information desired for upsampling on the current level. Unfortunately, the results did not converge to anything resembling upsampling and instead failed in the same ways as the first method did.

3.

The final approach I tried was something odd. Rather than denoise from random samples, I tried teaching a U-Net to denoise from a lossier quality sample generated from a lower layer. I made the assumption that each layer captured slightly more information as you went down in compression. I hoped a U-Net might be able to learn to recreate the residual information lost at each layer as you step up in compression. This way, it could be true upsampling through diffusion. Something like residual denoising.

This is the approach I decided to keep for the final model. Although in retrospect, I would revisit my first attempts again to see if they can work better. Because…

Did it work?

Kind of. Eh.

The upsamplers trained in this way do remove artifacts inherent in the lower level but do not provide any additional information. Instead, they introduce artifacts of their own.

Here, listen:

The best thing I can say for them is that they clean up the high-frequency areas a bit before the audio is finally handed over to Dance Diffusion. This helps keep the final samples less chaotic, especially in the high-frequency ranges.

Why add Dance Diffusion?

As demonstrated above, the upsampled JBDiff generations failed to blossom into perceptually listenable music. I needed a final upsampling layer that could flesh out the backbone of my compressed latent phrases with detail and clarity.

Earlier this year, I participated in an AI production challenge hosted by HarmonAI. For this challenge, I wanted to use my bottom-level Jukebox Diffusion model to alter personal recordings and transform them into AI-mangled goodies. During this time, I was still struggling with upsampling and, without much time left, decided to try upsampling from the bottom layer straight into Dance Diffusion.

Track produced in HarmonAI production challenge mentioned above. My first attempt at Jukebox Diffusion -> Dance Diffusion pipeline. The main synth line was created by using acoustic guitar as initial audio and diffusing through the bottom layer, then using that noisy output as initial audio diffused through DD.

To my delight, this worked better than I could have imagined.

The generations from DD kept faithful to the original phrases while adding depth, clarity, and a unique aesthetic touch. Going straight from the bottom layer to DD did result in chaos in the high frequencies of outputs, but luckily adding in the two other upsamplers really fixes this and brings the whole pipeline into another territory.

This layer of DD also provides the nice advantage of turning Jukebox’s decoded mono output into stereo audio.

To keep Dance Diffusion’s unconditioned outputs consistent, I implemented controls over the random noise used in the denoising process. This helped to smooth transitions between unconditioned blocks of generation.

What’s missing?

Since the start of this journey for me in January, major releases have popped up in the AI audio space. Most notably [MusicLM] from Google Research and [MusicGen] from Meta. Both models utilize a transformer-based sequence-to-sequence Language Model architecture and include notable advances that could be adopted into this project. Ideas for improvement include:

The Jukebox encoders and decoders are wildly out of date, while MusicGen, for example, utilizes SOTA EnCodec encoders and decoders. EnCodec, at the minimum, allows for stereo encoding/decoding, whereas Jukebox works only in mono. Additionally, EnCodec features a novel MS-STFT discriminator at the end of the decoder that dramatically increases the perceptual quality of decoded outputs. By replacing Jukebox in the enc/dec stage with something like EnCodec I feel confident diffusion generation quality would immediately increase.
While the upsampling technique I tried didn’t quite work out, I feel confident that some sort of ‘primed’ noise based off of lower level generations is something that could really be useful to investigate.
More compression. If upsampling can be cracked, there’s no reason why even lower levels of more compression couldn’t be used to understand full song structures, starting with generations at ultra-high compression rates and iteratively diffusing with better and better quality as you move to layers interested in sections down to phrases down to samples.

There are also various small tweaks to the sampling method that I will be updating in the coming weeks to months.

Is that it?

For now, this has been one of the coolest deep dives I’ve done in my life. I’m appreciative of the opportunity to learn and fiddle with these amazing tools. I hope to continue this research and improve this architecture over the next months to years. Thank you for reading! I hope it was worthwhile.

Where can I find it?

The code is available on GitHub here:

GitHub - jmoso13/jukebox-diffusion

Contribute to jmoso13/jukebox-diffusion development by creating an account on GitHub.

github.com

Soon, I make this tool available for more users via a Hugging Face Space or another alternative. More updates, tutorials, and videos will be posted on my YouTube channel:

jmoso13

Share your videos with friends, family, and the world

youtube.com

Thank you

I want to give big thank yous to the researchers at OpenAI who created Jukebox and the researchers at HarmonAI who created Dance Diffusion. Both of these tools were the backbone of Jukebox Diffusion, and I would not have been able to make this tool without them.

Thank you to Kyle for always being my sounding board and keeping me focused.

Thank you to Alicia for humoring me and always listening to my weird devil noises.