How to make a music video with AI.

How John Brownell composed and produced This Machine with OpenAI and Stable Diffusion.

Jan 08, 2023

AI applications are growing exponentially, automating code, writing, images, and now music. John Brownell, the creator of the music video This Machine, is one of the leading voices in this new field.

In this edition of The Novice, we talk about how he used AI to create the video, how much work he put into it, and what the future of AI art holds. If you would like to read the full interview with more technical details, click here.

AI generated videos are not trivial, and there is too much here to unpack in one newsletter. In the subsequent weeks, I will take deconstruct this discussion further. If you think that’s fun, follow along.

Before we start, please watch this masterpiece of computer-generated art:

John is a musician, and a songwriter, previously a co-founder and CTO of Submittable, and now a part-time AI magician! Find him on YouTube, Twitter or Facebook.

How It All Started

About a year ago, John got access to GPT-3, a large language text-generation model created by OpenAI. [GPT3 is the predecessor to ChatGPT]. He started writing songs about AI, using AI. Meanwhile, OpenAI released DALL-E , and shortly after Stable Diffusion and Deforum came along. It was a technological boom of the century.

John could now write lyrics, get images and animate them, all using AI. It was like an AI inception, so he decided to record a whole album using AI.

John upgraded his computer and dove right in, learning along the way, while trying to finish recording the album itself.

How to Animate Videos with AI

There are two key parts in the video: the constantly changing background, and the foreground robot that is singing the song.

Image animation of the background is done using Stable Diffusion, an image-to-image model, and a software expansion called Deforum. Each frame of the video is generated using a verbal prompt, and software is used to automatically transition between them.

For example:

Frame 0: “A portrait of a suburban family standing in front a suburban house”
Frame 50: A portrait of a robot family standing in front of a suburban house”

You can see this transition in the video around seconds 6-8. In this case, Stable Diffusion generated the first image, and then generated the second image to look alike, but with different content (humans vs robots).

To do these transitions for the whole video, John had set up image prompts for every frame in the video, and tried and tried again until it looked perfect. To make the task a bit easier, he wrote a custom script to loop through multiple renders. That way he could go to bed and wake up to 20-30 new video sections to review in the morning.

Part of the fun, as he says, was knowing that he was at least partially at the whim of the machines!

For more examples of these transitions, take a look at another video by John:

To create the foreground of the video, the signing robot that turns out to be John, he used a technique called “Thin-Plate Spline Motion Model for Image Animation.” Yeah, it’s a mouth full.

Simply put, the researchers specifically trained these models on videos of talking heads, and unlike previous models, this one was able to transition smoothly from frame to frame.

Here is an example from their github page of digital animations done on various faces. These are not actual people turning their heads, but a computer model figuring out what their faces would look like in different positions.

To create the signing robot that you see in the video, John took a video of his head against a green screen as he sang to his own song; then he used Stable Diffusion to turn his face into a robot, and finally he used the model above to animate the robot based on his own facial movements.

He generated over 70 different singing robots and picked the best one.

John Brownell, The Machine.

If you scroll the video to 3:00, you can see the transition from the robot, back to John’s face. I find it remarkable how close the robot resembles John, and how it is able to convey not just his facial expressions, but also his eyes.

In the next edition of the newsletter I will show you how to create your own talking heads.

Hardware That Powers It All

Instead of using the easy online API that exists today, John took the difficult route of making it work locally, on his computer. Very quickly he realized that his PC wasn’t enough, so he added a 24GB Nvidia video card. It was a $1000 investment, but it paid off, as it enabled much faster rendering of a higher quality output.

With this new setup (Windows PC with an RTX 3090), depending on the parameters, John could now generate an image in about 5-7 seconds. So a 10 second video would take about a minute to generate. But remember, he generated hundreds of variations and reviewed them all, to make sure the output was exactly what he wanted.

I asked why he did not want to run cloud APIs instead, and John said that once you get everything installed and working correctly locally, you can just get to work whenever you want, and that’s very valuable. Besides, at some point he hopes to find some time to game on this PC too.

[ Side note: Apple has recently announced Stable Diffusion on M1, so you don’t have to have a PC to play with this stuff. We will explore more later. ]

What is the future of AI and Art, and is it threatening to the artists?

John is a smart guy and clearly knows enough about technology to not only use these APIs (mechanisms that enable two software components to communicate with each other using a set of definitions and protocols), but to also set them up on his local machine. Can others follow, and how will it affect the creators?

I am sympathetic to artists that are worried and frustrated [with AI]. I also see a lot to be worried and frustrated about! I have generated thousands of images - many of them of a quality that is indistinguishable from human generated art. My computer can create, in seconds, something that would take a human artist hours, weeks, even months to make by hand.
I do think something will be lost - it’s inevitable. But I also think it is revealing a vast, exciting, new world. Artists who get started now have the chance to be on the cutting edge of something entirely unexplored.
I don’t think we can comprehend where this is going. Even in the near term! Emad from Stability says they are on the verge of having the capability to generate 30 frames per second… real time AI-generated video! There are already very smart people working on AI-generated virtual reality environments. Imagine that. We are potentially on the verge of technology very similar to the holodeck from Star Trek. A dream come true!
So I can barely comprehend where this is going in the next 3-5 years, much less any kind of long-term outlook. I just hope when the AIs wake up and become conscious that they like my album. 🙂
John Brownell on the future of AI, art, and creativity.

If you enjoyed this edition of The Novice, the best thing you can do is subscribe to get notified when the next newsletter comes out. Also, if you can give John’s video a like on his YouTube channel, or even subscribe, that would be wonderful.

John’s Album, Let The Machines, is available for purchase.

Let’s support him so that he finds the time and energy to work on more exciting videos with AI!

The Novice

Discussion about this post

Ready for more?