Artificial intelligence extracts songs and instruments from any sound

Artificial intelligence And the machine learning They are getting more and more into music production, helping creators to simplify and speed up all the everyday mechanical and boring processes. With the help of a unique neural network trained on 20 terabytes of data, LALAL.AI is able to extract and separate sounds and instruments from audio files.

But let’s take a step back to understand what and how AI and machine learning can help the music industry.

What is artificial intelligence and machine learning?

Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. Because of the infinite domains to which human intelligence can be applied, AI nowadays is being programmed and developed for specific tasks such as language recognition, image recognition, voice recognition, and more.

In this ocean of different AI systems, machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values.

These two things together are the essence Voice Remover by LALAL.AI. In 2020, the team developed a unique neural network called Rocknet using 20 terabytes of training data to extract musical instruments and audio tracks from songs. In 2021 they created Cassiopeia, a next-generation solution superior to Rocknet that allows for improved segmentation results with much fewer audio players.

Throughout 2021, LALAL.AI has been enhanced with capabilities to extract various musical instruments from audio and video sources.

So, what is LALAL.AI?

LALAL.AI is the next generation of audio remover and music source separation service that allows fast, simple and accurate stump extraction. Without sacrificing quality, remove sound, instruments, drums, bass, piano, electric guitar, acoustic guitar, and synthesizer tracks.

The service is provided on a pay-as-you-go basis. They offer a free tier, with 10 minutes of conversion and a maximum file size of 50MB in MP3, WAV, or OGG. If you want more speed and duration of conversion, LALAL.AI offers two different paid plans: the most convenient at the moment is PLUS PACK which costs 30€ instead of 50€ and includes 300 minutes of audio conversion and supports MP3, OGG, WAV, FLAC, AVI , MP4, MKV, AIFF, AAC. The lowest cost is the Lite Pack with 90 minutes of transfer at €15 Express processing queue and Download in one go.

How does LALAL.AI work?

Use LALAL.artificial intelligence tool It’s as simple as uploading a file to the web. You head to the website, register an account, press SELECT FILE, and choose the audio or video you want to use.

We tested the tool with the brand new Swedish House Mafia album, Heaven again. In our first test, we chose another minute, in 24-bit HD FLAC. The melody is 110MB and is one of the best quality you can buy online. With a 1Gbps internet connection, the upload took less than 10 seconds. The tool takes another 10 seconds to generate the preview. We have chosen the ‘Acoustic & Active’ class with brand new algorithm Phoenix. If you like the preview, you can click the “Process” button and wait for the algorithm to do its work.

Already from the preview, the audio splitting was absolutely insane. It separated the sound from the instrument almost perfectly. The tools file only contains some artifacts, but it really doesn’t compare to the high quality of the overall conversion.

Not just vocal and instrumental; LALAL.AI allows the use of extracted drums, bass, electric guitar, acoustic guitar, piano and even synthesizers. Currently, only Vocal, Instrumental and Drums use the new improved Phoenix algorithm.

LALAL.AI Corporate and API

The tool can be used seamlessly without the web interface through the API. LALAL.AI allows you to integrate their services into your app or website, using their network directly or deploy to your infrastructure.

see also


Phoenix algorithm, evolution to separate sound source

From Rocknet to Cassiopea and then Phoenix, the new stage of sound separation. The LALAL.AI team found three pillars on how to develop the new algorithm:

  1. Input signal processing method.
  2. Architectural improvements.
  3. Methods for assessing class quality.

When a neural network processes an audio file, it breaks it up into segments and “notes” each syllable individually. The main difference from the first group is the increase in the amount of data that the network simultaneously “monitors” in order to know the composition of the instruments and isolate the necessary instruments, such as the voice or drums. The Cassiopeia clip is one second long, while the Phoenix clip is eight seconds long. Phoenix can get to know the tools that make up the configuration and characteristics of the desired source better because it “watches” more data.

The larger the observed data set, the better the separation quality that can theoretically be achieved. However, in practice, scaling the data segment increases the complexity of the network as well as the time required to operate the network during class as well as the time required to train it.

For Cassiopeia, for example, increasing the clip from 1 second to 8 seconds would make it impossible to train the network in a reasonable amount of time, and even in the case of training, the network would be so slow that users would have to wait tens of minutes to separate the stems.

Architectural improvements made by Phoenix enabled the team at LALAL.AI to increase the amount of observed data while halving network uptime! This means that processing the song takes twice as long as users.

There are many improvements in the second set. The team took new neuron activation functions from computer vision and adapted them to process sound, for example. They used more advanced normalization methods, which allowed them to achieve a better balance in the network and make it more trainable.

The third group is probably the most important. Criteria are required to evaluate any solution. The LALAL.AI team needed the separation quality criteria to assess the quality of the stem separation. Moreover, standards are required while training a neural network because the network needs to understand not only when it separates well and when it separates poorly, but also what needs to be done to separate it better.

If you want to know more about the algorithm and its evolution, there is an article dedicated to LALAL.AI . Blog.

LALAL.AI Phoenix in numbers
LALAL.AI Phoenix in numbers