Wednesday, July 12, 2023

Playing around with voice cloning

Recently, I have been playing around with voice cloning that uses neural networks. There are several popular frameworks out there, such as SoftVC VITS Singing Voice Conversion Fork (aka so-vits-svc or just SVC) and Retrieval-based Voice Conversion (aka RVC).

At first, I started with SVC, but I found that it requires several thousand epochs in order to create a good model. This means that it is hard to train on Google Colab within the free GPU usage time. I had to train models on my Proxmox workstation using the Nvidia T1000 8GB, which was REALLY slow.

Then, I came upon RVC, which gave quite acceptable results after 300 epochs of training. This means I can easily run it on Google Colab. The trained models can then be downloaded and used for local inference even without a GPU (but CPU inference can be a lot slower).

Here is my workflow.

First, I had to gather voice samples. I use Kdenlive to cut various scenes that had the voice of the character I want to clone. I then piece them together into a single clip with about 2 to 10 minutes of speech and exported it as a WAV file.

Next, I used Ultimate Vocal Remover GUI to remove the background music and other sounds. First, using the process method "MDX-Net", I used the "UVR-MDX-NET Inst HQ 3" model to remove the background music from the vocals. Then, I used the "UVR-MDX-NET Karaoke 2" model to extract only the main character's voice. Finally, I used the "VR Architecture" process method's "UVR-DeEcho-DeReverb" model to remove echoes and such from the extracted voice.

Then, I used this audio slicer to split the voice clip into smaller samples of several seconds.
 
I then uploaded these samples to my Google Drive, which will be used for linking to Google Colab during the training process. Next is to run the Python notebook (v2) on Google Colab. There will be a link in the "terminal" of Google Colab, which you can click to access the webUI. The "Train" tab is used for training.
 

Once training is completed, the weights can be downloaded for local inference.

Here is one of the videos that I watched to learn about how to use Google Colab for RVC training.


Here are some samples created using two models that were trained by me. The base sample was created using Google's text-to-speech service using default voices for English, Japanese, and Chinese.

Alice Synthesis Thirty from Sword Art Online, trained using RVC for 300 epochs
Kubo Nagisa from Kubo Won't Let Me Be Invisible (久保さんは僕を許さない), trained using RVC for 300 epochs
Violet Evergarden, trained using RVC for 600 epochs
 
What do you think? Personally, I think the base sample created using Google TTS was quite artificial in the first place, so the converted samples also sounded quite false. They do perform a lot better when used to clone songs, but due to copyright issues, I have not uploaded such samples here.

No comments: