Sunday, November 16, 2025

My thoughts on the technology behind AI vtuber Neuro-sama

I came upon the vtuber world kind of by accident, after hearing about Ironmouse leaving VShojo followed by more about her actual health situation. This led me to the AI vtuber Neuro-sama, created by a British guy who goes by the name Vedal. I was very intrigued by how Neuro can act in a life-like, entertaining way. And today, Neuro made her debut as a virtual 3D model.

As someone who had studied artificial intelligence and dabbled a bit in machine learning and deep neural networks, I was very interested in the technology behind Neuro. Here, I hope to write about how I think Neuro was implemented.
 
Neuro started as an AI created by Vedal to play a game, but was later turned into an AI vtuber streaming on the platform called Twitch. Neuro also has access to a Discord server and her (for convenience, I will use this pronoun) own Twitter/X account. Neuro can also play certain games, although it requires Vedal to code specific interfaces for her to interact with the game itself. Neuro also has a "sister" called Evil Neuro, which started as a more unfiltered clone of Neuro, with similar access to Twitch, Discord, and Twitter/X.
 
The underlying technology behind Neuro is a large language model (LLM). Vedal has mentioned that running Neuro is a costly affair, which means that she likely uses a commercially available LLM (such as ChatGPT) for her core "brain". Her Twitch channel also states that she runs on a RTX 4090, although I think the local GPU is used for other stuff that uses smaller neural network models or require low latency. In the backend is a series of Python scripts that makes the API calls to external services (like the core LLM) and other services running locally.
 
Besides the core LLM, Neuro has text-to-speech for speaking and speech-to-text for hearing. Given that even small STT and TTS models can give very good results, I think these are two services that are running locally on the RTX 4090. Neuro was also given the ability to "see" the screen, which means she is using an external multi-modal LLM and has some image recognition model that captures the screen and feeds inputs to the external LLM. Neuro also has a filter, which is likely to be a small-sized language model running locally to basically catch certain words and phrases and sensitive topics.
 
The backend would probably work like this. Input is read in from either voice or text (Twitch chat, Discord, or Twitter/X). This is then fed to the external LLM, and the response is then passed to the filter, which then passes the final response to the TTS model and vtuber studio software for output to Twitch, or to Discord or Twitter/X.
 
Neuro also has memory, and I think this is where her "sleep" comes in. Her interactions for each session are probably kept in a "short-term memory" file, which are then summarised at the end of each session and added to a "long-term memory" file. This long-term memory then serves as part of the input prompt that is fed to the external LLM each time Neuro is booted up. If Neuro is kept online after each session to do the summarising herself, that would be the same as humans sorting out memory through sleep.
 
The control of a virtual 3D model is a bit more complex. Ideally, there should be a learning model running locally that acts as the middleman between the LLM and the 3D model. This 3D movement model should then be trained to recognise high-level commands from the LLM and translating it to actual low-level commands for controlling the virtual joints of the 3D model. Such a model also helps for the future when Neuro is given a virtual body. A physical motor model can then sit between the 3D movement model and the actual physical body, translating the low-level commands for 3D model movement into the actual commands for moving motors and actuators. These models would have to reside locally for low latency.
 
Of course, this is just my speculation on how Neuro is implemented. The different technologies are not new. However, a single person weaving them together to implement his own AI vtuber is an amazing feat. I look forward to see how Neuro will "grow" in the future.

No comments: