Giving a Virtual Human Its Voice and Memory: Lessons from the Humboldt Project

Written by Fabiano Busca and Samer Al Hussban (MSc Artificial Intelligence at Vrije Universiteit Amsterdam)

Introduction

During our internship at Cradle, we worked on a project called Humboldt: an effort to bring a real person’s knowledge and presence into a conversational AI agent. The result is a digital human, a virtual character modelled on a real client, that can speak in that person’s own voice and answer questions grounded in real, verified data.

The project combined two distinct technical challenges. Samer focused on voice cloning: capturing the client’s voice and reproducing it convincingly using AI. Fabiano focused on Retrieval-Augmented Generation (RAG): making sure the agent’s answers are factually correct and grounded in real datasets, rather than relying on guesswork.

Together, these two pieces form the backbone of an agent that feels both human and trustworthy.

Part One: Cloning the Client’s Voice

Why voice matters

A digital human without its own voice is just an avatar. The voice is what makes the character feel like a real person rather than a text-to-speech machine. For Humboldt, the client, needed to be represented authentically. The goal was not a generic AI voice, but their voice. To achieve this, we used ElevenLabs’ Professional Voice Clone (PVC) feature, which trains a personalised voice model from recorded audio samples.

Recording the audio

The quality of the voice clone depends almost entirely on the quality of the audio going in. We recorded the client using a Scarlett 2i2 interface with a professional XLR microphone and a pop filter, in a quiet room. The signal level on the Focusrite interface was kept green roughly 90% of the time, with only brief yellow peaks on strong consonants and no red (distortion) at any point.

We captured audio in two complementary ways: first, the client read from a prepared script covering the project’s main themes; second, we ran an informal interview-style conversation to capture natural speech patterns, emotions, and the particular rhythm of how they express themselves. This combination gave us richer training data than scripted reading alone.

Processing the audio

Raw recordings rarely go straight into a voice model. Before uploading, we used ElevenLabs’ built-in Speaker Separator and Background Noise Remover to clean the audio and isolate the client’s voice from any room noise or incidental sounds. The processed recordings added up to around 2 hours and 25 minutes of clean audio, enough to reach the “perfect” rating on ElevenLabs’ training quality scale.

Training and model selection

Once the audio was uploaded, the client verified their consent by recording a single sentence directly in the platform. Training then ran for approximately six hours. After training, we compared the cloned voice across several of ElevenLabs’ synthesis models, including Eleven Multilingual v2, Eleven Flash v2.5, Eleven Turbo v2, and the newer v3 alpha, using the same test paragraph each time. The differences between models are immediately noticeable: some prioritise similarity to the original voice, others prioritise stability across long passages. The client listened to all the samples and selected Eleven Turbo v2 as the model that sounded most like them and felt most natural for the use case.

Part Two: Grounding the Agent in Real Data

The problem with ungrounded AI

Language models are impressive, but they have a significant limitation: they answer based on patterns from their training data, which has a cutoff date, does not include private datasets, and can produce plausible-sounding but incorrect information, a problem known as hallucination. For Humboldt, the agent needed to discuss specific municipalities, neighbourhoods, demographic statistics, and transport accessibility data for the West Brabant region of the Netherlands. None of that exists in a general-purpose language model in the precise, up-to-date form needed. The solution is Retrieval-Augmented Generation, or RAG.

What RAG does

Rather than asking the language model to retrieve facts from memory, RAG supplies the model with relevant documents at the moment it needs to answer a question. The process works in three steps:

The user’s question is converted into a vector embedding, a mathematical representation of its meaning.
That embedding is compared against a pre-built knowledge base to retrieve the most relevant documents.
The language model reads those documents and formulates its answer based only on what they contain.

The model is not querying a database directly. It is reading carefully prepared text documents, much like a human researcher consulting a briefing pack before a meeting.

Building the knowledge base

The knowledge base for Humboldt was built from two primary sources: CBS neighbourhood statistics (demographic data per buurt for 2023) and isochrone layers showing how long it takes to reach a train station by walking, cycling, or e-bike from any given point in the region.

From these datasets, we generated four types of documents:

Neighbourhood profiles – one per buurt, summarising population, age structure, household composition, migration background, and rail accessibility.
Municipality overviews – aggregated summaries for each gemeente, including population-weighted demographic averages and distributions of accessibility scores.
A regional overview – a single document covering all 425 neighbourhoods and roughly 657,000 residents in the West Brabant area.
Ranking documents – precomputed lists such as “top 5 neighbourhoods with fastest bike access” or “municipalities with the highest share of elderly residents,” which help the agent answer comparative questions reliably.

Each document is stored alongside metadata (numeric fields like population counts, percentages, and travel times) that allows the retrieval system to filter and rank results before they reach the model.

What the agent can answer

Because the knowledge base is explicit about what it contains, we also know its limits. The agent can reliably answer questions such as:

What is the demographic profile of Sportpark in Breda?
Which municipalities in West Brabant have the highest share of elderly residents and how accessible are they by bike?
What share of the regional population has excellent cycling access to rail?

If a fact is not present in the retrieved documents, the model cannot answer it reliably, and it will say so, rather than guessing.

What We Learned

Working on Humboldt taught us that building a convincing and trustworthy digital human is not one problem but several, each requiring its own careful approach.

Voice quality lives or dies with the input audio. No amount of AI processing compensates for poor microphone technique or a noisy recording environment. The investment of time in recording proper samples, and in cleaning them thoroughly before training, made a visible difference in the final result.

On the knowledge side, the biggest insight was that structured preparation beats clever prompting. The more deliberately the knowledge base was designed, with consistent document formats, precomputed rankings, and validated data, the better and more reliable the agent’s answers became. The model performs best when it has clear, well-organised information to read, rather than having to infer or estimate.

Future Directions

What began as an internship project became the foundation for two independent research directions, each of which grew into a thesis.

Samer’s work on voice cloning and lip synchronisation led to a formal study investigating how the quality of these technologies affects the way people perceive a digital human. Using a within-subjects 2×2 experimental design with 175 participants, he compared two generations of technology: Microsoft Azure’s widely adopted speech synthesis and lip-sync pipeline against ElevenLabs Professional Voice Cloning combined with the Runtime MetaHuman lip-sync system. Participants watched videos of the same virtual human discussing different topics across four audiovisual conditions, created by systematically varying voice realism and lip-sync quality. The study measured humanness, eeriness, human-like appearance and behaviour, user acceptance, and trust; probing where the line sits between a convincing digital human and one that tips into the uncanny.

Fabiano’s work on RAG and grounding led to a credibility study examining how a digital human’s trustworthiness is perceived when it can back up its claims. The research focused on the impact of grounded, clickable citations and the role of embodiment, asking whether users find an agent more credible when its answers are visibly sourced, and whether the physical presence of a virtual human changes that judgement compared to a text-only interface.

Together, these two threads suggest that the design of a digital human cannot be reduced to a single variable. Voice, motion, knowledge, and transparency each play a role, and understanding how they interact is the next frontier.