AIDA: A Nuanced Transcription & Diarization Tool

Introduction

AIDA (Artificial Intelligence for Data Analysis) is an accessible and user-centric tool designed to optimize transcription and speaker diarization (“who spoke when”) for qualitative researchers. Qualitative research often involves numerous interviews and focus groups, where accurate and timely transcription is a time-consuming step prior to analysis. Researchers depend on the ability to capture not only what was said but also how, when, and by whom. Current commercial transcription tools (for e.g. Notta.ai, Otter.ai, Microsoft Word etc.), while increasingly available, are frequently costly, inflexible, and poorly adapted to the specific needs of qualitative inquiry. Moreover, these tools often lack contextual awareness and nuance, leading to significant manual editing by the researchers before analysis can even begin. This creates a bottleneck in research productivity and can compromise data quality.

AIDA integrates multiple state-of-the-art AI models adapted to provide high-quality, nuanced transcription and diarization, ensuring robust performance across diverse speaker profiles, including those with pronounced accents. Built on top of open-source AI models, AIDA is optimized for both inference speed and transcription nuance. It offers a customizable framework that allows researchers to adapt the tool to their specific language, domain, or contextual needs. Preliminary user testing indicates that AIDA outperforms several leading commercial tools, offering researchers a more effective and accessible solution for processing audio data.

Development Work

The development of a tool such as AIDA necessitates the integration of multiple AI models to function as a cohesive system. AIDA is specifically designed for academics engaged in qualitative research, who don’t often interact with AI models at a deep technical level. Consequently, the tool prioritizes usability, offering a user-friendly interface that abstracts away the complexity of individual model components.

To ensure broad applicability across diverse disciplines, AIDA is designed as a domain-agnostic tool and therefore avoids fine-tuning or custom training. Customizing models to particular datasets can improve contextual accuracy but also risks introducing biases and limiting generalizability. Consequently, AIDA relies on robust, pre-trained models that deliver consistent performance across a wide spectrum of use cases. To support this generalizability, mature and actively maintained open-source models were evaluated for transcription and diarization.

Transcription

Among the available speech-to-text systems, OpenAI’s Whisper emerged as the most suitable for AIDA’s requirements. Whisper has demonstrated low word error rates when tested against benchmark data. Additionally, Whisper supports multilingual transcription, integrates easily with Hugging Face pipelines, and offers accelerated inference through FlashAttention implementation, thus making it well-suited for our purpose.

Accurate punctuation, including the correct use of commas, periods, exclamation marks, and ellipses, is essential for qualitative researchers, as it reflects the tone, pacing, and hesitation in a participant’s speech. Such nuances are crucial in interpreting the cognitive or emotional state of interviewees, particularly when analyzing responses that involve reflection, uncertainty, or hesitation. To enhance punctuation fidelity, a custom prompt was passed to the Whisper model. This prompt sets a contextual style for the transcription output. Best results were achieved using the prompt: “Umm, let me think like, hmm… Okay, here’s what I’m, like, thinking.” This prompt encourages the model to reflect natural speech patterns and effectively utilize ellipses and filler words, which are common in natural conversation. For non-English audio inputs, the prompt was translated into the respective target language before being passed to the model. Testing showed that this practice improved transcription quality in non-English languages, particularly in preserving conversational style and punctuation.

Speaker Diarization

For speaker diarization, pyannote.audio is currently one of the most widely adopted tools and was initially considered for AIDA. While it performs well in two-speaker scenarios, many qualitative studies involve focus groups with multiple participants. In these more complex audio environments, pyannote frequently misattributed speech segments, leading to poor diarization performance. Additionally, the tool proved difficult to customize, with many clustering-related hyperparameters hardcoded into the source-code.

Alternative speaker separation models from Speechbrain, such as Sepformer-Libri2Mix and Sepformer-Libri3Mix, were also investigated. However, these models are constrained to fixed speaker counts (two and three respectively), and performed inconsistently even within those limits. In particular, they struggled with overlapping speech, a common feature in natural group conversations.

Finally, it was decided to use an end-to-end neural diarization (EEND) model for AIDA. Unlike traditional diarization systems that use multiple separate stages (speech activity detection, speaker embedding extraction, clustering), EEND performs the entire task in a single neural network model. This integrated approach enables superior handling of overlapping speech compared to clustering-based methods. NVIDIA’s NeMo SortFormer diarizer was selected for AIDA due to its demonstrated capability to minimize speaker confusion in extended audio recordings featuring multiple concurrent speakers.

Conclusions

AIDA is a tool developed to help alleviate the bottlenecks faced by qualitative researchers while transcribing audio data. Qualitative studies span a wide range of disciplines and subject areas, often involving complex, conversational audio data such as user interviews and focus groups. Given this broad scope, it is impractical to rely on domain-specific fine-tuning or custom training. AIDA leverages a combination of pre-trained models, integrated into a cohesive pipeline to deliver high-quality transcription and diarization without requiring technical expertise from the end user.

AIDA demonstrates that multiple open-source AI models can be adapted and combined effectively to meet the nuanced needs of qualitative research. Preliminary user testing indicates that the tool provides transcription quality on par with or superior to several leading commercial solutions. Notably, AIDA generates high-quality transcriptions with accurate punctuation and subtle discourse features such as the use of ellipses to reflect speaker hesitations or pauses during reflective speech. End-to-end neural diarization also ensures that diarization is accurate with lower instances of speaker confusion.

Future Work

Preliminary user testing of AIDA has yielded promising results, particularly in comparison to existing commercial transcription tools. Users noted improvements in both transcription quality and speaker labeling accuracy, particularly in picking up nuances in the speech. However, to more rigorously assess AIDA’s performance, future work includes benchmarking against standard evaluation datasets. Specifically, Word Error Rate (WER) and Diarization Error Rate (DER) will be calculated to provide a quantitative measure of performance.

In addition to transcription and diarization capabilities, the team is currently developing an NLP pipeline that will extend AIDA’s analytical functionality. This pipeline will leverage transcription and diarization outputs to deliver automated qualitative analysis support. Using large language models (LLMs) and sentiment analysis models, the system extracts key phrases and sentiments from the transcripts. These are used to generate an initial codebook, a structured set of thematic codes, tailored to the content of the conversation. Subsequently, these codes are aggregated into higher-level research themes, facilitating the early stages of qualitative data analysis.

Ultimately, AIDA is envisioned as a comprehensive, end-to-end tool that supports the entire workflow of qualitative researchers from data collection and transcription, through speaker diarization, to thematic analysis, while remaining accessible, customizable, and grounded in advanced AI methodologies.