What It Takes to Train a Versatile Speech AI System

This section outlines the core tasks used to train a multitask speech AI model, covering transcription, translation, emotion detection, keyword analysis, and more.


This content originally appeared on HackerNoon and was authored by Phonology Technology

Abstract and 1 Introduction

2 Approach

2.1 Architecture

2.2 Multimodal Instruction Finetuning

2.3 Curriculum Learning with Parameter Efficient Finetuning

3 Experiments

4 Results

4.1 Evaluation of SpeechVerse models

4.2 Generalization Across Instructions

4.3 Strategies for Improving Performance

5 Related Work

6 Conclusion, Limitations, Ethics Statement, and References

A Appendix

A.1 Audio Encoder Pre-training

A.2 Hyper-parameters

A.3 Tasks

A.3 Tasks

We provide the details about our training tasks below as well as provide some qualitative examples in the Table 10 to better understand the tasks.

\ ASR: We use a combination of 5 publicly available datasets for the ASR task, which totals to 3k hours of paired audio and text data. We evaluate performance on the standard benchmarks for ASR.

\ ST: We train our models to predict translations in multiple different languages from the audios recorded with English speech. The tokenizer of the backbone LLM limits the choice of what can be a potential target language. For our case, we train and evaluate on German, French, and Romanian translations from the EuroParl dataset [53]. We also augment the training data with German and Catalan translations from the CoVost2 [28] dataset.

\ IC/SF: We train and evaluate our models on a subset of the SLURP dataset [25] that consists of 10 intent classes and 4 slot labels. This also allows us to study the generalization ability of our models to unseen class labels and we separately study it in the Section 4.2. The intent classes and slot labels that are chosen for the "seen" subset are the ones that occur most frequently in the training data. The training prompt used for this task is designed to contain the description of each class label.

\ KWE: The goal of this task is to identify important keywords in the content of the speech in the audio. Since no publicly available dataset exists for this task, we synthetically extract keywords from the ground truth transcripts using a text-based keyword extraction model[4]. These are then used as labels for training and evaluating our models.

\ KWS: This is a binary classification task to detect whether a specified keyword was spoken in the audio or not. We create positive samples by randomly selecting keywords from the ground truth transcripts and negative samples by choosing a keyword that does not appear in the transcript. Positive and negative examples are created in 70-30 ratio respectively for both training and evaluation.

\ ER: For emotion recognition, we classify speech into one of four main emotion classes: neutral, happy, sad, and angry, chosen based on the availability of the training samples in the MSP-Podcast v1.11 dataset [54]. We report metrics on the corresponding four-emotion subset of the Test1 split of the dataset.

\ ASC: For audio sentiment classification, we classify speech as positive, negative, or neutral in sentiment. The sentiment labels were obtained by thresholding the valence scale (annotated from 1 to 7) with 3 and 5. We train on the entire training split of the MSP-Podcast v1.11 dataset, and evaluate on the corresponding Test1 split.

\ SC: For speaker counting, we identify whether one or two speakers are present. We train on segments from Fisher dataset transcripts [29, 30] with one or two speakers, and evaluate on the Fisher test split used in [55].

\ AC: We train our models to classify speech into five accents of English language: Canadian, Indian, Australian, British, and American, using metadata from the Mozilla Common Voice dataset.

\ SNS: In this task, we identify whether speech is present in the audio. We collect a diverse set of audios with and without speech for training our models and evaluate them on a combination of speech segments from Hub5 [56] dataset and held-out non-speech segments in our in-house collection.

\

:::info Authors:

(1) Nilaksh Das, AWS AI Labs, Amazon and Equal Contributions;

(2) Saket Dingliwal, AWS AI Labs, Amazon(skdin@amazon.com);

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Rohit Paturi, AWS AI Labs, Amazon;

(5) Zhaocheng Huang, AWS AI Labs, Amazon;

(6) Prashant Mathur, AWS AI Labs, Amazon;

(7) Jie Yuan, AWS AI Labs, Amazon;

(8) Dhanush Bekal, AWS AI Labs, Amazon;

(9) Xing Niu, AWS AI Labs, Amazon;

(10) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon;

(11) Xilai Li, AWS AI Labs, Amazon;

(12) Karel Mundnich, AWS AI Labs, Amazon;

(13) Monica Sunkara, AWS AI Labs, Amazon;

(14) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(15) Kyu J. Han, AWS AI Labs, Amazon;

(16) Katrin Kirchhoff, AWS AI Labs, Amazon.

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[4] https://huggingface.co/Voicelab/vlt5-base-keywords


This content originally appeared on HackerNoon and was authored by Phonology Technology


Print Share Comment Cite Upload Translate Updates
APA

Phonology Technology | Sciencx (2025-06-20T04:00:04+00:00) What It Takes to Train a Versatile Speech AI System. Retrieved from https://www.scien.cx/2025/06/20/what-it-takes-to-train-a-versatile-speech-ai-system/

MLA
" » What It Takes to Train a Versatile Speech AI System." Phonology Technology | Sciencx - Friday June 20, 2025, https://www.scien.cx/2025/06/20/what-it-takes-to-train-a-versatile-speech-ai-system/
HARVARD
Phonology Technology | Sciencx Friday June 20, 2025 » What It Takes to Train a Versatile Speech AI System., viewed ,<https://www.scien.cx/2025/06/20/what-it-takes-to-train-a-versatile-speech-ai-system/>
VANCOUVER
Phonology Technology | Sciencx - » What It Takes to Train a Versatile Speech AI System. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/06/20/what-it-takes-to-train-a-versatile-speech-ai-system/
CHICAGO
" » What It Takes to Train a Versatile Speech AI System." Phonology Technology | Sciencx - Accessed . https://www.scien.cx/2025/06/20/what-it-takes-to-train-a-versatile-speech-ai-system/
IEEE
" » What It Takes to Train a Versatile Speech AI System." Phonology Technology | Sciencx [Online]. Available: https://www.scien.cx/2025/06/20/what-it-takes-to-train-a-versatile-speech-ai-system/. [Accessed: ]
rf:citation
» What It Takes to Train a Versatile Speech AI System | Phonology Technology | Sciencx | https://www.scien.cx/2025/06/20/what-it-takes-to-train-a-versatile-speech-ai-system/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.