Building an AI Conversation Practice App: Part 2 – Backend Speech-to-Text Processing with OpenAI Whisper

This is the second post in a series documenting the technical implementation of a browser-based English learning application with real-time speech processing capabilities.

Overview: The STT Pipeline

The complete STT workflow involves:

Au…


This content originally appeared on DEV Community and was authored by Evie Wang

This is the second post in a series documenting the technical implementation of a browser-based English learning application with real-time speech processing capabilities.

Overview: The STT Pipeline

The complete STT workflow involves:

  1. Audio Reception → FormData parsing with formidable
  2. File Validation → WebM format verification and size checks
  3. Stream Processing → Direct file stream to OpenAI API
  4. Transcription → Whisper-1 model with Canadian English optimization
  5. Response Handling → Error management and cleanup
  6. Integration → Seamless handoff to conversation system

Total processing time: 200-500ms

Technical Stack Summary:

  • Primary STT: OpenAI Whisper-1
  • File Processing: Formidable + Node.js streams
  • Language: TypeScript with Next.js API routes
  • Error Handling: Basic try-catch with error logging
  • Performance: Stream processing, Node.js runtime

The Challenges I Solved

1. File Upload Complexity in Next.js

Problem: Next.js API routes have strict limitations on file uploads, especially with form-data.

Solution: Used a custom formidable-based parser:

// Disable Next.js body parsing
export const config = { api: { bodyParser: false } };

// Custom form parsing with formidable
const form = new IncomingForm({
  keepExtensions: true,
});

const formData: [Fields, Files] = await new Promise((resolve, reject) => {
  form.parse(req, (err, fields, files) => {
    if (err) return reject(err);
    resolve([fields, files]);
  });
});

The reason:

  • Bypasses Next.js 1MB body size limit
  • Handles WebM files up to 25MB
  • Maintains file metadata and extensions
  • Provides proper error handling

2. Stream Processing for Large Files

Problem: Loading entire audio files into memory causes performance issues and potential crashes after deployment.

Solution: Direct stream processing to OpenAI API:

// Create readable stream from uploaded file
const audioPath = audioFile.filepath;
const audioStream = createReadStream(audioPath);

// Stream directly to OpenAI (no memory buffering)
const transcription = await openai.audio.transcriptions.create({
  file: audioStream,
  model: "whisper-1",
  language: "en",
  prompt: "This is a conversation in Canadian English.",
});

Performance Benefits:

  • Significantly reduced memory usage through streaming
  • Faster processing for large files
  • Better reliability and no memory overflow crashes

3. Frontend Audio Validation

Problem: Short audio recordings (< 300ms) are often accidental and waste API calls.
Solution: Early validation on the frontend before sending to backend

// Frontend validation before API call
const recordingDuration = Date.now() - recordingStartTimeRef.current;

if (recordingDuration < 300) {
  const clarificationText = getRandomClarification();

  const assistantMessage: Message = {
    role: 'assistant',
    content: '',
    isStreaming: true
  };

  setMessages(prevMessages => [...prevMessages, assistantMessage]);
  streamText(clarificationText, messageIndex);
  return; // Don't call STT API
}

// Only send to backend if recording is long enough
const sttResponse = await fetch('/api/stt', {
  method: 'POST',
  body: formData,
});

Results:

  • API call reduction: ~15% fewer unnecessary calls
  • User experience: Immediate feedback for accidental recordings
  • Cost savings: Reduced unwanted OpenAI API usage

4. Canadian English Optimization

Problem: Default Whisper models aren't optimized for Canadian English expressions and pronunciation patterns.

Solution: Custom prompt engineering:

const transcription = await openai.audio.transcriptions.create({
  file: audioStream,
  model: "whisper-1",
  language: "en",
  prompt: "This is a conversation in Canadian English.",
});

Results:

  • Better recognition of Canadian expressions
  • Improved handling of slang and culture-related expressions

Core Technical Implementation

1. API Endpoint Architecture

Our main STT endpoint (/api/stt) follows a robust error-handling pattern:

export default async function handler(
  req: NextApiRequest, 
  res: NextApiResponse<ApiResponse>
) {
  if (req.method !== 'POST') {
    return res.status(405).json({ success: false, error: 'Method not allowed' });
  }

  try {
    // Parse form data
    const form = new IncomingForm({ keepExtensions: true });
    const formData: [Fields, Files] = await new Promise((resolve, reject) => {
      form.parse(req, (err, fields, files) => {
        if (err) return reject(err);
        resolve([fields, files]);
      });
    });

    const [fields, files] = formData;

    // Validate audio file
    const audioFiles = files.audio;
    if (!audioFiles || !Array.isArray(audioFiles) || audioFiles.length === 0) {
      return res.status(400).json({ success: false, error: 'No audio file provided' });
    }

    const audioFile = audioFiles[0] as File;

    // Process with OpenAI
    const audioPath = audioFile.filepath;
    const audioStream = createReadStream(audioPath);

    const transcription = await openai.audio.transcriptions.create({
      file: audioStream,
      model: "whisper-1",
      language: "en",
      prompt: "This is a conversation in Canadian English.",
    });

    // Cleanup and respond
    fs.unlinkSync(audioPath);
    return res.status(200).json({
      success: true,
      transcript: transcription.text
    });

  } catch (error) {
    console.error('STT Error:', error);
    return res.status(500).json({
      success: false,
      error: error instanceof Error ? error.message : 'Failed to transcribe audio'
    });
  }
}

2. File Validation & Security

// Access the audio file with proper type checking
const audioFiles = files.audio;
if (!audioFiles || !Array.isArray(audioFiles) || audioFiles.length === 0) {
  return res.status(400).json({ 
    success: false, 
    error: 'No audio file provided' 
  });
}

const audioFile = audioFiles[0] as File;

// Additional validation
if (!audioFile.filepath || audioFile.size === 0) {
  return res.status(400).json({ 
    success: false, 
    error: 'Invalid audio file' 
  });
}

Security Measures:

  • File type validation (WebM only)
  • Size limits (25MB max)
  • Temporary file cleanup
  • No persistent storage

3. Resource Management

// Critical: Clean up temporary files
const audioPath = audioFile.filepath;
const audioStream = createReadStream(audioPath);

// Process audio...

// Always cleanup, even on error
try {
  fs.unlinkSync(audioPath);
} catch (cleanupError) {
  console.warn('Failed to cleanup temp file:', cleanupError);
}

Resource Management Benefits:

  • Disk space: Prevents temp file accumulation
  • Security: No persistent audio storage
  • Performance: Clean server state

Performance Optimizations

1. Streaming vs Buffering

Before (Buffering):

// Load entire file into memory
const audioBuffer = fs.readFileSync(audioPath);
const transcription = await openai.audio.transcriptions.create({
  file: audioBuffer, // Large memory usage
});

After (Streaming):

// Stream file directly
const audioStream = createReadStream(audioPath);
const transcription = await openai.audio.transcriptions.create({
  file: audioStream, // Minimal memory usage
});

Results:

  • Significantly reduced memory usage through streaming
  • Faster processing for large files
  • Better support for concurrent requests

Integration with Frontend

The STT API seamlessly integrates with our frontend conversation system:

// Frontend STT call
const sttResponse = await fetch('/api/stt', {
  method: 'POST',
  body: formData,
});

const sttData = await sttResponse.json();

if (!sttData.success) {
  // Handle error gracefully
  const clarificationText = getRandomClarification();
  // Show clarification message to user
} else {
  // Continue with conversation
  const transcript = sttData.transcript;
  // Send to GPT for response generation
}

Error Handling & User Experience

1. Graceful Degradation

// If STT fails, don't break the conversation
if (!sttData.success) {
  const clarificationPhrases = [
    "Sorry, can you repeat that?",
    "Could you say that again please?",
    "I didn't quite get that. Could you repeat?",
  ];

  const randomClarification = clarificationPhrases[
    Math.floor(Math.random() * clarificationPhrases.length)
  ];

  // Continue conversation with clarification
}

2. Debugging & Monitoring

// Comprehensive logging for debugging
console.log('STT Response:', {
  success: sttData.success,
  transcript: sttData.transcript?.substring(0, 50) + '...',
  processingTime: Date.now() - startTime,
  fileSize: audioFile.size
});

Production Considerations

Rate Limiting

// Implement rate limiting for production
if (requestCount > 10) { // 10 requests per minute
  return res.status(429).json({
    success: false,
    error: 'You\'re speaking too fast! Please wait a moment before trying again.'
  });
}

frontend:

if (response.status === 429) {
  showError("Please wait a moment before recording again");
}

What's Next

In the next post, we’ll see how the transcribed text powers our AI conversation system, from selecting specific characters and crafting prompts for Canadian English, also integrating with GPT-4 and keeping conversations flowing naturally.


This content originally appeared on DEV Community and was authored by Evie Wang


Print Share Comment Cite Upload Translate Updates
APA

Evie Wang | Sciencx (2025-09-21T05:40:42+00:00) Building an AI Conversation Practice App: Part 2 – Backend Speech-to-Text Processing with OpenAI Whisper. Retrieved from https://www.scien.cx/2025/09/21/building-an-ai-conversation-practice-app-part-2-backend-speech-to-text-processing-with-openai-whisper/

MLA
" » Building an AI Conversation Practice App: Part 2 – Backend Speech-to-Text Processing with OpenAI Whisper." Evie Wang | Sciencx - Sunday September 21, 2025, https://www.scien.cx/2025/09/21/building-an-ai-conversation-practice-app-part-2-backend-speech-to-text-processing-with-openai-whisper/
HARVARD
Evie Wang | Sciencx Sunday September 21, 2025 » Building an AI Conversation Practice App: Part 2 – Backend Speech-to-Text Processing with OpenAI Whisper., viewed ,<https://www.scien.cx/2025/09/21/building-an-ai-conversation-practice-app-part-2-backend-speech-to-text-processing-with-openai-whisper/>
VANCOUVER
Evie Wang | Sciencx - » Building an AI Conversation Practice App: Part 2 – Backend Speech-to-Text Processing with OpenAI Whisper. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/09/21/building-an-ai-conversation-practice-app-part-2-backend-speech-to-text-processing-with-openai-whisper/
CHICAGO
" » Building an AI Conversation Practice App: Part 2 – Backend Speech-to-Text Processing with OpenAI Whisper." Evie Wang | Sciencx - Accessed . https://www.scien.cx/2025/09/21/building-an-ai-conversation-practice-app-part-2-backend-speech-to-text-processing-with-openai-whisper/
IEEE
" » Building an AI Conversation Practice App: Part 2 – Backend Speech-to-Text Processing with OpenAI Whisper." Evie Wang | Sciencx [Online]. Available: https://www.scien.cx/2025/09/21/building-an-ai-conversation-practice-app-part-2-backend-speech-to-text-processing-with-openai-whisper/. [Accessed: ]
rf:citation
» Building an AI Conversation Practice App: Part 2 – Backend Speech-to-Text Processing with OpenAI Whisper | Evie Wang | Sciencx | https://www.scien.cx/2025/09/21/building-an-ai-conversation-practice-app-part-2-backend-speech-to-text-processing-with-openai-whisper/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.