This project is a Go backend server that provides real-time, low-latency Speech-to-Speech (S2S) streaming. It captures audio from a client via a WebSocket connection, transcribes it in real-time using Deepgram, sends the transcribed text to an OpenAI-compatible Large Language Model (LLM) to get a text response, and then, sentence by sentence, converts the LLM's generated text into audio using the ElevenLabs API. The final audio is streamed back to the client over the same WebSocket as it's generated.
The primary goal is to minimize perceived latency for the end-user by:
- Streaming audio input for live transcription.
- Quickly processing the transcribed text with an LLM.
- Starting audio playback of the AI's response as soon as the first sentence is synthesized, while subsequent sentences are still being generated and processed.
- WebSocket Communication: Uses WebSockets for real-time, bidirectional audio and data communication.
- Live Speech-to-Text: Integrates Deepgram for real-time audio transcription with Voice Activity Detection (VAD).
- OpenAI/LLM Integration: Streams transcribed text to any OpenAI API-compatible LLM (e.g., GitHub Models, OpenAI's GPT).
- Conversation Memory: Maintains conversation history during each WebSocket session, providing context for the LLM to generate more relevant responses.
- PostgreSQL Integration: Stores conversation history in a PostgreSQL database for persistence across sessions, with non-blocking database operations to maintain low latency.
- ElevenLabs Text-to-Speech Integration:
- 🆕 WebSocket Streaming (Recommended): Real-time WebSocket connection to ElevenLabs for ultra-low latency audio generation. Text is streamed incrementally as OpenAI generates it, providing faster Time-to-First-Byte (TTFB).
- HTTP Streaming (Fallback): Traditional HTTP-based streaming for compatibility, processes text sentence by sentence.
- Utilizes advanced buffering control with
chunk_length_scheduleand context-aware generation withprevious_textandnext_textparameters.
- System Metrics: Provides a real-time metrics endpoint for monitoring active connections, CPU/memory usage, and other system statistics.
- Status Dashboard: Includes a simple web interface for viewing system metrics.
- Pipelined Streaming Workflow:
- Client streams audio to the backend.
- Backend streams audio to Deepgram for live transcription.
- Deepgram sends back transcript segments (interim and final).
- Upon utterance end (pause in client's speech), the accumulated final transcript is:
- Added to the client's conversation history as a user message
- Sent to the LLM along with previous conversation context
- LLM receives the full conversation history and streams text response back.
- LLM response is processed into sentences for real-time streaming while the full response is also captured.
- The complete LLM response is stored in the conversation history as an assistant message.
- Sentences are sent to ElevenLabs for TTS using a sliding window approach:
- The first sentence is sent with its
next_text(the second sentence) once available. - Subsequent sentences are sent with their
previous_textandnext_text. - The final sentence is sent with its
previous_text. This provides context to ElevenLabs for improved speech continuity.
- The first sentence is sent with its
- Audio for each sentence is streamed back to the client as soon as it's synthesized by ElevenLabs, allowing for low-latency playback while subsequent text is still being generated and processed.
- Low Latency Focus: Optimized HTTP client for ElevenLabs (TCP_NODELAY, HTTP/2).
- Concurrent Handling: Designed to handle multiple client connections concurrently.
- Context-Aware Cancellation: Gracefully handles client disconnects and server-side cancellations.
- Modular Design: Code is organized into packages for configuration, handlers, services, and utilities.
- Configuration: API keys and service parameters are managed via environment variables or defaults.
- Error Handling & Logging: Includes structured logging and error propagation.
The project is organized into the following directory structure:
chat_audio_streamer/
├── go.mod # Go module definition
├── go.sum # Go module checksums
├── main.go # Main application entry point, HTTP server setup
├── index.html # Example HTML/JavaScript client for Speech-to-Speech
├── config/
│ └── config.go # Configuration loading (API keys, URLs, model IDs)
├── database/
│ └── db.go # PostgreSQL database connection and operations
├── handlers/
│ ├── init.go # API client initialization (OpenAI, ElevenLabs, Deepgram SDK)
│ ├── metrics_handler.go # Endpoint handler for system metrics
│ └── websocket_handler.go # WebSocket connection handling, S2T, and core orchestration
├── metrics/
│ └── metrics.go # System metrics collection and tracking
├── services/
│ ├── elevenlabs_service.go # Logic for ElevenLabs API interaction and audio streaming
│ └── openai_service.go # Logic for OpenAI API interaction and text streaming
└── utils/
├── http_client.go # Custom HTTP client setup for ElevenLabs
└── json_utils.go # Utility for pretty-printing JSON (for logging)
- Go (version 1.21 or higher recommended)
- A Deepgram account and API key for speech-to-text.
- Access to an OpenAI-compatible LLM API endpoint and an API key/token.
- An ElevenLabs account and API key for text-to-speech.
- An ElevenLabs Voice ID.
- (Optional) PostgreSQL database for persistent conversation storage.
-
Clone the Repository (or create the project files): If you haven't already, create the project directory (
chat_audio_streamer) and populate it with the Go files andindex.html. -
Initialize Go Module: Navigate to the root directory (
chat_audio_streamer) and run:go mod init chat_audio_streamer # Or your chosen module name go mod tidyThis will download the necessary dependencies, including:
github.com/gorilla/websocketgithub.com/openai/openai-gogithub.com/deepgram/deepgram-go-sdk/v2
-
Configure API Keys and Endpoints: The application loads configuration from environment variables. Set the following environment variables before running the application:
-
DEEPGRAM_API_KEY: Your API key for Deepgram. -
OPENAI_API_KEY: Your API key/token for the OpenAI-compatible LLM.- For GitHub Models, this is likely a GitHub Personal Access Token (PAT) with appropriate scopes.
-
OPENAI_BASE_URL: (Optional, defaults tohttps://api.cerebras.ai/v1in code) The base URL for the LLM API. -
OPENAI_MODEL: (Optional, defaults tollama-4-scout-17b-16e-instructin code) The model to use. -
OPENAI_SYSTEM_PROMPT: (Optional, defaults to "You are a helpful assistant. Respond clearly and concisely, and do not use markdown formatting. Also give short responses.") The system prompt for the LLM. -
ELEVENLABS_API_KEY: Your API key for ElevenLabs. -
ELEVENLABS_VOICE_ID: (Optional, defaults to a sample ID likeecp3DWciuUyW7BYM7II1in code) The ElevenLabs Voice ID you want to use. -
ELEVENLABS_MODEL_ID: (Optional, defaults toeleven_flash_v2_5in code) The ElevenLabs model ID. -
ELEVENLABS_OUTPUT_FORMAT: (Optional, defaults tomp3_44100_128in code) The desired audio output format. -
ELEVENLABS_BASE_URL: (Optional, defaults tohttps://api.elevenlabs.io/v1in code) -
ELEVENLABS_MIME_TYPE: (Optional, defaults toaudio/mpegin code) Corresponds to the output format. -
ELEVENLABS_USE_WEBSOCKET: (Optional, defaults totrue) Set totrueto use WebSocket streaming for lower latency, orfalseto use HTTP streaming (fallback mode). -
PG_DB_URL: (Optional) PostgreSQL database connection string in the formatpostgres://username:password@host:port/dbname?sslmode=disable. If not provided, the application will function without database persistence. -
FIREBASE_SERVICE_ACCOUNT_KEY_PATH: (Optional, defaults toserviceAccountKey.json) Path to your Firebase service account credentials JSON file. This file is required for Firebase Storage integration. -
FIREBASE_STORAGE_BUCKET: (Optional, defaults tosaahara-1.appspot.com) Your Firebase Storage bucket name where audio recordings will be stored.
Example (bash/zsh):
export DEEPGRAM_API_KEY="YOUR_DEEPGRAM_KEY" export OPENAI_API_KEY="ghp_YOUR_GITHUB_PAT_OR_OPENAI_KEY" export ELEVENLABS_API_KEY="sk_YOUR_ELEVENLABS_KEY" export ELEVENLABS_VOICE_ID="YOUR_VOICE_ID" export ELEVENLABS_USE_WEBSOCKET="true" # Enable WebSocket streaming for lower latency # ... set other variables as needed
Alternatively, you can modify the fallback default values directly in
config/config.go, but using environment variables is highly recommended for security and flexibility. -
-
Port Configuration: The server listens on port
8080by default. You can change this by setting thePORTenvironment variable:export PORT=8888
Navigate to the root directory of the project (chat_audio_streamer) and run:
go run main.goYou should see log messages indicating that the API clients are initialized and the WebSocket server is starting:
YYYY/MM/DD HH:MM:SS Initializing API clients...
YYYY/MM/DD HH:MM:SS Deepgram client initialized.
YYYY/MM/DD HH:MM:SS OpenAI client configured for BaseURL: <your_openai_base_url>
YYYY/MM/DD HH:MM:SS ElevenLabs HTTP client initialized with custom transport.
YYYY/MM/DD HH:MM:SS Successfully connected to Firebase Storage for bucket: <your_firebase_bucket>
YYYY/MM/DD HH:MM:SS Starting WebSocket server on ws://localhost:8080/ws/chat-audio
WebSocket Endpoint: ws://localhost:<PORT>/ws/chat-audio
Default: ws://localhost:8080/ws/chat-audio
Metrics Endpoint: http://localhost:<PORT>/api/metrics
Default: http://localhost:8080/api/metrics
This endpoint provides real-time system metrics in JSON format including:
- Active and total connections
- CPU and memory usage
- Goroutine count
- Uptime and other runtime statistics
History Endpoint: http://localhost:<PORT>/api/history?sessionId=<SESSION_ID>
Default: http://localhost:8080/api/history?sessionId=<SESSION_ID>
Retrieves the conversation history for a specific session, including:
- Message ID
- Role (user or assistant)
- Content (decrypted)
- Creation timestamp
- Feedback (if provided)
Sessions Endpoint: http://localhost:<PORT>/api/sessions?userId=<USER_ID>
Default: http://localhost:8080/api/sessions?userId=<USER_ID>
Retrieves a list of all sessions for a specific user ID, including:
- Session ID
- User ID
- Creation timestamp
- Last activity timestamp
- Model ID and Voice ID
Feedback Endpoint: http://localhost:<PORT>/api/feedback
Default: http://localhost:8080/api/feedback
Accepts POST requests with JSON payload to update feedback for a specific message:
{
"message_id": 123,
"feedback": "positive" // or "negative"
}Status Dashboard: http://localhost:<PORT>/status
Default: http://localhost:8080/status
A simple web interface that displays system metrics in a user-friendly format with auto-refresh.
The application integrates with PostgreSQL for persistent storage of conversation history:
-
Asynchronous Operations: All database operations (reads and writes) are implemented using goroutines to ensure they don't block the main conversation flow, preserving low latency in the real-time audio streaming experience.
-
Connection Pooling: The database connection pool is configured with optimized settings (25 max open connections, 25 max idle connections, 5-minute connection lifetime) to handle concurrent requests efficiently.
-
Schema Design:
-
sessionstable: Stores session metadata including:session_id- Unique identifier for the sessionuser_id- Client's user identifiercreated_at- When the session was createdlast_activity_at- Timestamp of the most recent activitymodel_id- The OpenAI model used for this sessionvoice_id- The ElevenLabs voice used for this session
-
messagestable: Stores conversation messages including:id- Unique message identifiersession_id- Reference to the parent sessionrole- Message role ('user' or 'assistant')content- Encrypted message contentaudio_url- Optional URL to audio file (if stored)feedback- Optional user feedback on message ('positive' or 'negative')created_at- When the message was created
-
-
Conversation Persistence: Messages from both user and assistant are stored in the database as they occur, enabling:
- History retrieval across sessions
- Conversation continuity even after disconnections
- Potential for analytics and user experience improvements
- Feedback collection for message quality assessment
-
Optional Integration: The database integration is optional - if no database connection string is provided (
PG_DB_URL), the application will function with in-memory conversation history only. -
Selective Encryption: Only the message content is encrypted before storage to enhance data security, while keeping other fields searchable.
- Messages are stored asynchronously in separate goroutines to prevent blocking the main conversation flow
- Database indexes are created automatically for optimized query performance on
session_idanduser_id - Error handling is robust with detailed logging that doesn't interrupt the user experience
- Feedback can be provided on messages via the
/api/feedbackendpoint
In addition to real-time streaming, the application also includes an audio recording and storage feature that:
-
Concatenates Audio Chunks: As audio is streamed through the system, both user speech and AI-generated responses are concatenated in memory buffers.
userAudioBufferstores raw audio data from the client's microphoneassistantAudioBufferstores the synthesized audio from ElevenLabs
-
Firebase Integration: Completed audio recordings are automatically uploaded to Firebase Storage:
- User audio is converted from raw PCM to properly formatted WAV files
- Assistant audio is stored in the format specified by
ELEVENLABS_OUTPUT_FORMAT(default: MP3) - Files are organized in the storage bucket using a hierarchical structure:
audio_recordings/{user_id}/{session_id}/{filename} - Each filename includes a UUID and timestamp to ensure uniqueness
-
Audio Duration Calculation and Metadata: Enhanced audio storage with comprehensive metadata:
- Automatic Duration Detection: Calculates audio duration for multiple formats including WAV, MP3, and FLAC
- Format-Specific Parsing: Implements dedicated parsers for different audio formats:
- WAV: Full implementation with RIFF header and chunk parsing
- MP3: Complete MPEG frame analysis with support for MPEG 1/2/2.5 and bitrate detection
- FLAC: STREAMINFO block parsing for sample rate and total samples
- WebM/OGG/AAC: Framework in place for future implementation
- Smart Format Detection: Automatically detects audio format by file signature when content type is unclear
- Rich Metadata Storage: Each uploaded file includes comprehensive metadata:
- Duration in seconds (calculated automatically)
- Upload timestamp
- File size in bytes
- User ID and session ID for organization
- Content type and format information
- Metadata Retrieval Functions: API functions for retrieving audio metadata and duration information
- Session Audio Listing: Ability to list all audio files for a session with their durations and metadata
-
Database References: After successful upload, the audio file URLs are encrypted and stored in the database:
- The
audio_urlfield in themessagestable is updated with the encrypted Firebase Storage URL - URLs are retrieved and decrypted when conversation history is requested
- The
-
Asynchronous Processing: All audio processing and uploading operations happen asynchronously:
- Audio uploads occur in the background after message processing completes
- Database updates happen non-blocking to maintain low latency in the primary conversation flow
- Audio processing continues even if a client disconnects, ensuring complete conversation archiving
-
Configuration Options:
FIREBASE_SERVICE_ACCOUNT_KEY_PATH: Path to your Firebase service account credentials JSON file (default: "serviceAccountKey.json")FIREBASE_STORAGE_BUCKET: Your Firebase Storage bucket name
This feature enables:
- Complete conversation archiving with both text and audio
- Playback of previous conversations with accurate duration information
- Training data collection for AI improvement with detailed audio metadata
- Quality assurance and user experience analysis with comprehensive audio metrics
- Session analytics including total conversation duration and audio file organization
- Efficient audio file management with metadata-based search and filtering
The audio storage system is designed to be lightweight on the main processing thread, maintaining the application's focus on low-latency real-time communication while providing comprehensive conversation history.
The provided index.html serves as an example client for this Speech-to-Speech system.
-
Connect: The client establishes a WebSocket connection to the endpoint upon page load.
-
Send Audio (Recording):
- The user clicks "Start Recording".
- The browser captures audio from the microphone (typically at 16kHz, 16-bit PCM as configured).
- This audio data is sent as binary WebSocket messages to the server.
- When the user clicks "Stop Recording", a JSON message
{"type": "closeMicrophone"}is sent to signal the end of audio input.
-
Receive Status and Transcription Updates:
- The server may send JSON text messages to update the client on the status (e.g., "Transcription complete. AI is processing...", "AI response finished.").
- Interim transcription updates from Deepgram might also be relayed.
-
Receive AI's Spoken Audio:
- Once the AI generates text and ElevenLabs synthesizes speech, the server streams back binary WebSocket messages. Each message contains a chunk of audio data (e.g., MP3).
- The
index.htmlclient uses Media Source Extensions (MSE) to play this streamed audio in real-time.
-
Receive Errors (Optional): If an error occurs on the server-side, the server may send a JSON text message:
{ "error": "Description of the error" } -
Connection Close: The server closes the WebSocket connection if an unrecoverable error occurs. The client also handles connection closures.
To use the example HTML client:
- Ensure the Go backend server is running and configured with your API keys.
- Open
index.html(located in the project root) in a modern web browser (Chrome, Firefox, Edge recommended). - Allow microphone access when prompted.
- Click "Start Recording", speak, and then click "Stop Recording".
The server logs various stages of processing to the console, including:
- Client connections and disconnections
- Deepgram connection status and transcription events (like UtteranceEnd)
- Transcribed text being sent to the LLM
- Sentences produced by the LLM
- Requests to ElevenLabs for TTS
- Audio streaming events for the response
- Errors encountered during any stage
Check the server console output for these logs. The client-side JavaScript in index.html also logs extensively to the browser's developer console.
Initializes configuration, API clients (Deepgram, OpenAI, ElevenLabs), and starts the HTTP server with the WebSocket handler.
Loads application configuration from environment variables with defaults. This includes API keys, model IDs, service URLs for all three services (Deepgram, OpenAI, ElevenLabs), and the PostgreSQL database connection string.
Manages the PostgreSQL database connection and operations:
- Initializes the database connection pool with optimized settings
- Creates the schema if it doesn't exist (sessions and messages tables)
- Provides functions for storing and retrieving messages
- Implements session management functionality
- All database operations are designed to be non-blocking
Contains the InitializeClients function. It initializes the Deepgram SDK, the openai.Client, and the custom http.Client for ElevenLabs.
This is the core of the server-side WebSocket logic:
- Upgrades HTTP requests to WebSocket connections.
- Speech-to-Text (Deepgram):
- Initializes a Deepgram live transcription client for each WebSocket connection.
- Implements a
DeepgramCallbackHandlerto process events from Deepgram (e.g.,Open,Message,UtteranceEnd,Error). - Forwards binary audio data received from the client to Deepgram.
- Accumulates the transcript from Deepgram.
- Uses Deepgram's
UtteranceEndevent to determine when the user has paused speaking.
- LLM Processing (OpenAI-compatible):
- Once a complete utterance is transcribed, sends the text to
services.StreamOpenAIText.
- Once a complete utterance is transcribed, sends the text to
- Text-to-Speech (ElevenLabs):
- Pipes sentences from the LLM response to
services.StreamTTSWebSocketfor audio synthesis.
- Pipes sentences from the LLM response to
- Manages concurrent goroutines for all streaming operations.
- Handles context cancellation for graceful shutdown if the client disconnects or an error occurs.
- Streams binary audio data (the AI's speech) back to the client.
- Sends JSON status and error messages to the client.
StreamOpenAIText: Connects to the LLM, sends the prompt (transcribed text), and streams back the text response. It parses the incoming text into sentences and sends each to an output channel.
StreamTTSWebSocket: Takes a sentence of text, makes a POST request to the ElevenLabs API, and streams the resulting audio chunks via a callback, suitable for WebSocket transmission.
NewElevenLabsClient: Creates a configured http.Client for low-latency communication with ElevenLabs (TCP_NODELAY, HTTP/2).
PrettyJSON: A helper for indenting JSON for logging.
Manages Firebase Storage interactions for audio recording persistence with advanced metadata and duration calculation:
InitializeFirebase: Sets up the Firebase app and storage client using the service account credentialsUploadAudioToFirebase: Uploads audio data to Firebase Storage with comprehensive metadata including calculated durationCalculateAudioDuration: Main function that determines audio duration based on content type and format detection- Format-Specific Duration Calculators:
calculateWAVDuration: Parses WAV RIFF headers and data chunks for precise duration calculationcalculateMP3Duration: Advanced MP3 frame parsing with MPEG version detection and bitrate analysiscalculateFLACDuration: FLAC STREAMINFO block parsing for sample rate and total samplesdetectAndCalculateDuration: Smart format detection using file signatures when content type is unclear
- Metadata Management Functions:
GetAudioMetadata: Retrieves comprehensive metadata for uploaded audio filesGetAudioDuration: Convenience function to get duration from stored metadataListAudioFilesWithDuration: Lists all audio files for a session with duration and metadata
CreateWavFromPCM: Converts raw PCM audio data to WAV format by adding the appropriate header- Helper functions for determining content types, file extensions, and supported audio formats
- Client-Side Resampling: The
index.htmlhas basic audio resampling. For production, if the browser's audio capture rate doesn't match Deepgram's expected rate (e.g., 16kHz), a more robust client-side resampling library would improve transcription accuracy. - AudioWorklets: For client-side audio processing,
AudioWorkletis more modern and performant thanScriptProcessorNodeand should be considered for production applications. - Transcription Accuracy & Model Choice: Experiment with different Deepgram models (
nova-2,nova-3, etc.) and settings (e.g.,interim_results,endpointing) for optimal transcription. - More Sophisticated Sentence Tokenization: The current LLM response sentence splitting is basic.
- Buffer Management (MSE Client): The example MSE client has basic buffer handling for playback.
- Authentication/Authorization: Implement proper authentication for clients.
- Rate Limiting: Protect backend services (Deepgram, LLM, TTS) by implementing rate limiting.
- Database Enhancements:
- Implement a connection retry mechanism for database operations
- Add a caching layer to reduce database load for frequently accessed conversations
- Support for database sharding for high-volume deployments
- Create advanced analytics tools for conversation and feedback data analysis
- Implement database schema migrations for version control
- Scalability: For high-volume traffic, consider load balancing and horizontal scaling.
- Detailed Metrics & Observability: Integrate metrics for latency at each S2S stage.
- Configuration Management: Use a robust configuration system for production.
- Backpressure: Implement more sophisticated backpressure mechanisms if needed.
- Noise Reduction/Echo Cancellation: Explore advanced options if client-side audio quality is an issue. The current
index.htmlrequests browser-based echo cancellation and noise suppression.
The application provides a metrics endpoint and status dashboard for monitoring the system's health and performance:
The conversation memory feature maintains context throughout a WebSocket session, allowing the AI to respond more coherently to multi-turn conversations:
- Per-Session Memory: Each WebSocket connection maintains its own conversation history.
- Message Structure: Conversation is stored as a sequence of role-based messages:
systemMessage: The initial system prompt that defines the AI's behavioruserMessage: Transcribed speech from the clientassistantMessage: Responses generated by the LLM
- Implementation Details:
- Messages are stored in the
clientStatestruct for each connection - The conversation history begins with the system prompt when a client connects
- User messages are added to history as soon as transcription is complete
- Assistant responses are captured in full and added to history after processing
- The entire conversation context is sent to the LLM with each new user message
- Messages are stored in the
- Thread Safety: Proper locking mechanisms ensure thread-safe updates to conversation history
- Efficient Processing: The design maintains the original real-time processing flow while adding memory capabilities in parallel
The /api/metrics endpoint returns a JSON object with the following information:
- Connection statistics (active and total connections)
- System resource usage (CPU, memory)
- Go runtime metrics (goroutines, memory allocation)
- Server uptime and status
Example request:
curl http://localhost:8080/api/metricsExample response:
{
"timestamp": "2023-07-01T12:34:56Z",
"active_connections": 2,
"total_connections": 15,
"goroutines": 12,
"allocated_memory_bytes": 2097152,
"total_allocated_memory_bytes": 10485760,
"system_memory_usage_percent": 45.7,
"heap_objects": 8724,
"cpu_usage_percent": 2.5,
"system_cpu_usage_percent": 32.1,
"process_memory_bytes": 15728640,
"uptime_seconds": 3600,
"last_collection_time": 1688216096
}The /status endpoint provides a simple web interface for viewing system metrics in a user-friendly format. It automatically refreshes every 5 seconds and displays:
- Current active connections and total connections since startup
- CPU and memory usage (system and process)
- Go runtime metrics (goroutines, memory allocation)
- Server uptime
This dashboard is useful for quick visual monitoring of the system's health without using additional monitoring tools.