๐๏ธ TTS Multi-Mode Interface
This interface provides four different modes for text-to-speech and audio processing:
- Mode 1: Text + Features โ Audio (with predefined examples)
- Mode 2: Text โ Features + Audio
- Mode 3: Audio โ Text Features
- Mode 4: Text + Instruction โ Features (using OpenRouter Gemini)
Mode 1: Text + Features to Audio
Input text along with prosodic features to generate speech audio. Use the example buttons below to load predefined test cases.
๐ Predefined Examples
Mode 2: Text to Features + Audio
Input only text to generate both prosodic features and speech audio. The model will automatically generate appropriate features internally.
Mode 3: Audio to Text Features
Upload an audio file to extract transcribed text and word-level features. The system will perform speech recognition and feature extraction.
Mode 4: Text + Instruction to Features
Generate prosodic features from text and emotional/stylistic instructions using OpenRouter Gemini API.
โ ๏ธ Note about Prompt Templates:
- Template 1: Standard template for reliable feature generation
- Template 2: Experimental template that may be more expressive but could generate additional words not in the original text
๐ Usage Notes:
- Mode 1: Best for precise control over prosodic features
- Mode 2: Best for quick text-to-speech with automatic feature generation
- Mode 3: Best for analyzing existing audio files
- Mode 4: Best for generating features with specific emotional characteristics
๐ง Technical Requirements:
- CUDA-compatible GPU recommended for optimal performance
- Sufficient GPU memory for model loading
- Valid OpenRouter API key for Mode 4