- Rust 48.4%
- JavaScript 20.3%
- Shell 18.8%
- HTML 10.4%
- Makefile 2.1%
|
All checks were successful
Build / build (push) Successful in 5m29s
Reviewed-on: #4 |
||
|---|---|---|
| .cargo | ||
| .forgejo/workflows | ||
| crates | ||
| docs/schemas | ||
| schemas/voice | ||
| scripts | ||
| .gitignore | ||
| buildenv.sh | ||
| Cargo.lock | ||
| Cargo.toml | ||
| LICENSE | ||
| Makefile | ||
| README.md | ||
HeroVoice
Voice-to-markdown transcription server with real-time speech recognition, AI-powered text transformation, and live preview.
Features
- Real-time voice transcription - Stream audio from browser to server via WebSocket
- Voice Activity Detection - Silero VAD V5 neural network detects speech/silence transitions
- Automatic segmentation - Transcribes on natural pauses (350ms silence threshold)
- AI transcription - Uses Groq WhisperLargeV3Turbo with automatic failover
- Live markdown preview - Split-screen editor with real-time rendered HTML
- Text transformations - 14 built-in AI transformation styles:
spellcheck- Grammar and spelling correctionspecs- Technical specificationscode- Software architecture documentationdocs- User-friendly documentationlegal- Legal document formattingstory- Creative narrativesummary- Bullet-point summarytechnical- Technical documentationbusiness- Business analysismeeting- Meeting minutesemail- Professional email- Language translations: Dutch, French, Arabic
- Topic organization - Hierarchical folder structure for transcriptions
- Audio archival - Saves recordings as WAV and compressed OGG
Requirements
- Rust 1.92+
- Groq API key (required for transcription)
- Modern browser with Web Audio API and microphone support
Configuration
# Required
export GROQ_API_KEY=your-groq-api-key
# Optional fallback providers
export OPENROUTER_API_KEY=your-openrouter-key
export SAMBANOVA_API_KEY=your-sambanova-key
# Server configuration (optional)
export HOST=0.0.0.0 # Bind address (default: 0.0.0.0)
export PORT=2756 # Listen port (default: 2756)
export RUST_LOG=hero_voice=info # Log level
Usage
make run
Open http://localhost:2756 in your browser.
Recording
- Click the microphone button or press
Ctrl+Shift+Rto start recording - Speak naturally - transcription happens automatically on pauses
- Click stop or press the shortcut again to end recording
- Use
Ctrl+Shift+Cto copy the markdown content
Topics
- Create folders and topics in the sidebar tree
- Each topic stores its content, audio recordings, and transforms
- Right-click for rename, move, and delete options
Transformations
- Select a transformation style from the dropdown
- Click "Transform" to apply AI formatting
- Transforms are saved per-topic
Architecture
HeroVoice uses an OSchema-generated OpenRPC server with a custom WebSocket route for real-time audio streaming.
Browser (WebSocket)
│
▼
Axum Server (AxumRpcServer)
├── JSON-RPC endpoint (/api/root/voice/rpc)
│ └── OSchema-generated CRUD + VoiceService methods
│
├── WebSocket Handler (/ws)
│ ├── AudioRecorder (WAV file saving)
│ └── AudioProcessor (VAD-based segmentation)
│
├── Transcriber (herolib-ai)
│ └── Groq → OpenRouter → SambaNova (failover)
│
└── TextTransformer (LLM transformations)
API
JSON-RPC Endpoint
All data operations use JSON-RPC 2.0 at POST /api/root/voice/rpc.
Auto-generated CRUD (Topic and Folder root objects):
topic.new,topic.get,topic.set,topic.delete,topic.listfolder.new,folder.get,folder.set,folder.delete,folder.list
Custom service methods (VoiceService):
voiceservice.create_topic/voiceservice.create_foldervoiceservice.rename_topic/voiceservice.rename_foldervoiceservice.move_topic/voiceservice.move_foldervoiceservice.delete_topic/voiceservice.delete_foldervoiceservice.save_content/voiceservice.transform_contentvoiceservice.register_audio/voiceservice.delete_audiovoiceservice.reset_topic/voiceservice.get_audio_path
Inspector: GET /api/root/voice/inspector
WebSocket
GET /ws - Audio streaming endpoint
Client to Server:
{ "type": "start", "topic": "optional-topic-sid", "audio_dir": "optional-dir" }
{ "type": "stop" }
Plus binary audio data (16-bit PCM, 16kHz, mono, little-endian)
Server to Client:
{ "type": "transcription", "text": "...", "is_final": true }
{ "type": "status", "message": "..." }
{ "type": "error", "message": "..." }
Static Files
GET /files/audio/{filename}- Audio file downloadsGET /files/transforms/{filename}- Transform file downloads
Project Structure
hero_voice/
├── schemas/voice/voice.oschema # Domain schema (source of truth)
├── build.rs # OSchema code generation
├── src/
│ ├── main.rs # Server setup (AxumRpcServer + WebSocket)
│ ├── lib.rs # Library root
│ ├── ws.rs # WebSocket audio streaming handler
│ ├── audio.rs # Voice Activity Detection (Silero VAD V5)
│ ├── convert.rs # WAV to OGG conversion
│ ├── transcriber.rs # AI transcription + text transformation
│ └── voice/ # OSchema-generated domain
│ ├── core/types_generated.rs # Generated types (DO NOT EDIT)
│ └── server/
│ ├── osis_server_generated.rs # Generated server (DO NOT EDIT)
│ ├── rpc_generated.rs # Generated trait (DO NOT EDIT)
│ └── rpc.rs # Business logic implementation
├── static/
│ ├── index.html # Single-page application
│ └── app.js # Frontend (JSON-RPC client)
├── data/ # Runtime data (OTOML storage, audio, transforms)
├── Cargo.toml
├── Makefile
└── buildenv.sh
Audio Processing
- Sample rate: 16kHz (required for Silero VAD)
- Chunk size: 512 samples for VAD analysis
- Silence threshold: 350ms triggers transcription
- Speech threshold: 0.20 probability
- Maximum buffer: 30 seconds before forced transcription
- Compression: OGG Vorbis at quality 0.4 (~10% of WAV size)
Browser Support
- Chrome 120+
- Firefox 120+
- Safari 17+
- Edge 120+
Requires microphone permission and WebSocket support.
Embedding & CORS
HeroVoice allows iframe embedding (no X-Frame-Options restrictions), cross-origin API calls, and WebSocket connections from any origin.
License
Apache-2.0