supertonic 3
@@ -1,13 +1,11 @@
|
|||||||
assets/*
|
assets/
|
||||||
assets/.git
|
|
||||||
assets/.gitignore
|
|
||||||
assets/.gitattributes
|
|
||||||
|
|
||||||
*.onnx
|
*.onnx
|
||||||
onnx
|
onnx
|
||||||
|
|
||||||
# Output files
|
# Output files
|
||||||
results
|
results
|
||||||
|
results_v3_sdk/
|
||||||
|
|
||||||
# Python
|
# Python
|
||||||
__pycache__
|
__pycache__
|
||||||
@@ -21,6 +19,11 @@ __pycache__
|
|||||||
venv/
|
venv/
|
||||||
ENV/
|
ENV/
|
||||||
env/
|
env/
|
||||||
|
.dotnet/
|
||||||
|
.m2/
|
||||||
|
.uv-cache/
|
||||||
|
.clang-module-cache/
|
||||||
|
.swift-home/
|
||||||
|
|
||||||
# Node.js
|
# Node.js
|
||||||
node_modules/
|
node_modules/
|
||||||
|
|||||||
@@ -1,93 +1,51 @@
|
|||||||
# Supertonic — Lightning Fast, On-Device TTS
|
# Supertonic — Lightning Fast, On-Device, Accurate TTS
|
||||||
|
|
||||||
[](https://huggingface.co/spaces/Supertone/supertonic-2)
|
[](https://huggingface.co/spaces/Supertone/supertonic-3)
|
||||||
[](https://huggingface.co/Supertone/supertonic-2)
|
[](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
[](https://github.com/supertone-inc/supertonic/tree/release/supertonic-2)
|
||||||
[-Demo-lightgrey)](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo)
|
[-Demo-lightgrey)](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo)
|
||||||
[-Models-lightgrey)](https://huggingface.co/Supertone/supertonic)
|
[-Models-lightgrey)](https://huggingface.co/Supertone/supertonic)
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<img src="img/supertonic_preview_0.1.jpg" alt="Supertonic Banner">
|
<img src="img/Supertonic3_HeroImage.png" alt="Supertonic 3 Banner">
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
**Supertonic** is a lightning-fast, on-device text-to-speech system designed for **extreme performance** with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
|
**Supertonic** is a lightning-fast, on-device text-to-speech system designed for local inference with minimal overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
|
||||||
|
|
||||||
### 📰 Update News
|
### 📰 Update News
|
||||||
|
|
||||||
|
- **2026.04.29** - 🎉 **Supertonic 3** released with **31-language support**, improved reading accuracy, fewer repeat/skip failures, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
- **2026.01.22** - **[Voice Builder](https://supertonic.supertone.ai/voice_builder)** is now live! Turn your voice into a deployable, edge-native TTS with permanent ownership.
|
- **2026.01.22** - **[Voice Builder](https://supertonic.supertone.ai/voice_builder)** is now live! Turn your voice into a deployable, edge-native TTS with permanent ownership.
|
||||||
|
- **2026.01.06** - 🎉 **Supertonic 2** released with 5-language support. The v2 code path is preserved on the [`release/supertonic-2`](https://github.com/supertone-inc/supertonic/tree/release/supertonic-2) branch.
|
||||||
<p align="center">
|
|
||||||
<img src="img/voicebuilder_img.png" alt="Voice Builder" width="600">
|
|
||||||
</p>
|
|
||||||
|
|
||||||
- **2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
|
||||||
- **2025.12.10** - Added `supertonic` PyPI package! Install via `pip install supertonic`. For details, visit [supertonic-py documentation](https://supertone-inc.github.io/supertonic-py)
|
- **2025.12.10** - Added `supertonic` PyPI package! Install via `pip install supertonic`. For details, visit [supertonic-py documentation](https://supertone-inc.github.io/supertonic-py)
|
||||||
- **2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
- **2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
- **2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
- **2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
|
||||||
- **2025.11.24** - Added Flutter SDK support with macOS compatibility
|
- **2025.11.24** - Added Flutter SDK support with macOS compatibility
|
||||||
|
|
||||||
### Table of Contents
|
## Quick Start
|
||||||
|
|
||||||
- [Demo](#demo)
|
Install the Python SDK and generate speech immediately. On the first run, Supertonic downloads the model assets from Hugging Face automatically.
|
||||||
- [Why Supertonic?](#why-supertonic)
|
|
||||||
- [Language Support](#language-support)
|
|
||||||
- [Getting Started](#getting-started)
|
|
||||||
- [Performance](#performance)
|
|
||||||
- [Built with Supertonic](#built-with-supertonic)
|
|
||||||
- [Citation](#citation)
|
|
||||||
- [License](#license)
|
|
||||||
|
|
||||||
## Demo
|
```bash
|
||||||
|
pip install supertonic
|
||||||
|
```
|
||||||
|
|
||||||
### Raspberry Pi
|
### Python
|
||||||
|
|
||||||
Watch Supertonic running on a **Raspberry Pi**, demonstrating on-device, real-time text-to-speech synthesis:
|
```python
|
||||||
|
from supertonic import TTS
|
||||||
|
|
||||||
https://github.com/user-attachments/assets/ea66f6d6-7bc5-4308-8a88-1ce3e07400d2
|
# First run downloads the model from Hugging Face automatically.
|
||||||
|
tts = TTS(auto_download=True)
|
||||||
|
|
||||||
### E-Reader
|
style = tts.get_voice_style(voice_name="M1")
|
||||||
|
|
||||||
Experience Supertonic on an **Onyx Boox Go 6** e-reader in airplane mode, achieving an average RTF of 0.3× with zero network dependency:
|
text = "A gentle breeze moved through the open window while everyone listened to the story."
|
||||||
|
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
|
||||||
|
|
||||||
https://github.com/user-attachments/assets/64980e58-ad91-423a-9623-78c2ffc13680
|
tts.save_audio(wav, "output.wav")
|
||||||
|
print(f"Generated {duration:.2f}s of audio")
|
||||||
### Chrome Extension
|
```
|
||||||
|
|
||||||
Turns any webpage into audio in under one second, delivering lightning-fast, on-device text-to-speech with zero network dependency—free, private, and effortless:
|
|
||||||
|
|
||||||
https://github.com/user-attachments/assets/cc8a45fc-5c3e-4b2c-8439-a14c3d00d91c
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
> 🎧 **Try it now**: Experience Supertonic in your browser with our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic-2), or get started with pre-trained models from [**Hugging Face Hub**](https://huggingface.co/Supertone/supertonic-2)
|
|
||||||
|
|
||||||
## Why Supertonic?
|
|
||||||
|
|
||||||
- **⚡ Blazingly Fast**: Generates speech up to **167× faster than real-time** on consumer hardware (M4 Pro)—unmatched by any other TTS system
|
|
||||||
- **🪶 Ultra Lightweight**: Only **66M parameters**, optimized for efficient on-device performance with minimal footprint
|
|
||||||
- **📱 On-Device Capable**: **Complete privacy** and **zero latency**—all processing happens locally on your device
|
|
||||||
- **🎨 Natural Text Handling**: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
|
|
||||||
- **⚙️ Highly Configurable**: Adjust inference steps, batch processing, and other parameters to match your specific needs
|
|
||||||
- **🧩 Flexible Deployment**: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.
|
|
||||||
|
|
||||||
## Language Support
|
|
||||||
|
|
||||||
We provide ready-to-use TTS inference examples across multiple ecosystems:
|
|
||||||
|
|
||||||
| Language/Platform | Path | Description |
|
|
||||||
|-------------------|------|-------------|
|
|
||||||
| [**Python**](py/) | `py/` | ONNX Runtime inference |
|
|
||||||
| [**Node.js**](nodejs/) | `nodejs/` | Server-side JavaScript |
|
|
||||||
| [**Browser**](web/) | `web/` | WebGPU/WASM inference |
|
|
||||||
| [**Java**](java/) | `java/` | Cross-platform JVM |
|
|
||||||
| [**C++**](cpp/) | `cpp/` | High-performance C++ |
|
|
||||||
| [**C#**](csharp/) | `csharp/` | .NET ecosystem |
|
|
||||||
| [**Go**](go/) | `go/` | Go implementation |
|
|
||||||
| [**Swift**](swift/) | `swift/` | macOS applications |
|
|
||||||
| [**iOS**](ios/) | `ios/` | Native iOS apps |
|
|
||||||
| [**Rust**](rust/) | `rust/` | Memory-safe systems |
|
|
||||||
| [**Flutter**](flutter/) | `flutter/` | Cross-platform apps |
|
|
||||||
|
|
||||||
> For detailed usage instructions, please refer to the README.md in each language directory.
|
|
||||||
|
|
||||||
## Getting Started
|
## Getting Started
|
||||||
|
|
||||||
@@ -107,18 +65,30 @@ Before running the examples, download the ONNX models and preset voices, and pla
|
|||||||
> - Generic: see `https://git-lfs.com` for installers
|
> - Generic: see `https://git-lfs.com` for installers
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone https://huggingface.co/Supertone/supertonic-2 assets
|
git lfs install
|
||||||
|
git clone https://huggingface.co/Supertone/supertonic-3 assets
|
||||||
```
|
```
|
||||||
|
|
||||||
### Quick Start
|
Some language examples need native runtimes:
|
||||||
|
- **Go**: install the ONNX Runtime C library. On macOS, `brew install onnxruntime` is enough; the Go example auto-detects Homebrew paths.
|
||||||
|
- **Java**: use a JDK, not just a JRE. On macOS, `brew install openjdk@17` works.
|
||||||
|
- **C#**: targets .NET 9 and allows major-version roll-forward, so .NET 9 or newer runtimes can run it.
|
||||||
|
|
||||||
|
Then run the Python example:
|
||||||
|
|
||||||
**Python Example** ([Details](py/))
|
|
||||||
```bash
|
```bash
|
||||||
cd py
|
cd py
|
||||||
uv sync
|
uv sync
|
||||||
uv run example_onnx.py
|
uv run example_onnx.py
|
||||||
```
|
```
|
||||||
|
|
||||||
|
This generates `outputs/output.wav` using the default preset voice.
|
||||||
|
|
||||||
|
### Other Runtime Examples
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary><b>Run Supertonic in other languages and platforms</b></summary>
|
||||||
|
|
||||||
**Node.js Example** ([Details](nodejs/))
|
**Node.js Example** ([Details](nodejs/))
|
||||||
```bash
|
```bash
|
||||||
cd nodejs
|
cd nodejs
|
||||||
@@ -182,95 +152,130 @@ cd ios/ExampleiOSApp
|
|||||||
xcodegen generate
|
xcodegen generate
|
||||||
open ExampleiOSApp.xcodeproj
|
open ExampleiOSApp.xcodeproj
|
||||||
```
|
```
|
||||||
- In Xcode: Targets → ExampleiOSApp → Signing: select your Team
|
|
||||||
- Choose your iPhone as run destination → Build & Run
|
In Xcode: Targets → ExampleiOSApp → Signing: select your Team, then choose your iPhone as run destination and build.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
|
||||||
### Technical Details
|
### Technical Details
|
||||||
|
|
||||||
- **Runtime**: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode is not tested)
|
- **Runtime**: ONNX Runtime for cross-platform inference
|
||||||
- **Browser Support**: onnxruntime-web for client-side inference
|
- **Browser Support**: onnxruntime-web for client-side inference
|
||||||
- **Batch Processing**: Supports batch inference for improved throughput
|
- **Batch Processing**: Supports batch inference for improved throughput
|
||||||
- **Audio Output**: Outputs 16-bit WAV files
|
- **Audio Output**: Outputs 16-bit WAV files
|
||||||
|
|
||||||
## Performance
|
## Performance Highlights
|
||||||
|
|
||||||
We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).
|
Supertonic 3 is designed for practical on-device inference: compact enough to run locally, while staying competitive with much larger open TTS systems.
|
||||||
|
|
||||||
**Metrics:**
|
### Reading Accuracy
|
||||||
- **Characters per Second**: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
|
|
||||||
- **Real-time Factor (RTF)**: Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
|
|
||||||
|
|
||||||
### Characters per Second
|
<p align="center">
|
||||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
<img src="img/metrics/s3_vs_measured_wer_range_voxcpm2.png" alt="Supertonic 3 reading accuracy compared with measured model ranges and VoxCPM2">
|
||||||
|--------|-----------------|----------------|-----------------|
|
</p>
|
||||||
| **Supertonic** (M4 pro - CPU) | 912 | 1048 | 1263 |
|
|
||||||
| **Supertonic** (M4 pro - WebGPU) | 996 | 1801 | 2509 |
|
|
||||||
| **Supertonic** (RTX4090) | 2615 | 6548 | 12164 |
|
|
||||||
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 144 | 209 | 287 |
|
|
||||||
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 37 | 55 | 82 |
|
|
||||||
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 12 | 18 | 24 |
|
|
||||||
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 38 | 64 | 92 |
|
|
||||||
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 104 | 107 | 117 |
|
|
||||||
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 37 | 42 | 47 |
|
|
||||||
|
|
||||||
> **Notes:**
|
Across measured languages, Supertonic 3 stays within a competitive WER/CER range against much larger open TTS models such as VoxCPM2, while preserving a lightweight on-device deployment path. Asterisked languages use CER; the others use WER.
|
||||||
> `API` = Cloud-based API services (measured from Seoul)
|
|
||||||
> `Open` = Open-source models
|
|
||||||
> Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
|
|
||||||
> Supertonic (RTX4090): Tested with PyTorch model
|
|
||||||
> Kokoro: Tested on M4 Pro CPU with ONNX
|
|
||||||
> NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
|
|
||||||
|
|
||||||
### Real-time Factor
|
### Supertonic 2 to Supertonic 3
|
||||||
|
|
||||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
<p align="center">
|
||||||
|--------|-----------------|----------------|-----------------|
|
<img src="img/metrics/supertonic2_vs_3_comparison.png" alt="Supertonic 2 and Supertonic 3 comparison">
|
||||||
| **Supertonic** (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
|
</p>
|
||||||
| **Supertonic** (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
|
|
||||||
| **Supertonic** (RTX4090) | 0.005 | 0.002 | 0.001 |
|
|
||||||
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 0.133 | 0.077 | 0.057 |
|
|
||||||
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 0.471 | 0.302 | 0.201 |
|
|
||||||
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 1.060 | 0.673 | 0.541 |
|
|
||||||
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 0.372 | 0.206 | 0.163 |
|
|
||||||
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 0.144 | 0.124 | 0.126 |
|
|
||||||
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 0.390 | 0.338 | 0.343 |
|
|
||||||
|
|
||||||
<details>
|
Compared with Supertonic 2, Supertonic 3 reduces repeat and skip failures, improves speaker similarity across the shared-language set, and expands language coverage from 5 to 31 languages. It keeps the v2-compatible public ONNX interface, so existing integrations can move to v3 with the same inference contract.
|
||||||
<summary><b>Additional Performance Data (5-step inference)</b></summary>
|
|
||||||
|
|
||||||
<br>
|
### Runtime Footprint
|
||||||
|
|
||||||
**Characters per Second (5-step)**
|
<p align="center">
|
||||||
|
<img src="img/metrics/runtime_cpu_gpu_latency_memory.png" alt="Supertonic CPU runtime compared with GPU baselines">
|
||||||
|
</p>
|
||||||
|
|
||||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
Supertonic 3 runs fast on CPU, even compared with larger baselines measured on A100 GPU, and uses substantially less memory. The open-weight fixed-voice setting does not require a GPU, which makes local, browser, and edge deployment much easier.
|
||||||
|--------|-----------------|----------------|-----------------|
|
|
||||||
| **Supertonic** (M4 pro - CPU) | 596 | 691 | 850 |
|
|
||||||
| **Supertonic** (M4 pro - WebGPU) | 570 | 1118 | 1546 |
|
|
||||||
| **Supertonic** (RTX4090) | 1286 | 3757 | 6242 |
|
|
||||||
|
|
||||||
**Real-time Factor (5-step)**
|
### Model Size
|
||||||
|
|
||||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
<p align="center">
|
||||||
|--------|-----------------|----------------|-----------------|
|
<img src="img/metrics/model_size_comparison.png" alt="Model size comparison">
|
||||||
| **Supertonic** (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
|
</p>
|
||||||
| **Supertonic** (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
|
|
||||||
| **Supertonic** (RTX4090) | 0.011 | 0.004 | 0.002 |
|
|
||||||
|
|
||||||
</details>
|
At about 99M parameters across the public ONNX assets, Supertonic 3 is much smaller than 0.7B to 2B class open TTS systems. The smaller model size is a practical advantage for download size, startup time, and on-device inference.
|
||||||
|
|
||||||
### Natural Text Handling
|
## Demo
|
||||||
|
|
||||||
Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.
|
> **Try it now**: Experience Supertonic in your browser with our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic-3), or get started with pre-trained models from [**Hugging Face Hub**](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
> 🎧 **View audio samples more easily**: Check out our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#text-handling) for a better viewing experience of all audio examples
|
### Raspberry Pi
|
||||||
|
|
||||||
|
Watch Supertonic running on a **Raspberry Pi**, demonstrating on-device, real-time text-to-speech synthesis:
|
||||||
|
|
||||||
|
https://github.com/user-attachments/assets/ea66f6d6-7bc5-4308-8a88-1ce3e07400d2
|
||||||
|
|
||||||
|
### E-Reader
|
||||||
|
|
||||||
|
Experience Supertonic on an **Onyx Boox Go 6** e-reader in airplane mode, achieving an average RTF of 0.3× with zero network dependency:
|
||||||
|
|
||||||
|
https://github.com/user-attachments/assets/64980e58-ad91-423a-9623-78c2ffc13680
|
||||||
|
|
||||||
|
### Chrome Extension
|
||||||
|
|
||||||
|
Turns any webpage into audio in under one second, delivering lightning-fast, on-device text-to-speech with zero network dependency—free, private, and effortless:
|
||||||
|
|
||||||
|
https://github.com/user-attachments/assets/cc8a45fc-5c3e-4b2c-8439-a14c3d00d91c
|
||||||
|
|
||||||
|
## Why Supertonic?
|
||||||
|
|
||||||
|
- **Blazingly Fast**: Optimized for low-latency, on-device speech generation across desktop, browser, and edge deployments
|
||||||
|
- **Lightweight**: Compact ONNX assets designed for efficient local execution
|
||||||
|
- **On-Device Capable**: Complete privacy and zero network dependency
|
||||||
|
- **Accurate Reading**: Improved reading stability with fewer repeat and skip failures
|
||||||
|
- **Expressive Tags**: Supports simple expression tags such as `<laugh>`, `<breath>`, and `<sigh>`
|
||||||
|
- **Flexible Deployment**: Ready-to-use examples across Python, JavaScript, browser, mobile, and native runtimes
|
||||||
|
|
||||||
|
## Language Support
|
||||||
|
|
||||||
|
Supertonic 3 supports 31 languages:
|
||||||
|
|
||||||
|
| Code | Language | Code | Language | Code | Language | Code | Language |
|
||||||
|
|------|----------|------|----------|------|----------|------|----------|
|
||||||
|
| `en` | English | `ko` | Korean | `ja` | Japanese | `ar` | Arabic |
|
||||||
|
| `bg` | Bulgarian | `cs` | Czech | `da` | Danish | `de` | German |
|
||||||
|
| `el` | Greek | `es` | Spanish | `et` | Estonian | `fi` | Finnish |
|
||||||
|
| `fr` | French | `hi` | Hindi | `hr` | Croatian | `hu` | Hungarian |
|
||||||
|
| `id` | Indonesian | `it` | Italian | `lt` | Lithuanian | `lv` | Latvian |
|
||||||
|
| `nl` | Dutch | `pl` | Polish | `pt` | Portuguese | `ro` | Romanian |
|
||||||
|
| `ru` | Russian | `sk` | Slovak | `sl` | Slovenian | `sv` | Swedish |
|
||||||
|
| `tr` | Turkish | `uk` | Ukrainian | `vi` | Vietnamese | | |
|
||||||
|
|
||||||
|
We provide ready-to-use TTS inference examples across multiple ecosystems:
|
||||||
|
|
||||||
|
| Language/Platform | Path | Description |
|
||||||
|
|-------------------|------|-------------|
|
||||||
|
| [**Python**](py/) | `py/` | ONNX Runtime inference |
|
||||||
|
| [**Node.js**](nodejs/) | `nodejs/` | Server-side JavaScript |
|
||||||
|
| [**Browser**](web/) | `web/` | WebGPU/WASM inference |
|
||||||
|
| [**Java**](java/) | `java/` | Cross-platform JVM |
|
||||||
|
| [**C++**](cpp/) | `cpp/` | High-performance C++ |
|
||||||
|
| [**C#**](csharp/) | `csharp/` | .NET ecosystem |
|
||||||
|
| [**Go**](go/) | `go/` | Go implementation |
|
||||||
|
| [**Swift**](swift/) | `swift/` | macOS applications |
|
||||||
|
| [**iOS**](ios/) | `ios/` | Native iOS apps |
|
||||||
|
| [**Rust**](rust/) | `rust/` | Memory-safe systems |
|
||||||
|
| [**Flutter**](flutter/) | `flutter/` | Cross-platform apps |
|
||||||
|
|
||||||
|
> For detailed usage instructions, please refer to the README.md in each language directory.
|
||||||
|
|
||||||
|
## Natural Text Handling
|
||||||
|
|
||||||
|
Supertonic is designed to handle complex, real-world text inputs that contain natural prose, punctuation, abbreviations, and proper nouns.
|
||||||
|
|
||||||
|
> 🎧 **View audio samples more easily**: Check out our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic-3) for a better viewing experience of all audio examples
|
||||||
|
|
||||||
**Overview of Test Cases:**
|
**Overview of Test Cases:**
|
||||||
|
|
||||||
| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini | Microsoft |
|
| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini | Microsoft |
|
||||||
|:--------:|:--------------:|:----------:|:----------:|:------:|:------:|:---------:|
|
|:--------:|:--------------:|:----------:|:----------:|:------:|:------:|:---------:|
|
||||||
| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ✅ | ❌ | ❌ | ❌ | ❌ |
|
| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| Time and Date | Time notation, abbreviated weekdays/months, date formats | ✅ | ❌ | ❌ | ❌ | ❌ |
|
|
||||||
| Phone Number | Area codes, hyphens, extensions (ext.) | ✅ | ❌ | ❌ | ❌ | ❌ |
|
| Phone Number | Area codes, hyphens, extensions (ext.) | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| Technical Unit | Decimal numbers with units, abbreviated technical notations | ✅ | ❌ | ❌ | ❌ | ❌ |
|
| Technical Unit | Decimal numbers with units, abbreviated technical notations | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||||
|
|
||||||
@@ -300,33 +305,7 @@ Supertonic is designed to handle complex, real-world text inputs that contain nu
|
|||||||
</details>
|
</details>
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>Example 2: Time and Date</b></summary>
|
<summary><b>Example 2: Phone Number</b></summary>
|
||||||
|
|
||||||
<br>
|
|
||||||
|
|
||||||
**Text:**
|
|
||||||
> "The train delay was announced at **4:45 PM** on **Wed, Apr 3, 2024** due to track maintenance."
|
|
||||||
|
|
||||||
**Challenges:**
|
|
||||||
- Time expression with PM notation (4:45 PM)
|
|
||||||
- Abbreviated weekday (Wed)
|
|
||||||
- Abbreviated month (Apr)
|
|
||||||
- Full date format (Apr 3, 2024)
|
|
||||||
|
|
||||||
**Audio Samples:**
|
|
||||||
|
|
||||||
| System | Result | Audio Sample |
|
|
||||||
|--------|--------|--------------|
|
|
||||||
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1ehkZU8eiizBenG2DgR5tzBGQBvHS0Uaj/view?usp=sharing) |
|
|
||||||
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1ta3r6jFyebmA-sT44l8EaEQcMLVmuOEr/view?usp=sharing) |
|
|
||||||
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1sskmem9AzHAQ3Hv8DRSZoqX_pye-CXuU/view?usp=sharing) |
|
|
||||||
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1zx9X8oMsLMXW0Zx_SURoqjju-By2yh_n/view?usp=sharing) |
|
|
||||||
| VibeVoice Realtime 0.5B | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1ZpGEstZr4hA0EdAWBMCUFFWuAkIpYsVh/view?usp=sharing) |
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary><b>Example 3: Phone Number</b></summary>
|
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
@@ -352,7 +331,7 @@ Supertonic is designed to handle complex, real-world text inputs that contain nu
|
|||||||
</details>
|
</details>
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>Example 4: Technical Unit</b></summary>
|
<summary><b>Example 3: Technical Unit</b></summary>
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
@@ -444,9 +423,8 @@ This paper describes the self-purification technique for training flow matching
|
|||||||
|
|
||||||
This project's sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.
|
This project's sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.
|
||||||
|
|
||||||
The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic-2/blob/main/LICENSE) file for details.
|
The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic-3/blob/main/LICENSE) file for details.
|
||||||
|
|
||||||
This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.
|
This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.
|
||||||
|
|
||||||
Copyright (c) 2026 Supertone Inc.
|
Copyright (c) 2026 Supertone Inc.
|
||||||
|
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ High-performance text-to-speech inference using ONNX Runtime.
|
|||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
|
|
||||||
@@ -51,8 +51,10 @@ vcpkg integrate install
|
|||||||
## Building
|
## Building
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd cpp && mkdir build && cd build
|
cd cpp
|
||||||
cmake .. && cmake --build . --config Release
|
mkdir -p build && cd build
|
||||||
|
cmake ..
|
||||||
|
cmake --build . --config Release
|
||||||
./example_onnx
|
./example_onnx
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -65,17 +67,17 @@ Run inference with default settings:
|
|||||||
```
|
```
|
||||||
|
|
||||||
This will use:
|
This will use:
|
||||||
- Voice style: `../assets/voice_styles/M1.json`
|
- Voice style: `../../assets/voice_styles/M1.json`
|
||||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
- Output directory: `results/`
|
- Output directory: `results/`
|
||||||
- Total steps: 5
|
- Total steps: 8
|
||||||
- Number of generations: 4
|
- Number of generations: 4
|
||||||
|
|
||||||
### Example 2: Batch Inference
|
### Example 2: Batch Inference
|
||||||
Process multiple voice styles and texts at once:
|
Process multiple voice styles and texts at once:
|
||||||
```bash
|
```bash
|
||||||
./example_onnx \
|
./example_onnx \
|
||||||
--voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json \
|
--voice-style ../../assets/voice_styles/M1.json,../../assets/voice_styles/F1.json \
|
||||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
||||||
--lang en,ko \
|
--lang en,ko \
|
||||||
--batch
|
--batch
|
||||||
@@ -93,19 +95,19 @@ Increase denoising steps for better quality:
|
|||||||
```bash
|
```bash
|
||||||
./example_onnx \
|
./example_onnx \
|
||||||
--total-step 10 \
|
--total-step 10 \
|
||||||
--voice-style ../assets/voice_styles/M1.json \
|
--voice-style ../../assets/voice_styles/M1.json \
|
||||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
- Use 10 denoising steps instead of the default 5
|
- Use 10 denoising steps instead of the default 8
|
||||||
- Produce higher quality output at the cost of slower inference
|
- Produce higher quality output at the cost of slower inference
|
||||||
|
|
||||||
### Example 4: Long-Form Inference
|
### Example 4: Long-Form Inference
|
||||||
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
||||||
```bash
|
```bash
|
||||||
./example_onnx \
|
./example_onnx \
|
||||||
--voice-style ../assets/voice_styles/M1.json \
|
--voice-style ../../assets/voice_styles/M1.json \
|
||||||
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -121,19 +123,19 @@ This will:
|
|||||||
|
|
||||||
| Argument | Type | Default | Description |
|
| Argument | Type | Default | Description |
|
||||||
|----------|------|---------|-------------|
|
|----------|------|---------|-------------|
|
||||||
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
|
| `--onnx-dir` | str | `../../assets/onnx` | Path to ONNX model directory |
|
||||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
| `--total-step` | int | 8 | Number of denoising steps (higher = better quality, slower) |
|
||||||
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
||||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||||
| `--voice-style` | str | `../assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated for batch) |
|
| `--voice-style` | str | `../../assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated for batch) |
|
||||||
| `--text` | str | (long default text) | Text(s) to synthesize (pipe-separated for batch) |
|
| `--text` | str | (long default text) | Text(s) to synthesize (pipe-separated for batch) |
|
||||||
| `--lang` | str | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr` (comma-separated for batch) |
|
| `--lang` | str | `en` | Language(s) for text(s); see the main README for all 31 codes (comma-separated for batch) |
|
||||||
| `--save-dir` | str | `results` | Output directory |
|
| `--save-dir` | str | `results` | Output directory |
|
||||||
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||||
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
- **Multilingual Support**: Use `--lang` to specify language(s). Available: 31 languages; see the main README for the full list
|
||||||
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
||||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||||
|
|||||||
@@ -8,11 +8,11 @@
|
|||||||
namespace fs = std::filesystem;
|
namespace fs = std::filesystem;
|
||||||
|
|
||||||
struct Args {
|
struct Args {
|
||||||
std::string onnx_dir = "../assets/onnx";
|
std::string onnx_dir = "../../assets/onnx";
|
||||||
int total_step = 5;
|
int total_step = 8;
|
||||||
float speed = 1.05f;
|
float speed = 1.05f;
|
||||||
int n_test = 4;
|
int n_test = 4;
|
||||||
std::vector<std::string> voice_style = {"../assets/voice_styles/M1.json"};
|
std::vector<std::string> voice_style = {"../../assets/voice_styles/M1.json"};
|
||||||
std::vector<std::string> text = {
|
std::vector<std::string> text = {
|
||||||
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
};
|
};
|
||||||
|
|||||||
@@ -12,7 +12,7 @@
|
|||||||
using json = nlohmann::json;
|
using json = nlohmann::json;
|
||||||
|
|
||||||
// Available languages for multilingual TTS
|
// Available languages for multilingual TTS
|
||||||
const std::vector<std::string> AVAILABLE_LANGS = {"en", "ko", "es", "pt", "fr"};
|
const std::vector<std::string> AVAILABLE_LANGS = {"en", "ko", "ja", "ar", "bg", "cs", "da", "de", "el", "es", "et", "fi", "fr", "hi", "hr", "hu", "id", "it", "lt", "lv", "nl", "pl", "pt", "ro", "ru", "sk", "sl", "sv", "tr", "uk", "vi"};
|
||||||
|
|
||||||
// Global tensor buffers for memory management
|
// Global tensor buffers for memory management
|
||||||
static std::vector<std::vector<float>> g_tensor_buffers_float;
|
static std::vector<std::vector<float>> g_tensor_buffers_float;
|
||||||
@@ -190,7 +190,7 @@ std::string UnicodeProcessor::preprocessText(const std::string& text, const std:
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
if (!valid_lang) {
|
if (!valid_lang) {
|
||||||
throw std::runtime_error("Invalid language: " + lang + ". Available: en, ko, es, pt, fr");
|
throw std::runtime_error("Invalid language: " + lang + ". See AVAILABLE_LANGS for supported codes.");
|
||||||
}
|
}
|
||||||
|
|
||||||
// Wrap text with language tags
|
// Wrap text with language tags
|
||||||
@@ -695,7 +695,7 @@ TextToSpeech::SynthesisResult TextToSpeech::call(
|
|||||||
throw std::runtime_error("Single speaker text to speech only supports single style");
|
throw std::runtime_error("Single speaker text to speech only supports single style");
|
||||||
}
|
}
|
||||||
|
|
||||||
int max_len = (lang == "ko") ? 120 : 300;
|
int max_len = (lang == "ko" || lang == "ja") ? 120 : 300;
|
||||||
auto text_list = chunkText(text, max_len);
|
auto text_list = chunkText(text, max_len);
|
||||||
std::vector<float> wav_cat;
|
std::vector<float> wav_cat;
|
||||||
float dur_cat = 0.0f;
|
float dur_cat = 0.0f;
|
||||||
|
|||||||
@@ -10,11 +10,11 @@ namespace Supertonic
|
|||||||
class Args
|
class Args
|
||||||
{
|
{
|
||||||
public bool UseGpu { get; set; } = false;
|
public bool UseGpu { get; set; } = false;
|
||||||
public string OnnxDir { get; set; } = "assets/onnx";
|
public string OnnxDir { get; set; } = "../assets/onnx";
|
||||||
public int TotalStep { get; set; } = 5;
|
public int TotalStep { get; set; } = 8;
|
||||||
public float Speed { get; set; } = 1.05f;
|
public float Speed { get; set; } = 1.05f;
|
||||||
public int NTest { get; set; } = 4;
|
public int NTest { get; set; } = 4;
|
||||||
public List<string> VoiceStyle { get; set; } = new List<string> { "assets/voice_styles/M1.json" };
|
public List<string> VoiceStyle { get; set; } = new List<string> { "../assets/voice_styles/M1.json" };
|
||||||
public List<string> Text { get; set; } = new List<string>
|
public List<string> Text { get; set; } = new List<string>
|
||||||
{
|
{
|
||||||
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
|
|||||||
@@ -13,7 +13,7 @@ namespace Supertonic
|
|||||||
// Available languages for multilingual TTS
|
// Available languages for multilingual TTS
|
||||||
public static class Languages
|
public static class Languages
|
||||||
{
|
{
|
||||||
public static readonly string[] Available = { "en", "ko", "es", "pt", "fr" };
|
public static readonly string[] Available = { "en", "ko", "ja", "ar", "bg", "cs", "da", "de", "el", "es", "et", "fi", "fr", "hi", "hr", "hu", "id", "it", "lt", "lv", "nl", "pl", "pt", "ro", "ru", "sk", "sl", "sv", "tr", "uk", "vi" };
|
||||||
}
|
}
|
||||||
|
|
||||||
// ============================================================================
|
// ============================================================================
|
||||||
@@ -440,7 +440,7 @@ namespace Supertonic
|
|||||||
throw new ArgumentException("Single speaker text to speech only supports single style");
|
throw new ArgumentException("Single speaker text to speech only supports single style");
|
||||||
}
|
}
|
||||||
|
|
||||||
int maxLen = lang == "ko" ? 120 : 300;
|
int maxLen = (lang == "ko" || lang == "ja") ? 120 : 300;
|
||||||
var textList = Helper.ChunkText(text, maxLen);
|
var textList = Helper.ChunkText(text, maxLen);
|
||||||
var wavCat = new List<float>();
|
var wavCat = new List<float>();
|
||||||
float durCat = 0.0f;
|
float durCat = 0.0f;
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ This guide provides examples for running TTS inference using `ExampleONNX.cs`.
|
|||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
|
|
||||||
@@ -19,9 +19,11 @@ This guide provides examples for running TTS inference using `ExampleONNX.cs`.
|
|||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
- .NET 9.0 SDK or later
|
- .NET 9.0 SDK/runtime, or a newer .NET runtime with major-version roll-forward
|
||||||
- [Download .NET SDK](https://dotnet.microsoft.com/download)
|
- [Download .NET SDK](https://dotnet.microsoft.com/download)
|
||||||
|
|
||||||
|
The project targets `net9.0` and sets `RollForward=Major`, so systems with only a newer runtime such as .NET 10 can still run the example.
|
||||||
|
|
||||||
### Install dependencies
|
### Install dependencies
|
||||||
```bash
|
```bash
|
||||||
dotnet restore
|
dotnet restore
|
||||||
@@ -36,17 +38,17 @@ dotnet run
|
|||||||
```
|
```
|
||||||
|
|
||||||
This will use:
|
This will use:
|
||||||
- Voice style: `assets/voice_styles/M1.json`
|
- Voice style: `../assets/voice_styles/M1.json`
|
||||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
- Output directory: `results/`
|
- Output directory: `results/`
|
||||||
- Total steps: 5
|
- Total steps: 8
|
||||||
- Number of generations: 4
|
- Number of generations: 4
|
||||||
|
|
||||||
### Example 2: Batch Inference
|
### Example 2: Batch Inference
|
||||||
Process multiple voice styles and texts at once:
|
Process multiple voice styles and texts at once:
|
||||||
```bash
|
```bash
|
||||||
dotnet run -- \
|
dotnet run -- \
|
||||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
--voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json \
|
||||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
||||||
--lang en,ko \
|
--lang en,ko \
|
||||||
--batch
|
--batch
|
||||||
@@ -64,19 +66,19 @@ Increase denoising steps for better quality:
|
|||||||
```bash
|
```bash
|
||||||
dotnet run -- \
|
dotnet run -- \
|
||||||
--total-step 10 \
|
--total-step 10 \
|
||||||
--voice-style assets/voice_styles/M1.json \
|
--voice-style ../assets/voice_styles/M1.json \
|
||||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
- Use 10 denoising steps instead of the default 5
|
- Use 10 denoising steps instead of the default 8
|
||||||
- Produce higher quality output at the cost of slower inference
|
- Produce higher quality output at the cost of slower inference
|
||||||
|
|
||||||
### Example 4: Long-Form Inference
|
### Example 4: Long-Form Inference
|
||||||
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
||||||
```bash
|
```bash
|
||||||
dotnet run -- \
|
dotnet run -- \
|
||||||
--voice-style assets/voice_styles/M1.json \
|
--voice-style ../assets/voice_styles/M1.json \
|
||||||
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -93,20 +95,20 @@ This will:
|
|||||||
| Argument | Type | Default | Description |
|
| Argument | Type | Default | Description |
|
||||||
|----------|------|---------|-------------|
|
|----------|------|---------|-------------|
|
||||||
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
|
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
|
||||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
|
||||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
| `--total-step` | int | 8 | Number of denoising steps (higher = better quality, slower) |
|
||||||
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
||||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated) |
|
| `--voice-style` | str+ | `../assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated) |
|
||||||
| `--text` | str+ | (long default text) | Text(s) to synthesize (pipe-separated: `|`) |
|
| `--text` | str+ | (long default text) | Text(s) to synthesize (pipe-separated: `|`) |
|
||||||
| `--lang` | str+ | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr` (comma-separated) |
|
| `--lang` | str+ | `en` | Language(s) for text(s); see the main README for all 31 codes (comma-separated) |
|
||||||
| `--save-dir` | str | `results` | Output directory |
|
| `--save-dir` | str | `results` | Output directory |
|
||||||
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||||
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
- **Multilingual Support**: Use `--lang` to specify language(s). Available: 31 languages; see the main README for the full list
|
||||||
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
||||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||||
- **GPU Support**: GPU mode is not supported yet
|
- **GPU Support**: GPU mode is not supported yet
|
||||||
@@ -134,4 +136,3 @@ csharp/
|
|||||||
└── results/ # Output directory (created automatically)
|
└── results/ # Output directory (created automatically)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -5,6 +5,7 @@
|
|||||||
<TargetFramework>net9.0</TargetFramework>
|
<TargetFramework>net9.0</TargetFramework>
|
||||||
<LangVersion>13.0</LangVersion>
|
<LangVersion>13.0</LangVersion>
|
||||||
<Nullable>enable</Nullable>
|
<Nullable>enable</Nullable>
|
||||||
|
<RollForward>Major</RollForward>
|
||||||
</PropertyGroup>
|
</PropertyGroup>
|
||||||
|
|
||||||
<ItemGroup>
|
<ItemGroup>
|
||||||
@@ -14,4 +15,3 @@
|
|||||||
|
|
||||||
</Project>
|
</Project>
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,13 +1,13 @@
|
|||||||
# Supertonic Flutter Example
|
# Supertonic Flutter Example
|
||||||
|
|
||||||
This example demonstrates how to use Supertonic 2 in a Flutter application using ONNX Runtime.
|
This example demonstrates how to use Supertonic 3 in a Flutter application using ONNX Runtime.
|
||||||
|
|
||||||
> **Note:** This project uses the `flutter_onnxruntime` package ([https://pub.dev/packages/flutter_onnxruntime](https://pub.dev/packages/flutter_onnxruntime)). At the moment, only the macOS platform has been tested. Although the flutter_onnxruntime package supports several other platforms, they have not been tested in this project yet and may require additional verification.
|
> **Note:** This project uses the `flutter_onnxruntime` package ([https://pub.dev/packages/flutter_onnxruntime](https://pub.dev/packages/flutter_onnxruntime)). At the moment, only the macOS platform has been tested. Although the flutter_onnxruntime package supports several other platforms, they have not been tested in this project yet and may require additional verification.
|
||||||
|
|
||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
|
|
||||||
@@ -17,12 +17,7 @@ This example demonstrates how to use Supertonic 2 in a Flutter application using
|
|||||||
|
|
||||||
## Multilingual Support
|
## Multilingual Support
|
||||||
|
|
||||||
Supertonic 2 supports multiple languages. Select the appropriate language from the dropdown:
|
Supertonic 3 supports 31 languages. Select the appropriate language from the dropdown; see the main README for the full code list.
|
||||||
- **English (en)**: Default language
|
|
||||||
- **한국어 (ko)**: Korean
|
|
||||||
- **Español (es)**: Spanish
|
|
||||||
- **Português (pt)**: Portuguese
|
|
||||||
- **Français (fr)**: French
|
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
@@ -35,4 +30,3 @@ flutter clean
|
|||||||
flutter pub get
|
flutter pub get
|
||||||
flutter run -d macos
|
flutter run -d macos
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -12,7 +12,7 @@ final logger = Logger(
|
|||||||
);
|
);
|
||||||
|
|
||||||
// Available languages for multilingual TTS
|
// Available languages for multilingual TTS
|
||||||
const List<String> availableLangs = ['en', 'ko', 'es', 'pt', 'fr'];
|
const List<String> availableLangs = ['en', 'ko', 'ja', 'ar', 'bg', 'cs', 'da', 'de', 'el', 'es', 'et', 'fi', 'fr', 'hi', 'hr', 'hu', 'id', 'it', 'lt', 'lv', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sv', 'tr', 'uk', 'vi'];
|
||||||
|
|
||||||
bool isValidLang(String lang) => availableLangs.contains(lang);
|
bool isValidLang(String lang) => availableLangs.contains(lang);
|
||||||
|
|
||||||
@@ -285,7 +285,7 @@ class TextToSpeech {
|
|||||||
Future<Map<String, dynamic>> call(
|
Future<Map<String, dynamic>> call(
|
||||||
String text, String lang, Style style, int totalStep,
|
String text, String lang, Style style, int totalStep,
|
||||||
{double speed = 1.05, double silenceDuration = 0.3}) async {
|
{double speed = 1.05, double silenceDuration = 0.3}) async {
|
||||||
final maxLen = lang == 'ko' ? 120 : 300;
|
final maxLen = (lang == 'ko' || lang == 'ja') ? 120 : 300;
|
||||||
final chunks = _chunkText(text, maxLen: maxLen);
|
final chunks = _chunkText(text, maxLen: maxLen);
|
||||||
final langList = List.filled(chunks.length, lang);
|
final langList = List.filled(chunks.length, lang);
|
||||||
List<double>? wavCat;
|
List<double>? wavCat;
|
||||||
|
|||||||
@@ -14,7 +14,7 @@ class SupertonicApp extends StatelessWidget {
|
|||||||
@override
|
@override
|
||||||
Widget build(BuildContext context) {
|
Widget build(BuildContext context) {
|
||||||
return MaterialApp(
|
return MaterialApp(
|
||||||
title: 'Supertonic 2',
|
title: 'Supertonic 3',
|
||||||
theme: ThemeData(
|
theme: ThemeData(
|
||||||
colorScheme: ColorScheme.fromSeed(seedColor: Colors.deepPurple),
|
colorScheme: ColorScheme.fromSeed(seedColor: Colors.deepPurple),
|
||||||
useMaterial3: true,
|
useMaterial3: true,
|
||||||
@@ -42,7 +42,7 @@ class _TTSPageState extends State<TTSPage> {
|
|||||||
bool _isLoading = false;
|
bool _isLoading = false;
|
||||||
bool _isGenerating = false;
|
bool _isGenerating = false;
|
||||||
String _status = 'Not initialized';
|
String _status = 'Not initialized';
|
||||||
int _totalSteps = 5;
|
int _totalSteps = 8;
|
||||||
double _speed = 1.05;
|
double _speed = 1.05;
|
||||||
String _selectedLang = 'en';
|
String _selectedLang = 'en';
|
||||||
bool _isPlaying = false;
|
bool _isPlaying = false;
|
||||||
@@ -210,7 +210,7 @@ class _TTSPageState extends State<TTSPage> {
|
|||||||
return Scaffold(
|
return Scaffold(
|
||||||
appBar: AppBar(
|
appBar: AppBar(
|
||||||
backgroundColor: Theme.of(context).colorScheme.inversePrimary,
|
backgroundColor: Theme.of(context).colorScheme.inversePrimary,
|
||||||
title: const Text('Supertonic 2'),
|
title: const Text('Supertonic 3'),
|
||||||
),
|
),
|
||||||
body: Padding(
|
body: Padding(
|
||||||
padding: const EdgeInsets.all(16.0),
|
padding: const EdgeInsets.all(16.0),
|
||||||
@@ -323,13 +323,10 @@ class _TTSPageState extends State<TTSPage> {
|
|||||||
child: DropdownButton<String>(
|
child: DropdownButton<String>(
|
||||||
value: _selectedLang,
|
value: _selectedLang,
|
||||||
isExpanded: true,
|
isExpanded: true,
|
||||||
items: const [
|
items: availableLangs
|
||||||
DropdownMenuItem(value: 'en', child: Text('English')),
|
.map((lang) =>
|
||||||
DropdownMenuItem(value: 'ko', child: Text('한국어')),
|
DropdownMenuItem(value: lang, child: Text(lang)))
|
||||||
DropdownMenuItem(value: 'es', child: Text('Español')),
|
.toList(),
|
||||||
DropdownMenuItem(value: 'pt', child: Text('Português')),
|
|
||||||
DropdownMenuItem(value: 'fr', child: Text('Français')),
|
|
||||||
],
|
|
||||||
onChanged: _isLoading || _isGenerating
|
onChanged: _isLoading || _isGenerating
|
||||||
? null
|
? null
|
||||||
: (value) => setState(() => _selectedLang = value!),
|
: (value) => setState(() => _selectedLang = value!),
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ This guide provides examples for running TTS inference using `example_onnx.go`.
|
|||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
|
|
||||||
@@ -30,6 +30,8 @@ This project uses Go modules for dependency management.
|
|||||||
brew install onnxruntime
|
brew install onnxruntime
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The Go example auto-detects Homebrew ONNX Runtime paths on both Apple Silicon and Intel Macs.
|
||||||
|
|
||||||
**Linux:**
|
**Linux:**
|
||||||
```bash
|
```bash
|
||||||
# Download ONNX Runtime from GitHub releases
|
# Download ONNX Runtime from GitHub releases
|
||||||
@@ -48,7 +50,7 @@ go mod download
|
|||||||
|
|
||||||
### Configure ONNX Runtime Library Path (Optional)
|
### Configure ONNX Runtime Library Path (Optional)
|
||||||
|
|
||||||
If the ONNX Runtime library is not in a standard location, set the environment variable:
|
If the ONNX Runtime library is not in a standard or Homebrew location, set the environment variable:
|
||||||
|
|
||||||
**Automatic Detection (Recommended):**
|
**Automatic Detection (Recommended):**
|
||||||
|
|
||||||
@@ -77,10 +79,10 @@ go run example_onnx.go helper.go
|
|||||||
```
|
```
|
||||||
|
|
||||||
This will use:
|
This will use:
|
||||||
- Voice style: `assets/voice_styles/M1.json`
|
- Voice style: `../assets/voice_styles/M1.json`
|
||||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
- Output directory: `results/`
|
- Output directory: `results/`
|
||||||
- Total steps: 5
|
- Total steps: 8
|
||||||
- Number of generations: 4
|
- Number of generations: 4
|
||||||
|
|
||||||
### Example 2: Batch Inference
|
### Example 2: Batch Inference
|
||||||
@@ -88,7 +90,7 @@ Process multiple voice styles and texts at once:
|
|||||||
```bash
|
```bash
|
||||||
go run example_onnx.go helper.go \
|
go run example_onnx.go helper.go \
|
||||||
--batch \
|
--batch \
|
||||||
-voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
|
-voice-style "../assets/voice_styles/M1.json,../assets/voice_styles/F1.json" \
|
||||||
-text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
-text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
||||||
-lang "en,ko"
|
-lang "en,ko"
|
||||||
```
|
```
|
||||||
@@ -104,12 +106,12 @@ Increase denoising steps for better quality:
|
|||||||
```bash
|
```bash
|
||||||
go run example_onnx.go helper.go \
|
go run example_onnx.go helper.go \
|
||||||
-total-step 10 \
|
-total-step 10 \
|
||||||
-voice-style "assets/voice_styles/M1.json" \
|
-voice-style "../assets/voice_styles/M1.json" \
|
||||||
-text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
-text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
- Use 10 denoising steps instead of the default 5
|
- Use 10 denoising steps instead of the default 8
|
||||||
- Produce higher quality output at the cost of slower inference
|
- Produce higher quality output at the cost of slower inference
|
||||||
|
|
||||||
### Example 4: Long-Form Inference
|
### Example 4: Long-Form Inference
|
||||||
@@ -117,7 +119,7 @@ The system automatically chunks long texts into manageable segments, synthesizes
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
go run example_onnx.go helper.go \
|
go run example_onnx.go helper.go \
|
||||||
-voice-style "assets/voice_styles/M1.json" \
|
-voice-style "../assets/voice_styles/M1.json" \
|
||||||
-text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
-text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -134,18 +136,18 @@ This will:
|
|||||||
| Argument | Type | Default | Description |
|
| Argument | Type | Default | Description |
|
||||||
|----------|------|---------|-------------|
|
|----------|------|---------|-------------|
|
||||||
| `-use-gpu` | flag | false | Use GPU for inference (default: CPU) |
|
| `-use-gpu` | flag | false | Use GPU for inference (default: CPU) |
|
||||||
| `-onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
| `-onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
|
||||||
| `-total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
| `-total-step` | int | 8 | Number of denoising steps (higher = better quality, slower) |
|
||||||
| `-n-test` | int | 4 | Number of times to generate each sample |
|
| `-n-test` | int | 4 | Number of times to generate each sample |
|
||||||
| `-voice-style` | str | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
|
| `-voice-style` | str | `../assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
|
||||||
| `-text` | str | (long default text) | Text(s) to synthesize, pipe-separated |
|
| `-text` | str | (long default text) | Text(s) to synthesize, pipe-separated |
|
||||||
| `-lang` | str | `en` | Language(s) for synthesis, comma-separated (en, ko, es, pt, fr) |
|
| `-lang` | str | `en` | Language(s) for synthesis, comma-separated; see the main README for all 31 codes |
|
||||||
| `-save-dir` | str | `results` | Output directory |
|
| `-save-dir` | str | `results` | Output directory |
|
||||||
| `--batch` | flag | false | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
|
| `--batch` | flag | false | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- **Multilingual Support**: Use `-lang` to specify the language for each text. Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
- **Multilingual Support**: Use `-lang` to specify the language for each text. Available: 31 languages; see the main README for the full list
|
||||||
- **Batch Processing**: When using `--batch`, the number of `-voice-style`, `-text`, and `-lang` entries must match
|
- **Batch Processing**: When using `--batch`, the number of `-voice-style`, `-text`, and `-lang` entries must match
|
||||||
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
||||||
- **Quality vs Speed**: Higher `-total-step` values produce better quality but take longer
|
- **Quality vs Speed**: Higher `-total-step` values produce better quality but take longer
|
||||||
@@ -162,4 +164,3 @@ Then run it:
|
|||||||
```bash
|
```bash
|
||||||
./tts_example -voice-style "../assets/voice_styles/M1.json" -text "Hello world"
|
./tts_example -voice-style "../assets/voice_styles/M1.json" -text "Hello world"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -28,17 +28,17 @@ func parseArgs() *Args {
|
|||||||
args := &Args{}
|
args := &Args{}
|
||||||
|
|
||||||
flag.BoolVar(&args.useGPU, "use-gpu", false, "Use GPU for inference (default: CPU)")
|
flag.BoolVar(&args.useGPU, "use-gpu", false, "Use GPU for inference (default: CPU)")
|
||||||
flag.StringVar(&args.onnxDir, "onnx-dir", "assets/onnx", "Path to ONNX model directory")
|
flag.StringVar(&args.onnxDir, "onnx-dir", "../assets/onnx", "Path to ONNX model directory")
|
||||||
flag.IntVar(&args.totalStep, "total-step", 5, "Number of denoising steps")
|
flag.IntVar(&args.totalStep, "total-step", 8, "Number of denoising steps")
|
||||||
flag.Float64Var(&args.speed, "speed", 1.05, "Speech speed factor (higher = faster)")
|
flag.Float64Var(&args.speed, "speed", 1.05, "Speech speed factor (higher = faster)")
|
||||||
flag.IntVar(&args.nTest, "n-test", 4, "Number of times to generate")
|
flag.IntVar(&args.nTest, "n-test", 4, "Number of times to generate")
|
||||||
flag.StringVar(&args.saveDir, "save-dir", "results", "Output directory")
|
flag.StringVar(&args.saveDir, "save-dir", "results", "Output directory")
|
||||||
flag.BoolVar(&args.batch, "batch", false, "Enable batch mode (multiple text-style pairs)")
|
flag.BoolVar(&args.batch, "batch", false, "Enable batch mode (multiple text-style pairs)")
|
||||||
|
|
||||||
var voiceStyleStr, textStr, langStr string
|
var voiceStyleStr, textStr, langStr string
|
||||||
flag.StringVar(&voiceStyleStr, "voice-style", "assets/voice_styles/M1.json", "Voice style file path(s), comma-separated")
|
flag.StringVar(&voiceStyleStr, "voice-style", "../assets/voice_styles/M1.json", "Voice style file path(s), comma-separated")
|
||||||
flag.StringVar(&textStr, "text", "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.", "Text(s) to synthesize, pipe-separated")
|
flag.StringVar(&textStr, "text", "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.", "Text(s) to synthesize, pipe-separated")
|
||||||
flag.StringVar(&langStr, "lang", "en", "Language(s) for synthesis, comma-separated (en, ko, es, pt, fr)")
|
flag.StringVar(&langStr, "lang", "en", "Language(s) for synthesis, comma-separated")
|
||||||
|
|
||||||
flag.Parse()
|
flag.Parse()
|
||||||
|
|
||||||
|
|||||||
@@ -19,7 +19,7 @@ import (
|
|||||||
)
|
)
|
||||||
|
|
||||||
// Available languages for multilingual TTS
|
// Available languages for multilingual TTS
|
||||||
var AvailableLangs = []string{"en", "ko", "es", "pt", "fr"}
|
var AvailableLangs = []string{"en", "ko", "ja", "ar", "bg", "cs", "da", "de", "el", "es", "et", "fi", "fr", "hi", "hr", "hu", "id", "it", "lt", "lv", "nl", "pl", "pt", "ro", "ru", "sk", "sl", "sv", "tr", "uk", "vi"}
|
||||||
|
|
||||||
// Config structures
|
// Config structures
|
||||||
type SpecProcessorConfig struct {
|
type SpecProcessorConfig struct {
|
||||||
@@ -801,7 +801,7 @@ func (tts *TextToSpeech) _infer(textList []string, langList []string, style *Sty
|
|||||||
// Call synthesizes speech from a single text with automatic chunking
|
// Call synthesizes speech from a single text with automatic chunking
|
||||||
func (tts *TextToSpeech) Call(text string, lang string, style *Style, totalStep int, speed float32, silenceDuration float32) ([]float32, float32, error) {
|
func (tts *TextToSpeech) Call(text string, lang string, style *Style, totalStep int, speed float32, silenceDuration float32) ([]float32, float32, error) {
|
||||||
maxLen := 300
|
maxLen := 300
|
||||||
if lang == "ko" {
|
if lang == "ko" || lang == "ja" {
|
||||||
maxLen = 120
|
maxLen = 120
|
||||||
}
|
}
|
||||||
chunks := chunkText(text, maxLen)
|
chunks := chunkText(text, maxLen)
|
||||||
@@ -920,17 +920,28 @@ func LoadTextToSpeech(onnxDir string, useGPU bool, cfg Config) (*TextToSpeech, e
|
|||||||
func InitializeONNXRuntime() error {
|
func InitializeONNXRuntime() error {
|
||||||
libPath := os.Getenv("ONNXRUNTIME_LIB_PATH")
|
libPath := os.Getenv("ONNXRUNTIME_LIB_PATH")
|
||||||
if libPath == "" {
|
if libPath == "" {
|
||||||
libPath = "/usr/local/lib/libonnxruntime.so"
|
candidates := []string{
|
||||||
if _, err := os.Stat("/usr/local/lib/libonnxruntime.dylib"); err == nil {
|
"/opt/homebrew/opt/onnxruntime/lib/libonnxruntime.dylib",
|
||||||
libPath = "/usr/local/lib/libonnxruntime.dylib"
|
"/usr/local/opt/onnxruntime/lib/libonnxruntime.dylib",
|
||||||
} else if _, err := os.Stat("/usr/lib/libonnxruntime.so"); err == nil {
|
"/opt/homebrew/lib/libonnxruntime.dylib",
|
||||||
libPath = "/usr/lib/libonnxruntime.so"
|
"/usr/local/lib/libonnxruntime.dylib",
|
||||||
|
"/usr/local/lib/libonnxruntime.so",
|
||||||
|
"/usr/lib/libonnxruntime.so",
|
||||||
|
}
|
||||||
|
for _, candidate := range candidates {
|
||||||
|
if _, err := os.Stat(candidate); err == nil {
|
||||||
|
libPath = candidate
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if libPath == "" {
|
||||||
|
libPath = "/usr/local/lib/libonnxruntime.so"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
ort.SetSharedLibraryPath(libPath)
|
ort.SetSharedLibraryPath(libPath)
|
||||||
|
|
||||||
if err := ort.InitializeEnvironment(); err != nil {
|
if err := ort.InitializeEnvironment(); err != nil {
|
||||||
return fmt.Errorf("failed to initialize ONNX Runtime: %w\nHint: Set ONNXRUNTIME_LIB_PATH environment variable", err)
|
return fmt.Errorf("failed to initialize ONNX Runtime: %w\nHint: install ONNX Runtime (macOS: brew install onnxruntime) or set ONNXRUNTIME_LIB_PATH", err)
|
||||||
}
|
}
|
||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|||||||
|
After Width: | Height: | Size: 1.4 MiB |
|
After Width: | Height: | Size: 92 KiB |
|
After Width: | Height: | Size: 256 KiB |
|
After Width: | Height: | Size: 111 KiB |
|
After Width: | Height: | Size: 193 KiB |
|
After Width: | Height: | Size: 158 KiB |
|
Before Width: | Height: | Size: 766 KiB After Width: | Height: | Size: 558 KiB |
@@ -12,7 +12,7 @@ struct ContentView: View {
|
|||||||
Spacer()
|
Spacer()
|
||||||
|
|
||||||
VStack(spacing: 12) {
|
VStack(spacing: 12) {
|
||||||
Text("Supertonic 2 iOS Demo")
|
Text("Supertonic 3 iOS Demo")
|
||||||
.font(.title2.weight(.semibold))
|
.font(.title2.weight(.semibold))
|
||||||
.foregroundColor(.primary)
|
.foregroundColor(.primary)
|
||||||
|
|
||||||
|
|||||||
@@ -6,17 +6,69 @@ final class TTSService {
|
|||||||
enum Language: String, CaseIterable {
|
enum Language: String, CaseIterable {
|
||||||
case en = "en"
|
case en = "en"
|
||||||
case ko = "ko"
|
case ko = "ko"
|
||||||
|
case ja = "ja"
|
||||||
|
case ar = "ar"
|
||||||
|
case bg = "bg"
|
||||||
|
case cs = "cs"
|
||||||
|
case da = "da"
|
||||||
|
case de = "de"
|
||||||
|
case el = "el"
|
||||||
case es = "es"
|
case es = "es"
|
||||||
case pt = "pt"
|
case et = "et"
|
||||||
|
case fi = "fi"
|
||||||
case fr = "fr"
|
case fr = "fr"
|
||||||
|
case hi = "hi"
|
||||||
|
case hr = "hr"
|
||||||
|
case hu = "hu"
|
||||||
|
case id = "id"
|
||||||
|
case it = "it"
|
||||||
|
case lt = "lt"
|
||||||
|
case lv = "lv"
|
||||||
|
case nl = "nl"
|
||||||
|
case pl = "pl"
|
||||||
|
case pt = "pt"
|
||||||
|
case ro = "ro"
|
||||||
|
case ru = "ru"
|
||||||
|
case sk = "sk"
|
||||||
|
case sl = "sl"
|
||||||
|
case sv = "sv"
|
||||||
|
case tr = "tr"
|
||||||
|
case uk = "uk"
|
||||||
|
case vi = "vi"
|
||||||
|
|
||||||
var displayName: String {
|
var displayName: String {
|
||||||
switch self {
|
switch self {
|
||||||
case .en: return "English"
|
case .en: return "English"
|
||||||
case .ko: return "한국어"
|
case .ko: return "한국어"
|
||||||
|
case .ja: return "日本語"
|
||||||
|
case .ar: return "العربية"
|
||||||
|
case .bg: return "Bulgarian"
|
||||||
|
case .cs: return "Czech"
|
||||||
|
case .da: return "Danish"
|
||||||
|
case .de: return "Deutsch"
|
||||||
|
case .el: return "Greek"
|
||||||
case .es: return "Español"
|
case .es: return "Español"
|
||||||
case .pt: return "Português"
|
case .et: return "Estonian"
|
||||||
|
case .fi: return "Finnish"
|
||||||
case .fr: return "Français"
|
case .fr: return "Français"
|
||||||
|
case .hi: return "Hindi"
|
||||||
|
case .hr: return "Croatian"
|
||||||
|
case .hu: return "Hungarian"
|
||||||
|
case .id: return "Indonesian"
|
||||||
|
case .it: return "Italian"
|
||||||
|
case .lt: return "Lithuanian"
|
||||||
|
case .lv: return "Latvian"
|
||||||
|
case .nl: return "Dutch"
|
||||||
|
case .pl: return "Polish"
|
||||||
|
case .pt: return "Português"
|
||||||
|
case .ro: return "Romanian"
|
||||||
|
case .ru: return "Russian"
|
||||||
|
case .sk: return "Slovak"
|
||||||
|
case .sl: return "Slovenian"
|
||||||
|
case .sv: return "Swedish"
|
||||||
|
case .tr: return "Turkish"
|
||||||
|
case .uk: return "Ukrainian"
|
||||||
|
case .vi: return "Vietnamese"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ import AVFoundation
|
|||||||
@MainActor
|
@MainActor
|
||||||
final class TTSViewModel: ObservableObject {
|
final class TTSViewModel: ObservableObject {
|
||||||
@Published var text: String = "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
@Published var text: String = "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
@Published var nfe: Double = 5
|
@Published var nfe: Double = 8
|
||||||
@Published var voice: TTSService.Voice = .male
|
@Published var voice: TTSService.Voice = .male
|
||||||
@Published var language: TTSService.Language = .en
|
@Published var language: TTSService.Language = .en
|
||||||
@Published var isGenerating: Bool = false
|
@Published var isGenerating: Bool = false
|
||||||
|
|||||||
@@ -1,10 +1,10 @@
|
|||||||
# Supertonic iOS Example App
|
# Supertonic iOS Example App
|
||||||
|
|
||||||
A minimal iOS demo that runs Supertonic 2 (ONNX Runtime) on-device. The app shows:
|
A minimal iOS demo that runs Supertonic 3 (ONNX Runtime) on-device. The app shows:
|
||||||
- Multiline text input
|
- Multiline text input
|
||||||
- NFE (denoising steps) slider
|
- NFE (denoising steps) slider
|
||||||
- Voice toggle (M/F)
|
- Voice toggle (M/F)
|
||||||
- Language selector (en, ko, es, pt, fr)
|
- Language selector for 31 supported languages
|
||||||
- Generate & Play buttons
|
- Generate & Play buttons
|
||||||
- RTF display (Elapsed / Audio seconds)
|
- RTF display (Elapsed / Audio seconds)
|
||||||
|
|
||||||
@@ -12,7 +12,7 @@ All ONNX models/configs are reused from `Supertonic/assets/onnx`, and voice styl
|
|||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
|
|
||||||
@@ -61,18 +61,13 @@ These references are defined in `project.yml` and added to the app bundle by Xco
|
|||||||
|
|
||||||
## App Controls
|
## App Controls
|
||||||
- **Text**: Multiline `TextEditor`
|
- **Text**: Multiline `TextEditor`
|
||||||
- **NFE**: Denoising steps (default 5)
|
- **NFE**: Denoising steps (default 8)
|
||||||
- **Voice**: M/F voice style selector
|
- **Voice**: M/F voice style selector
|
||||||
- **Language**: Language selector (English, 한국어, Español, Português, Français)
|
- **Language**: Language selector for 31 supported languages
|
||||||
- **Generate**: Runs end-to-end synthesis
|
- **Generate**: Runs end-to-end synthesis
|
||||||
- **Play/Stop**: Controls playback of the last output
|
- **Play/Stop**: Controls playback of the last output
|
||||||
- **RTF**: Shows Elapsed / Audio seconds for quick performance intuition
|
- **RTF**: Shows Elapsed / Audio seconds for quick performance intuition
|
||||||
|
|
||||||
## Multilingual Support
|
## Multilingual Support
|
||||||
|
|
||||||
Supertonic 2 supports multiple languages. Select the appropriate language for your input text:
|
Supertonic 3 supports 31 languages. Select the appropriate language for your input text; see the main README for the full code list.
|
||||||
- **English (en)**: Default language
|
|
||||||
- **한국어 (ko)**: Korean
|
|
||||||
- **Español (es)**: Spanish
|
|
||||||
- **Português (pt)**: Portuguese
|
|
||||||
- **Français (fr)**: French
|
|
||||||
|
|||||||
@@ -13,11 +13,11 @@ public class ExampleONNX {
|
|||||||
*/
|
*/
|
||||||
static class Args {
|
static class Args {
|
||||||
boolean useGpu = false;
|
boolean useGpu = false;
|
||||||
String onnxDir = "assets/onnx";
|
String onnxDir = "../assets/onnx";
|
||||||
int totalStep = 5;
|
int totalStep = 8;
|
||||||
float speed = 1.05f;
|
float speed = 1.05f;
|
||||||
int nTest = 4;
|
int nTest = 4;
|
||||||
List<String> voiceStyle = Arrays.asList("assets/voice_styles/M1.json");
|
List<String> voiceStyle = Arrays.asList("../assets/voice_styles/M1.json");
|
||||||
List<String> text = Arrays.asList(
|
List<String> text = Arrays.asList(
|
||||||
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
);
|
);
|
||||||
|
|||||||
@@ -22,7 +22,7 @@ import java.util.regex.Matcher;
|
|||||||
* Available languages for multilingual TTS
|
* Available languages for multilingual TTS
|
||||||
*/
|
*/
|
||||||
class Languages {
|
class Languages {
|
||||||
public static final List<String> AVAILABLE = Arrays.asList("en", "ko", "es", "pt", "fr");
|
public static final List<String> AVAILABLE = Arrays.asList("en", "ko", "ja", "ar", "bg", "cs", "da", "de", "el", "es", "et", "fi", "fr", "hi", "hr", "hu", "id", "it", "lt", "lv", "nl", "pl", "pt", "ro", "ru", "sk", "sl", "sv", "tr", "uk", "vi");
|
||||||
|
|
||||||
public static boolean isValid(String lang) {
|
public static boolean isValid(String lang) {
|
||||||
return AVAILABLE.contains(lang);
|
return AVAILABLE.contains(lang);
|
||||||
@@ -450,7 +450,7 @@ class TextToSpeech {
|
|||||||
*/
|
*/
|
||||||
public TTSResult call(String text, String lang, Style style, int totalStep, float speed, float silenceDuration, OrtEnvironment env)
|
public TTSResult call(String text, String lang, Style style, int totalStep, float speed, float silenceDuration, OrtEnvironment env)
|
||||||
throws OrtException {
|
throws OrtException {
|
||||||
int maxLen = lang.equals("ko") ? 120 : 300;
|
int maxLen = (lang.equals("ko") || lang.equals("ja")) ? 120 : 300;
|
||||||
List<String> chunks = Helper.chunkText(text, maxLen);
|
List<String> chunks = Helper.chunkText(text, maxLen);
|
||||||
|
|
||||||
List<Float> wavCat = new ArrayList<>();
|
List<Float> wavCat = new ArrayList<>();
|
||||||
@@ -952,4 +952,3 @@ public class Helper {
|
|||||||
return result;
|
return result;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ This guide provides examples for running TTS inference using `ExampleONNX.java`.
|
|||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
|
|
||||||
@@ -22,9 +22,17 @@ This project uses [Maven](https://maven.apache.org/) for dependency management.
|
|||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
- Java 11 or higher
|
- Java Development Kit (JDK) 11 or higher, not just a JRE
|
||||||
- Maven 3.6 or higher
|
- Maven 3.6 or higher
|
||||||
|
|
||||||
|
On macOS with Homebrew:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
brew install openjdk@17 maven
|
||||||
|
export JAVA_HOME="$(brew --prefix openjdk@17)/libexec/openjdk.jdk/Contents/Home"
|
||||||
|
export PATH="$(brew --prefix openjdk@17)/bin:$PATH"
|
||||||
|
```
|
||||||
|
|
||||||
### Install dependencies
|
### Install dependencies
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@@ -40,16 +48,16 @@ mvn exec:java
|
|||||||
```
|
```
|
||||||
|
|
||||||
This will use:
|
This will use:
|
||||||
- Voice style: `assets/voice_styles/M1.json`
|
- Voice style: `../assets/voice_styles/M1.json`
|
||||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
- Output directory: `results/`
|
- Output directory: `results/`
|
||||||
- Total steps: 5
|
- Total steps: 8
|
||||||
- Number of generations: 4
|
- Number of generations: 4
|
||||||
|
|
||||||
### Example 2: Batch Inference
|
### Example 2: Batch Inference
|
||||||
Process multiple voice styles and texts at once:
|
Process multiple voice styles and texts at once:
|
||||||
```bash
|
```bash
|
||||||
mvn exec:java -Dexec.args="--batch --voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json --text 'The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요.' --lang en,ko"
|
mvn exec:java -Dexec.args="--batch --voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json --text 'The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요.' --lang en,ko"
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
@@ -61,18 +69,18 @@ This will:
|
|||||||
### Example 3: High Quality Inference
|
### Example 3: High Quality Inference
|
||||||
Increase denoising steps for better quality:
|
Increase denoising steps for better quality:
|
||||||
```bash
|
```bash
|
||||||
mvn exec:java -Dexec.args="--total-step 10 --voice-style assets/voice_styles/M1.json --text 'Increasing the number of denoising steps improves the output fidelity and overall quality.'"
|
mvn exec:java -Dexec.args="--total-step 10 --voice-style ../assets/voice_styles/M1.json --text 'Increasing the number of denoising steps improves the output fidelity and overall quality.'"
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
- Use 10 denoising steps instead of the default 5
|
- Use 10 denoising steps instead of the default 8
|
||||||
- Produce higher quality output at the cost of slower inference
|
- Produce higher quality output at the cost of slower inference
|
||||||
|
|
||||||
### Example 4: Long-Form Inference
|
### Example 4: Long-Form Inference
|
||||||
The system automatically chunks long texts into manageable segments, synthesizes each segment separately, and concatenates them with natural pauses (0.3 seconds by default) into a single audio file. This happens by default when you don't use the `--batch` flag:
|
The system automatically chunks long texts into manageable segments, synthesizes each segment separately, and concatenates them with natural pauses (0.3 seconds by default) into a single audio file. This happens by default when you don't use the `--batch` flag:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
mvn exec:java -Dexec.args="--voice-style assets/voice_styles/M1.json --text 'This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues.'"
|
mvn exec:java -Dexec.args="--voice-style ../assets/voice_styles/M1.json --text 'This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues.'"
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
@@ -110,21 +118,20 @@ java -jar target/tts-example.jar --total-step 10 --text "Your custom text here"
|
|||||||
| Argument | Type | Default | Description |
|
| Argument | Type | Default | Description |
|
||||||
|----------|------|---------|-------------|
|
|----------|------|---------|-------------|
|
||||||
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
||||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
|
||||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
| `--total-step` | int | 8 | Number of denoising steps (higher = better quality, slower) |
|
||||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
|
| `--voice-style` | str+ | `../assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
|
||||||
| `--text` | str+ | (long default text) | Text(s) to synthesize, pipe-separated |
|
| `--text` | str+ | (long default text) | Text(s) to synthesize, pipe-separated |
|
||||||
| `--lang` | str+ | `en` | Language(s) for synthesis, comma-separated (en, ko, es, pt, fr) |
|
| `--lang` | str+ | `en` | Language(s) for synthesis, comma-separated; see the main README for all 31 codes |
|
||||||
| `--save-dir` | str | `results` | Output directory |
|
| `--save-dir` | str | `results` | Output directory |
|
||||||
| `--batch` | flag | False | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
|
| `--batch` | flag | False | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- **Multilingual Support**: Use `--lang` to specify the language for each text. Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
- **Multilingual Support**: Use `--lang` to specify the language for each text. Available: 31 languages; see the main README for the full list
|
||||||
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
|
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
|
||||||
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
||||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||||
- **GPU Support**: GPU mode is not supported yet
|
- **GPU Support**: GPU mode is not supported yet
|
||||||
- **Voice Styles**: Uses pre-extracted voice style JSON files for fast inference
|
- **Voice Styles**: Uses pre-extracted voice style JSON files for fast inference
|
||||||
|
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ Node.js implementation for TTS inference. Uses ONNX Runtime to generate speech f
|
|||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
|
|
||||||
@@ -33,7 +33,7 @@ npm install
|
|||||||
### Example 1: Default Inference
|
### Example 1: Default Inference
|
||||||
Run inference with default settings:
|
Run inference with default settings:
|
||||||
```bash
|
```bash
|
||||||
npm start
|
node example_onnx.js
|
||||||
```
|
```
|
||||||
|
|
||||||
Or:
|
Or:
|
||||||
@@ -42,17 +42,17 @@ node example_onnx.js
|
|||||||
```
|
```
|
||||||
|
|
||||||
This will use:
|
This will use:
|
||||||
- Voice style: `assets/voice_styles/M1.json`
|
- Voice style: `../assets/voice_styles/M1.json`
|
||||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
- Output directory: `results/`
|
- Output directory: `results/`
|
||||||
- Total steps: 5
|
- Total steps: 8
|
||||||
- Number of generations: 4
|
- Number of generations: 4
|
||||||
|
|
||||||
### Example 2: Batch Inference
|
### Example 2: Batch Inference
|
||||||
Process multiple voice styles and texts at once:
|
Process multiple voice styles and texts at once:
|
||||||
```bash
|
```bash
|
||||||
node example_onnx.js \
|
node example_onnx.js \
|
||||||
--voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
|
--voice-style "../assets/voice_styles/M1.json,../assets/voice_styles/F1.json" \
|
||||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
||||||
--lang "en,ko" \
|
--lang "en,ko" \
|
||||||
--batch
|
--batch
|
||||||
@@ -70,19 +70,19 @@ Increase denoising steps for better quality:
|
|||||||
```bash
|
```bash
|
||||||
node example_onnx.js \
|
node example_onnx.js \
|
||||||
--total-step 10 \
|
--total-step 10 \
|
||||||
--voice-style "assets/voice_styles/M1.json" \
|
--voice-style "../assets/voice_styles/M1.json" \
|
||||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
- Use 10 denoising steps instead of the default 5
|
- Use 10 denoising steps instead of the default 8
|
||||||
- Produce higher quality output at the cost of slower inference
|
- Produce higher quality output at the cost of slower inference
|
||||||
|
|
||||||
### Example 4: Long-Form Inference
|
### Example 4: Long-Form Inference
|
||||||
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
||||||
```bash
|
```bash
|
||||||
node example_onnx.js \
|
node example_onnx.js \
|
||||||
--voice-style "assets/voice_styles/M1.json" \
|
--voice-style "../assets/voice_styles/M1.json" \
|
||||||
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -99,20 +99,20 @@ This will:
|
|||||||
| Argument | Type | Default | Description |
|
| Argument | Type | Default | Description |
|
||||||
|----------|------|---------|-------------|
|
|----------|------|---------|-------------|
|
||||||
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
|
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
|
||||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
|
||||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
| `--total-step` | int | 8 | Number of denoising steps (higher = better quality, slower) |
|
||||||
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
||||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s). Separate multiple files with commas |
|
| `--voice-style` | str+ | `../assets/voice_styles/M1.json` | Voice style file path(s). Separate multiple files with commas |
|
||||||
| `--text` | str+ | (long default text) | Text(s) to synthesize. Separate multiple texts with pipes |
|
| `--text` | str+ | (long default text) | Text(s) to synthesize. Separate multiple texts with pipes |
|
||||||
| `--lang` | str+ | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr`. Separate multiple with commas |
|
| `--lang` | str+ | `en` | Language(s) for text(s); see the main README for all 31 codes. Separate multiple with commas |
|
||||||
| `--save-dir` | str | `results` | Output directory |
|
| `--save-dir` | str | `results` | Output directory |
|
||||||
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- **Batch Processing**: The number of voice style files must match the number of texts. Use commas to separate files and pipes to separate texts
|
- **Batch Processing**: The number of voice style files must match the number of texts. Use commas to separate files and pipes to separate texts
|
||||||
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
- **Multilingual Support**: Use `--lang` to specify language(s). Available: 31 languages; see the main README for the full list
|
||||||
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
||||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||||
- **GPU Support**: GPU mode is not supported yet
|
- **GPU Support**: GPU mode is not supported yet
|
||||||
|
|||||||
@@ -13,11 +13,11 @@ const __dirname = path.dirname(__filename);
|
|||||||
function parseArgs() {
|
function parseArgs() {
|
||||||
const args = {
|
const args = {
|
||||||
useGpu: false,
|
useGpu: false,
|
||||||
onnxDir: 'assets/onnx',
|
onnxDir: '../assets/onnx',
|
||||||
totalStep: 5,
|
totalStep: 8,
|
||||||
speed: 1.05,
|
speed: 1.05,
|
||||||
nTest: 4,
|
nTest: 4,
|
||||||
voiceStyle: ['assets/voice_styles/M1.json'],
|
voiceStyle: ['../assets/voice_styles/M1.json'],
|
||||||
text: ['This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.'],
|
text: ['This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.'],
|
||||||
lang: ['en'],
|
lang: ['en'],
|
||||||
saveDir: 'results',
|
saveDir: 'results',
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ import * as ort from 'onnxruntime-node';
|
|||||||
|
|
||||||
const __filename = fileURLToPath(import.meta.url);
|
const __filename = fileURLToPath(import.meta.url);
|
||||||
|
|
||||||
const AVAILABLE_LANGS = ["en", "ko", "es", "pt", "fr"];
|
const AVAILABLE_LANGS = ["en", "ko", "ja", "ar", "bg", "cs", "da", "de", "el", "es", "et", "fi", "fr", "hi", "hr", "hu", "id", "it", "lt", "lv", "nl", "pl", "pt", "ro", "ru", "sk", "sl", "sv", "tr", "uk", "vi"];
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Unicode text processor
|
* Unicode text processor
|
||||||
@@ -275,7 +275,7 @@ class TextToSpeech {
|
|||||||
if (style.ttl.dims[0] !== 1) {
|
if (style.ttl.dims[0] !== 1) {
|
||||||
throw new Error('Single speaker text to speech only supports single style');
|
throw new Error('Single speaker text to speech only supports single style');
|
||||||
}
|
}
|
||||||
const maxLen = lang === 'ko' ? 120 : 300;
|
const maxLen = (lang === 'ko' || lang === 'ja') ? 120 : 300;
|
||||||
const textList = chunkText(text, maxLen);
|
const textList = chunkText(text, maxLen);
|
||||||
let wavCat = null;
|
let wavCat = null;
|
||||||
let durCat = 0;
|
let durCat = 0;
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ This guide provides examples for running TTS inference using `example_onnx.py`.
|
|||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added `supertonic` PyPI package! Install via `pip install supertonic` for a streamlined experience. This is a separate usage method from the ONNX examples in this directory. For more details, visit [supertonic-py documentation](https://supertone-inc.github.io/supertonic-py) and see `example_pypi.py` for usage.
|
**2025.12.10** - Added `supertonic` PyPI package! Install via `pip install supertonic` for a streamlined experience. This is a separate usage method from the ONNX examples in this directory. For more details, visit [supertonic-py documentation](https://supertone-inc.github.io/supertonic-py) and see `example_pypi.py` for usage.
|
||||||
|
|
||||||
@@ -46,17 +46,17 @@ uv run example_onnx.py
|
|||||||
```
|
```
|
||||||
|
|
||||||
This will use:
|
This will use:
|
||||||
- Voice style: `assets/voice_styles/M1.json`
|
- Voice style: `../assets/voice_styles/M1.json`
|
||||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
- Output directory: `results/`
|
- Output directory: `results/`
|
||||||
- Total steps: 5
|
- Total steps: 8
|
||||||
- Number of generations: 4
|
- Number of generations: 4
|
||||||
|
|
||||||
### Example 2: Batch Inference
|
### Example 2: Batch Inference
|
||||||
Process multiple voice styles and texts at once:
|
Process multiple voice styles and texts at once:
|
||||||
```bash
|
```bash
|
||||||
uv run example_onnx.py \
|
uv run example_onnx.py \
|
||||||
--voice-style assets/voice_styles/M1.json assets/voice_styles/F1.json \
|
--voice-style ../assets/voice_styles/M1.json ../assets/voice_styles/F1.json \
|
||||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange." "오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange." "오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
|
||||||
--lang en ko \
|
--lang en ko \
|
||||||
--batch
|
--batch
|
||||||
@@ -74,19 +74,19 @@ Increase denoising steps for better quality:
|
|||||||
```bash
|
```bash
|
||||||
uv run example_onnx.py \
|
uv run example_onnx.py \
|
||||||
--total-step 10 \
|
--total-step 10 \
|
||||||
--voice-style assets/voice_styles/M1.json \
|
--voice-style ../assets/voice_styles/M1.json \
|
||||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
- Use 10 denoising steps instead of the default 5
|
- Use 10 denoising steps instead of the default 8
|
||||||
- Produce higher quality output at the cost of slower inference
|
- Produce higher quality output at the cost of slower inference
|
||||||
|
|
||||||
### Example 4: Long-Form Inference
|
### Example 4: Long-Form Inference
|
||||||
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
|
||||||
```bash
|
```bash
|
||||||
uv run example_onnx.py \
|
uv run example_onnx.py \
|
||||||
--voice-style assets/voice_styles/M1.json \
|
--voice-style ../assets/voice_styles/M1.json \
|
||||||
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -103,13 +103,13 @@ Control the speed of speech synthesis:
|
|||||||
```bash
|
```bash
|
||||||
# Faster speech (speed > 1.0)
|
# Faster speech (speed > 1.0)
|
||||||
uv run example_onnx.py \
|
uv run example_onnx.py \
|
||||||
--voice-style assets/voice_styles/F2.json \
|
--voice-style ../assets/voice_styles/F2.json \
|
||||||
--text "This text will be synthesized at a faster pace." \
|
--text "This text will be synthesized at a faster pace." \
|
||||||
--speed 1.2
|
--speed 1.2
|
||||||
|
|
||||||
# Slower speech (speed < 1.0)
|
# Slower speech (speed < 1.0)
|
||||||
uv run example_onnx.py \
|
uv run example_onnx.py \
|
||||||
--voice-style assets/voice_styles/M2.json \
|
--voice-style ../assets/voice_styles/M2.json \
|
||||||
--text "This text will be synthesized at a slower, more deliberate pace." \
|
--text "This text will be synthesized at a slower, more deliberate pace." \
|
||||||
--speed 0.9
|
--speed 0.9
|
||||||
```
|
```
|
||||||
@@ -125,20 +125,20 @@ This will:
|
|||||||
| Argument | Type | Default | Description |
|
| Argument | Type | Default | Description |
|
||||||
|----------|------|---------|-------------|
|
|----------|------|---------|-------------|
|
||||||
| `--use-gpu` | flag | False | Use GPU for inference (with CPU fallback) |
|
| `--use-gpu` | flag | False | Use GPU for inference (with CPU fallback) |
|
||||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
|
||||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
| `--total-step` | int | 8 | Number of denoising steps (higher = better quality, slower) |
|
||||||
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
|
||||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
|
| `--voice-style` | str+ | `../assets/voice_styles/M1.json` | Voice style file path(s) |
|
||||||
| `--text` | str+ | (long default text) | Text(s) to synthesize |
|
| `--text` | str+ | (long default text) | Text(s) to synthesize |
|
||||||
| `--lang` | str+ | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr` |
|
| `--lang` | str+ | `en` | Language(s) for text(s); see the main README for all 31 codes |
|
||||||
| `--save-dir` | str | `results` | Output directory |
|
| `--save-dir` | str | `results` | Output directory |
|
||||||
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||||
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
- **Multilingual Support**: Use `--lang` to specify language(s). Available: 31 languages; see the main README for the full list
|
||||||
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
|
||||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||||
- **GPU Support**: GPU mode is not supported yet
|
- **GPU Support**: GPU mode is not supported yet
|
||||||
|
|||||||
@@ -18,13 +18,13 @@ def parse_args():
|
|||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--onnx-dir",
|
"--onnx-dir",
|
||||||
type=str,
|
type=str,
|
||||||
default="assets/onnx",
|
default="../assets/onnx",
|
||||||
help="Path to ONNX model directory",
|
help="Path to ONNX model directory",
|
||||||
)
|
)
|
||||||
|
|
||||||
# Synthesis parameters
|
# Synthesis parameters
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--total-step", type=int, default=5, help="Number of denoising steps"
|
"--total-step", type=int, default=8, help="Number of denoising steps"
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--speed",
|
"--speed",
|
||||||
@@ -44,7 +44,7 @@ def parse_args():
|
|||||||
"--voice-style",
|
"--voice-style",
|
||||||
type=str,
|
type=str,
|
||||||
nargs="+",
|
nargs="+",
|
||||||
default=["assets/voice_styles/M1.json"],
|
default=["../assets/voice_styles/M1.json"],
|
||||||
help="Voice style file path(s). Can specify multiple files for batch processing",
|
help="Voice style file path(s). Can specify multiple files for batch processing",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
|
|||||||
@@ -10,7 +10,7 @@ import onnxruntime as ort
|
|||||||
|
|
||||||
import re
|
import re
|
||||||
|
|
||||||
AVAILABLE_LANGS = ["en", "ko", "es", "pt", "fr"]
|
AVAILABLE_LANGS = ["en", "ko", "ja", "ar", "bg", "cs", "da", "de", "el", "es", "et", "fi", "fr", "hi", "hr", "hu", "id", "it", "lt", "lv", "nl", "pl", "pt", "ro", "ru", "sk", "sl", "sv", "tr", "uk", "vi"]
|
||||||
|
|
||||||
|
|
||||||
class UnicodeProcessor:
|
class UnicodeProcessor:
|
||||||
@@ -226,7 +226,7 @@ class TextToSpeech:
|
|||||||
assert (
|
assert (
|
||||||
style.ttl.shape[0] == 1
|
style.ttl.shape[0] == 1
|
||||||
), "Single speaker text to speech only supports single style"
|
), "Single speaker text to speech only supports single style"
|
||||||
max_len = 120 if lang == "ko" else 300
|
max_len = 120 if lang in ("ko", "ja") else 300
|
||||||
text_list = chunk_text(text, max_len=max_len)
|
text_list = chunk_text(text, max_len=max_len)
|
||||||
wav_cat = None
|
wav_cat = None
|
||||||
dur_cat = None
|
dur_cat = None
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ edition = "2021"
|
|||||||
ort = "2.0.0-rc.7"
|
ort = "2.0.0-rc.7"
|
||||||
|
|
||||||
# Array processing (like NumPy)
|
# Array processing (like NumPy)
|
||||||
ndarray = { version = "0.16", features = ["rayon"] }
|
ndarray = { version = "0.17", features = ["rayon"] }
|
||||||
rand = "0.8"
|
rand = "0.8"
|
||||||
rand_distr = "0.4"
|
rand_distr = "0.4"
|
||||||
|
|
||||||
@@ -41,4 +41,3 @@ libc = "0.2"
|
|||||||
[[bin]]
|
[[bin]]
|
||||||
name = "example_onnx"
|
name = "example_onnx"
|
||||||
path = "src/example_onnx.rs"
|
path = "src/example_onnx.rs"
|
||||||
|
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ This guide provides examples for running TTS inference using Rust.
|
|||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
|
|
||||||
@@ -47,10 +47,10 @@ cargo run --release --bin example_onnx
|
|||||||
```
|
```
|
||||||
|
|
||||||
This will use:
|
This will use:
|
||||||
- Voice style: `assets/voice_styles/M1.json`
|
- Voice style: `../assets/voice_styles/M1.json`
|
||||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
- Output directory: `results/`
|
- Output directory: `results/`
|
||||||
- Total steps: 5
|
- Total steps: 8
|
||||||
- Number of generations: 4
|
- Number of generations: 4
|
||||||
|
|
||||||
### Example 2: Batch Inference
|
### Example 2: Batch Inference
|
||||||
@@ -59,14 +59,14 @@ Process multiple voice styles and texts at once:
|
|||||||
# Using cargo run
|
# Using cargo run
|
||||||
cargo run --release --bin example_onnx -- \
|
cargo run --release --bin example_onnx -- \
|
||||||
--batch \
|
--batch \
|
||||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
--voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json \
|
||||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
||||||
--lang en,ko
|
--lang en,ko
|
||||||
|
|
||||||
# Or using the binary directly
|
# Or using the binary directly
|
||||||
./target/release/example_onnx \
|
./target/release/example_onnx \
|
||||||
--batch \
|
--batch \
|
||||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
--voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json \
|
||||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
||||||
--lang en,ko
|
--lang en,ko
|
||||||
```
|
```
|
||||||
@@ -83,18 +83,18 @@ Increase denoising steps for better quality:
|
|||||||
# Using cargo run
|
# Using cargo run
|
||||||
cargo run --release --bin example_onnx -- \
|
cargo run --release --bin example_onnx -- \
|
||||||
--total-step 10 \
|
--total-step 10 \
|
||||||
--voice-style assets/voice_styles/M1.json \
|
--voice-style ../assets/voice_styles/M1.json \
|
||||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||||
|
|
||||||
# Or using the binary directly
|
# Or using the binary directly
|
||||||
./target/release/example_onnx \
|
./target/release/example_onnx \
|
||||||
--total-step 10 \
|
--total-step 10 \
|
||||||
--voice-style assets/voice_styles/M1.json \
|
--voice-style ../assets/voice_styles/M1.json \
|
||||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
- Use 10 denoising steps instead of the default 5
|
- Use 10 denoising steps instead of the default 8
|
||||||
- Produce higher quality output at the cost of slower inference
|
- Produce higher quality output at the cost of slower inference
|
||||||
|
|
||||||
### Example 4: Long-Form Inference
|
### Example 4: Long-Form Inference
|
||||||
@@ -103,12 +103,12 @@ The system automatically chunks long texts into manageable segments, synthesizes
|
|||||||
```bash
|
```bash
|
||||||
# Using cargo run
|
# Using cargo run
|
||||||
cargo run --release --bin example_onnx -- \
|
cargo run --release --bin example_onnx -- \
|
||||||
--voice-style assets/voice_styles/M1.json \
|
--voice-style ../assets/voice_styles/M1.json \
|
||||||
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
||||||
|
|
||||||
# Or using the binary directly
|
# Or using the binary directly
|
||||||
./target/release/example_onnx \
|
./target/release/example_onnx \
|
||||||
--voice-style assets/voice_styles/M1.json \
|
--voice-style ../assets/voice_styles/M1.json \
|
||||||
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -125,18 +125,18 @@ This will:
|
|||||||
| Argument | Type | Default | Description |
|
| Argument | Type | Default | Description |
|
||||||
|----------|------|---------|-------------|
|
|----------|------|---------|-------------|
|
||||||
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
||||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
|
||||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
| `--total-step` | int | 8 | Number of denoising steps (higher = better quality, slower) |
|
||||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
|
| `--voice-style` | str+ | `../assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
|
||||||
| `--text` | str+ | (long default text) | Text(s) to synthesize, pipe-separated |
|
| `--text` | str+ | (long default text) | Text(s) to synthesize, pipe-separated |
|
||||||
| `--lang` | str+ | `en` | Language(s) for synthesis, comma-separated (en, ko, es, pt, fr) |
|
| `--lang` | str+ | `en` | Language(s) for synthesis, comma-separated; see the main README for all 31 codes |
|
||||||
| `--save-dir` | str | `results` | Output directory |
|
| `--save-dir` | str | `results` | Output directory |
|
||||||
| `--batch` | flag | False | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
|
| `--batch` | flag | False | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- **Multilingual Support**: Use `--lang` to specify the language for each text. Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
|
- **Multilingual Support**: Use `--lang` to specify the language for each text. Available: 31 languages; see the main README for the full list
|
||||||
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
|
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
|
||||||
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
||||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||||
|
|||||||
@@ -19,11 +19,11 @@ struct Args {
|
|||||||
use_gpu: bool,
|
use_gpu: bool,
|
||||||
|
|
||||||
/// Path to ONNX model directory
|
/// Path to ONNX model directory
|
||||||
#[arg(long, default_value = "assets/onnx")]
|
#[arg(long, default_value = "../assets/onnx")]
|
||||||
onnx_dir: String,
|
onnx_dir: String,
|
||||||
|
|
||||||
/// Number of denoising steps
|
/// Number of denoising steps
|
||||||
#[arg(long, default_value = "5")]
|
#[arg(long, default_value = "8")]
|
||||||
total_step: usize,
|
total_step: usize,
|
||||||
|
|
||||||
/// Speech speed factor (higher = faster)
|
/// Speech speed factor (higher = faster)
|
||||||
@@ -35,14 +35,14 @@ struct Args {
|
|||||||
n_test: usize,
|
n_test: usize,
|
||||||
|
|
||||||
/// Voice style file path(s)
|
/// Voice style file path(s)
|
||||||
#[arg(long, value_delimiter = ',', default_values_t = vec!["assets/voice_styles/M1.json".to_string()])]
|
#[arg(long, value_delimiter = ',', default_values_t = vec!["../assets/voice_styles/M1.json".to_string()])]
|
||||||
voice_style: Vec<String>,
|
voice_style: Vec<String>,
|
||||||
|
|
||||||
/// Text(s) to synthesize
|
/// Text(s) to synthesize
|
||||||
#[arg(long, value_delimiter = '|', default_values_t = vec!["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.".to_string()])]
|
#[arg(long, value_delimiter = '|', default_values_t = vec!["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.".to_string()])]
|
||||||
text: Vec<String>,
|
text: Vec<String>,
|
||||||
|
|
||||||
/// Language(s) for synthesis (en, ko, es, pt, fr)
|
/// Language(s) for synthesis; see the main README for all supported codes
|
||||||
#[arg(long, value_delimiter = ',', default_values_t = vec!["en".to_string()])]
|
#[arg(long, value_delimiter = ',', default_values_t = vec!["en".to_string()])]
|
||||||
lang: Vec<String>,
|
lang: Vec<String>,
|
||||||
|
|
||||||
|
|||||||
@@ -15,7 +15,7 @@ use rand_distr::{Distribution, Normal};
|
|||||||
use regex::Regex;
|
use regex::Regex;
|
||||||
|
|
||||||
// Available languages for multilingual TTS
|
// Available languages for multilingual TTS
|
||||||
pub const AVAILABLE_LANGS: &[&str] = &["en", "ko", "es", "pt", "fr"];
|
pub const AVAILABLE_LANGS: &[&str] = &["en", "ko", "ja", "ar", "bg", "cs", "da", "de", "el", "es", "et", "fi", "fr", "hi", "hr", "hu", "id", "it", "lt", "lv", "nl", "pl", "pt", "ro", "ru", "sk", "sl", "sv", "tr", "uk", "vi"];
|
||||||
|
|
||||||
pub fn is_valid_lang(lang: &str) -> bool {
|
pub fn is_valid_lang(lang: &str) -> bool {
|
||||||
AVAILABLE_LANGS.contains(&lang)
|
AVAILABLE_LANGS.contains(&lang)
|
||||||
@@ -688,7 +688,7 @@ impl TextToSpeech {
|
|||||||
speed: f32,
|
speed: f32,
|
||||||
silence_duration: f32,
|
silence_duration: f32,
|
||||||
) -> Result<(Vec<f32>, f32)> {
|
) -> Result<(Vec<f32>, f32)> {
|
||||||
let max_len = if lang == "ko" { 120 } else { 300 };
|
let max_len = if lang == "ko" || lang == "ja" { 120 } else { 300 };
|
||||||
let chunks = chunk_text(text, Some(max_len));
|
let chunks = chunk_text(text, Some(max_len));
|
||||||
|
|
||||||
let mut wav_cat: Vec<f32> = Vec::new();
|
let mut wav_cat: Vec<f32> = Vec::new();
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ This guide provides examples for running TTS inference using `example_onnx`.
|
|||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
|
|
||||||
@@ -38,10 +38,10 @@ Run inference with default settings:
|
|||||||
```
|
```
|
||||||
|
|
||||||
This will use:
|
This will use:
|
||||||
- Voice style: `assets/voice_styles/M1.json`
|
- Voice style: `../assets/voice_styles/M1.json`
|
||||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||||
- Output directory: `results/`
|
- Output directory: `results/`
|
||||||
- Total steps: 5
|
- Total steps: 8
|
||||||
- Number of generations: 4
|
- Number of generations: 4
|
||||||
|
|
||||||
### Example 2: Batch Inference
|
### Example 2: Batch Inference
|
||||||
@@ -49,7 +49,7 @@ Process multiple voice styles and texts at once:
|
|||||||
```bash
|
```bash
|
||||||
.build/release/example_onnx \
|
.build/release/example_onnx \
|
||||||
--batch \
|
--batch \
|
||||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
--voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json \
|
||||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
|
||||||
--lang en,ko
|
--lang en,ko
|
||||||
```
|
```
|
||||||
@@ -65,12 +65,12 @@ Increase denoising steps for better quality:
|
|||||||
```bash
|
```bash
|
||||||
.build/release/example_onnx \
|
.build/release/example_onnx \
|
||||||
--total-step 10 \
|
--total-step 10 \
|
||||||
--voice-style assets/voice_styles/M1.json \
|
--voice-style ../assets/voice_styles/M1.json \
|
||||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
- Use 10 denoising steps instead of the default 5
|
- Use 10 denoising steps instead of the default 8
|
||||||
- Produce higher quality output at the cost of slower inference
|
- Produce higher quality output at the cost of slower inference
|
||||||
|
|
||||||
### Example 4: Long-Form Inference
|
### Example 4: Long-Form Inference
|
||||||
@@ -78,7 +78,7 @@ The system automatically chunks long texts into manageable segments, synthesizes
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
.build/release/example_onnx \
|
.build/release/example_onnx \
|
||||||
--voice-style assets/voice_styles/M1.json \
|
--voice-style ../assets/voice_styles/M1.json \
|
||||||
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -95,28 +95,22 @@ This will:
|
|||||||
| Argument | Type | Default | Description |
|
| Argument | Type | Default | Description |
|
||||||
|----------|------|---------|-------------|
|
|----------|------|---------|-------------|
|
||||||
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
||||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
|
||||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
| `--total-step` | int | 8 | Number of denoising steps (higher = better quality, slower) |
|
||||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
|
| `--voice-style` | str+ | `../assets/voice_styles/M1.json` | Voice style file path(s) |
|
||||||
| `--text` | str+ | (long default text) | Text(s) to synthesize |
|
| `--text` | str+ | (long default text) | Text(s) to synthesize |
|
||||||
| `--lang` | str+ | `en` | Language(s) for synthesis (en, ko, es, pt, fr) |
|
| `--lang` | str+ | `en` | Language(s) for synthesis; see the main README for all 31 codes |
|
||||||
| `--save-dir` | str | `results` | Output directory |
|
| `--save-dir` | str | `results` | Output directory |
|
||||||
| `--batch` | flag | False | Enable batch mode (multiple text-style-lang triplets, disables automatic chunking) |
|
| `--batch` | flag | False | Enable batch mode (multiple text-style-lang triplets, disables automatic chunking) |
|
||||||
|
|
||||||
## Multilingual Support
|
## Multilingual Support
|
||||||
|
|
||||||
Supertonic 2 supports multiple languages. Use the `--lang` argument to specify the language:
|
Supertonic 3 supports 31 languages. Use the `--lang` argument to specify the language; see the main README for the full code list.
|
||||||
|
|
||||||
- `en` - English (default)
|
|
||||||
- `ko` - Korean (한국어)
|
|
||||||
- `es` - Spanish (Español)
|
|
||||||
- `pt` - Portuguese (Português)
|
|
||||||
- `fr` - French (Français)
|
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
|
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
|
||||||
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
|
||||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||||
- **GPU Support**: GPU mode is not supported yet
|
- **GPU Support**: GPU mode is not supported yet
|
||||||
|
|||||||
@@ -3,11 +3,11 @@ import OnnxRuntimeBindings
|
|||||||
|
|
||||||
struct Args {
|
struct Args {
|
||||||
var useGpu: Bool = false
|
var useGpu: Bool = false
|
||||||
var onnxDir: String = "assets/onnx"
|
var onnxDir: String = "../assets/onnx"
|
||||||
var totalStep: Int = 5
|
var totalStep: Int = 8
|
||||||
var speed: Float = 1.05
|
var speed: Float = 1.05
|
||||||
var nTest: Int = 4
|
var nTest: Int = 4
|
||||||
var voiceStyle: [String] = ["assets/voice_styles/M1.json"]
|
var voiceStyle: [String] = ["../assets/voice_styles/M1.json"]
|
||||||
var text: [String] = ["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."]
|
var text: [String] = ["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."]
|
||||||
var lang: [String] = ["en"]
|
var lang: [String] = ["en"]
|
||||||
var saveDir: String = "results"
|
var saveDir: String = "results"
|
||||||
@@ -32,7 +32,7 @@ func parseArgs() -> Args {
|
|||||||
}
|
}
|
||||||
case "--total-step":
|
case "--total-step":
|
||||||
if i + 1 < arguments.count {
|
if i + 1 < arguments.count {
|
||||||
args.totalStep = Int(arguments[i + 1]) ?? 5
|
args.totalStep = Int(arguments[i + 1]) ?? 8
|
||||||
i += 1
|
i += 1
|
||||||
}
|
}
|
||||||
case "--speed":
|
case "--speed":
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ import OnnxRuntimeBindings
|
|||||||
|
|
||||||
// MARK: - Available Languages
|
// MARK: - Available Languages
|
||||||
|
|
||||||
let AVAILABLE_LANGS = ["en", "ko", "es", "pt", "fr"]
|
let AVAILABLE_LANGS = ["en", "ko", "ja", "ar", "bg", "cs", "da", "de", "el", "es", "et", "fi", "fr", "hi", "hr", "hu", "id", "it", "lt", "lv", "nl", "pl", "pt", "ro", "ru", "sk", "sl", "sv", "tr", "uk", "vi"]
|
||||||
|
|
||||||
func isValidLang(_ lang: String) -> Bool {
|
func isValidLang(_ lang: String) -> Bool {
|
||||||
return AVAILABLE_LANGS.contains(lang)
|
return AVAILABLE_LANGS.contains(lang)
|
||||||
@@ -701,7 +701,7 @@ class TextToSpeech {
|
|||||||
}
|
}
|
||||||
|
|
||||||
func call(_ text: String, _ lang: String, _ style: Style, _ totalStep: Int, speed: Float = 1.05, silenceDuration: Float = 0.3) throws -> (wav: [Float], duration: Float) {
|
func call(_ text: String, _ lang: String, _ style: Style, _ totalStep: Int, speed: Float = 1.05, silenceDuration: Float = 0.3) throws -> (wav: [Float], duration: Float) {
|
||||||
let maxLen = lang == "ko" ? 120 : 300
|
let maxLen = (lang == "ko" || lang == "ja") ? 120 : 300
|
||||||
let chunks = chunkText(text, maxLen: maxLen)
|
let chunks = chunkText(text, maxLen: maxLen)
|
||||||
let langList = Array(repeating: lang, count: chunks.count)
|
let langList = Array(repeating: lang, count: chunks.count)
|
||||||
|
|
||||||
|
|||||||
@@ -4,10 +4,16 @@
|
|||||||
# This script runs inference tests for all supported languages except web
|
# This script runs inference tests for all supported languages except web
|
||||||
|
|
||||||
set -e # Exit on error
|
set -e # Exit on error
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
|
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
|
||||||
cd "$SCRIPT_DIR"
|
cd "$SCRIPT_DIR"
|
||||||
|
|
||||||
|
export UV_CACHE_DIR="${UV_CACHE_DIR:-$SCRIPT_DIR/.uv-cache}"
|
||||||
|
export CLANG_MODULE_CACHE_PATH="${CLANG_MODULE_CACHE_PATH:-$SCRIPT_DIR/.clang-module-cache}"
|
||||||
|
SWIFT_HOME="${SWIFT_HOME:-$SCRIPT_DIR/.swift-home}"
|
||||||
|
mkdir -p "$UV_CACHE_DIR" "$CLANG_MODULE_CACHE_PATH" "$SWIFT_HOME"
|
||||||
|
|
||||||
echo "=================================="
|
echo "=================================="
|
||||||
echo "Supertonic - Testing All Examples"
|
echo "Supertonic - Testing All Examples"
|
||||||
echo "=================================="
|
echo "=================================="
|
||||||
@@ -110,6 +116,28 @@ NC='\033[0m' # No Color
|
|||||||
declare -a PASSED=()
|
declare -a PASSED=()
|
||||||
declare -a FAILED=()
|
declare -a FAILED=()
|
||||||
|
|
||||||
|
# Local toolchain fallbacks for Homebrew keg-only installs.
|
||||||
|
DOTNET_CMD="${DOTNET_CMD:-dotnet}"
|
||||||
|
if ! "$DOTNET_CMD" --list-runtimes 2>/dev/null | grep -q "Microsoft.NETCore.App 9\\."; then
|
||||||
|
if [ -x "/opt/homebrew/opt/dotnet@9/bin/dotnet" ]; then
|
||||||
|
DOTNET_CMD="/opt/homebrew/opt/dotnet@9/bin/dotnet"
|
||||||
|
export DOTNET_ROOT="${DOTNET_ROOT:-/opt/homebrew/opt/dotnet@9/libexec}"
|
||||||
|
elif [ -x "/usr/local/opt/dotnet@9/bin/dotnet" ]; then
|
||||||
|
DOTNET_CMD="/usr/local/opt/dotnet@9/bin/dotnet"
|
||||||
|
export DOTNET_ROOT="${DOTNET_ROOT:-/usr/local/opt/dotnet@9/libexec}"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! javac -version >/dev/null 2>&1; then
|
||||||
|
if [ -x "/opt/homebrew/opt/openjdk@17/bin/javac" ]; then
|
||||||
|
export JAVA_HOME="/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home"
|
||||||
|
export PATH="/opt/homebrew/opt/openjdk@17/bin:$PATH"
|
||||||
|
elif [ -x "/usr/local/opt/openjdk@17/bin/javac" ]; then
|
||||||
|
export JAVA_HOME="/usr/local/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home"
|
||||||
|
export PATH="/usr/local/opt/openjdk@17/bin:$PATH"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
# Helper function to show statistics
|
# Helper function to show statistics
|
||||||
show_stats() {
|
show_stats() {
|
||||||
local name=$1
|
local name=$1
|
||||||
@@ -181,10 +209,10 @@ if [ "$TEST_DEFAULT" = true ]; then
|
|||||||
run_test "Python (default)" "py" "uv run example_onnx.py"
|
run_test "Python (default)" "py" "uv run example_onnx.py"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_BATCH" = true ]; then
|
if [ "$TEST_BATCH" = true ]; then
|
||||||
run_test "Python (batch)" "py" "uv run example_onnx.py --batch --voice-style $BATCH_VOICE_STYLE_1 $BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1' '$BATCH_TEXT_2' --lang $BATCH_LANG_1 $BATCH_LANG_2"
|
run_test "Python (batch)" "py" "uv run example_onnx.py --batch --voice-style ../$BATCH_VOICE_STYLE_1 ../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1' '$BATCH_TEXT_2' --lang $BATCH_LANG_1 $BATCH_LANG_2"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_LONGFORM" = true ]; then
|
if [ "$TEST_LONGFORM" = true ]; then
|
||||||
run_test "Python (long-form)" "py" "uv run example_onnx.py --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
run_test "Python (long-form)" "py" "uv run example_onnx.py --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# ====================================
|
# ====================================
|
||||||
@@ -197,10 +225,10 @@ if [ "$TEST_DEFAULT" = true ]; then
|
|||||||
run_test "JavaScript (default)" "nodejs" "node example_onnx.js"
|
run_test "JavaScript (default)" "nodejs" "node example_onnx.js"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_BATCH" = true ]; then
|
if [ "$TEST_BATCH" = true ]; then
|
||||||
run_test "JavaScript (batch)" "nodejs" "node example_onnx.js --batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
run_test "JavaScript (batch)" "nodejs" "node example_onnx.js --batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_LONGFORM" = true ]; then
|
if [ "$TEST_LONGFORM" = true ]; then
|
||||||
run_test "JavaScript (long-form)" "nodejs" "node example_onnx.js --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
run_test "JavaScript (long-form)" "nodejs" "node example_onnx.js --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# ====================================
|
# ====================================
|
||||||
@@ -209,15 +237,20 @@ fi
|
|||||||
echo -e "${YELLOW}Testing Go...${NC}"
|
echo -e "${YELLOW}Testing Go...${NC}"
|
||||||
echo "Cleaning Go cache..."
|
echo "Cleaning Go cache..."
|
||||||
cd go && go clean && cd ..
|
cd go && go clean && cd ..
|
||||||
export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
|
if [ -z "${ONNXRUNTIME_LIB_PATH:-}" ] && command -v brew >/dev/null 2>&1; then
|
||||||
|
ORT_PREFIX="$(brew --prefix onnxruntime 2>/dev/null || true)"
|
||||||
|
if [ -n "$ORT_PREFIX" ]; then
|
||||||
|
export ONNXRUNTIME_LIB_PATH="$ORT_PREFIX/lib/libonnxruntime.dylib"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
if [ "$TEST_DEFAULT" = true ]; then
|
if [ "$TEST_DEFAULT" = true ]; then
|
||||||
run_test "Go (default)" "go" "go run example_onnx.go helper.go"
|
run_test "Go (default)" "go" "go run example_onnx.go helper.go"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_BATCH" = true ]; then
|
if [ "$TEST_BATCH" = true ]; then
|
||||||
run_test "Go (batch)" "go" "go run example_onnx.go helper.go --batch -voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 -text '$BATCH_TEXT_1|$BATCH_TEXT_2' -lang $BATCH_LANG_1,$BATCH_LANG_2"
|
run_test "Go (batch)" "go" "go run example_onnx.go helper.go --batch -voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 -text '$BATCH_TEXT_1|$BATCH_TEXT_2' -lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_LONGFORM" = true ]; then
|
if [ "$TEST_LONGFORM" = true ]; then
|
||||||
run_test "Go (long-form)" "go" "go run example_onnx.go helper.go -voice-style $LONGFORM_VOICE_STYLE -text '$LONGFORM_TEXT'"
|
run_test "Go (long-form)" "go" "go run example_onnx.go helper.go -voice-style ../$LONGFORM_VOICE_STYLE -text '$LONGFORM_TEXT'"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# ====================================
|
# ====================================
|
||||||
@@ -230,10 +263,10 @@ if [ "$TEST_DEFAULT" = true ]; then
|
|||||||
run_test "Rust (default)" "rust" "cargo run --release"
|
run_test "Rust (default)" "rust" "cargo run --release"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_BATCH" = true ]; then
|
if [ "$TEST_BATCH" = true ]; then
|
||||||
run_test "Rust (batch)" "rust" "cargo run --release -- --batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
run_test "Rust (batch)" "rust" "cargo run --release -- --batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_LONGFORM" = true ]; then
|
if [ "$TEST_LONGFORM" = true ]; then
|
||||||
run_test "Rust (long-form)" "rust" "cargo run --release -- --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
run_test "Rust (long-form)" "rust" "cargo run --release -- --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# ====================================
|
# ====================================
|
||||||
@@ -241,15 +274,15 @@ fi
|
|||||||
# ====================================
|
# ====================================
|
||||||
echo -e "${YELLOW}Testing C#...${NC}"
|
echo -e "${YELLOW}Testing C#...${NC}"
|
||||||
echo "Building C# project..."
|
echo "Building C# project..."
|
||||||
cd csharp && dotnet clean && cd ..
|
cd csharp && DOTNET_CLI_HOME="$SCRIPT_DIR/.dotnet" "$DOTNET_CMD" clean && cd ..
|
||||||
if [ "$TEST_DEFAULT" = true ]; then
|
if [ "$TEST_DEFAULT" = true ]; then
|
||||||
run_test "C# (default)" "csharp" "dotnet run --configuration Release"
|
run_test "C# (default)" "csharp" "DOTNET_CLI_HOME='$SCRIPT_DIR/.dotnet' '$DOTNET_CMD' run --configuration Release"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_BATCH" = true ]; then
|
if [ "$TEST_BATCH" = true ]; then
|
||||||
run_test "C# (batch)" "csharp" "dotnet run --configuration Release -- --batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
run_test "C# (batch)" "csharp" "DOTNET_CLI_HOME='$SCRIPT_DIR/.dotnet' '$DOTNET_CMD' run --configuration Release -- --batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_LONGFORM" = true ]; then
|
if [ "$TEST_LONGFORM" = true ]; then
|
||||||
run_test "C# (long-form)" "csharp" "dotnet run --configuration Release -- --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
run_test "C# (long-form)" "csharp" "DOTNET_CLI_HOME='$SCRIPT_DIR/.dotnet' '$DOTNET_CMD' run --configuration Release -- --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# ====================================
|
# ====================================
|
||||||
@@ -257,15 +290,15 @@ fi
|
|||||||
# ====================================
|
# ====================================
|
||||||
echo -e "${YELLOW}Testing Java...${NC}"
|
echo -e "${YELLOW}Testing Java...${NC}"
|
||||||
echo "Building Java project..."
|
echo "Building Java project..."
|
||||||
cd java && mvn clean install -q && cd ..
|
cd java && mvn -Dmaven.repo.local="$SCRIPT_DIR/.m2/repository" clean install -q && cd ..
|
||||||
if [ "$TEST_DEFAULT" = true ]; then
|
if [ "$TEST_DEFAULT" = true ]; then
|
||||||
run_test "Java (default)" "java" "mvn exec:java -q"
|
run_test "Java (default)" "java" "mvn -Dmaven.repo.local='$SCRIPT_DIR/.m2/repository' exec:java -q"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_BATCH" = true ]; then
|
if [ "$TEST_BATCH" = true ]; then
|
||||||
run_test "Java (batch)" "java" "mvn exec:java -q -Dexec.args='--batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text \"$BATCH_TEXT_1|$BATCH_TEXT_2\" --lang $BATCH_LANG_1,$BATCH_LANG_2'"
|
run_test "Java (batch)" "java" "mvn -Dmaven.repo.local='$SCRIPT_DIR/.m2/repository' exec:java -q -Dexec.args='--batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text \"$BATCH_TEXT_1|$BATCH_TEXT_2\" --lang $BATCH_LANG_1,$BATCH_LANG_2'"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_LONGFORM" = true ]; then
|
if [ "$TEST_LONGFORM" = true ]; then
|
||||||
run_test "Java (long-form)" "java" "mvn exec:java -q -Dexec.args='--voice-style $LONGFORM_VOICE_STYLE --text \"$LONGFORM_TEXT\"'"
|
run_test "Java (long-form)" "java" "mvn -Dmaven.repo.local='$SCRIPT_DIR/.m2/repository' exec:java -q -Dexec.args='--voice-style ../$LONGFORM_VOICE_STYLE --text \"$LONGFORM_TEXT\"'"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# ====================================
|
# ====================================
|
||||||
@@ -273,15 +306,15 @@ fi
|
|||||||
# ====================================
|
# ====================================
|
||||||
echo -e "${YELLOW}Testing Swift...${NC}"
|
echo -e "${YELLOW}Testing Swift...${NC}"
|
||||||
echo "Building Swift project..."
|
echo "Building Swift project..."
|
||||||
cd swift && swift build -c release && cd ..
|
cd swift && HOME="$SWIFT_HOME" CLANG_MODULE_CACHE_PATH="$CLANG_MODULE_CACHE_PATH" swift build --disable-sandbox -c release && cd ..
|
||||||
if [ "$TEST_DEFAULT" = true ]; then
|
if [ "$TEST_DEFAULT" = true ]; then
|
||||||
run_test "Swift (default)" "swift" ".build/release/example_onnx"
|
run_test "Swift (default)" "swift" "HOME='$SWIFT_HOME' CLANG_MODULE_CACHE_PATH='$CLANG_MODULE_CACHE_PATH' .build/release/example_onnx"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_BATCH" = true ]; then
|
if [ "$TEST_BATCH" = true ]; then
|
||||||
run_test "Swift (batch)" "swift" ".build/release/example_onnx --batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
run_test "Swift (batch)" "swift" "HOME='$SWIFT_HOME' CLANG_MODULE_CACHE_PATH='$CLANG_MODULE_CACHE_PATH' .build/release/example_onnx --batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_LONGFORM" = true ]; then
|
if [ "$TEST_LONGFORM" = true ]; then
|
||||||
run_test "Swift (long-form)" "swift" ".build/release/example_onnx --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
run_test "Swift (long-form)" "swift" "HOME='$SWIFT_HOME' CLANG_MODULE_CACHE_PATH='$CLANG_MODULE_CACHE_PATH' .build/release/example_onnx --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# ====================================
|
# ====================================
|
||||||
@@ -289,15 +322,15 @@ fi
|
|||||||
# ====================================
|
# ====================================
|
||||||
echo -e "${YELLOW}Testing C++...${NC}"
|
echo -e "${YELLOW}Testing C++...${NC}"
|
||||||
echo "Building C++ project..."
|
echo "Building C++ project..."
|
||||||
cd cpp && mkdir -p build && cd build && cmake .. && make && cd ../..
|
cmake -S cpp -B cpp/build && cmake --build cpp/build --config Release
|
||||||
if [ "$TEST_DEFAULT" = true ]; then
|
if [ "$TEST_DEFAULT" = true ]; then
|
||||||
run_test "C++ (default)" "cpp/build" "./example_onnx"
|
run_test "C++ (default)" "cpp/build" "./example_onnx"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_BATCH" = true ]; then
|
if [ "$TEST_BATCH" = true ]; then
|
||||||
run_test "C++ (batch)" "cpp/build" "./example_onnx --batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
run_test "C++ (batch)" "cpp/build" "./example_onnx --batch --voice-style ../../$BATCH_VOICE_STYLE_1,../../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
|
||||||
fi
|
fi
|
||||||
if [ "$TEST_LONGFORM" = true ]; then
|
if [ "$TEST_LONGFORM" = true ]; then
|
||||||
run_test "C++ (long-form)" "cpp/build" "./example_onnx --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
run_test "C++ (long-form)" "cpp/build" "./example_onnx --voice-style ../../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# ====================================
|
# ====================================
|
||||||
@@ -327,4 +360,3 @@ else
|
|||||||
echo -e "${GREEN}All tests passed! 🎉${NC}"
|
echo -e "${GREEN}All tests passed! 🎉${NC}"
|
||||||
exit 0
|
exit 0
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ This example demonstrates how to use Supertonic in a web browser using ONNX Runt
|
|||||||
|
|
||||||
## 📰 Update News
|
## 📰 Update News
|
||||||
|
|
||||||
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
|
**2026.04.29** - 🎉 **Supertonic 3** released with 31-language support, improved reading accuracy, and v2-compatible public ONNX assets. [Demo](https://huggingface.co/spaces/Supertone/supertonic-3) | [Models](https://huggingface.co/Supertone/supertonic-3)
|
||||||
|
|
||||||
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
|
||||||
|
|
||||||
@@ -20,7 +20,7 @@ This example demonstrates how to use Supertonic in a web browser using ONNX Runt
|
|||||||
|
|
||||||
- 🌐 Runs entirely in the browser (no server required for inference)
|
- 🌐 Runs entirely in the browser (no server required for inference)
|
||||||
- 🚀 WebGPU support with automatic fallback to WebAssembly
|
- 🚀 WebGPU support with automatic fallback to WebAssembly
|
||||||
- 🌍 Multilingual support: English (en), Korean (ko), Spanish (es), Portuguese (pt), French (fr)
|
- 🌍 Multilingual support: 31 languages
|
||||||
- ⚡ Pre-extracted voice styles for instant generation
|
- ⚡ Pre-extracted voice styles for instant generation
|
||||||
- 🎨 Modern, responsive UI
|
- 🎨 Modern, responsive UI
|
||||||
- 🎭 Multiple voice style presets (5 Male, 5 Female)
|
- 🎭 Multiple voice style presets (5 Male, 5 Female)
|
||||||
@@ -58,14 +58,10 @@ This will start a local development server (usually at http://localhost:3000) an
|
|||||||
- **Male 1-5 (M1-M5)**: Male voice styles
|
- **Male 1-5 (M1-M5)**: Male voice styles
|
||||||
- **Female 1-5 (F1-F5)**: Female voice styles
|
- **Female 1-5 (F1-F5)**: Female voice styles
|
||||||
3. **Select Language**: Choose the language that matches your input text
|
3. **Select Language**: Choose the language that matches your input text
|
||||||
- **English (en)**: Default language
|
- Supertonic 3 supports 31 language codes; see the main README for the full list.
|
||||||
- **한국어 (ko)**: Korean
|
|
||||||
- **Español (es)**: Spanish
|
|
||||||
- **Português (pt)**: Portuguese
|
|
||||||
- **Français (fr)**: French
|
|
||||||
4. **Enter Text**: Type or paste the text you want to convert to speech
|
4. **Enter Text**: Type or paste the text you want to convert to speech
|
||||||
5. **Adjust Settings** (optional):
|
5. **Adjust Settings** (optional):
|
||||||
- **Total Steps**: More steps = better quality but slower (default: 5)
|
- **Total Steps**: More steps = better quality but slower (default: 8)
|
||||||
6. **Generate Speech**: Click the "Generate Speech" button
|
6. **Generate Speech**: Click the "Generate Speech" button
|
||||||
7. **View Results**:
|
7. **View Results**:
|
||||||
- See the full input text
|
- See the full input text
|
||||||
@@ -75,7 +71,7 @@ This will start a local development server (usually at http://localhost:3000) an
|
|||||||
|
|
||||||
## Multilingual Support
|
## Multilingual Support
|
||||||
|
|
||||||
Supertonic 2 supports multiple languages. Make sure to select the correct language for your input text to get the best results. The model will automatically handle text preprocessing and pronunciation for the selected language.
|
Supertonic 3 supports 31 languages. Make sure to select the correct language for your input text to get the best results. The model will automatically handle text preprocessing and pronunciation for the selected language.
|
||||||
|
|
||||||
## Technical Details
|
## Technical Details
|
||||||
|
|
||||||
@@ -118,4 +114,4 @@ This demo uses:
|
|||||||
### Slow generation
|
### Slow generation
|
||||||
- If using WebAssembly, try a browser that supports WebGPU
|
- If using WebAssembly, try a browser that supports WebGPU
|
||||||
- Ensure no other heavy processes are running
|
- Ensure no other heavy processes are running
|
||||||
- Consider using fewer denoising steps for faster (but lower quality) results
|
- Consider using fewer denoising steps for faster (but lower quality) results
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
import * as ort from 'onnxruntime-web';
|
import * as ort from 'onnxruntime-web';
|
||||||
|
|
||||||
// Available languages for multilingual TTS
|
// Available languages for multilingual TTS
|
||||||
export const AVAILABLE_LANGS = ['en', 'ko', 'es', 'pt', 'fr'];
|
export const AVAILABLE_LANGS = ['en', 'ko', 'ja', 'ar', 'bg', 'cs', 'da', 'de', 'el', 'es', 'et', 'fi', 'fr', 'hi', 'hr', 'hu', 'id', 'it', 'lt', 'lv', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sv', 'tr', 'uk', 'vi'];
|
||||||
|
|
||||||
export function isValidLang(lang) {
|
export function isValidLang(lang) {
|
||||||
return AVAILABLE_LANGS.includes(lang);
|
return AVAILABLE_LANGS.includes(lang);
|
||||||
@@ -272,7 +272,7 @@ export class TextToSpeech {
|
|||||||
if (style.ttl.dims[0] !== 1) {
|
if (style.ttl.dims[0] !== 1) {
|
||||||
throw new Error('Single speaker text to speech only supports single style');
|
throw new Error('Single speaker text to speech only supports single style');
|
||||||
}
|
}
|
||||||
const maxLen = lang === 'ko' ? 120 : 300;
|
const maxLen = (lang === 'ko' || lang === 'ja') ? 120 : 300;
|
||||||
const textList = chunkText(text, maxLen);
|
const textList = chunkText(text, maxLen);
|
||||||
const langList = new Array(textList.length).fill(lang);
|
const langList = new Array(textList.length).fill(lang);
|
||||||
let wavCat = [];
|
let wavCat = [];
|
||||||
|
|||||||
@@ -3,13 +3,13 @@
|
|||||||
<head>
|
<head>
|
||||||
<meta charset="UTF-8">
|
<meta charset="UTF-8">
|
||||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||||
<title>Supertonic - Web Demo</title>
|
<title>Supertonic 3 - Web Demo</title>
|
||||||
<link rel="stylesheet" href="/style.css">
|
<link rel="stylesheet" href="/style.css">
|
||||||
</head>
|
</head>
|
||||||
<body>
|
<body>
|
||||||
<div class="container">
|
<div class="container">
|
||||||
<h1>🎤 Supertonic 2</h1>
|
<h1>🎤 Supertonic 3</h1>
|
||||||
<p class="subtitle">Multilingual Text-to-Speech with ONNX Runtime Web</p>
|
<p class="subtitle">31-language Text-to-Speech with ONNX Runtime Web</p>
|
||||||
|
|
||||||
<div id="statusBox" class="status-box">
|
<div id="statusBox" class="status-box">
|
||||||
<div class="status-text-wrapper">
|
<div class="status-text-wrapper">
|
||||||
@@ -46,9 +46,35 @@
|
|||||||
<select id="langSelect">
|
<select id="langSelect">
|
||||||
<option value="en" selected>English (en)</option>
|
<option value="en" selected>English (en)</option>
|
||||||
<option value="ko">한국어 (ko)</option>
|
<option value="ko">한국어 (ko)</option>
|
||||||
|
<option value="ja">日本語 (ja)</option>
|
||||||
|
<option value="ar">العربية (ar)</option>
|
||||||
|
<option value="bg">Bulgarian (bg)</option>
|
||||||
|
<option value="cs">Czech (cs)</option>
|
||||||
|
<option value="da">Danish (da)</option>
|
||||||
|
<option value="de">Deutsch (de)</option>
|
||||||
|
<option value="el">Greek (el)</option>
|
||||||
<option value="es">Español (es)</option>
|
<option value="es">Español (es)</option>
|
||||||
<option value="pt">Português (pt)</option>
|
<option value="et">Estonian (et)</option>
|
||||||
|
<option value="fi">Finnish (fi)</option>
|
||||||
<option value="fr">Français (fr)</option>
|
<option value="fr">Français (fr)</option>
|
||||||
|
<option value="hi">Hindi (hi)</option>
|
||||||
|
<option value="hr">Croatian (hr)</option>
|
||||||
|
<option value="hu">Hungarian (hu)</option>
|
||||||
|
<option value="id">Indonesian (id)</option>
|
||||||
|
<option value="it">Italian (it)</option>
|
||||||
|
<option value="lt">Lithuanian (lt)</option>
|
||||||
|
<option value="lv">Latvian (lv)</option>
|
||||||
|
<option value="nl">Dutch (nl)</option>
|
||||||
|
<option value="pl">Polish (pl)</option>
|
||||||
|
<option value="pt">Português (pt)</option>
|
||||||
|
<option value="ro">Romanian (ro)</option>
|
||||||
|
<option value="ru">Russian (ru)</option>
|
||||||
|
<option value="sk">Slovak (sk)</option>
|
||||||
|
<option value="sl">Slovenian (sl)</option>
|
||||||
|
<option value="sv">Swedish (sv)</option>
|
||||||
|
<option value="tr">Turkish (tr)</option>
|
||||||
|
<option value="uk">Ukrainian (uk)</option>
|
||||||
|
<option value="vi">Vietnamese (vi)</option>
|
||||||
</select>
|
</select>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
@@ -62,7 +88,7 @@
|
|||||||
<div class="section">
|
<div class="section">
|
||||||
<label for="totalStep">Total Steps (higher = better
|
<label for="totalStep">Total Steps (higher = better
|
||||||
quality):</label>
|
quality):</label>
|
||||||
<input type="number" id="totalStep" value="5"
|
<input type="number" id="totalStep" value="8"
|
||||||
min="1" max="50">
|
min="1" max="50">
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|||||||
@@ -1,6 +1,38 @@
|
|||||||
|
import { createReadStream, existsSync, statSync } from 'node:fs';
|
||||||
|
import path from 'node:path';
|
||||||
|
import { fileURLToPath } from 'node:url';
|
||||||
import { defineConfig } from 'vite';
|
import { defineConfig } from 'vite';
|
||||||
|
|
||||||
|
const __dirname = path.dirname(fileURLToPath(import.meta.url));
|
||||||
|
const rootAssetsDir = path.resolve(__dirname, '../assets');
|
||||||
|
|
||||||
|
function serveRootAssets() {
|
||||||
|
return {
|
||||||
|
name: 'serve-root-assets',
|
||||||
|
configureServer(server) {
|
||||||
|
server.middlewares.use('/assets', (req, res, next) => {
|
||||||
|
const urlPath = decodeURIComponent((req.url || '').split('?')[0]);
|
||||||
|
const filePath = path.resolve(rootAssetsDir, `.${urlPath}`);
|
||||||
|
|
||||||
|
if (!filePath.startsWith(rootAssetsDir) || !existsSync(filePath)) {
|
||||||
|
next();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const stat = statSync(filePath);
|
||||||
|
if (!stat.isFile()) {
|
||||||
|
next();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
createReadStream(filePath).pipe(res);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
export default defineConfig({
|
export default defineConfig({
|
||||||
|
plugins: [serveRootAssets()],
|
||||||
server: {
|
server: {
|
||||||
port: 3000,
|
port: 3000,
|
||||||
open: true
|
open: true
|
||||||
|
|||||||