text-to-speech-en-multi-0001 (composite)¶
Use Case and High-Level Description¶
This is a speech synthesis composite model that simultaneously reconstructs mel-spectrogram and wave form from text. The model generates wave form from symbol sequences separated by space for forty speakers. The speaker voice characteristics are represented by vector of two numbers. The model is built on top of the modified ForwardTacotron and modified MelGAN frameworks.
Composite model specification¶
Metric |
Value |
---|---|
Source framework |
PyTorch* |
Duration prediction model specification¶
The text-to-speech-en-multi-0001-duration-prediction model is a ForwardTacotron-based duration predictor for symbols.
Metric |
Value |
---|---|
GFlops |
28.75 |
MParams |
26.18 |
Inputs¶
Sequence, name:
input_seq
, shape:1, 512
, format:B,C
, where:B
- batch sizeC
- number of symbols in sequence
Mask for input sequence, name:
input_mask
, shape:1, 1, 512
, format:B, D, C
, where:B
- batch sizeD
- extra dimension for multiplicationC
- number of symbols in sequence
Mask for relative position representation in attention, name:
pos_mask
, shape:1, 1, 512, 512
, format:B, D, C, C
, where:B
- batch sizeD
- extra dimension for multiplicationC
- number of symbols in sequence
Vector for representing the speaker voice embedding, name:
speaker_embedding
, shape:1, 2
, format:B, D
, where:B
- batch sizeD
- size of the embedding vector
Outputs¶
Duration for input symbols, name:
duration
, shape:1, 512, 1
, formatB, C, H
. Contains predicted duration for each of the symbol in sequence.B
- batch sizeC
- number of symbols in sequenceH
- empty dimension
Processed embeddings, name:
embeddings
, shape:1, 512, 256
, formatB, C, H
. Contains processed embeddings for each symbol in sequence.B
- batch sizeC
- number of symbols in sequenceH
- height of the intermediate feature map
Mel-spectrogram regression model specification¶
The text-to-speech-en-multi-0001-regression model accepts aligned by duration processed embeddings (for example: if duration is [2, 3] and processed embeddings is [[1, 2], [3, 4]], aligned embeddings is [[1, 2], [1, 2], [1,2], [3, 4], [3, 4]]) and produces mel-spectrogram.
Metric |
Value |
---|---|
GFlops |
7.81 |
MParams |
5.12 |
Inputs¶
Processed embeddigs aligned by durations, name:
data
, shape:1, 512, 256
, format:B, T, C
, where:B
- batch sizeT
- time in mel-spectrogramC
- processed embedding dimension
Mask for
data
by time dimension, name:data_mask
, shape:1, 1, 512
, format:B, D, T
, where:B
- batch sizeD
- extra dimension for multiplicationT
- time in mel-spectrogram
Mask for relative position representation in attention, name:
pos_mask
, shape:1, 1, 512, 512
, format:B, D, C, C
, where:B
- batch sizeD
- extra dimension for multiplicationC
- number of symbols in sequence
Vector for representing the speaker voice embedding, name:
speaker_embedding
, shape:1, 2
, format:B, D
, where:B
- batch sizeD
- size of the embedding vector
Output¶
Mel-spectrogram, name: mel
, shape: 80, 512
, format: C, T
, where:
T
- time in mel-spectrogramC
- number of rows in mel-spectrogram
Audio generation model specification¶
The text-to-speech-en-multi-0001-generation model is a MelGAN based audio generator.
Metric |
Value |
---|---|
GFlops |
48.38 |
MParams |
12.77 |
Inputs¶
Mel-spectrogram, name: mel
, shape: 1, 80, 128
, format: B, C, T
, where:
B
- batch sizeC
- number of rows in mel-spectrogramT
- time in mel-spectrogram
Outputs¶
Audio, name: audio
, shape: 32768
, format: T
, where:
T
- time in audio with sampling rate 22050 (~1.5 sec).
Demo usage¶
The model can be used in the following demos provided by the Open Model Zoo to show its capabilities:
Legal Information¶
[*] Other names and brands may be claimed as the property of others.