21-High-fidelity neural text-to-speech for phones

-Developed a 21-language, fast, high-fidelity text-to-speech neural technology

-The developed model can synthesize one second of high-speed speech in just 0.1 seconds using a single CPU core, which is about eight times faster than conventional methods.

-The developed model can realize fast synthesis with a delay of 0.5 seconds on a smartphone without a network connection

-The technology is expected to make its way into speech applications, such as multilingual speech translation and car navigation

The Universal Communication Research Institute of the National Institute of Information and Communication Technology (NICT, President: TOKUDA ​​Hideyuki, Ph.D.) has successfully developed a 21-language, high-speed text-to-speech neural technology high fidelity. The development of this technology has made it possible to synthesize one second of high-speed speech in just 0.1 seconds using a single CPU core, which is about eight times faster than conventional methods. This technology also enables fast synthesis with a delay of 0.5 seconds on a mid-range smartphone without a network connection (see Figure 1).

Additionally, the neural models developed with 21 text-to-speech languages ​​have been installed on the server of VoiceTra, a multilingual speech translation application for smartphones operated by NICT, and made available to the public. In the future, the technology is expected to be introduced into various speech applications such as multilingual speech translation and car navigation through commercial licensing.

These results will be presented at the INTERSPEECH 2024 Show & Tell, an international conference organized by the International Speech Communication Association (ISCA) in September 2024.

Universal Communication Research Institute, NICT is conducting Research and Development of multilingual speech translation technology to realize spoken language communication that overcomes language barriers. R&D results have been released to the public as a field experiment in VoiceTra, a speech translation application for smartphones, and many other applications have been made to society through commercial licensing. Text-to-speech technology, which can synthesize translated text as human speech, is very important for the realization of multilingual speech translation technology, as well as automatic speech recognition and machine translation. The synthesized text-to-speech sound quality has improved dramatically in recent years thanks to the introduction of neural network technology and has reached a level comparable to that of natural speech, however, the large amount of computation was a major issue ; so impossible to synthesize on a smartphone without a network connection.

Additionally, NICT is currently conducting R&D in multilingual simultaneous interpretation technology. In simultaneous interpretation, it is required to output the interpreted speech one after the other without waiting for the speaker to finish speaking. Therefore, further acceleration of text-to-speech, both in automatic speech recognition and machine translation, is necessary.

Text-to-speech models are typically constructed from an acoustic model that converts the input text into intermediate features and a waveform generator model that converts the intermediate features into speech waveforms.

While neural networks (Encoder Transformer + Decoder Transformer), which are widely used in machine translation, automatic speech recognition and large language models (e.g. ChatGPT) are the main stream in acoustic modeling for text transmission in neurally speaking, we have introduced high-speed, high-performance neural networks (ConvNeXt encoder + ConvNeXt decoder), which have recently been proposed in image identification, acoustic modeling, and achieved three times faster synthesis without performance degradation compared to conventional methods.

In 2021, we introduced MS-HiFi-GAN, in which the signal processing method [2-4] is represented with a trainable neural network, extending the conventional model, HiFi-GAN, which can synthesize speech equivalent to human speech and has achieved twice faster synthesis without degradation of synthesis performance [5]. In 2023, we successfully developed MS-FC-HiFi-GAN by further accelerating MS-HiFi-GAN and achieved four times faster synthesis without degradation of synthesis performance compared to conventional HiFi-GAN.

As the culmination of these achievements, we have developed a new, fast and high-quality neural text-to-speech transmission model using an acoustic model (Transformer encoder + ConvNeXt decoder) and a waveform generation model (MS -FC-HiFi-GAN) as shown in Figure 2. As a result, the developed model is capable of synthesizing one second of high-speed speech in only 0.1 seconds using a single CPU core, which is about eight times faster than conventional models. Furthermore, by introducing a method where incremental synthesis is applied only to the waveform generating model (see Figure 3), the developed model achieved fast synthesis with a delay of 0.5 seconds on a mid-range smartphone without connectivity network or synthesis performance degradation. This eliminates the need for Internet connectivity or conventional server-based synthesis and enables high-quality neural text-to-speech transmission to smartphones, PCs, and other devices at reduced communication costs. Additionally, incremental synthesis processing makes it possible to instantly synthesize translated text into simultaneous multilingual interpretation.

As of March 2024, the developed technology has been used for neural transmission of text to speech in 21 of languages ​​supported in VoiceTra and made available to the public.

†21 languages: Japanese, English, Chinese, Korean, Thai, French, Indonesian, Vietnamese, Spanish, Myanmar, Filipino, Brazilian Portuguese, Khmer, Nepali, Mongolian, Arabic, Italian, Ukrainian, German, Hindi and Russian

In the future, we will promote social implementation, especially for smartphone applications, etc. such as multilingual speech translation and car navigation systems through commercial licensing.

Item information

Journal: Proceedings of INTERSPEECH 2024

Title: Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with MS-FC-HiFi-GAN incremental inference for low-latency synthesis

Authors: Takuma Okamoto, Yamato Ohtani, Hisashi Kawai

References

[1] T. Okamoto, Y. Ohtani, T. Toda, and H. Kawai, “ConvNeXt-TTS and ConvNeXt-VC: Fast text-to-speech and voice-to-speech conversion based on ConvNeXt,” in Proc. ICASSP, April 2024, p. 12456–12460.

[2] T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai, “Subband WaveNet with one-side-overlapped filterbanks,” in Proc. ASRU, December 2017, pp. 698–704.

[3] T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai, “An investigation of WaveNet subband vocoder covering the entire audible frequency range with limited acoustic features,” in Proc. ICASSP, April 2018, pp. 5654–5658.

[4] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai, “FFTNet vocoder improvement with noise shaping and subband approaches,” in Proc. SLT, December 2018, p. 304–311.

[5] T. Okamoto, T. Toda, and H. Kawai, “Multi-stream HiFi-GAN with data-driven waveform decomposition,” in Proc. ASRU, December 2021, p. 610–617.

[6] T. Okamoto, H. Yamashita, Y. Ohtani, T. Toda, and H. Kawai, “WaveNeXt: ConvNeXt-based fast neural vocoder without iSTFT layer,” in Proc. ASRU, December 2023.

[7] H. Yamashita, T. Okamoto, R. Takashima, Y. Ohtani, T. Takiguchi, T. Toda, and H. Kawai, “Generative neural fast speech waveform models with upsampling based on fully connected layers ,” IEEE Access, vol. 12, p. 31409–31421, 2024.

/Public Notice. This material from the original organization/author(s) may be current in nature and edited for clarity, style and length. Mirage.News does not take institutional positions or sides and all views, opinions and conclusions expressed herein are solely those of the author(s). Watch it in full here.

Leave a Comment