Text to speech and Lipsync on UE5


I have successfully integrated Microsoft SAPI into Unreal Engine 5 to generate text-to-speech (TTS) along with my lip-syncing method. Let me know what you think.

Progress in the update of More than words to Unreal Engine 5.

- Main interface [OK].
- Updated Marian's 3d model [OK]
- Microsoft SAPI integration on UE5 to generate text to speech [OK].
- State machine integration: Wait, Think, Speak and Listen [OK].
- State machine integration for lipsync [OK].
  + Lipsync now has a more simplified algorithm with better results.
- Integration of the thread that handles speaking [OK].
  + Correction of a minor bug that prevented the thread from being completely destroyed.

Next steps: Integrate speech recognition.

Get More than words

Download NowName your own price

Comments

Log in with itch.io to leave a comment.

Looks good. Any chance you could add face (camera) tracking too so that she looks at the player? Bonus points if you could do so with a model that incorporates glances away instead of just staring the whole time.

On the TTS specificially, I'm sure it's for computational load and ease of programming that you're using the Microsoft built in system, but have you looked into local TTS models? This game uses one that is pretty convincingly human sounding for what it is: https://jetro30087.itch.io/ai-companion-miku If you can get in contact with the dev, maybe he'll tell you what system he used.

"Any chance you could add face (camera) tracking so that she looks at the player? Bonus points if you could do so with a model that incorporates glances away instead of just staring the whole time."

What you say can be achieved with BlendSpace animations, when I work on the animations module I'll see how far I can go.

"I'm sure it's for computational load and ease of programming that you're using the Microsoft built in system, but have you looked into local TTS models?"

Currently, to generate static voice I am using this one (Cortana's voice in the new update):

But as you say, the main reason is that the Microsoft system generates the voice and the voice analysis (it gives you the list of phonemes) with a very good performance, it practically does not represent a load for the video game and as it runs at the same time as the graphics of unreal engine and the LLMs, it is the reason why I still hold on to it. But I'm not closing my eyes. We'll see how much the TTs advance this year and if a similar performance can be achieved.