Skip to main content

Voice cloning best practices

Better cloning starts with better source material. Small improvements in the source audio can make a noticeable difference in the final voice.

Keep the samples clean

Aim for speech that is:
  • clear
  • dry
  • low-noise
  • free from heavy music or crowd sound
The quality of the capture matters more than the file format itself.

Use consistent speech

The model responds better when the sample material sounds like one coherent speaking style. Try to keep:
  • a similar recording environment
  • a similar speaking tone
  • a similar energy level
If the source swings between whispering, shouting, noisy clips, and clean narration, the cloned result can become less stable.

Match the performance you want

Cloned voices tend to inherit the style of the sample material.
  • calm samples produce calmer output
  • expressive samples produce more expressive output
  • rushed delivery can create rushed output
If you want a clean, steady narrator feel, source clips with that same style usually work better.

Use enough audio, but not too much

As a rule of thumb:
  • use at least about 1 minute of useful speech when possible
  • 1 to 2 minutes of clear material is usually a strong range
  • more than about 5 minutes gives little practical gain in most cases

Keep the volume balanced

Try to avoid samples that are:
  • too quiet
  • clipped or distorted
  • heavily normalized to the point of sounding unnatural
A balanced recording level is better than an aggressively loud one.

Prefer the right isolation mode

For most projects:
  • start with Studio
  • switch to Realistic only when the environment character is part of the experience