According to Datamation by Beating monitoring, Resemble AI today open-sourced the voice generation model DramaBox on Hugging Face. As the first voice engine to feature director-level controllability, it allows AI voices to completely move away from monotonous robot assistant mode.
The core mechanism lies in dissociated cue control. Users input lines within straight quotation marks and stage actions such as sighs, long pauses, whispers, or even voice huskiness due to sadness outside the quotes. The model does not read out the action commands but directly renders them into emotionally imbued physical vocalizations, transforming the output from mere voice synthesis to a true character performance. This capability directly replaces the need for human voice-overs or cumbersome post-production workflows.
Technically, DramaBox possesses zero-shot voice cloning capability, requiring only 10 seconds of reference audio to match the target voice and supports setting the character's age, accent, and emotion directly through natural language cues. The model natively outputs studio-quality 48kHz stereo audio. To prevent deepfakes, all generated audio is automatically embedded with an invisible Perth watermark, resistant to MP3 compression and common audio editing operations.
In terms of underlying architecture, the model is based on Lightricks' 33-billion-parameter LTX-2.3 Audio Megamodel fine-tuned, incorporating a Dispersed Transformer (DiT) with a flow-matching architecture and leveraging Gemma 3's 12B text embeddings.
