TL;DR Generating string instrument performances with intricate movements and complex interactions
poses
significant challenges.
To address these, we present ELGAR—the first framework for whole-body instrument
performance motion generation solely from audio.
We further contribute innovative losses, metrics, and dataset, marking a novel attempt with promising
results for this emerging task.
Abstract
The art of instrument performance stands as a vivid manifestation of human creativity and emotion.
Nonetheless, generating instrument performance motions is a highly challenging task, as it requires
not only capturing intricate movements but also reconstructing the complex dynamics of the
performer-instrument interaction.
While existing works primarily focus on modeling partial body motions, we propose Expressive ceLlo
performance motion Generation for Audio Rendition (ELGAR), a state-of-the-art diffusion-based
framework for whole-body fine-grained instrument performance motion generation solely from audio.
To emphasize the interactive nature of the instrument performance, we introduce Hand Interactive
Contact Loss (HICL) and Bow Interactive Contact Loss (BICL), which effectively guarantee the
authenticity of the interplay.
Moreover, to better evaluate whether the generated motions align with the semantic context of the
music audio, we design novel metrics specifically for string instrument performance motion
generation, including finger-contact distance, bow-string distance, and bowing score.
Extensive evaluations and ablation studies are conducted to validate the efficacy of the proposed
methods. In addition, we put forward a motion generation dataset SPD-GEN, collated and normalized
from the MoCap dataset SPD.
As demonstrated, ELGAR has shown great potential in generating instrument performance motions with
complicated and fast interactions, which will promote further development in areas such as
animation, music education, interactive art creation, etc.
Video
Play in Different Tempos
The model generalizes well across tempo variations, suitable for the same musical passage played at
different speeds.
Test Set Sample
The model generates plausible performance motions based on test audios in the SPD-GEN dataset.
In-the-wild Sample
The model is capable of generating plausible performance motions based on in-the-wild audios beyond the
curated dataset.
Retargeting
In this work, we leverage Unreal Engine to retarget motions from the SMPL-X model to alternative
avatars, aiming to promote broader applicability of motion retargeting methods to complex
interactive motions.
BibTeX
@article{qiu2025elgar,
title={ELGAR: Expressive Cello Performance Motion Generation for Audio Rendition},
author={Qiu, Zhiping and Jin, Yitong and Wang, Yuan and Shi, Yi and Wang, Chongwu and Tan, Chao and Li, Xiaobing and Yu, Feng and Yu, Tao and Dai, Qionghai},
journal={arXiv e-prints},
pages={arXiv--2505},
year={2025}
}
@inproceedings{10.1145/3721238.3730756,
author = {Qiu, Zhiping and Jin, Yitong and Wang, Yuan and Shi, Yi and Tan, Chao and Wang, Chongwu and Li, Xiaobing and Yu, Feng and Yu, Tao and Dai, Qionghai},
title = {ELGAR: Expressive Cello Performance Motion Generation for Audio Rendition},
year = {2025},
isbn = {9798400715402},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3721238.3730756},
doi = {10.1145/3721238.3730756},
booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
articleno = {54},
numpages = {9},
keywords = {Motion Generation, Musical Instrument Performance},
series = {SIGGRAPH Conference Papers '25}
}