ELGAR: Expressive CeLlo Performance Motion Generation for Audio Rendition

Zhiping Qiu1,2     Yitong Jin1,2     Yuan Wang2     Yi Shi1,2     Chongwu Wang2    
Chao Tan3     Xiaobing Li2     Feng Yu2     Tao Yu1,✉     Qionghai Dai1,✉
1Tsinghua University      2Central Conservatory of Music      3Weilan Tech, Beijing
Corresponding Author
TL;DR Generating string instrument performances with intricate movements and complex interactions poses significant challenges. To address these, we present ELGAR—the first framework for whole-body instrument performance motion generation solely from audio. We further contribute innovative losses, metrics, and dataset, marking a novel attempt with promising results for this emerging task.

Abstract

The art of instrument performance stands as a vivid manifestation of human creativity and emotion. Nonetheless, generating instrument performance motions is a highly challenging task, as it requires not only capturing intricate movements but also reconstructing the complex dynamics of the performer-instrument interaction. While existing works primarily focus on modeling partial body motions, we propose Expressive ceLlo performance motion Generation for Audio Rendition (ELGAR), a state-of-the-art diffusion-based framework for whole-body fine-grained instrument performance motion generation solely from audio. To emphasize the interactive nature of the instrument performance, we introduce Hand Interactive Contact Loss (HICL) and Bow Interactive Contact Loss (BICL), which effectively guarantee the authenticity of the interplay. Moreover, to better evaluate whether the generated motions align with the semantic context of the music audio, we design novel metrics specifically for string instrument performance motion generation, including finger-contact distance, bow-string distance, and bowing score. Extensive evaluations and ablation studies are conducted to validate the efficacy of the proposed methods. In addition, we put forward a motion generation dataset SPD-GEN, collated and normalized from the MoCap dataset SPD. As demonstrated, ELGAR has shown great potential in generating instrument performance motions with complicated and fast interactions, which will promote further development in areas such as animation, music education, interactive art creation, etc.

Video

Play in Different Tempos

The model generalizes well across tempo variations, suitable for the same musical passage played at different speeds.

Test Set Sample

The model generates plausible performance motions based on test audios in the SPD-GEN dataset.

In-the-wild Sample

The model is capable of generating plausible performance motions based on in-the-wild audios beyond the curated dataset.

Retargeting

In this work, we leverage Unreal Engine to retarget motions from the SMPL-X model to alternative avatars, aiming to promote broader applicability of motion retargeting methods to complex interactive motions.

BibTeX



    @article{qiu2025elgar,
      title={ELGAR: Expressive Cello Performance Motion Generation for Audio Rendition},
      author={Qiu, Zhiping and Jin, Yitong and Wang, Yuan and Shi, Yi and Wang, Chongwu and Tan, Chao and Li, Xiaobing and Yu, Feng and Yu, Tao and Dai, Qionghai},
      journal={arXiv e-prints},
      pages={arXiv--2505},
      year={2025}
    }


    @inproceedings{10.1145/3721238.3730756,
    author = {Qiu, Zhiping and Jin, Yitong and Wang, Yuan and Shi, Yi and Tan, Chao and Wang, Chongwu and Li, Xiaobing and Yu, Feng and Yu, Tao and Dai, Qionghai},
    title = {ELGAR: Expressive Cello Performance Motion Generation for Audio Rendition},
    year = {2025},
    isbn = {9798400715402},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3721238.3730756},
    doi = {10.1145/3721238.3730756},
    booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
    articleno = {54},
    numpages = {9},
    keywords = {Motion Generation, Musical Instrument Performance},
    series = {SIGGRAPH Conference Papers '25}
    }