Motion Capture of musical instrument performance is challenging even with markers. By extracting playing cues inherent in the audio for markerless video motion capture, our method recovers subtle finger-string contacts and intricate playing movements. We further contribute the first large-scale String Performance Dataset (SPD) with high-quality motion and contact annotations.
Abstract
In this paper, we touch on the problem of markerless multi-modal human motion capture especially for string performance capture which involves inherently subtle hand-string contacts and intricate movements. To fulfill this goal,
we first collect a dataset, named String Performance Dataset (SPD), featuring cello and violin performances. The dataset includes videos captured from up to 23 different views, audio signals, and detailed 3D motion annotations of
the body, hands, instrument, and bow. Moreover, to acquire the detailed motion annotations, we propose an audio-guided multi-modal motion capture framework that explicitly incorporates hand-string contacts detected from the audio
signals for solving detailed hand poses. This framework serves as a baseline for string performance capture in a completely markerless manner without imposing any external devices on performers, eliminating the potential of
introducing distortion in such delicate movements. We argue that the movements of performers, particularly the sound-producing gestures, contain subtle information often elusive to visual methods but can be inferred and retrieved
from audio cues. Consequently, we refine the vision-based motion capture results through our innovative audio-guided approach, simultaneously clarifying the contact relationship between the performer and the instrument, as deduced
from the audio. We validate the proposed framework and conduct ablation studies to demonstrate its efficacy. Our results outperform current state-of-the-art vision-based algorithms, underscoring the feasibility of augmenting visual
motion capture with audio modality. To the best of our knowledge, SPD is the first dataset for musical instrument performance, covering fine-grained hand motion details in a multi-modal, large-scale collection. It holds significant
implications and guidance for string instrument pedagogy, animation, and virtual concerts, as well as for both musical performance analysis and generation.
Video
Examples in SPD
Ablation Study
Acknowledgments
We would like to extend our heartfelt gratitude to all the musicians who participated in the creation of the String Performance Dataset. Their dedication, skill, and passion have been invaluable in advancing our research. We especially thank:
Chongwu Wang, associate professor of cello, Central Conservatory of Music.
Xinghong Wang, undergraduate student of the 2017 class in the Orchestral Department, Central Conservatory of Music.
Ziyi Huang, undergraduate student of the 2022 class in the Orchestral Department, Central Conservatory of Music.
Yutong Ding, master's student of the 2022 class in the Orchestral Department, Central Conservatory of Music.
Shihao Yao, master's student of the 2023 class in the Orchestral Department, Central Conservatory of Music.
Haotian Zhou, doctoral student of the 2020 class in the Department of Music AI and Music Information Technology, Central Conservatory of Music.
Yuan Wang, doctoral student of the 2022 class in the Department of Music AI and Music Information Technology, Central Conservatory of Music.
Yuetonghui Xu, doctoral student of the 2022 class in the Department of Music AI and Music Information Technology, Central Conservatory of Music.
Yitong Jin, doctoral student of the 2021 class in the Department of Music AI and Music Information Technology, Central Conservatory of Music.
Without their contributions, this project would not have been possible.
BibTeX
@article{jin2024audio,
title={Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture},
author={Jin, Yitong and Qiu, Zhiping and Shi, Yi and Sun, Shuangpeng and Wang, Chongwu and Pan, Donghao and Zhao, Jiachen and Liang, Zhenghao and Wang, Yuan and Li, Xiaobing and others},
journal={arXiv preprint arXiv:2405.04963},
year={2024}
}
@article{10.1145/3658235,
author = {Jin, Yitong and Qiu, Zhiping and Shi, Yi and Sun, Shuangpeng and Wang, Chongwu and Pan, Donghao and Zhao, Jiachen and Liang, Zhenghao and Wang, Yuan and Li, Xiaobing and Yu, Feng and Yu, Tao and Dai, Qionghai},
title = {Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture},
year = {2024},
issue_date = {July 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {43},
number = {4},
issn = {0730-0301},
url = {https://doi.org/10.1145/3658235},
doi = {10.1145/3658235},
abstract = {In this paper, we touch on the problem of markerless multi-modal human motion capture especially for string performance capture which involves inherently subtle hand-string contacts and intricate movements. To fulfill this goal, we first collect a dataset, named String Performance Dataset (SPD), featuring cello and violin performances. The dataset includes videos captured from up to 23 different views, audio signals, and detailed 3D motion annotations of the body, hands, instrument, and bow. Moreover, to acquire the detailed motion annotations, we propose an audio-guided multi-modal motion capture framework that explicitly incorporates hand-string contacts detected from the audio signals for solving detailed hand poses. This framework serves as a baseline for string performance capture in a completely markerless manner without imposing any external devices on performers, eliminating the potential of introducing distortion in such delicate movements. We argue that the movements of performers, particularly the sound-producing gestures, contain subtle information often elusive to visual methods but can be inferred and retrieved from audio cues. Consequently, we refine the vision-based motion capture results through our innovative audio-guided approach, simultaneously clarifying the contact relationship between the performer and the instrument, as deduced from the audio. We validate the proposed framework and conduct ablation studies to demonstrate its efficacy. Our results outperform current state-of-the-art vision-based algorithms, underscoring the feasibility of augmenting visual motion capture with audio modality. To the best of our knowledge, SPD is the first dataset for musical instrument performance, covering fine-grained hand motion details in a multi-modal, large-scale collection. It holds significant implications and guidance for string instrument pedagogy, animation, and virtual concerts, as well as for both musical performance analysis and generation. Our code and SPD dataset are available at https://github.com/Yitongishere/string_performance.},
journal = {ACM Trans. Graph.},
month = {jul},
articleno = {90},
numpages = {10},
keywords = {marker-less motion capture, string performance, multi-modality}
}