Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture.

Audio Matters Too!
Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture.

Yitong Jin^1,2,* Zhiping Qiu^1,2,* Yi Shi^1,2 Shuangpeng Sun¹ Chongwu Wang² Donghao Pan²
Jiachen Zhao¹ Zhenghao Liang³ Yuan Wang² Xiaobing Li² Feng Yu² Tao Yu^1,✉ Qionghai Dai^1,✉

¹Tsinghua University ²Central Conservatory of Music ³Weilan Tech, Beijing

^*Equal Contribution ^✉Corresponding Author

Abstract

In this paper, we touch on the problem of markerless multi-modal human motion capture especially for string performance capture which involves inherently subtle hand-string contacts and intricate movements. To fulfill this goal, we first collect a dataset, named String Performance Dataset (SPD), featuring cello and violin performances. The dataset includes videos captured from up to 23 different views, audio signals, and detailed 3D motion annotations of the body, hands, instrument, and bow. Moreover, to acquire the detailed motion annotations, we propose an audio-guided multi-modal motion capture framework that explicitly incorporates hand-string contacts detected from the audio signals for solving detailed hand poses. This framework serves as a baseline for string performance capture in a completely markerless manner without imposing any external devices on performers, eliminating the potential of introducing distortion in such delicate movements. We argue that the movements of performers, particularly the sound-producing gestures, contain subtle information often elusive to visual methods but can be inferred and retrieved from audio cues. Consequently, we refine the vision-based motion capture results through our innovative audio-guided approach, simultaneously clarifying the contact relationship between the performer and the instrument, as deduced from the audio. We validate the proposed framework and conduct ablation studies to demonstrate its efficacy. Our results outperform current state-of-the-art vision-based algorithms, underscoring the feasibility of augmenting visual motion capture with audio modality. To the best of our knowledge, SPD is the first dataset for musical instrument performance, covering fine-grained hand motion details in a multi-modal, large-scale collection. It holds significant implications and guidance for string instrument pedagogy, animation, and virtual concerts, as well as for both musical performance analysis and generation.

BibTeX

@article{jin2024audio, title={Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture}, author={Jin, Yitong and Qiu, Zhiping and Shi, Yi and Sun, Shuangpeng and Wang, Chongwu and Pan, Donghao and Zhao, Jiachen and Liang, Zhenghao and Wang, Yuan and Li, Xiaobing and others}, journal={arXiv preprint arXiv:2405.04963}, year={2024} }

@article{10.1145/3658235, author = {Jin, Yitong and Qiu, Zhiping and Shi, Yi and Sun, Shuangpeng and Wang, Chongwu and Pan, Donghao and Zhao, Jiachen and Liang, Zhenghao and Wang, Yuan and Li, Xiaobing and Yu, Feng and Yu, Tao and Dai, Qionghai}, title = {Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture}, year = {2024}, issue_date = {July 2024}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {43}, number = {4}, issn = {0730-0301}, url = {https://doi.org/10.1145/3658235}, doi = {10.1145/3658235}, abstract = {In this paper, we touch on the problem of markerless multi-modal human motion capture especially for string performance capture which involves inherently subtle hand-string contacts and intricate movements. To fulfill this goal, we first collect a dataset, named String Performance Dataset (SPD), featuring cello and violin performances. The dataset includes videos captured from up to 23 different views, audio signals, and detailed 3D motion annotations of the body, hands, instrument, and bow. Moreover, to acquire the detailed motion annotations, we propose an audio-guided multi-modal motion capture framework that explicitly incorporates hand-string contacts detected from the audio signals for solving detailed hand poses. This framework serves as a baseline for string performance capture in a completely markerless manner without imposing any external devices on performers, eliminating the potential of introducing distortion in such delicate movements. We argue that the movements of performers, particularly the sound-producing gestures, contain subtle information often elusive to visual methods but can be inferred and retrieved from audio cues. Consequently, we refine the vision-based motion capture results through our innovative audio-guided approach, simultaneously clarifying the contact relationship between the performer and the instrument, as deduced from the audio. We validate the proposed framework and conduct ablation studies to demonstrate its efficacy. Our results outperform current state-of-the-art vision-based algorithms, underscoring the feasibility of augmenting visual motion capture with audio modality. To the best of our knowledge, SPD is the first dataset for musical instrument performance, covering fine-grained hand motion details in a multi-modal, large-scale collection. It holds significant implications and guidance for string instrument pedagogy, animation, and virtual concerts, as well as for both musical performance analysis and generation. Our code and SPD dataset are available at https://github.com/Yitongishere/string_performance.}, journal = {ACM Trans. Graph.}, month = {jul}, articleno = {90}, numpages = {10}, keywords = {marker-less motion capture, string performance, multi-modality} }