☆ 4.7 Article

Late multimodal fusion for image and audio music transcription

EXPERT SYSTEMS WITH APPLICATIONS (2023)

期刊

EXPERT SYSTEMS WITH APPLICATIONS

卷 216, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.eswa.2022.119491

关键词

Optical Music Recognition; Automatic Music Transcription; Multimodality; Deep learning; Connectionist Temporal Classification; Sequence labeling; Word graphs

类别

Computer Science, Artificial Intelligence Engineering, Electrical & Electronic Operations Research & Management Science

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Music transcription is crucial for Music Information Retrieval (MIR) as it converts music sources into a structured digital format. The MIR community has approached this problem through two lines of research: Optical Music Recognition (OMR) for music documents and Automatic Music Transcription (AMT) for audio recordings. Although these fields have developed modality-specific frameworks, recent developments in sequence labeling tasks have led to a common output representation, enabling research on multimodal image and audio music transcription.

Music transcription, which deals with the conversion of music sources into a structured digital format, is a key problem for Music Information Retrieval (MIR). When addressing this challenge in computational terms, the MIR community follows two lines of research: music documents, which is the case of Optical Music Recognition (OMR), or audio recordings, which is the case of Automatic Music Transcription (AMT). The different nature of the aforementioned input data has conditioned these fields to develop modality-specific frameworks. However, their recent definition in terms of sequence labeling tasks leads to a common output representation, which enables research on a combined paradigm. In this respect, multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities. In this work, we explore this question at a late-fusion level: we study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems in a lattice-based search space. The results obtained for a series of performance scenarios-in which the corresponding single-modality models yield different error rates-showed interesting benefits of these approaches. In addition, two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.

Late multimodal fusion for image and audio music transcription

期刊

EXPERT SYSTEMS WITH APPLICATIONS

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Late multimodal fusion for image and audio music transcription

期刊

EXPERT SYSTEMS WITH APPLICATIONS

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文