"A Sequence-to-Sequence Framework Based on Transformer With Masked Language Model for Optical Music Recognition"

Optical music recognition technology is of great significance in the development of the digital music. In recent years, the convolutional recurrent neural network framework with connectionist temporal classification has been used in music recognition. However, its loss function is calculated in serial mode, which leads to low efficiency in training and difficulty in convergence. Additionally, because of the gradient disappearance of excessive long music sequences, the existing music recognition models are hard to learn the relationships between musical symbols, resulting in high sequence error rate. Therefore, we propose a sequence-to-sequence framework based on transformer with masked language model to deal with these problems. The context representation between musical symbols can be captured further by the self-attention module in the transformer, which will reduce the sequence error rate. In addition, we refer to the masked language model and design a mask matrix to predict each musical symbol in a parallel way, so as to speed up the training process. Our experiments are carried out on the printed images of music stave dataset, and the results show that our proposed method is training-efficient and has great improvement in sequence accuracy rate.

Keywords: music; language model; music recognition; sequence; masked language

Journal Title: IEEE Access
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended