Improving piano music signal recognition through enhanced frequency domain analysis

. Feature extraction is a crucial component in the analysis of piano music signals. This article introduced three methods for feature extraction based on frequency domain analysis, namely short-time Fourier transform (STFT), linear predictive cepstral coefficient (LPCC), and Mel-frequency cepstral coefficient (MFCC). An improvement was then made to the MFCC. The inverse MFCC (IMFCC) was combined with mid-frequency MFCC (MidMFCC). The Fisher criterion was used to select the 12-order parameters with the maximum Fisher ratio, which were combined into the F-MFCC feature for recognizing 88 single piano notes through a support vector machine. The results indicated that when compared with the STFT and LPCC, the MFCC exhibited superior performance in recognizing piano music signals, with an accuracy rate of 78.03 % and an F1 value of 85.92 %. Nevertheless, the proposed F-MFCC achieved a remarkable accuracy rate of 90.91 %, representing a substantial improvement by 12.88 % over the MFCC alone. These findings provide evidence for the effectiveness of the designed F-MFCC feature for piano music signal recognition as well as its potential application in practical music signal analysis.


Introduction
With the continuous development of computer technology, music can be stored and produced through computers [1], making analysis and processing of music more convenient.Compared with speech signals, music signals have a richer timbre and more complex frequency variations.Therefore, the analysis and processing methods for speech signals are not completely applicable to music signals.The analysis and processing of music signals can provide support for tasks such as music information retrieval and music genre classification, making it a highly important research direction in the field of music [2].Numerous methods have already been applied [3], such as deep learning (DL) [4], convolutional neural network [5], and deep neural network [6].Li et al. [7] designed a supervised robust non-negative matrix factorization method to enhance the separation performance of instrumental music signals, such as piano and trombone.Experimental results demonstrated that this method yielded better separation effects compared to traditional approaches.Waghmare et al. [8] conducted a study on the classification and labeling of Indian music, proposed that Mel-frequency cepstral coefficients (MFCCs) can provide timbre information, and demonstrated the effectiveness of this method through experimental analysis.O'Brien et al. [9] conducted a study on the transcription of polyphonic music and proposed a probabilistic latent component analysis model.Their experiments demonstrated that this method effectively decomposed the signal into distinct hierarchical smooth structures, resulting in high-quality transcriptions.Hashemi et al. [10] introduced a DL-based approach for separating Persian musical sources and found that it performed well in isolating two audio sources and can also be applied to various audio sources and the combinations of more than two audio sources.Feature extraction is an important step in music signal analysis, which can generally be divided into two types: time domain and frequency domain.The traditional approach to music signal processing often emphasizes temporal characteristics, whereas the piano, being a polyphonic instrument, presents more intricate musical signals and larger volumes of temporal data.Compared to time domain analysis, piano music has a smaller computational load and better expression of musical information in frequency domain analysis.Currently, there is a dearth of research on piano music signal recognition, and the applicability of conventional speech signal analysis methods in this context is also limited.Therefore, this paper focused on the signal of piano music and extracted its features through frequency domain analysis.Several different features in the frequency domain were compared, and an improved MFCC feature was designed.Taking the recognition of 88 single piano notes as an example, the performance of the extracted features in recognizing piano music signals was demonstrated.The features extracted by the proposed method effectively represents the information embedded in piano music signals and emphasize crucial details to enhance recognition accuracy.The research on feature extraction, rather than recognition algorithms, significantly contributes to improving the interpretability of features.It reduces feature dimensions while preserving essential musical information, thus alleviating the computational burden of subsequent recognition algorithms and enabling their adaptation to complex music signal environments.Consequently, this directly enhances system performance.This work provides a novel approach for analyzing and processing music signals and promotes further advancements in digital music.The proposed features can be applied to the signal recognition of other musical instruments, which in turn can be extended to the field of speech signal processing.

Frequency analysis-based feature extraction
In the analysis and processing tasks of music signals, feature extraction of music signals is required to provide services for the subsequent research.Time domain analysis, such as short-time energy and zero-crossing rate [11], involves a large amount of data in computation; therefore, frequency domain analysis is more commonly used in signal analysis.
(1) Short-time Fourier transform (STFT).STFT is a common feature extraction method based on frequency domain analysis [12], widely used in audio signal processing.It analyzes the time-frequency distribution of local signals to obtain the patterns of amplitude variation in the signal.The calculation formula is: where   represents an input signal,   is a window function,   −  is a sliding window, and  is the step length of Fourier transform.
(2) Linear predictive cepstral coefficient (LPCC).LPCC is a feature based on linear predictive coefficients (LPC) [13], which has certain advantages in suppressing low-frequency and high-frequency noise.It is assumed that after the LPC analysis of signal   , the obtained system transfer function is written as: where  denotes the model order and  is a real number.A -order linear predictor is defined as: The current sample is predicted using the first  samples.The predictive value is: Then, the error function is obtained:   =   −   .Coefficient  that minimizes the mean square prediction error is known as the LPC.
After obtaining a , the cepstrum is obtained by using the following recursion formula: where  is the LPCC.
(3) Mel-frequency cepstral coefficient (MFCC).MFCC is a feature that references the characteristics of human auditory perception [14].The relationship between Mel frequency and linear frequency  is written as: The preprocessed time-domain signal   is transformed to the frequency domain through fast Fourier transform (FFT): where  stands for the number of points in Fourier transform.Then, the spectrum is smoothed through  triangular bandpass filters to obtain output response   .The energy output of every filter is calculated, and the logarithm is taken.Then: Discrete cosine transform (DCT) is performed on   to obtain MFCC: where  stands for the order of MFCC.
However, MFCC has a poor ability to extract information from mid and high-frequency audio.To address this issue, improvements need to be made to the coefficients of MFCC.Firstly, the high-frequency region of MFCC can be achieved through a reversed filter bank structure known as inverse MFCC (IMFCC) [15].The response of the reversed filters can be expressed as: where  stands for the number of filters.The relationship between IMFCC and linear frequency  is written as: The MFCC in the middle frequency region is referred to as mid-frequency MFCC (MidMFCC) according to literature [16].The relationship between MidMFCC and linear frequency f is written as: By combining MFCC, IMFCC, and MidMFCC together, it is possible to extract complete information about the high-frequency, mid-frequency, and low-frequency regions of piano audio.However, a simple combination would greatly increase the dimensionality of the features.For example, if each parameter is taken as 12 orders, the total would be 36 orders which are not conducive to subsequent recognition and analysis of piano audio.Therefore, in order to reduce feature dimensionality, this paper applies Fisher criterion [17] for selecting MFCC+IMFCC+MidMFCC.
Fisher criterion determines the information amount in the feature dimensionality through calculating Fisher ratio.The corresponding formulas are: where  refers to the between-class distance,  is the inner-class distance,  refers to the -th kind of piano audio feature sequence,  refers to the average value of the feature parameter of the -th kind of piano audio on the -th dimension,  refers to the mean value of the -th dimensional feature on all classes, and  is the -th component of the -th kind of piano audio feature sequence.
The Fisher ratio of MFCC, IMFCC, and MidMFCC is calculated, and the results are presented in Fig. 1.
The top 12 dimensions with the largest Fisher ratio in Fig. 1 are extracted as features for subsequent piano music recognition.For MFCC, the chosen orders include 1, 4, 5, 6, 7, and 10.For IMFCC, the chosen orders include 5, 6, and 9.For MidMFCC, the chosen orders include 6, 7, and 8.
The features extracted by Fisher criterion are referred to as F-MFCC.The process of extracting F-MFCC is illustrated in Fig. 2.
After 12 orders of MFCC, IMFCC, and MidMFCC are extracted from the piano music signal, 12 orders of feature dimension with the largest Fisher ratio are selected using the Fisher criterion to obtain F-MFCC as the feature input of the subsequent piano music signal recognition.

Piano music signal recognition method
For the recognition of piano music signals, this paper employs the support vector machine (SVM) method.SVM is a statistical learning-based approach that offers effective solutions for nonlinearity and dimensionality curse problems [18].It has a simple structure and has high flexibility [19].It has been widely utilized in image classification, data prediction, and other domains [20].It is assumed that there is linearly separable sample set  ,  ,  = 1,2, ⋯ , ,  ∈ +1, −1 .The equation of the classification plane can be written as:  +  = 0, satisfying: Moreover, the classification plane that minimize ‖‖ is optimal.The Lagrange function is defined: where  is a Lagrange coefficient.By setting the derivatives of  and  to zero, the original problem can be transformed into a dual problem.Under the condition of ∑   = 0 ,  ≥ 0, the following equation is solved: If there is optimal solution  * , then  * .The optimal classification function is written as:   = sgn ∑  *   •  +  * .
In the selection of the kernel function, the Gaussian kernel function is used:   ,  = exp  −  / , where  refers to the kernel function parameter.

Result and analysis
Eighty-weight single-tone data were collected from a regular piano, with a sampling rate of 44,100 Hz and a sampling time of 5 s.A total of ten sets was recorded, resulting in 880 samples.The collected data were saved in .wavformat, allocating 70 % for the training set and 30 % for the test set.In the SVM, the parameter of the kernel function was determined through grid search and ultimately set to 0.5.The recognition performance of different features was evaluated based on the confusion matrix (Table 1), with the following evaluation indicators.The final results were obtained by averaging the results from the 88 single-tones: Table 2 shows the recognition accuracy of the SVM method for the training set.From Table 2, it can be observed that in the ten experiments, the SVM method achieved a recognition accuracy of approximately 90 % on the training set, with an average value of 90.07 %.This result indicated that the SVM method exhibited excellent precision in recognizing the training set.
Firstly, the impact of feature dimensionality selected by Fisher's criterion on recognition accuracy was analyzed using the test set.The results are presented in Table 3. From Table 3, it can be observed that an accuracy of 87.64 % was achieved when selecting the top 6-dimensional feature based on Fisher ratio as input.When choosing the top 12-dimensional feature based on Fisher ratio, the accuracy increased to 90.91 %, showing a significant improvement of 3.27 % compared to the case with only the 6-dimensional feature.However, as the dimensionality continued to increase, the accuracy gradually declined.Therefore, in subsequent experiments, the top 12-dimensional feature based on Fisher ratio was selected as F-MFCC and input into the SVM method for recognition.
A comparison was made among three frequency domain analysis methods: STFT, LPCC, and MFCC, using the test set.They all used the SVM method to recognize the 88 piano single-tone signals, and the results are presented in Table 4.According to the results in Table 4, the recognition performance of the STFT, LPCC, and MFCC was calculated and presented in Table 5.From Table 5, it can be observed that among the three frequency domain analysis-based features, the STFT performed the worst in recognizing piano music signals, with an accuracy rate of only 59.09 % and an F1 value of 69.14 %.When using LPCC as the feature input for the SVM method, the accuracy rate for recognizing piano single-tone signals reached 74.24 %, which showed a significant improvement of 15.15 % compared to the STFT.The F1value also increased to 81.01 %, showing an improvement of 11.87 % compared to the STFT.Compared to the STFT and LFCC, the MFCC achieved an accuracy rate of 78.03 % in single-tone recognition, which indicated a 3.79 % improvement over the LFCC.The recall rate and precision of the MFCC were 82.71 % and 89.39 %, respectively, both higher than those of the STFT and LFCC.The F1 value was 85.92 %, showing a significant increase of 4.91 % compared to the LFCC.These results indicated that among the three frequency domain features compared, the MFCC performed the best in recognizing piano music signals.
Then, the MFCC feature was further analyzed.The recognition results of the MFCC, IMFCC, and MidMFCC were compared (Table 6).According to Table 6, the recognition performance of the MFCC, IMFCC, and MidMFCC was calculated, and the results are shown in Table 7. From Table 7, it can be observed that compared to the MFCC, IMFF and MidMFCC contained less information as they were based on the recomputation of the MFCC.Therefore, when used alone, their recognition performance was inferior to the MFCC.The F1 values of the IMFF and MidMFCC were 84.16 % and 85.43 %, respectively, both lower than that of the MFCC.The accuracy comparison between different combinations of the MFCC, IMFCC, and MidMFCC is shown in Fig. 3.
According to Fig. 3, when the MFCC, IMFCC, and MidMFCC were combined pairwise, there was no significant improvement in recognition accuracy compared to the MFCC.When all three features (MFCC+IMFCC+MidMFCC) were used as input for the SVM, the recognition accuracy dropped to 71.32 %, showing a decrease of 6.71 % compared to using only MFCC.The results demonstrated that an excessive number of feature dimensions could result in a decrease in recognition performance.

Fig. 3. The accuracy of piano music signal recognition using different MFCC features
The F-MFCC was used as the SVM input and compared with the MFCC.The results of piano music signal recognition is shown in Fig. 4. From Fig. 4, it can be observed that when using the F-MFCC as the feature, the SVM demonstrated a significant improvement in recognizing piano music signals.Firstly, in terms of the accuracy, the F-MFCC achieved 90.91 %, which represented an increase of 12.88 % compared to the MFCC; secondly, in terms of the recall rate, the F-MFCC achieved 95.71 %, indicating a 13 % increase compared to the MFCC.The F-MFCC achieved a precision of 93.06 %, which was a 3.67 % improvement compared to the MFCC.In terms of the F1 value, the F-MFCC achieved 94.37 %, which was a 8.45 % improvement compared to the MFCC.The F-MFCC selected 12-order MFCC parameters with the highest information content and combined them as the feature input for the SVM, thereby achieving improved performance in recognizing piano music signals.
The accuracy of piano music signal recognition was evaluated as an example.Ten-fold cross-validation was performed to obtain ten accuracy values, which were then averaged.A t-test was conducted to compare the accuracy of the F-MFCC with the other features, and the  value The recognition accuracy of the other features was lower compared to the F-MFCC, as observed from Table 8.Through comparison, it can be concluded that there was a significant difference between the accuracy obtained from the F-MFCC and other features ( < 0.05).This result demonstrated the distinct advantage of the F-MFCC in piano music signal recognition.

Discussion
Music signal processing has extensive practical applications, such as audio content recognition and analysis, as well as the enhancement and noise reduction of audio.In the field of music composition, based on music signal processing, it is possible to synthesize virtual instruments and achieve automated note and melody recognition, thereby enhancing the intelligence of music creation.Feature extraction plays a crucial role in music signal processing as it directly affects the recognition and classification results of musical signals.Therefore, this paper focuses on improving piano music signal recognition effectiveness through improved frequency domain analysis.
MFCC is a commonly used feature in signal processing.This paper further enhanced the extraction of piano audio features by incorporating IMFCC and MidMFCC based on MFCC.Subsequently, Fisher's criterion was applied to filter the obtained features, resulting in the F-MFCC feature.Through the recognition experiment on 88 individual piano notes, it can be observed that compared to the STFT and LPCC, the MFCC exhibited better performance in recognizing piano music signals.MFCC is a feature that aligns more closely with human auditory characteristics, thus the SVM method based on MFCC achieved higher accuracy in single note recognition, thus proving the reliability of selecting MFCC for further research.
When comparing the MFCC, IMFCC, and MidMFCC, it can be observed that both IMFCC and MidMFCC did not perform as well as MFCC in single-tone recognition.Additionally, when combined pairwise, they also did not achieve higher recognition accuracy.Surprisingly, when all 36-dimensional features of the MFCC, IMFCC, and MidMFCC were used as inputs for SVM classification, the obtained accuracy actually decreased.This suggested that an excessive number of dimensions led to a decrease in precision.The F-MFCC features selected by the Fisher criterion achieved the highest recognition accuracy, i.e., 90.91 %.This represented a 12.88 % improvement compared to the MFCC and demonstrated both the reliability of the designed F-MFCC as a feature for piano music signal recognition and its potential for further application in practical music signal processing.

Conclusions
This article conducted a study on the extraction of piano music signal features from the perspective of frequency domain analysis.A Fisher criterion-based F-MFCC feature was designed, and the SVM was used to recognize 88 single piano notes.From the results, it can be observed that the STFT and LPCC exhibited poor performance in recognizing piano music signals.The accuracy and F1 value of the MFCC were found to be 78.03% and 85.92 %, respectively, which were superior to those of the STFT and LPCC.When comparing different MFCC features, it can be observed that an excessive number of feature parameters led to a decrease in recognition performance.However, the proposed F-MFCC achieved an accuracy of 90.91 % and an F1 score of 94.37 %, demonstrating significant improvements compared to the MFCC.The findings highlight the effectiveness of the proposed method and its potential for practical applications.However, this study also has some limitations.For example, it solely focuses on extracting piano music signal features while overlooking the optimization of recognition algorithms.Additionally, the size of the experimental data was relatively small.In future work, we will make further improvements and optimizations to the SVM method and conduct experiments on a wider range of data to validate the reliability of the proposed approach.

Fig. 4 .
Fig. 4. The performance of F-MFCC in piano music signal recognition

Table 2 .
The recognition accuracy of the SVM method for the training set

Table 3 .
The impact of the feature dimensionality selected by Fisher's criterion on the recognition accuracy

Table 4 .
Recognition results of the SFTF, LPCC, and MFCC

Table 5 .
Comparison of the recognition performance between the STFT, LPCC, and MFCC

Table 6 .
Recognition results of the MFCC, IMFCC, and MidMFCC

Table 7 .
Comparison of recognition performance between the MFCC, IMFCC, and MidMFCC

Table 8 .
If  < 0.05, it indicated a significant difference between the F-MFCC and the other features.The results are presented in Table8.Statistical analysis of recognition accuracy