4.6 Article

Self-organizing speech recognition that processes acoustic and articulatory features

Journal

MULTIMEDIA TOOLS AND APPLICATIONS
Volume -, Issue -, Pages -

Publisher

SPRINGER
DOI: 10.1007/s11042-023-17080-4

Keywords

Robust speech recognition; Neural network; Self-organizing map; Acoustic-to-articulatory inversion; Articulatory feature; Acoustic feature

Ask authors/readers for more resources

This article introduces an ASR model called Self-Organizing Speech Recognizer (SOSR) which uses acoustic and articulatory features, employs unsupervised and incremental learning, and is suitable for real-time applications. SOSR can learn quickly, handle noisy signals, various speakers, different types of speeches, and assorted lengths of utterances.
In automatic speech recognition (ASR) systems, the minimization of noxious effects caused by different background noises between training and operating situations has been a challenging task for many years. An ASR robust to noise that can deal with different types of speeches and various speakers still is an open research point. Typically, conventional ASR models for missing-feature reconstructions and robust speech descriptors employ acoustic features and statistical methods. In spite of improved performance in dealing with noise, such methods still degrade the performance when different background noises co-exist with the main signal. More recent approaches use neural networks, particularly deep learning models, for ASR purposes. Such models increase performance at the high training cost. In order to mitigate such limitations, we proposed an ASR model called Self-Organizing Speech Recognizer (SOSR). Unlike most conventional ASRs, SOSR is characterized by using acoustic and articulatory features, employing unsupervised and incremental learning, and is suitable for real-time applications due to its quick training stage. SOSR simultaneously processes an audio signal in a two-branch. In the first path, the acoustic features are extracted from the original signal whereas in the second path an acoustic-to-articulatory inversion is performed by several Self-organizing Maps. The signal from both paths is delivered to a Self-organizing Map with a time-varying structure, which is responsible for recognizing the input speech signal. Four datasets (TIMIT, Aurora 2, Aurora 4, and CHIME 2) were used for SOSR assessment. The Word Error Rate (WER) was the chosen metric to compare the experimental results of the tests with different noise levels and signal variations. Hence, the experimental results suggest that SOSR can learn quickly, and it can handle noisy signals, various speakers, different types of speeches, and assorted lengths of utterances.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available