Relevant Courses: 04361 - 04363 - 04364 - 04365 - 04461 - 04462.
- 1. Decimation For Reducing Complexity of Speech Recognition
- In speech recognition, a piece-wise stationary statistical model of speech is constructed from a sequence of so-called feature vectors. The feature vectors are the result of a speech preprocessor, which typically computes a set of coefficients every 1
0ms based on a fixed window of 20-30ms of a speech signal. Typically the feature vectors are composed of LPC or so-called cepstral coefficients. In recent studies, it has been observed that under certain conditions the stream of feature vectors from the p
reprocessor can be down-sampled with a factor of 2-5 without loss in recognition performance. These studies have primarily focused on so-called whole-word based speech recognition, where a separate model is created for each word in the recognition vocabul
ary. Unfortunately, whole-word based speech recognition has the limitation that only fairly small vocabularies are feasible in practice as a separate model is required for each word. Therefore, more and more systems are today based on so-called sub-word m
odels (phonemes) from which any word of a given language can be constructed.
In this project, decimation (down-sampling) of the preprocessor output for sub-word based speech recognition is investigated. Both fixed decimation, where the down-sampling factor is fixed, and variable decimation, where the down-sampling factor is set ac
cording to the difference between consecutive feature vectors, can be investigated.
For evaluating the decimation techniques a sub-word based speech recognition engine will be provided.
- 2. Improved Preprocessing for Speech Recognition
-
In speech recognition, a piece-wise stationary statistical model of speech is constructed from a sequence of so-called feature vectors. The feature vectors are the result of a speech preprocessor, which typically computes a set of coefficients every 10ms
based on a fixed window of 20-30ms of a speech signal. Typically the feature vectors are composed of LPC or so-called cepstral coefficients. In recent studies, it has been observed that under certain conditions the stream of feature vectors from the prepr
ocessor can be down-sampled with a factor of 2-5 without loss in recognition performance. These studies have primarily focused on so-called whole-word based speech recognition, where a separate model is created for each word in the recognition vocabulary.
Unfortunately, whole-word based speech recognition has the limitation that only fairly small vocabularies are feasible in practice as a separate model is required for each word. Therefore, more and more systems are today based on so-called sub-word model
s (phonemes) from which any word of a given language can be constructed.
In this project, decimation (down-sampling) of the preprocessor output for sub-word based speech recognition is investigated. Both fixed decimation, where the down-sampling factor is fixed, and variable decimation, where the down-sampling factor is set ac
cording to the difference between consecutive feature vectors, can be investigated.
For evaluating the decimation techniques a sub-word based speech recognition engine will be provided.
- 3. Out-Of-Vocabulary Word Rejection for Speech Recognition
-
This project aims at investigating utterance rejection algorithms for small vocabulary isolated word recognition like name dialing and command word recognition. In name dialing applications for portable devices like mobile phones, the ability of the speec
h recognizer to reject utterances is very important. If rejection is not used, an erroneous recognition may result and consequently the phone will place a call to a wrong number. This situation is likely to happen especially in noisy conditions or if the
user utters a name which is not part of the recognizer vocabulary (out-of-vocabulary word rejection).
Most utterance rejection algorithms are based on so-called log likelihood ratios, that is, the ratio between the log-probability of the "winning" model compared to the second best model or a "filler" or "garbage" model. When the ratio is above some thresh
old the recognition has high confidence, whereas a low ratio implies low confidence and consequently the utterance should be rejected. Unfortunately, the threshold that gives a good trade-off between rejection and recognition is very sensitive to the sign
al to noise ratio. In some applications the signal to noise ratio can be estimated only roughly based on a single utterance. It is therefore desirable to develop a rejection measure, which is less sensitive to SNR. One possibility in this direction is to
use a posterior probability based measure, or to set the rejection threshold proportional to the log likelihood ratio between speech and non-speech segments of the waveform.
The project starts by a literature study on current methods followed by an evaluation and possibly improvement of a few selected approaches. The selected approaches must be suitable primarily for so-called sub-word (phoneme) based Isolated Word Recognitio
n. In the project, various phoneme based speech recognition engines will be available for evaluating the developed rejection algorithms
- 4. Rate of Speech and Phoneme Count Estimation for Speech Recognition
-
This project investigates and compares methods for estimating the speaking rate of a speaker also known as the rate of speech (ROS). The ROS is typically measured in terms of the number of sound units (phonemes) per time unit. A reliable estimate of ROS c
an be used to improve the performance of a speech recognizer significantly, as it is well known that performance is very poor for very fast or very slow speaking persons. The poor performance for these speakers can be improved by appropriately taking care
of the ROS, e.g., by phoneme duration modelling according to the estimated ROS for each phoneme, or by using separate models for "outlier" ROS speakers. The ROS estimator can also be used for estimating the number of phonemes in an utterance. The phoneme
count estimate can be used for constraining the recognition task to words of a particular length, so as to improve performance and reduce recognition complexity.
In this project various methods for ROS and phoneme count estimation based on the speech recognizer itself are compared to neural network based approaches. Both standard feed-forward and feed-back networks can be evaluated.
- 5. Text-to-phoneme mapping with Hidden Markov Models
- Speaker-independent speech recognition systems often employ statistical phoneme models to map a specific language. In order to recognize a pre-specified vocabulary it is required to translate the "spelled" word strings into string of phonemes. In lang
uages like Japanese or Finnish this is easy since the pronunciation is uniquely determined from the spelling. In other languages the translation is not known in advance and a dictionary of phonetic transcription is needed. However, due to large sizes, suc
h dictionaries are not suitable for handheld devices like mobile phones.
A widely used approach for statistical Text-To-Phoneme (TTP) models are the so-called decision trees. However, if the vocabulary in the application is unconstrained, the decision trees will typically be very large in order to provide an acceptable mapping
accuracy.
In this project, alternative methods for text-to-phoneme mapping are compared to the decision tree based approach. The main objective of the developed model is that it should be significantly smaller than the decision tree model without compromising the m
apping accuracy. Potential frameworks to consider are so-called Hidden Markov Models (HMM), feed-forward or recurrent neural networks.
- 6. Pre-processing for Speech Recognition Using Advanced Auditory Models
-
A normal automatic speech recognition system includes a pre-processing module, which extracts features from the audio waveform. Mel Frequency Cepstral Coefficient (MFCC) features have more or less been adopted by the speech processing society as standard.
MFCCs models the basilar membrane by a mel scaled frequency axis, and turns the convolution with the vocal tract into a sum by using the cepstrum instead of the spectrum.
MFCC provides a very good front end when the speech is relatively clean, whereas in noisy environments performance of the recognizer may be degraded quite severely.
In order to improve noise robustness of the recognizer, this project will focus on applying advanced auditory models for feature extraction like e.g. the PEMO model developed by the Medical Physics Group at Oldenburg University (http://medi.uni-oldenburg.
de /members/juergen/asr.html) or the Auditory Image Model from Medical Research Council, Cambridge (ftp://ftp.essex.ac.uk/pub/omard/dsam/).