The performance is shown in terms of the word error rate, which is defined as the sum of word substitutions, deletions, and insertions, as a percentage of the actual number of words in the test. All training and test speakers were native speakers of American English. The error rates are for speaker-independent recognition, that is, test speakers were different from the speakers used for training.
All the results in Figure 7 are for laboratory systems; they were obtained from the following references Bates et al. The first two corpora were collected in very quiet rooms at TI, while the latter two were collected in office environments at several different sites. The ATIS corpus was collected from subjects trying to access airline information by voice using natural English queries; it is the only corpus of the four presented here for which the training and test speech are spontaneous instead of being read sentences.
The Use of d-gram Language Models for Speech Recognition in Russian
The WSJ corpus consists largely of read sentences from the Wall Street Journal, with some spontaneous sentences used for testing. Shown in Figure 7 are the vocabulary size for each corpus and whether the vocabulary is closed or open. The vocabulary is closed when all the words in the test are guaranteed to be in the system's lexicon, while in the open condition the test may contain words that are not in the system's lexicon and, therefore, will cause errors in the recognition.
The perplexity is the test-set perplexity defined above. Strictly speaking, perplexity is not defined for the open-vocabulary condition, so the value of the perplexity that is shown was obtained by making some simple assumptions about the probability of n-grams that contain the unknown words. The results shown in Figure 7 are average results over a number of test speakers. The error rates for individual speakers vary over a relatively wide range and may be several times lower or higher than the average values shown.
Since much of the data were collected in relatively benign conditions, one would expect the performance to degrade in the presence of noise and channel distortion. It is clear from Figure 7 that higher perplexity, open vocabulary, and spontaneous speech tend to increase the word error rate. We shall quantify some of these effects next and discuss some important issues that affect performance. It is well recognized that increasing the amount of training data generally decreases the word error rate. However, it is important that the increased training be representative of the types of data in the test.
Otherwise, the increased training might not help. With the RM corpus, it has been found that the error rate is inversely proportional to the square root of the amount of training data, so that quadrupling the training data results in cutting the word error rate by a factor of 2. This large reduction in error rate by increasing the training data may have been the result of an artifact of the RM corpus, namely, that the sentence patterns of the test data were the same as those in the training.
In a realistic corpus, where the sentence patterns of the test can often be quite different from the. For example, recent experiments with the WSJ corpus have failed to show significant reduction in error rate by doubling the amount of training. However, it is possible that increasing the complexity of the models as the training data are increased could result in larger reduction in the error rate. This is still very much a research issue. Word error rates generally increase with an increase in grammar perplexity. A general rule of thumb is that the error rate increases as the square root of perplexity, with everything else being equal.
This rule of thumb may not always be a good predictor of performance, but it is a reasonable approximation.
e-Products & Services
Note that the size of the vocabulary as such is not the primary determiner of recognition performance but rather the freedom in which the words are put together, which is represented by the grammar. A less constrained grammar, such as in the WSJ corpus, results in higher error rates. The terms speaker-dependent SD and speaker-independent SI recognition are often used to describe different modes of operation of a speech recognition system. SD recognition refers to the case when a single speaker is used to train the system and the same speaker is used to test the system.
SI recognition refers to the case where the test speaker is not included in the training. In SD mode training speech is collected from a single speaker only, while in SI mode training speech is collected from a variety of speakers. SD and SI modes of recognition can be compared in terms of the word error rates for a given amount of training. A general rule of thumb is that, if the total amount of training speech is fixed at some level, the SI word error rates are about four times the SD error rates.
Another way of stating this rule of thumb is that, for SI recognition to have the same performance as SD recognition, requires about 15 times the amount of training data Schwartz et al. These results were obtained when one hour of speech was used to compute the SD models. However, in the limit, as the amount of training speech for SD and SI models is made larger and larger, it is not clear that any amount of training data will allow SI performance to approach SD performance.
The idea behind SI recognition is that the training is done once, after which any speaker can use the system with good performance. SD recognition is seen as an inconvenience for potential users. If the system is used in a domain in which it was not trained, the performance degrades. It has been a historical axiom that, for optimal SI recognition performance, it is best to collect training speech from as many speakers as possible in each domain. For example, instead of collecting utterances from each of speakers, it was believed that it is far superior to collect, say, 10 utterances from each of speakers.
Recent experiments have shown that, for some applications, collecting speech from only a dozen speakers may be sufficient for good SI performance. In an experiment with the WSJ corpus, for a fixed amount of training data, it was shown that training with 12 speakers gave basically the same SI performance as training with 84 speakers Schwartz et al. This is a welcome result; it makes it easier to develop SI models in a new domain since collecting data from fewer speakers is cheaper and more convenient.
- Special Issue on “Mathematical modelling in applied sciences”.
- First Comes Love, Then Comes Money: A Couples Guide to Financial Communication.
- Feeling the Heat: The Politics of Climate Policy in Rapidly Industrializing Countries!
- Barbarians and Brothers: Anglo-American Warfare, 1500-1865.
- Workshop on Modelling in Biology and Medicine.
- Japanese Stage-Step Course: Workbook 1: Volume 3.
- Mathematical model - IEEE Conferences, Publications, and Resources!
The ultimate goal of speech recognition research is to have a system that is domain independent DI , that is, a system that is trained once and for all so that it can be used in any new domain and for any speaker without retraining. Currently, the only method used for DI recognition is to train the system on a very large amount of data from different domains. However, preliminary tests have shown that DI recognition on a new domain not included in the training can increase the error rate by a factor of 1.
If a good grammar is not available from the new domain, performance can be several times worse. It is possible to improve the performance of an SI or DI system by incrementally adapting to the voice of a new speaker as the speaker uses the system. This would be especially needed for atypical speakers with high error rates who might otherwise find the system unusable.
Such speakers would include speakers with unusual dialects and those for whom the SI models simply are not good models of their speech. However, incremental adaptation could require hours of usage and a lot of patience from the new user before the performance becomes adequate. A good solution to the atypical speaker problem is to use a method known as rapid speaker adaptation.
In this method only a small amount of speech about two minutes is collected from the new speaker before using the system. By having the same utterances collected. It is possible with these methods to achieve average SI performance for speakers who otherwise would have several times the error rate. One salient example of atypical speakers are nonnative speakers, given that the SI system was trained on only native speakers. In a pilot experiment where four nonnative speakers were tested in the RM domain in SI mode, there was an eight-fold increase in the word error rate over that of native speakers!
By collecting two minutes of speech from each of these speakers and using rapid speaker adaptation, the average word error rate for the four speakers decreased five-fold. Out-of-vocabulary words cause recognition errors and degrade performance. There have been very few attempts at automatically detecting the presence of new words, with limited success Asadi et al.
Most systems simply do not do anything special to deal with the presence of such words. After the user realizes that some of the errors are being caused by new words and determines what these words are, it is possible to add them to the system's vocabulary. In word-based recognition, where whole words are modeled without having an intermediate phonetic stage, adding new words to the vocabulary requires specific training of the system on the new words Bahl et al. However, in phonetically based recognition, such as the phonetic HMM approach presented in this paper, adding new words to the vocabulary can be accomplished by including their phonetic pronunciations in the system's lexicon.
Mathematical models for elastic structures - PDF Free Download
If the new word is not in the lexicon, a phonetic pronunciation can be derived from a combination of a transcription and an actual pronunciation of the word by the speaker Bahl et al. The HMMs for the new words are then automatically compiled from the preexisting phonetic models, as shown in Figure 6. The new words must also be added to the grammar in an appropriate manner.