What’s Wrong with Current Tech

Current speech processors are based on the assumption that, given a speech waveform, one can calculate the probability that the waveform symbolizes a certain word by applying a statistical model. That statistical model represents a list of probabilities of associations between words and their waveform. The ideal model would contain all the words in a certain language spoken by all the people who speak or will ever speak that language. Obviously that is not possible. The choice of a sample that is representative of a population is a cornerstone of statistical science. In statistics in general, if one chooses a sample with characteristics that are not representative of the population the results will be biased and may not be generalizable to the population. This also applies to the current, statistics based, speech processors. Their predictive capacity depends on the choice of subset of the ideal model. If the group of people whose speech is recorded is not sufficiently diversified the model may not be able to find the correct word given a speech waveform that does not belong to that group, i.e. it will not be able to generalize. The solution to this problem is to collect as much data as possible. The solution of the problem is also one of the drawbacks of the method. This method depends completely on collecting a huge amount of data for each language. Speech recordings are easily available on the Internet for certain languages but for most languages there aren’t enough recordings for a speech processor. For some languages speech recordings on the Internet may be enough but are often single topic and so may create a biased model. Recording people’s speech in the field is resource intensive and may also lead to a biased model if it is too local.

The morphology of the human speech apparatus is limited in the range of sounds that can be produced. A subset of these sounds,called phonemes, are used as speech elements. All languages of the world use this set of phonemes to transmit information. In general languages use only a subset of all the available phonemes and all languages have at least a few phonemes in common with another language. The method used to create a speech model with current technology does not take advantage of the commonalities between languages and so for every language the model must be created from scratch even if there is already a model for a language thet is very similar to the one at hand.

The large quantity of data needed to create a model of a language for current speech recognition results in very large models. The fact that these models are very large forces the speech recognition service providers to use a system of remote servers to implement the service. The models would not be usable on handheld devices because of their low memory capacity and low processing speeds. On personal computers the models could be usable but the service providers prefer to keep their data centralized to facilitate upgrades. Some providers are trying to give users the choice of using a local model but with less accuracy due to its smaller size.