Personal View site logo
Microsoft Translation Demo
  • 14 Replies sorted by
  • It's a funny new feature to simulate some specific persons voice in speech synthesis, but that feature is not really relevant for the listeners to understand the message.

    The rest - speech recognition and automatic translation - is the much more relevant and much harder part, and I don't see that much of an improvement in there over what was state of the art 15 years ago (when I stopped working in the field of speech recognition).

    The problem is that even a single word that is misunderstood or mis-translated can turn the message of the speaker upside down, so nobody will rely on such a system unless it is as safe as using a human translator. Human translators still have a huge advantage in avoiding misunderstandings due to their semantic understanding of what is being said, and due to their semantic context knowledge.

    I'm kind of convinced that we will see a real "automatic translation breakthrough" to make such technology useful for everyday use only after the problem of teaching a computer "common sense" has been solved. And the long history of failures in that regard (see e.g. "Cyc") seems to indicate that it may still take quite a while until this is really solved...

  • Speech synthesis hasn't improved since 1980 when the Texas Instruments Ti99/4A home computer speech ROM came out. Ah memories :-)

  • @karl

    The rest - speech recognition and automatic translation - is the much more relevant and much harder part, and I don't see that much of an improvement in there over what was state of the art 15 years ago (when I stopped working in the field of speech recognition).

    I absolutely do not agree here.

    @driftwood

    Speech synthesis hasn't improved since 1980 when the Texas Instruments Ti99/4A home computer speech ROM came out. Ah memories :-)

    How long ago you checked it? It was 1982, I guess?

  • @Vitaliy_Kiselev: I was involved in the research project Verbmobil, you can download a video about it here. This system essentially did 15 years ago what MicroSoft presented recently, with the exception that Verbmobil did not try to emulate the speakers voice.

    Back then, automatic speech recognition software already achieved better recognition rates than humans when processing single, isolated words uttered by unknown speakers. But the ugly truth is that while humans do not correctly recognize many words they hear, they are very good in guessing the ones they could not recognize, and they are very good in adapting to the voice of a certain speaker quickly. This in addition with "knowledge of the world" and common sense allows a human to successfully understand what other people speak.

    The results of automatic translations are still more funny than accurate, at least when they are not restricted to a narrow topic.

    What definitely has improved is the computational power available per money and weight/size, what required a network of parallel computing expensive servers back then is now available at moderate prices and fits on your desktop. (But still, services like "Siri" rely on fat server farms to do recognition remotely, instead of implementing the whole process on the phone itself.)

  • I was involved in the research project Verbmobil, you can download a video about it here. This system essentially did 15 years ago what MicroSoft presented recently, with the exception that Verbmobil did not try to emulate the speakers voice.

    I really do not like "essentially did 15 years ago what MicroSoft presented recently, with the exception that Verbmobil did not try to emulate the speakers voice" claim. As here whole idea is to reduce the delay, not only emulate voice. Plus approach to translation can be quite different.

    Back then, automatic speech recognition software already achieved better recognition rates than humans when processing single, isolated words uttered by unknown speakers.

    Do you have some links to papers? As such recognition rates usually means some special restrictions - like random non-contex words spoken by non-native speaker. And your later words also really are about this.

    Also, as far as I remember, system recognizing isolated words and real speech are very different, amd most are based on different algorithms, so you can't just get one and scale it.

    This in addition with "knowledge of the world" and common sense allows a human to successfully understand what other people speak.

    And this is exactly "knowledge of the world" that is different from old academic attempts.

    The results of automatic translations are still more funny than accurate, at least when they are not restricted to a narrow topic.

    But results become better.
    Alailable data is growing fast, so "knowledge of the world" is improving. It is not the same as for humans, but it is becoming better.

    What definitely has improved is the computational power available per money and weight/size, what required a network of parallel computing expensive servers back then is now available at moderate prices and fits on your desktop. (But still, services like "Siri" rely on fat server farms to do recognition remotely, instead of implementing the whole process on the phone itself.)

    Google voice recognition can work on device also, but usually result will be better for distant recognition, not because computing power, but because of very big and specially made database of context and speakers data.

  • @Vitaliy_Kiselev:

    As here whole idea is to reduce the delay, not only emulate voice.

    Hmmm... the delay of any translation is primarily caused by the requirement to wait for a speaker to have completed at least so much words of his sentence that the following words will not alter the semantics of the already spoken part.

    The contribution of the computational effort is, in comparison, pretty low, and just depends on the amount of computing power one is willing to invest.

    Plus approach to translation can be quite different.

    Sure, but the results of automatic translation are today still so far from the quality of manual translations that I (subjectively) don't see much improvements during the last years.

    Back then, automatic speech recognition software already achieved better recognition rates than humans when processing single, isolated words uttered by unknown speakers.

    Do you have some links to papers?

    Such would be difficult to dig out. I remember a live demo vividly: Our professor replayed a recording where different people each spoke just one word: Either the (German) word "Nein" ("no") or "Neun" (the number 9). We were asked to recognize them, and our (human) average recognition rate was about 80%, while our automatic speech recognizer achieved 92%.

    Sure, this is a very artificial scenario, but you probably know a similar, much more common situation: Have a receptionist write down the names of people who are calling him - without asking for repetition or spelling. He will have an extremely hard time get most of the names right.

    BTW: All the recognizers that were developed for "Verbmobil" were speaker-independent - which is much more difficult than building a recognizer that is "trained with the voice of its later user". According to the Microsoft presentation, the speaker had to deliver a lenghty sample of his voice to allow for emulating his voice. I hope they didn't cheat by using the same material to also train the recognizer.

    Also, as far as I remember, system recognizing isolated words and real speech are very different, amd most are based on different algorithms, so you can't just get one and scale it.

    During my time in speech recognition research, we wrote software for single word recognition (useful for navigating through menus on a phone), word spotting (recognizing a limited set of words within fluent speech) and recognition of complete sentences. While the later parts of processing differ, the extraction of observables, their filtering and preprocessing, their feeding into stochastic models or neural networks and the principles how to determine the hypothesis with the highest likelyhood are pretty much the same with all three use cases.

    This in addition with "knowledge of the world" and common sense allows a human to successfully understand what other people speak. And this is exactly "knowledge of the world" that is different from old academic attempts.

    Where do you see that in the Microsoft presentation?

    Alailable data is growing fast, so "knowledge of the world" is improving. It is not the same as for humans, but it is becoming better.

    Sure the systems can today much more easily be based on larger databases, and that is certainly helpful, especially to all the stochastic approaches. But no system has yet won at least the $25,000 Loebner prize, and I would be very surprised if we saw an automatic translator of human equivalent quality before this is achieved.

  • Hmmm... the delay of any translation is primarily caused by the requirement to wait for a speaker to have completed at least so much words of his sentence that the following words will not alter the semantics of the already spoken part.

    The contribution of the computational effort is, in comparison, pretty low, and just depends on the amount of computing power one is willing to invest.

    Yep. But I think that whole goal in real situation is to be close to humans who can guess from the context and knowledge much sooner.

    Sure, but the results of automatic translation are today still so far from the quality of manual translations that I (subjectively) don't see much improvements during the last years.

    Improvements are constant, at least in usage. Bases collected by corporations are huge and grow very fast.

    During my time in speech recognition research, we wrote software for single word recognition (useful for navigating through menus on a phone), word spotting (recognizing a limited set of words within fluent speech) and recognition of complete sentences. While the later parts of processing differ, the extraction of observables, their filtering and preprocessing, their feeding into stochastic models or neural networks and the principles how to determine the hypothesis with the highest likelyhood are pretty much the same with all three use cases.

    My position is that all this "stochastic models or neural networks" is just big waste of time. As translation, in fact, require understanding. And understanding means that you need extremely complex associative models that human have. And translation is not really translation, but resynthesis.

    But no system has yet won at least the $25,000 Loebner prize, and I would be very surprised if we saw an automatic translator of human equivalent quality before this is achieved.

    If you ask me, we'll reach the goal if we'll fire most academic researchers-parasites in this field leaving commercial firms, and start looking at reality. Look for different approaches.

  • As usual the truth is in the middle. Speech recognition has developed in the past 20 years, but at a much slower rate than other computational tasks. Think of 3D graphics, in the 80's it took minutes to render a low poly sphere in gourod shading at 320x240, now we have scenes around 1.000.000 polygons with complex lightings and reflections in real time at 1920x1080. If speech recognition had same developing rate now our mobile phones would have instant translation in any language with perfect non robotic voices.

  • @jazzroy And the answer to the slow/none development in speech recognition is probably, not enough money to make. Which, for most companies, means waste of time. I'll bet if one smal company makes a brake through into this and get it to good use with an iPhone, or something alike, there will be hundreds of these programs working just fine in a year or two. And school (foreigner language) grades will go to shit a couple of years after that.

  • @fix

    It is not about money. It is about skills and concepts. Researchers spent billions of goverment money, ye it looks like approach was not too good.

  • Well there are a lot of skilled people out there. Someone will surely make some big bucks if they could brake the code. Still, I'm quite sure it would become more "pop", in resarch, if someone made a brake through.

  • @Vitaliy_Kiselev: I really wonder why you call academc researchers parasites. I cannot tell about the situation in other countries, but in Germany the majority of research work at universities is done by unpaid students, some of the work is done by low-paid graduate assistants (on temporary work contracts) and only a small fraction of the total effort is actually done by relatively well paid professors - and with "relatively well paid" I mean "still paid less than an average middle manager in the industry".

    In reality, it is the companies participating in BMBF (gouvernment-)funded research projects that one could call "parasites", as they spend money collected from the tax payers to pay part of the development of new products.

    And only very few corporations ever invest in true basic research, because they usually expect to see a positive ROI within a very short time frame, and such is unlikely to be gained from basic research - with speech recognition being a perfect example where it took many years until first marketable products.

    Instead, corporations buy "spin-offs" that originated from government funded non-profit research institutes - like Nuance or Siri (which spun off SRI International), or Entropic, a spin-off from the Cambridge university, which was aquired by Microsoft.

    It might help bringing technology to markets when you commercialize what has already been researched, but basic reasearch would mostly just not happen if we had to rely on some MBA to sign a budget for it.

  • @fix: If automatic translation was completely solved and available everywhere, nobody would miss being taught foreign languages. A few would still learn foreign languages as a hobby - like some still play chess today, even though computers can do that better.