Intellego/Research

From MozillaWiki
Jump to: navigation, search

The problems

  • One of the features where Chrome has beaten Firefox is providing users with automatic translation of web content using Google Translate. Google has spent a lot of time and incorporated some interesting strategies into building a complex, proprietary machine translation engine to handle this. The feature within Chrome not only allows users to call and retrieve machine translation output through the Google Translate engine, but Google Translate has an interface to allow users to make recommendations for improving the translation, thus allowing the engine to become more sophisticated and accurate.
  • Before Chrome, Google Translate had an open API, which allowed them to collect content for use in their engine, but also made the web a generally more multilingual place. Using this open API, any website could add a snippet of code and see their site translated on the fly. Over three years ago, Google closed this API and began charging for the service, resulting in many websites becoming monolingual once again. Closing Google Translate has left a massive gap in the web and nothing yet has been able to fill the need.
  • The open MT ecosystem currently suffers from not being able to provide potential users with a quality web service or API which both MT end users and web admins could use for their projects.
  • Many Mozilla l10n teams consist of only 1-2 people. While they would love to be able to provide coverage in their language for all of the support and websites used to market to and assist users with issues, they do not have the time to commit. User, thus, have a localized Firefox, but lack the troubleshooting support in their language.
  • More and more Mozillians are non-English speakers or do not have English writing skills. There have been efforts to provide language education for Mozillians, however, the opportunities are limited to a small percentage of Mozillians. These Mozillians are thus limited in their participation due to the significan language barrier.
  • Language support selection for machine translation projects are driven, in part, by ROI and availability of resources. This often results in minority languages, and even some majority languages (see Indic languages) being under-represented in the machine translation ecosystem. While ROI continues to be a primary motivator for incorporating support for these languages, they will remain under-represented and unsupported.
  • Data collected for machine translation corpuses is often done via web crawling and consuming data that users unknowingly offer to these engines either due to web crawling or due to agreeing to obscure terms and conditions of using that MT service. Open data collection for MT corpuses is either non-existent or an obscure practice.

Research questions

How does machine translation work?

There are four general approaches to Machine Translation. Most of the early work, before massive corpora, was done with Rule-based machine translation ( http://en.wikipedia.org/wiki/Rule-based_machine_translation ). However, most of the current work being done is with Statistical Machine Translation ( http://en.wikipedia.org/wiki/Statistical_machine_translation ). A brief description of each is available below.

Rule-Based Machine Translation

Uses pre-defined grammatical and syntactic rules and large bilingual dictionaries to translate text. It can be very costly to produce the necessary resources for this type of translation but according to http://blog.globalizationpartners.com/machine-translation.aspx it can actually "produce better quality for language pairs with very different word orders (for, example English to Japanese)"

Statistical Machine Translation

Uses statistical information to choose the "best" translation from the possible translations of a text. As far as I know, all work with statistical machine translation requires a bilingual corpus for calculating the necessary probabilities.

Example-based Machine Translation

Uses cases and analogies, along with a parallel corpus, to determine the best translation. Somewhat similar to Rule-Based (http://en.wikipedia.org/wiki/Example-based_machine_translation).

Hybrid Machine Translation

A combination of the previously mentioned approaches.

What are the benefits and drawbacks to each methodology?

How do you measure the output quality of a machine translation engine?

Automated evaluation
  • BLEU Score - http://en.wikipedia.org/wiki/BLEU
    • Compares MT output against reference translations consisting of professional human translation, assigning a score (based on n-gram precision) to determine how close to the human translation the MT output arrives.
  • NIST - http://en.wikipedia.org/wiki/NIST_(metric)
    • Similar to BLEU, however, not all correct n-grams are created equal. Correct n-grams are weighted according to rarity of occurrence.
  • METEOR - http://en.wikipedia.org/wiki/METEOR
    • Evaluation based on unigram recall consistency, rather than precision (as BLEU and NIST do).
  • LEPOR - http://en.wikipedia.org/wiki/LEPOR
    • New MT evaluation model that is based on evaluating precision, recall, sentence-length and n-gram based word order.
  • WER score - https://en.wikipedia.org/wiki/Word_error_rate
    • The Word Error Rate calculates the word-level Levenshtein distance between MT output and a reference translation. Should correlate with the difficulty of post-editing machine translation output for publication.
    • PWER (Position-independent WER) is a variant where reorderings are disregarded.

What prominent machine translation engines are out there and what are they known for?

This is a much more concise table of the current offerings. Includes both open and closed source engines that have front-end applications.
This is a list of all open source MT engines. Some have web services, many do not.
Name Owner Method Open/Closed # of supported languages Web hosted?
Google Translate Google Statistical Closed +70 translate.google.com
Microsoft Translator Microsoft Closed
Babelfish Yahoo! Closed
MosesMT Statistical Open
Apertium Rule-based Open 30+ apy
Other
Other
Other

What prominent corpuses are currently available?

Name Owner Method Open/Closed # of languages Noteworthy
Google Translate
Microsoft Translator
Babelfish
EuroParl European Parliament Open 22 Sentence aligned text
JRC-Acquis European Union Open 22 Sentence aligned text
Hansards Corpus Canadian Govt Open 2 Sentence or smaller aligned text
OPUS Open Many Contains a variety of different corpora including some of those mentioned above
MultiUN United Nations Open 7 Sentence alignment

What are the pros and cons of having a Mozilla MT engine?

What technology resources would be needed to build our own MT engine?

What human resources would be needed to build our own MT engine?

What partnership opportunities could be available for this project?

See https://www.taus.net/taus-machine-translation-showcase.

User stories

Firefox end-users

  • I want to automatically translate web sites into my native language in Firefox desktop.
  • I want to automatically translate web sites into my native language in Firefox for Android.
  • I want to automatically translate web sites into my native language in the Firefox OS browser.
  • I want to be able to give feedback and make corrections to machine translation output within these products.
  • My minority language has a very small presence online, but it's my native language and I want to see the web translated into that language.

Browser users in general

  • I want/need language tools in browser, but am currently forced to use Chrome/go without
  • I want language tools in my browser of choice

Web admins

  • I want an open API to an open MT engine that will allow my users to automatically translate the page's content into their native language with the press of a button.

Businesses

  • My product is popular in many countries, but I just don't have the resources to offer support in other languages. I want to better server and retain customers who don't speak my language.

Non-english speaking Mozillians

  • I want to be able to read emails sent to me in my native language.
  • I want to be able to send emails to other mozillians who don't speak my language, knowing that my message will be understood by anyone who reads it.
  • I want to be able to participate in Mozilla forum discussions in my native language.

Non-english speaking potential Mozillians

  • I want to support Mozilla but my English is not good enough (or I have none) to participate

Mozilla localizers

  • I want to translate support pages (or marketing campaigns, or other projects) for my localization of Firefox, but it requires a lot of time to translate. I want to be able to post-edit MT output in order to still provide language coverage without the massive time commitment.

State of the Field

Researchers

  • Andrew Ng (Stanford University)
  • Philipp Koehn (University of Edinburgh) - Maintains http://statmt.org/
  • Daniel Marcu (University of Southern California)

Bibliography

Overview

I have broken down the bibliography into two sections below. The first is pages that contain lists of papers including conference proceedings and other things. The second section is specific papers that would be good to read.

Websites/Conference Proceedings

Individual Papers/Articles/Presentations