Intellego/Research
⚡ Warning: The content of this page is obsolete and kept for archiving purposes of past processes.
Contents
- 1 The problems
- 2 Research questions
- 2.1 How does machine translation work?
- 2.2 What are the benefits and drawbacks to each methodology?
- 2.3 How do you measure the output quality of a machine translation engine?
- 2.4 What prominent machine translation engines are out there and what are they known for?
- 2.5 What prominent corpuses are currently available?
- 2.6 What are the pros and cons of having a Mozilla MT engine?
- 2.7 What technology resources would be needed to build our own MT engine?
- 2.8 What human resources would be needed to build our own MT engine?
- 2.9 What partnership opportunities could be available for this project?
- 3 User stories
- 4 State of the Field
The problems
- One of the features where Chrome has beaten Firefox is providing users with automatic translation of web content using Google Translate. Google has spent a lot of time and incorporated some interesting strategies into building a complex, proprietary machine translation engine to handle this. The feature within Chrome not only allows users to call and retrieve machine translation output through the Google Translate engine, but Google Translate has an interface to allow users to make recommendations for improving the translation, thus allowing the engine to become more sophisticated and accurate.
- Before Chrome, Google Translate had an open API, which allowed them to collect content for use in their engine, but also made the web a generally more multilingual place. Using this open API, any website could add a snippet of code and see their site translated on the fly. Over three years ago, Google closed this API and began charging for the service, resulting in many websites becoming monolingual once again. Closing Google Translate has left a massive gap in the web and nothing yet has been able to fill the need.
- The open MT ecosystem currently suffers from not being able to provide potential users with a quality web service or API which both MT end users and web admins could use for their projects.
- Many Mozilla l10n teams consist of only 1-2 people. While they would love to be able to provide coverage in their language for all of the support and websites used to market to and assist users with issues, they do not have the time to commit. User, thus, have a localized Firefox, but lack the troubleshooting support in their language.
- More and more Mozillians are non-English speakers or do not have English writing skills. There have been efforts to provide language education for Mozillians, however, the opportunities are limited to a small percentage of Mozillians. These Mozillians are thus limited in their participation due to the significan language barrier.
- Language support selection for machine translation projects are driven, in part, by ROI and availability of resources. This often results in minority languages, and even some majority languages (see Indic languages) being under-represented in the machine translation ecosystem. While ROI continues to be a primary motivator for incorporating support for these languages, they will remain under-represented and unsupported.
- Data collected for machine translation corpuses is often done via web crawling and consuming data that users unknowingly offer to these engines either due to web crawling or due to agreeing to obscure terms and conditions of using that MT service. Open data collection for MT corpuses is either non-existent or an obscure practice.
Research questions
How does machine translation work?
There are four general approaches to Machine Translation. Most of the early work, before massive corpora, was done with Rule-based machine translation ( http://en.wikipedia.org/wiki/Rule-based_machine_translation ). However, most of the current work being done is with Statistical Machine Translation ( http://en.wikipedia.org/wiki/Statistical_machine_translation ). A brief description of each is available below.
Rule-Based Machine Translation
Uses pre-defined grammatical and syntactic rules and large bilingual dictionaries to translate text. It can be very costly to produce the necessary resources for this type of translation but according to http://blog.globalizationpartners.com/machine-translation.aspx it can actually "produce better quality for language pairs with very different word orders (for, example English to Japanese)"
Statistical Machine Translation
Uses statistical information to choose the "best" translation from the possible translations of a text. As far as I know, all work with statistical machine translation requires a bilingual corpus for calculating the necessary probabilities.
Example-based Machine Translation
Uses cases and analogies, along with a parallel corpus, to determine the best translation. Somewhat similar to Rule-Based (http://en.wikipedia.org/wiki/Example-based_machine_translation).
Hybrid Machine Translation
A combination of the previously mentioned approaches.
What are the benefits and drawbacks to each methodology?
How do you measure the output quality of a machine translation engine?
- Automated evaluation
- BLEU Score - http://en.wikipedia.org/wiki/BLEU
- Compares MT output against reference translations consisting of professional human translation, assigning a score (based on n-gram precision) to determine how close to the human translation the MT output arrives.
- NIST - http://en.wikipedia.org/wiki/NIST_(metric)
- Similar to BLEU, however, not all correct n-grams are created equal. Correct n-grams are weighted according to rarity of occurrence.
- METEOR - http://en.wikipedia.org/wiki/METEOR
- Evaluation based on unigram recall consistency, rather than precision (as BLEU and NIST do).
- LEPOR - http://en.wikipedia.org/wiki/LEPOR
- New MT evaluation model that is based on evaluating precision, recall, sentence-length and n-gram based word order.
- WER score - https://en.wikipedia.org/wiki/Word_error_rate
- The Word Error Rate calculates the word-level Levenshtein distance between MT output and a reference translation. Should correlate with the difficulty of post-editing machine translation output for publication.
- PWER (Position-independent WER) is a variant where reorderings are disregarded.
What prominent machine translation engines are out there and what are they known for?
- This is a much more concise table of the current offerings. Includes both open and closed source engines that have front-end applications.
- This is a list of all open source MT engines. Some have web services, many do not.
Name | Owner | Method | Open/Closed | # of supported languages | Web hosted? |
---|---|---|---|---|---|
Google Translate | Statistical | Closed | +70 | translate.google.com | |
Microsoft Translator | Microsoft | Closed | |||
Babelfish | Yahoo! | Closed | |||
MosesMT | Statistical | Open | |||
Apertium | Rule-based | Open | 30+ | apy | |
Other | |||||
Other | |||||
Other |
What prominent corpuses are currently available?
Name | Owner | Method | Open/Closed | # of languages | Noteworthy |
---|---|---|---|---|---|
Google Translate | |||||
Microsoft Translator | |||||
Babelfish | |||||
EuroParl | European Parliament | Open | 22 | Sentence aligned text | |
JRC-Acquis | European Union | Open | 22 | Sentence aligned text | |
Hansards Corpus | Canadian Govt | Open | 2 | Sentence or smaller aligned text | |
OPUS | Open | Many | Contains a variety of different corpora including some of those mentioned above | ||
MultiUN | United Nations | Open | 7 | Sentence alignment |
What are the pros and cons of having a Mozilla MT engine?
What technology resources would be needed to build our own MT engine?
What human resources would be needed to build our own MT engine?
What partnership opportunities could be available for this project?
See https://www.taus.net/taus-machine-translation-showcase.
User stories
Firefox end-users
- I want to automatically translate web sites into my native language in Firefox desktop.
- I want to automatically translate web sites into my native language in Firefox for Android.
- I want to automatically translate web sites into my native language in the Firefox OS browser.
- I want to be able to give feedback and make corrections to machine translation output within these products.
- My minority language has a very small presence online, but it's my native language and I want to see the web translated into that language.
Browser users in general
- I want/need language tools in browser, but am currently forced to use Chrome/go without
- I want language tools in my browser of choice
Web admins
- I want an open API to an open MT engine that will allow my users to automatically translate the page's content into their native language with the press of a button.
Businesses
- My product is popular in many countries, but I just don't have the resources to offer support in other languages. I want to better server and retain customers who don't speak my language.
Non-english speaking Mozillians
- I want to be able to read emails sent to me in my native language.
- I want to be able to send emails to other mozillians who don't speak my language, knowing that my message will be understood by anyone who reads it.
- I want to be able to participate in Mozilla forum discussions in my native language.
Non-english speaking potential Mozillians
- I want to support Mozilla but my English is not good enough (or I have none) to participate
Mozilla localizers
- I want to translate support pages (or marketing campaigns, or other projects) for my localization of Firefox, but it requires a lot of time to translate. I want to be able to post-edit MT output in order to still provide language coverage without the massive time commitment.
State of the Field
Researchers
- Andrew Ng (Stanford University)
- Philipp Koehn (University of Edinburgh) - Maintains http://statmt.org/
- Daniel Marcu (University of Southern California)
Bibliography
Overview
I have broken down the bibliography into two sections below. The first is pages that contain lists of papers including conference proceedings and other things. The second section is specific papers that would be good to read.
Websites/Conference Proceedings
- http://statmt.org/
- AMTA 2012 Proceedings
- Machine Translation Archive
- An Introduction to Machine Translation
- A (Brief) History of Machine Translation
- TAUS Tutorial Requires you to be logged in to a free TAUS account
- TAUS Blog Post - Creating Quality MT with sparse parallel corpora
- Heafield, K., & Lavie, A. (2010). Combining Machine Translation Output with Open Source, (93), 27–36. doi:10.2478/v10108-010-0008-4.PBML
- Vasi, A. (2012). Enabling Users to Create Their Own Web-Based Machine Translation Engine, 295–298.
Individual Papers/Articles/Presentations
- https://en.wikipedia.org/wiki/Machine_translation
- http://ice.he.net/~hedden/intro_mt.html (A little old but has some good info)
- http://michaelnielsen.org/blog/introduction-to-statistical-machine-translation/
- Machine Translation: An Introductory Guide; Arnold, Douglas and Balkan, Lorna and Meijer, Siety and Humphreys, R Lee and Sadler, Louisa; 2001; http://promethee.philo.ulg.ac.be/engdep1/download/bacIII/Arnold%20et%20al%20Machine%20Translation.pdf (Direct to PDF link)