0% found this document useful (0 votes)
120 views17 pages

English-Afaan Oromoo Translation Study

This document summarizes an experiment on English to Afaan Oromoo machine translation using a statistical approach. The experiment used 20,000 sentence pairs from various documents to build translation and language models. The models achieved an average BLEU score of 17.74% after correcting alignment errors. While the score is fair for this language pair given the limited data, increasing the size and quality of training data could improve accuracy. Next steps include growing the parallel corpus through system output and using comparable corpora.

Uploaded by

Christine Ghali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views17 pages

English-Afaan Oromoo Translation Study

This document summarizes an experiment on English to Afaan Oromoo machine translation using a statistical approach. The experiment used 20,000 sentence pairs from various documents to build translation and language models. The models achieved an average BLEU score of 17.74% after correcting alignment errors. While the score is fair for this language pair given the limited data, increasing the size and quality of training data could improve accuracy. Next steps include growing the parallel corpus through system output and using comparable corpora.

Uploaded by

Christine Ghali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

English – Afaan Oromoo

Machine Translation:
An Experiment Using a Statistical
Approach

Sisay Adugna Andreas Eisele


Haramaya University DFKI GmbH
Ethiopia Germany
sisayie@[Link] eisele@[Link]
Outline

  Introduction
  Objectives
  Experiment
  Result and Discussion
  Conclusion
  Next Steps
  Acknowledgement
Introduction

  Afaan Oromoo (ISO Language Code: om)‫‏‬


  17 million people's mother tongue – MS Encarta 
  24,395,000 people's Official working language‐CSA 
  Spoken also in Kenya and Somalia 

  English (ISO Language Code: en)‫‏‬


  Lingua franca of online informaKon.  
  71% of all web pages – [Link]  
Objectives

  The paper has two main goals:


1. to test how far we can go with the available
limited parallel corpus for the English – Oromo
language pair and the applicability of existing
Statistical Machine Translation (SMT) systems
on this language pair.
2. to analyze the output of the system with the
objective of identifying the challenges that need
to be tackled.
Experiment
Monolingual Corpus Bilingual Corpus

Training Set Test Set

Language Translation Modeling


Modeling
Source Reference

Language Model Translation Model

Decoding

Target

Evaluation

Performance Metric
Experiment ...

  Data
  Documents include the ConsKtuKon of FDRE (Federal DemocraKc Republic of Ethiopia),  
  ProclamaKons of the Council of Oromia Regional State, 
  Universal DeclaraKon of Human Right and Kenyan Refugee Act  
  Religious and medical documents 

  Source
  Council of Oromia Regional State (Caffee Oromiyaa)‫‏‬
  WWW 
Experiment ...

  Size and organization


  20K Sentence pairs (EN, OM) or (300,000 words) for TM 
  62K Sentences (OM) or (1,024,156 words) for LM 
  90% for training and 10% for tesKng 
Experiment ...

  Software tools used


  Preprocessing : PERL and python scripts 
  Language Modeling: SRILM 
  Alignment: GIZA++ 
  Phrase‐based TranslaKon Modeling: Moses 
  Decoding: Moses 
  Postprocessing: PERL scripts 
  EvaluaKon: PERL Script 
  DemonstraKon: Python Scripts 
Result and Discussion

  Sentence aligner mistake in tokenization


  Due to appostrophe called hudhaa(`)‫ ‏‬in Oromo
  Wrong tokenizaKon bal'ina  bal  ‘  ina 
  Results in wrong alignment 
Result and Discussion ...

  Impurity in the data 
  mis‐alligned sentences pairs were found to cause lower 
BLUE score of 5.06% 
  Example of wrongly aligned sentence pair 

  CorrecKng the sentence pairs manually improved BLUE 
score to 17.74% 
Result and Discussion ...

•  Result after improving the alignment

•  Average BLEU Score of 17.74%


•  As n increases, accuracy decreases sharply
Result and Discussion ...

  In addition to limited size and impurity of the


data, the BLUE score was affected by:
  Availability of a single reference translation
  Domain of the test data
  the system performs better if it is tested on
religious documents than documents from other
domain
Conclusion

  How well has this system performed?


  Average score was 17.74% 
  Compare?
  No MT for Oromoo 
  Compared to other systems
  Fair score as shown in the tables on the following slide 
Conclusion (Cont.)‫‏‬

•  Size

•  Score

(From Koehn, 2005)‫‏‬


Next Steps

  Grow of parallel corpora for this language pair


using the output of the system
  Consider collection and use of comparable
corpora
  Building linguistic models of Oromo morphology
in a suitable finite-state formalism
Relation to ongoing projects

EuroMatrix Plus plans to build


  easy-to-access MT engines for many EU language pairs
  a platform for translation and post-editing of Wikipedia articles
Languages like Oromoo could be easily incorporated

ACCURAT works on learning of MT models from comparable


corpora, which would be highly applicable to Oromoo
We would need additional manpower to make this happen
Acknowledgement

  EU projects EuroMatrix and EuroMatrix Plus


  Saarland University
  DFKI GmbH
  Addis Ababa University
  German Academic Exchange Service (DAAD)

You might also like