Indonesian – English Parallel Texts for
Statistical Machine Translation
( Hammam Riza, Adiansya Prasetya, Henky Mulyadi )
Background
The Republic of Indonesia is an:
• Archipelago of 13,000 islands that spread over an area of 1,900,000 square
kilometers
• Population of 245,000,000 (July. 2006 estimated)
• 7% growth of the GDP was recorded on per year
• Indonesian economy and political conditions are gradually stabilizing
• Indonesia is back on the track to become an industrialized nation
• Bahasa Indonesia became the formal language of the country, uniting its
citizens who speak different languages
• Bahasa Indonesia has become the language that bridges the language
barrier among Indonesians who have different mother-tongues
• The vocabulary of bahasa Indonesia has been extensively influenced by
outside languages, especially Sanskrit, Arabic, Chinese, Dutch, and
English, as well as local languages such as Javanese and Batavian
Research Topics
Tourism in Business Social Service Safety and Education Archiving of
Asia in Asia In Asia Security in Asia In Asia Asian Language
Multi-lingual Multi-lingual Speech and Language
Speech translation Transcription and formats
Multi-lingual Speech Multi-lingual Speech Multi-lingual Speech
Translation Transcription and Text Archive
Parallel Corpus ( Synonymous Speech + Text)
Indonesian Language English Language
Speech+Text Speech+Text
Parallel Corpus Format
Dictionary
Corpus Collection and Processing
Data collection schema:
Antara News Selection & Alignment
Selected Corpus
Agency Transformation Article
DB Indonesian-
(oracle DB)
(SQL 2000) English
Alignment
Sentences
Collection &
Web Corpus
Alignment Sentences
News Indonesian-
English Toggle
Cleaning
Indonesian
Text
Conversion
to Text Clean
English
Text Corpus
Translation of SMT System (1)
• A. Translation model
– > SRI Language Modeling Toolkit which extracts a 3-gram language model from
the data. Besides the SRILM distribution, you will also need the following freely
available tools: ANSI-C/C++ compiler, gcc version 3.4.3 or higher, GNU make,
GNU gawk, GNU gzip, Tcl, CYGWIN porting layer, to build SRILM on a
Microsoft Windows system.
– > Functionalities of SRILM:
• Generate the n-gram count file from the corpus
• Train the language model from the n-gram count file
• Calculate the test data perplexity using the trained language model
Training corpus ngram corpus Count file
Lexicon ngram count LM
Test data ngram ppl
Translation of SMT System (2)
• B. Language Model
– bin contains GIZA++ which is an implementation based on the IBM models, and
mkcls which divides words into probabilistically based classes.
– In order to compile GIZA++ you may need:
• a recent version of the GNU compiler (2.95 or higher)
• a recent version of assembler and linker which do not have restrictions with
respect to the length of symbol names
– corpus is where the data should be placed when training the translation model.
source
Translation Model
-Program (SRILM)
-Compiler Pharaoh SMT
Data Generation System
preparations
Train Phrase
Model
target
Testing System Performance (1)
Sentence translation process use decoder Pharaoh :
Files used for translasi are:
– pharaoh (executable)
– [Link]
– [Link]
– phrase-table
Example:
Type the command like this
echo ‘Can I check in now’ | ./pharaoh –f ./[Link] > OUT
The process will yield file OUT, to see result type
cat OUT
Presented results “Dapatkah saya check in sekarang”
Testing System Performance (2)
Testing Performance
Bleu Score
Sample 275,000 sentence bleu score is = 0.878
Thank you...
sunset in Kuta, Bali