0% found this document useful (0 votes)

16 views9 pages

Indonesian-English Translation Corpus Guide

This document discusses the development of parallel corpora and statistical machine translation systems between Indonesian and English. It describes collecting parallel texts from news sources, preprocessing and aligning the data to create corpora. It then explains training translation models using SRILM and GIZA++, and using the Pharaoh decoder to generate translations. Evaluation is done using BLEU scores, with a sample text achieving a score of 0.878.

Uploaded by

shekoembang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views9 pages

Indonesian-English Translation Corpus Guide

Uploaded by

shekoembang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Indonesian – English Parallel Texts for

Statistical Machine Translation

( Hammam Riza, Adiansya Prasetya, Henky Mulyadi )

Background
The Republic of Indonesia is an:
• Archipelago of 13,000 islands that spread over an area of 1,900,000 square
kilometers
• Population of 245,000,000 (July. 2006 estimated)
• 7% growth of the GDP was recorded on per year
• Indonesian economy and political conditions are gradually stabilizing
• Indonesia is back on the track to become an industrialized nation
• Bahasa Indonesia became the formal language of the country, uniting its
citizens who speak different languages
• Bahasa Indonesia has become the language that bridges the language
barrier among Indonesians who have different mother-tongues
• The vocabulary of bahasa Indonesia has been extensively influenced by
outside languages, especially Sanskrit, Arabic, Chinese, Dutch, and
English, as well as local languages such as Javanese and Batavian
Research Topics

Tourism in Business Social Service Safety and Education Archiving of

Asia in Asia In Asia Security in Asia In Asia Asian Language

Multi-lingual Multi-lingual Speech and Language

Speech translation Transcription and formats

Multi-lingual Speech Multi-lingual Speech Multi-lingual Speech

Translation Transcription and Text Archive

Parallel Corpus ( Synonymous Speech + Text)

Indonesian Language English Language
Speech+Text Speech+Text

Parallel Corpus Format

Dictionary
Corpus Collection and Processing
Data collection schema:

Antara News Selection & Alignment

Selected Corpus
Agency Transformation Article
DB Indonesian-
(oracle DB)
(SQL 2000) English

Alignment
Sentences

Collection &
Web Corpus
Alignment Sentences
News Indonesian-
English Toggle
Cleaning

Indonesian
Text
Conversion
to Text Clean
English
Text Corpus
Translation of SMT System (1)
• A. Translation model
– > SRI Language Modeling Toolkit which extracts a 3-gram language model from
the data. Besides the SRILM distribution, you will also need the following freely
available tools: ANSI-C/C++ compiler, gcc version 3.4.3 or higher, GNU make,
GNU gawk, GNU gzip, Tcl, CYGWIN porting layer, to build SRILM on a
Microsoft Windows system.
– > Functionalities of SRILM:
• Generate the n-gram count file from the corpus
• Train the language model from the n-gram count file
• Calculate the test data perplexity using the trained language model

Training corpus ngram corpus Count file

Lexicon ngram count LM

Test data ngram ppl

Translation of SMT System (2)
• B. Language Model
– bin contains GIZA++ which is an implementation based on the IBM models, and
mkcls which divides words into probabilistically based classes.
– In order to compile GIZA++ you may need:
• a recent version of the GNU compiler (2.95 or higher)
• a recent version of assembler and linker which do not have restrictions with
respect to the length of symbol names

– corpus is where the data should be placed when training the translation model.

source
Translation Model
-Program (SRILM)
-Compiler Pharaoh SMT
Data Generation System
preparations
Train Phrase
Model

target
Testing System Performance (1)
Sentence translation process use decoder Pharaoh :
Files used for translasi are:
– pharaoh (executable)
– [Link]
– [Link]
– phrase-table

Example:
Type the command like this
echo ‘Can I check in now’ | ./pharaoh –f ./[Link] > OUT
The process will yield file OUT, to see result type
cat OUT
Presented results “Dapatkah saya check in sekarang”
Testing System Performance (2)
Testing Performance

Bleu Score

Sample 275,000 sentence bleu score is = 0.878

Thank you...

sunset in Kuta, Bali

Telugu to English Translation Model
No ratings yet
Telugu to English Translation Model
4 pages
Machine Translation Challenges Explained
No ratings yet
Machine Translation Challenges Explained
236 pages
Moses: Open Source Toolkit For Statistical Machine Translation
No ratings yet
Moses: Open Source Toolkit For Statistical Machine Translation
4 pages
Edinburgh Research Explorer: Moses: Open Source Toolkit For Statistical Machine Translation
No ratings yet
Edinburgh Research Explorer: Moses: Open Source Toolkit For Statistical Machine Translation
5 pages
Statistical Machine Translation: Sinhala-Tamil
No ratings yet
Statistical Machine Translation: Sinhala-Tamil
7 pages
Statistical Phrase-Based Translation: Philipp Koehn, Franz Josef Och, Daniel Marcu
No ratings yet
Statistical Phrase-Based Translation: Philipp Koehn, Franz Josef Och, Daniel Marcu
7 pages
GA-Based Machine Translation System For Sanskrit To Hindi Language
No ratings yet
GA-Based Machine Translation System For Sanskrit To Hindi Language
9 pages
Tamil-English MT System Evaluation
No ratings yet
Tamil-English MT System Evaluation
9 pages
AI Communication and Machine Translation
No ratings yet
AI Communication and Machine Translation
36 pages
English-Sinhala Machine Translation System
No ratings yet
English-Sinhala Machine Translation System
10 pages
Myanmar-English Translation Models Comparison
No ratings yet
Myanmar-English Translation Models Comparison
5 pages
Advanced NLP Applications Overview
No ratings yet
Advanced NLP Applications Overview
62 pages
Statistical Phrase-Based Translation: Philipp Koehn, Franz Josef Och, Daniel Marcu
No ratings yet
Statistical Phrase-Based Translation: Philipp Koehn, Franz Josef Och, Daniel Marcu
7 pages
Machine Translation Approaches Explained
No ratings yet
Machine Translation Approaches Explained
42 pages
Unit 5 MT
No ratings yet
Unit 5 MT
41 pages
Neural Machine Translation Case Study
No ratings yet
Neural Machine Translation Case Study
8 pages
Six Phases of Compiler Explained
No ratings yet
Six Phases of Compiler Explained
20 pages
Comparison of Neural Architectures For Machine Translation of The Slovak Language Using The Fairseq Toolkit
No ratings yet
Comparison of Neural Architectures For Machine Translation of The Slovak Language Using The Fairseq Toolkit
6 pages
Machine Translation for Indian Languages
No ratings yet
Machine Translation for Indian Languages
18 pages
Hindi-English Machine Translation Report
No ratings yet
Hindi-English Machine Translation Report
20 pages
UN Parallel Corpus Overview
No ratings yet
UN Parallel Corpus Overview
5 pages
Comp Design
100% (1)
Comp Design
604 pages
Google Translate
No ratings yet
Google Translate
33 pages
Understanding Compilers and Their Importance
No ratings yet
Understanding Compilers and Their Importance
62 pages
Machine Translation
No ratings yet
Machine Translation
5 pages
English to Tamil Translation System
No ratings yet
English to Tamil Translation System
280 pages
Neural Machine Translation for Javanese & Sundanese
No ratings yet
Neural Machine Translation for Javanese & Sundanese
17 pages
1.1 Language Processing System
No ratings yet
1.1 Language Processing System
4 pages
Statistical Machine Translation Overview
No ratings yet
Statistical Machine Translation Overview
7 pages
Sanskrit-Hindi Machine Translation Analysis
No ratings yet
Sanskrit-Hindi Machine Translation Analysis
4 pages
Understanding Machine Translation Concepts
No ratings yet
Understanding Machine Translation Concepts
42 pages
Introduction to Compiler Design
No ratings yet
Introduction to Compiler Design
37 pages
Unsupervised Neural Transcompilation
No ratings yet
Unsupervised Neural Transcompilation
21 pages
Exploring Web-Based Translation Resources Applied To Hindi-English Cross-Lingual Information Retrieval
No ratings yet
Exploring Web-Based Translation Resources Applied To Hindi-English Cross-Lingual Information Retrieval
19 pages
Types of Translators Explained
No ratings yet
Types of Translators Explained
17 pages
Automated Language Translation Device
No ratings yet
Automated Language Translation Device
18 pages
Compiler Notes
No ratings yet
Compiler Notes
83 pages
Extending Capabilities of English To Marathi Machi PDF
No ratings yet
Extending Capabilities of English To Marathi Machi PDF
8 pages
Extending Capabilities of English To Marathi Machine Translator
No ratings yet
Extending Capabilities of English To Marathi Machine Translator
8 pages
Machine Translation in Indian Languages
No ratings yet
Machine Translation in Indian Languages
21 pages
Speech-to-Speech Translation for Indic Languages
No ratings yet
Speech-to-Speech Translation for Indic Languages
14 pages
Dawan Language Speech Translator App
No ratings yet
Dawan Language Speech Translator App
6 pages
Limitations of GPT-4 in Indigenous Translation
No ratings yet
Limitations of GPT-4 in Indigenous Translation
5 pages
Enhancing English-Tamil Machine Translation
No ratings yet
Enhancing English-Tamil Machine Translation
18 pages
Overview of Compiler Construction Techniques
No ratings yet
Overview of Compiler Construction Techniques
42 pages
Cross-Language Information Retrieval Guide
No ratings yet
Cross-Language Information Retrieval Guide
20 pages
Understanding Machine Translation Challenges
No ratings yet
Understanding Machine Translation Challenges
71 pages
Rule Based NLP Algo
No ratings yet
Rule Based NLP Algo
9 pages
Machine Translation for Indian Languages
No ratings yet
Machine Translation for Indian Languages
29 pages
Compression of Translation Tables
No ratings yet
Compression of Translation Tables
6 pages
NLP Applications Session 9 - Machine Translation - Statistical by Dr. Chetana Gavankar
No ratings yet
NLP Applications Session 9 - Machine Translation - Statistical by Dr. Chetana Gavankar
83 pages
Overview of Programming Languages and Translators
No ratings yet
Overview of Programming Languages and Translators
18 pages
Sanskrit-Gujarati Machine Translation Framework
No ratings yet
Sanskrit-Gujarati Machine Translation Framework
7 pages
Moses Statistical Machine Translation Guide
No ratings yet
Moses Statistical Machine Translation Guide
9 pages
Installing and Configuring Moses SMT
No ratings yet
Installing and Configuring Moses SMT
9 pages
Bhootpurv Translation in Machine Learning
No ratings yet
Bhootpurv Translation in Machine Learning
14 pages
Machine Translation Systems For Indian Languages: Review of Modelling Techniques, Challenges, Open Issues and Future Research Directions
No ratings yet
Machine Translation Systems For Indian Languages: Review of Modelling Techniques, Challenges, Open Issues and Future Research Directions
29 pages
SMT vs NMT in Machine Translation
No ratings yet
SMT vs NMT in Machine Translation
79 pages
Compiler Construction Overview for HND 414
100% (2)
Compiler Construction Overview for HND 414
79 pages
New Guidance On The Design of Offshore Structures
No ratings yet
New Guidance On The Design of Offshore Structures
8 pages
Abhijat Resume
0% (1)
Abhijat Resume
3 pages
ST-IA Indirect Solenoid Valve Specs
No ratings yet
ST-IA Indirect Solenoid Valve Specs
3 pages
Transmission Testing Procedure Guide
No ratings yet
Transmission Testing Procedure Guide
4 pages
Power System Operation & Control Plan
No ratings yet
Power System Operation & Control Plan
2 pages
7 Steps to Grow Marijuana Successfully
100% (1)
7 Steps to Grow Marijuana Successfully
10 pages
Enhancing Sales with Siebel Automation
No ratings yet
Enhancing Sales with Siebel Automation
24 pages
Candidate Attorney Position Available
No ratings yet
Candidate Attorney Position Available
2 pages
Greenbury Report on Directors' Remuneration
100% (10)
Greenbury Report on Directors' Remuneration
58 pages
BRICS and G7: Policy Synergies for SDGs
No ratings yet
BRICS and G7: Policy Synergies for SDGs
20 pages
Calibration of Motomco 919 Moisture Meter
No ratings yet
Calibration of Motomco 919 Moisture Meter
2 pages
Snowpiercer and Conflict Theory Analysis
No ratings yet
Snowpiercer and Conflict Theory Analysis
6 pages
Computer GK Questions for Class 4
No ratings yet
Computer GK Questions for Class 4
4 pages
Internship Report at Felda Palm Industries
0% (1)
Internship Report at Felda Palm Industries
13 pages
Delhi Virtual Court Guidelines
No ratings yet
Delhi Virtual Court Guidelines
201 pages
Exploring the Invisible World of Cells
No ratings yet
Exploring the Invisible World of Cells
12 pages
Overview of the World Trade Organization
No ratings yet
Overview of the World Trade Organization
2 pages
Deped Copy: Unit 1: Consumer Health
No ratings yet
Deped Copy: Unit 1: Consumer Health
34 pages
Miele CVA 610/615 Coffee Systems Guide
No ratings yet
Miele CVA 610/615 Coffee Systems Guide
84 pages
Overview of Numerical Methods for PDEs
No ratings yet
Overview of Numerical Methods for PDEs
17 pages
TLE 6 ICT & Entrepreneurship Test
No ratings yet
TLE 6 ICT & Entrepreneurship Test
4 pages
Arrays and Strings in C Programming
No ratings yet
Arrays and Strings in C Programming
16 pages
Notice No.8: Rules and Regulations For The
No ratings yet
Notice No.8: Rules and Regulations For The
40 pages
Hirchmann Efs Switch Configuration
No ratings yet
Hirchmann Efs Switch Configuration
9 pages
Fire Protection Nozzles Specifications
No ratings yet
Fire Protection Nozzles Specifications
1 page
Abinet Langena's Sales Experience Resume
No ratings yet
Abinet Langena's Sales Experience Resume
1 page
SurveyGizmo Data Privacy Overview
No ratings yet
SurveyGizmo Data Privacy Overview
2 pages
TDA8920 vs TDA8950 PCB Discussion
No ratings yet
TDA8920 vs TDA8950 PCB Discussion
6 pages
Deconvolution of DSC Instrumental Coefficients
No ratings yet
Deconvolution of DSC Instrumental Coefficients
11 pages
P099a - Wascator Label Washes - Apr 2006
No ratings yet
P099a - Wascator Label Washes - Apr 2006
6 pages

Indonesian-English Translation Corpus Guide

Uploaded by

Indonesian-English Translation Corpus Guide

Uploaded by

Indonesian – English Parallel Texts for

( Hammam Riza, Adiansya Prasetya, Henky Mulyadi )

Tourism in Business Social Service Safety and Education Archiving of

Multi-lingual Multi-lingual Speech and Language

Multi-lingual Speech Multi-lingual Speech Multi-lingual Speech

Parallel Corpus ( Synonymous Speech + Text)

Parallel Corpus Format

Antara News Selection & Alignment

Training corpus ngram corpus Count file

Lexicon ngram count LM

Test data ngram ppl

Sample 275,000 sentence bleu score is = 0.878

sunset in Kuta, Bali

You might also like