0% found this document useful (0 votes)
11 views129 pages

Language Segmentation

The document is a master's thesis focused on language segmentation, which identifies boundaries between languages in multilingual texts. It compares supervised and unsupervised methods for language model training, particularly in low-resource scenarios, and introduces a weakly supervised approach that performs well on difficult texts. The findings indicate that while the weakly supervised method shows promise, further research is needed for improvement.

Uploaded by

Thu Uyên Đoàn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views129 pages

Language Segmentation

The document is a master's thesis focused on language segmentation, which identifies boundaries between languages in multilingual texts. It compares supervised and unsupervised methods for language model training, particularly in low-resource scenarios, and introduces a weakly supervised approach that performs well on difficult texts. The findings indicate that while the weakly supervised method shows promise, further research is needed for improvement.

Uploaded by

Thu Uyên Đoàn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

U T

F II
C  D H

Language Segmentation

Supervisors:
Author:
Prof. Dr. Caroline S
David A
Dr. Sven N

August 18, 2015


Erklärung zur Masterarbeit
Hiermit erkläre ich, dass ich die Masterarbeit selbstständig verfasst und keine ande-
ren als die angegebenen ellen und Hilfsmiel benutzt und die aus fremden ellen
direkt oder indirekt übernommenen Gedanken als solche kenntlich gemacht habe.
Die Arbeit habe ich bisher keinem anderen Prüfungsamt in gleicher oder vergleich-
barer Form vorgelegt. Sie wurde bisher nicht veröffentlicht.

Datum Unterschri

i
Abstract
Language segmentation consists in finding the boundaries where one lan-
guage ends and another language begins in a text wrien in more than one lan-
guage. is is important for all natural language processing tasks.
e problem can be solved by training language models on language data.
However, in the case of low- or no-resource languages, this is problematic. I
therefore investigate whether unsupervised methods perform beer than super-
vised methods when it is difficult or impossible to train supervised approaches.
A special focus is given to difficult texts, i.e. texts that are rather short (one
sentence), containing abbreviations, low-resource languages and non-standard
language.
I compare three approaches: supervised n-gram language models, unsuper-
vised clustering and weakly supervised n-gram language model induction. I de-
vised the weakly supervised approach in order to deal with difficult text specif-
ically. In order to test the approach, I compiled a small corpus of different text
types, ranging from one-sentence texts to texts of about 300 words.
e weakly supervised language model induction approach works well on
short and difficult texts, outperforming the clustering algorithm and reaching
scores in the vicinity of the supervised approach. e results look promising,
but there is room for improvement and a more thorough investigation should be
undertaken.

ii
Anowledgements
My thanks go to professor Caroline Sporleder for sharing her knowledge with me, for
her inspiring ideas and for agreeing to supervise my Bachelor’s and Master’s esis
despite her busy schedule. It was also thanks to the topic she suggested for my Bach-
elor’s esis that I met Jürgen Knauth and later was able to get a research assistant
position at the SeNeReKo project, collaborating closely with Jürgen.
Which brings me to the next person on the list. I would like to thank Jürgen Knauth
for the wonderful collaboration, for his patience, for his contagious enthusiasm, and
all the interesting conversations in passing that always lasted longer than intended.
I would like to thank Stephan Faber for his insightful comments when I couldn’t
see the wood for the trees, for his patience and optimism, for pushing me to go further
and to persevere.
I would also like to thank Julian Vaudroz for accompanying me throughout the
degree program. We both didn’t know what we were in for when we started, but we
persevered and it paid off. It wouldn’t have been the same without you.
Finally, I would like to thank all the people that volunteered to proofread my thesis
and all the people that helped me during the writing of this thesis. Unfortunately, I
cannot list everyone. You know who you are!

iii
List of Figures
1 Out-of-place metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Simple text illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Initial model creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Initial model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Model update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7 New model creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
8 Multiple model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 14
9 Updating relevant model . . . . . . . . . . . . . . . . . . . . . . . . . . 14
10 Multiple model evaluation 2 . . . . . . . . . . . . . . . . . . . . . . . . 14
11 New model creation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
12 Problematic text sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
13 Finding the most similar models . . . . . . . . . . . . . . . . . . . . . . 16
14 Merging most similar models . . . . . . . . . . . . . . . . . . . . . . . . 16
15 Word-Model assignment . . . . . . . . . . . . . . . . . . . . . . . . . . 16
16 Clustering preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
17 WEKA: Cluster visualization . . . . . . . . . . . . . . . . . . . . . . . . 28
18 ELKI: Cluster visualization . . . . . . . . . . . . . . . . . . . . . . . . . 29
19 Language model: Distribution 1 . . . . . . . . . . . . . . . . . . . . . . 35
20 Language Model: Distribution 2 . . . . . . . . . . . . . . . . . . . . . . 35
21 Language model: Distribution 3 . . . . . . . . . . . . . . . . . . . . . . 36
22 Alternating language structure . . . . . . . . . . . . . . . . . . . . . . . 54

iv
List of Tables
1 Training data: Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Unambiguous encoding: distances . . . . . . . . . . . . . . . . . . . . . 27
3 Simplified encoding: distances . . . . . . . . . . . . . . . . . . . . . . . 27
4 N-Gram language model results: Latin script . . . . . . . . . . . . . . . 38
5 N-Gram language model results: Mixed script . . . . . . . . . . . . . . 39
6 N-Gram language model results: Pali data . . . . . . . . . . . . . . . . . 40
7 N-Gram language model results: Twier data . . . . . . . . . . . . . . . 41
8 Textcat results: Latin script . . . . . . . . . . . . . . . . . . . . . . . . . 42
9 Textcat results: Mixed script . . . . . . . . . . . . . . . . . . . . . . . . 43
10 Textcat results: Pali data . . . . . . . . . . . . . . . . . . . . . . . . . . 44
11 Textcat results: Twier data . . . . . . . . . . . . . . . . . . . . . . . . 45
12 Clustering results: Latin script . . . . . . . . . . . . . . . . . . . . . . . 46
13 Clustering results: Mixed script . . . . . . . . . . . . . . . . . . . . . . 47
14 Clustering results: Pali data . . . . . . . . . . . . . . . . . . . . . . . . . 48
15 Clustering results: Twier data . . . . . . . . . . . . . . . . . . . . . . . 49
16 Induction results: Latin script . . . . . . . . . . . . . . . . . . . . . . . . 50
17 Induction results: Mixed script . . . . . . . . . . . . . . . . . . . . . . . 51
18 Induction results: Pali data . . . . . . . . . . . . . . . . . . . . . . . . . 52
19 Induction results: Twier data . . . . . . . . . . . . . . . . . . . . . . . 53
20 ‘Twier 3’: Textcat versus Gold clustering . . . . . . . . . . . . . . . . 58
21 ‘Twier 4’: Textcat versus Gold clustering . . . . . . . . . . . . . . . . 58

v
List of Algorithms
1 N-gram numerical encoding . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Model induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Initial model creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Max model and max score . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Model merger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Distributional Similarity Calculation . . . . . . . . . . . . . . . . . . . . 36

vi
Contents
1 Introduction 1

2 Related work 2
2.1 N-Grams and rank order statistics . . . . . . . . . . . . . . . . . . . . . 2
2.2 N-Grams and maximum likelihood estimator . . . . . . . . . . . . . . . 3
2.3 Trigrams and short words . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 N-Grams and clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Inclusion detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Clustering and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.7 Monolingual training data . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.8 Predictive suffix trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 eory 9
3.1 Supervised language model . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 N-Gram models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Unsupervised clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Weakly supervised language model induction . . . . . . . . . . . . . . 12

4 Experimental setup 18
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Supervised language model . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Training phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.3 Application of the approach . . . . . . . . . . . . . . . . . . . . 21
4.2.4 Textcat and language segmentation . . . . . . . . . . . . . . . . 21
4.3 Unsupervised clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Defining features . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.3 Mapping features to a common scale . . . . . . . . . . . . . . . 25
4.3.4 e problem of unambiguous encoding . . . . . . . . . . . . . . 26
4.3.5 e clusterer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.6 Evaluating clusterings . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Weakly supervised language model induction . . . . . . . . . . . . . . 31
4.4.1 Distributional similarity . . . . . . . . . . . . . . . . . . . . . . 34
4.4.2 Evaluating results . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.3 Estimating the parameters . . . . . . . . . . . . . . . . . . . . . 37

vii
5 Results 38
5.1 N-Gram language model . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Textcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Language model induction . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Discussion 54
6.1 N-Gram language models . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Textcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4 Language model induction . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7 Conclusion 65

8 Appendix 72
8.1 Development data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.1.1 Latin script data . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.1.2 Mixed script data . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.1.3 Twier data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.1.4 Pali dictionary data . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2 Test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.1 Latin script data . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.2 Mixed script data . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.2.3 Twier data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2.4 Pali dictionary data . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.3.1 N-Gram Language Models . . . . . . . . . . . . . . . . . . . . . 80
8.3.2 Textcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.3.4 Language Model Induction . . . . . . . . . . . . . . . . . . . . . 112

viii
1 Introduction
Language segmentation and identification are important for all natural language pro-
cessing operations that are language-specific, such as taggers, parsers or machine
translation (Jain and Bhat, 2014; Zubiaga et al., 2014). Indeed, using “traditional” mono-
lingual natural language processing components on mixed language data leads to mis-
erable results (Jain and Bhat, 2014). Even if the results are not terrible, language identi-
fication and segmentation can improve the overall results. For example, by identifying
foreign language inclusions in an otherwise monolingual text, parser accuracy can be
increased (Alex et al., 2007).
One important point that has to be borne in mind is the difference between lan-
guage identification and language segmentation. Language identification is concerned
with recognizing the language at hand. It is possible to use language identification
for language segmentation. Indeed, by identifying the languages in a text, the seg-
mentation is implicitly obtained. Language segmentation on the other hand is only
concerned with identifying language boundaries. No claims about the languages in-
volved are made.
Aer giving an overview over related work and different approaches that can be
taken for language segmentation, I will present the theory behind supervised methods
as well as unsupervised methods. Finally, I will introduce a weakly supervised method
for language segmentation that I developed.
Aer the theoretical part, I will present experiments done with the different ap-
proaches, comparing their effectiveness on the task of language segmentation on dif-
ferent text types. A special focus will be given to difficult text types, such as short texts,
texts containing under-resourced languages or texts containing a lot of abbreviations
or other non-standard features.
A big advantage of unsupervised methods is language independence. If the ap-
proach used does not rely on language-specific details, the approach is more flexible
as no language resources have to be adapted for the method to work on other lan-
guages. ese advantages might be especially useful for under-resourced languages.
When there is no or insufficient data available to train a supervised language model,
an unsupervised approach might yield beer results.
Another advantage is that unsupervised methods do not require prior training.
ey are not dependent on training data and thus cannot be skewed by the data. In-
deed, supervised approaches that are trained on data are qualitatively tied to their
training data; different training data will, in all probability, yield different models.
is thesis aims at answering the question whether unsupervised language seg-
mentation approaches work beer on difficult text types than supervised language
approaches.

1
2 Related work
2.1 N-Grams and rank order statistics
Cavnar and Trenkle (1994) use an n-gram language model for language identification
purposes. eir program ‘Textcat’ is intended to classify documents by language. e
system calculates n-grams for 1 6 n 6 5 from training data and orders the n-grams
according to inverse frequency, i.e. from the most frequent n-grams to the most infre-
quent n-grams. e numerical frequency data is then discarded and only inherently
present.
During training, the program calculates an n-gram profile consisting of these n-
gram lists for each category (i.e. language to classify).
New data is classified by first calculating the n-gram profile and then comparing
the profile to existing profiles. e category with the lowest difference score is taken
as the category for the document.
e score they use for classification is called out-of-place metric. For each n-gram
in the document n-gram profile, the corresponding n-gram in the category profile is
looked up and the absolute difference of ranks is taken as score. e sum is calculated
over all n-grams. More formally, the out-of-place metric moop is calculated as:

n
moop = (|r(xi , d) − r(xi , c)|) (1)
i=1

With n the number of n-grams in the document profile, xi the i-th n-gram, r(xi , d)
the rank of the i-th n-gram in the document profile, r(xi , c) the rank of the i-th n-gram
in the category profile.
Figure 1 illustrates the out-of-place metric.

Category profile Document profile Score

Most frequent AT ER no match → max

ING ING 0

TH AT 2

WH TH 1

Least frequent

Figure 1: Out-of-place metric

2
In figure 1, the document profile has ‘ER’ as most frequent n-gram, at rank 1, fol-
lowed by ‘ING’ at rank 2, etc. e category profile does not contain the n-gram ‘ER’; in
that case, an arbitrary fixed maximum value is assigned. e category profile contains
the n-gram ‘ING’ at rank 2, the same rank as in the document profile; the difference is
0. e category profile contains the n-gram ‘AT’ at rank 1, while in the document pro-
file, it occurs at rank 3. e absolute difference is 2. e out-of-place metric consists
of the sum of all scores thus calculated.
Cavnar and Trenkle (1994) collected 3713 Usenet texts with a cultural theme in
different languages. ey filtered out non-monolingual texts and texts that had no
useful content for language classification. In the end, they had 3478 articles ranging
from a single line of text to 50 KB of text.
eir results indicated that length had no significant impact on the classification,
contrary to what they thought. Also, they found that training the system with 400
n-grams yielded the best result with a precision of 99.8%.
ey also showed that their approach could be used for subject classification of
texts in the same language with reasonable precision. is finding indicates that lan-
guage and domain are linked to a certain degree.

2.2 N-Grams and maximum likelihood estimator


Dunning (1994) also uses an n-gram language model for language identification pur-
poses. e program calculates n-grams and their frequencies from the training data
and estimates the probability P of a given string using the Maximum Likelihood Esti-
mator (MLE) with Laplace add-one [Link] formally:

C(w1 , . . . , wi ) + 1
P (wi |w1 , . . . , wi−1 ) = (2)
C(w1 , . . . , wi−1 ) + |V |
with C(w1 , . . . , Ci ) the number of times the n-gram w1 , . . . , wi occurred,
C(w1 , . . . , Ci−1 ) the number of times the (n − 1)-gram w1 , . . . , wi−1 occurred and |V |
the size of the vocabulary.
For a string S, the string is decomposed into n-grams and the log probability lk is
calculated as:

lk = C(w1 , . . . , wk ) log P (wk |w1 , . . . , wk−1 ) (3)
w1 ,...,wk ∈S

where k is the order of the n-gram (k = n) used.


In order to test the system, Dunning (1994) uses a specially constructed test cor-
pus from a bilingual parallel translated English-Spanish corpus containing English and
Spanish texts with 10 texts varying from 1000 to 50000 bytes for the training set and
100 texts varying from 10 to 500 bytes for the test set.

3
e results indicate that bigram models perform beer for shorter strings and less
training data while trigram models work beer for larger strings and more training
data.
Dunning (1994) criticizes Cavnar and Trenkle (1994) for saying that their system
would be insensitive to the length of the string to be classified, as the shortest text they
classified was about 50 words. e system implemented by Dunning (1994) can classify
strings of 10 characters in length “moderately well”, while strings of 50 characters or
more are classified “very well”. Accuracies given vary from 92% for 20 bytes of training
data to 99.9% for 500 bytes of text.

2.3 Trigrams and short words


Grefenstee (1995) compares trigrams versus short words for language identification.
Short words are oen function words that are typical for and highly frequent in a given
language.
e trigram language guesser was trained on one million characters of text in 10
languages: Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese,
Spanish and Swedish. From the same texts, all words with 5 or less characters were
counted for the short-word-strategy.
e results indicate that the trigram approach works beer for small text fragments
of up to 15 words, while for any text longer than 15 words, both methods work equally
well with reported accuracies of up to 100% in the 11-15 word range.

2.4 N-Grams and clustering


Gao et al. (2001) present a system that augments n-gram language models with clus-
tering techniques. ey cluster words by similarity and use these clusters in order to
overcome the data sparsity problem.
In traditional cluster-based n-gram models, the probability P (wi ) is defined as the
product of the probability of a word given a cluster ci and the probability of the cluster
ci given the preceding clusters. For a trigram model, the probability P (wi ) of a word
wi is calculated as

P (wi |wi−2 wi−1 ) = P (wi |ci ) × P (ci |ci−2 ci−1 ) (4)


e probability of a word given a cluster is calculated as

C(wi )
P (wi |ci ) = (5)
C(ci )
with C(wi ) the count of the word wi and C(ci ) the count of the cluster ci .

4
e probability of a cluster given the preceding clusters is calculated using the
Maximum Likelihood Estimator

C(ci−2 ci−1 ci )
P (ci |ci−2 ci−1 ) = (6)
C(ci−2 ci−1 )
Gao et al. (2001) derive from this three ways of using clusters to augment language
models: predictive clustering (7), conditional clustering (8) and combined clustering
(9).

P (wi |wi−2 wi−1 ) = P (ci |wi−2 wi−1 ) × P (wi |wi−2 wi−1 ci ) (7)

P (wi |wi−2 wi−1 ) = P (wi |ci−2 ci−1 ) (8)

P (wi |wi−2 wi−1 ) = P (ci |ci−2 ci−1 ) × P (wi |ci−2 ci−1 ci ) (9)
Similarly, Dreyfuss et al. (2007) use clustering to cluster words by their context in
order to improve trigram language models. In addition to Gao et al. (2001), they also
use information about the subject-verb and verb-object relations of the sentence.
ey show that their model, using clustering, subject-verb information, verb-object
information, and the Porter stemmer outperforms a traditional trigram model.
Carter (1994) clusters training sentences (i.e. the corpus) into subcorpora of similar
sentences and calculates separate language model parameters for each subcorpus in
order to capture contextual information. In contrast to other works, Carter (1994)
clusters sentences instead of single words (compare Pereira et al. (1993) and Ney et al.
(1994)). Carter (1994) shows that the subdivision into smaller clusters increases the
accuracy of bigram language models, but not trigram models.

2.5 Inclusion detection


Beatrice Alex (cf. Alex (2005, 2006, 2007); Alex et al. (2007); Alex and Onysko (2010))
addresses the problem of English inclusions in mainly non-English texts. For the lan-
guage pair German-English, inclusions are detected using a German and an English
lexicon as first resource. If a word is found only in the English lexicon, it is tagged
as unambiguously English. If the word is found in neither lexicon, a web search is
conducted, restricting the search options to either German or English and counting
the number of results. If the German search yields more results, the word is tagged as
German, otherwise as English inclusion. If a word is found in both lexicons, a post-
processing module resolves the ambiguity.
Alex is mainly concerned with the improvement of parsing results by inclusion
detection. For example in (Alex et al., 2007) they report an increase in F-Score of 4.3

5
by using inclusion detection when parsing a German text with a parser trained on the
TIGER corpus (Brants et al., 2002).

2.6 Clustering and spee


In the area of clustering and spoken language identification, Yin et al. (2007) present a
hierarchical clusterer for spoken language. ey cluster 10 languages1 using prosodic
features and Mel Frequency Cepstral Coefficients (MFCC). MFCC vectors are a way of
representing acoustic signals (Logan et al., 2000). e signal is first divided into smaller
‘frames’, each frame is passed through the discrete Fourier transform and only the log-
arithm of the amplitude spectrum is retained (Logan et al., 2000). e spectrum is then
projected onto the ‘Mel frequency scale’, a scale that maps actual pitch to perceived
pitch, “as apparently the human auditory system does not perceive pitch in a linear
manner” (Logan et al., 2000). Finally, a discrete cosine transform is applied to the
spectrum to get the MFCC representations of the original signal (Logan et al., 2000).
Yin et al. (2007) show that their hierarchical clusterer outperforms traditional Acous-
tic Gaussian Mixture Model systems.
As spoken language will not be further investigated in this thesis, I will not dive
deeper into the maer at this point.

2.7 Monolingual training data


Yamaguchi and Tanaka-Ishii (2012), King and Abney (2013) and Lui et al. (2014) use
monolingual training data in order to train a system capable of recognizing the lan-
guages in a multilingual text.
Yamaguchi and Tanaka-Ishii (2012) use a dynamic programming approach to seg-
ment a text by language. eir test data contains fragments of 40 to 160 characters and
achieves F-scores of 0.94 on the relatively ‘closed’ data set of the Universal Declara-
tion of Human Rights2 and 0.84 on the more ‘open’ Wikipedia data set. However, the
approach is computationally intensive, not to say prohibitive; while Yamaguchi and
Tanaka-Ishii (2012) self-report a processing time of 1 second for an input of 1000 char-
acters, Lui et al. (2014) found that with 44 languages, the approach by Yamaguchi and
Tanaka-Ishii (2012) takes almost 24 hours to complete the computation on a 16 core
workstation.
King and Abney (2013) use weakly supervised methods to label the languages of
words. ey consider the task as sequence labeling task. ey have limited them-
selves to bilingual documents with a single language boundary and the task consists
1
e authors do not explicitly list the languages clustered, except for two-leer abbreviations which
seem to correspond to ISO 639-1. e languages under investigation could have been Vietnamese, Ger-
man, Farsi, French, Japanese, Spanish, Korean, English, Tamil, and ‘ma’, though it is impossible to tell.
2
[Link]

6
in discriminating between English and non-English text. ey found that a Condi-
tional Random Field model augmented with Generalized Expectation criteria worked
best, yielding accuracies of 88% with as lile as 10 words used for training.
Lui et al. (2014) consider the task as multi-label classification task. ey represent
a document as an n-gram distribution of byte sequences in a bag-of-words manner.
ey report F-scores of 0.957 and 0.959. ey note that similar languages will pose
problems when trying to identify a language, and solve this problem by identifying a
set of languages that most probably are correct instead of a single language.
One problem that these approaches all have is that they need to know the languages
that will occur in the test data (King and Abney, 2013; Lui et al., 2014).

2.8 Predictive suffix trees


Seldin et al. (2001) propose a system for automatic unsupervised language segmenta-
tion and protein sequence segmentation. eir system uses Variable Memory Markov
(VMM) sources, an alternative to Hidden Markov Models (HMM) implemented as Pre-
dictive Suffix Trees (PST).
Whereas HMMs require substantial amounts of training data and a deep under-
standing of the problem in order to restrict the model architecture, VMMs are simpler
and less expressive than HMMs, but have been shown to “solve many applications
with notable success” (Begleiter et al., 2004). In contrast to n-gram models that es-
timate the probability of w as P (w|N ) with N the context (typically the n previous
words), VMMs can vary N in function of the available context (Begleiter et al., 2004).
us, they can capture both small and large order dependencies, depending on the
training data (Begleiter et al., 2004).
ere is no single VMM algorithm, but rather a family of related algorithms. One
of these algorithms is called Predictive Suffix Tree (PST) (Ron et al., 1996). A PST is
a tree over an alphabet Σ, with each node either having 0 (leaf nodes) or |Σ| children
(non-terminal nodes) (Ron et al., 1996). Each node is labeled with the result of the walk
from that node up to the root (Ron et al., 1996). Each edge is labeled by a symbol s ∈ Σ
and the probability for the next symbol being s (Ron et al., 1996).
By modifying the Predictive Suffix Tree (PST) algorithm using the Minimum De-
scription Length (MDL) principle, Seldin et al. (2001) end up with a non-parametric
self-regulating algorithm. e MDL principle avoids overfiing of the model by favor-
ing low complexity over goodness-of-fit (Grünwald, 2007).
ey embed the algorithm in a deterministic annealing (DA) procedure to refine
the results. Finally, they use the Blahut-Arimoto algorithm, a rate-distortion function,
until convergence of the system.
For the language segmentation task, they use 150000 leers of text, 30000 from
each of the following languages: English, German, French, Italian, transliterated Rus-
sian. ey used continuous language fragments of approximately 100 leers, yielding a

7
synthetic multilingual text that switches language approximately every two sentences.
One important point that they note is that “too short segments do not enable reliable
discrimination between different models”. erefore, they disallow switching models
aer every word.
ey report very good results on the language segmentation task (and on the pro-
tein segmentation task). Aer 2000-3000 iterations of the Blahut-Arimoto algorithm,
the correct number of languages is identified and the segmentation is accurate up to a
few leers.

8
3 eory
3.1 Supervised language model
3.1.1 N-Gram models
Among supervised language models, n-gram models are very popular (Gao et al., 2001).
An n-gram is a slice from the original string (Cavnar and Trenkle, 1994). ese slices
can be contiguous or not. Non-contiguous n-grams are also called skip-grams (Guthrie
et al., 2006). In skip-grams, an additional parameter k indicates the maximum distance
that is allowed between units. In this parlance, contiguous n-grams can be regarded
as 0-skip-n-grams (Guthrie et al., 2006).
e following example demonstrates the difference between (traditional) n-grams
and skip-grams. Given the following sentence:
This i s a sample s e n t e n c e .
We can construct, for example, the following word k-skip-n-grams:
(0-skip-)2-grams: is is, is a, a sample, sample sentence
2-skip-2-grams: is is, is a, is sample, is a, is sample, is sentence, a sample, a
sentence, sample sentence
(0-skip-)3-grams:is is a, is a sample, a sample sentence
2-skip-3-grams:is is a, is is sample, is is sentence, is a sample, is a sen-
tence, is sample sentence, is a sample, is a sentence, is sample sentence, a sample
sentence

e results for 2-skip-2-grams does not include the skip-gram “is sentence”, as
the distance in words between these two words is 3, higher than the allowed k of 2. As
can be seen from this example, the number of skip-grams is more than two times higher
than the number of contiguous n-grams, and this trend continues the more skips are
allowed (Guthrie et al., 2006). Skip-grams, unlike n-grams, do not incur the problem
of data sparseness with an increase of n.
Instead of using words as unit for n-gram decompositions, we can also choose char-
acters. Each word is then decomposed into sequences of n characters. For example,
the word
model
can be decomposed into the 2-grams: mo, de, el. Oen, the word to decompose
is padded with start and end tags in order to improve the model (Cavnar and Tren-
kle, 1994). If we pad the word with <w> and </w>, the 2-gram decomposition yields:
<w>m, mo, de, el, l </w>. e use of paddings allows the model to capture details about
character distribution with regard to the start and end of words (Cavnar and Trenkle,
1994). For example, in English the leer ‘y’ occurs more oen at the end of words than

9
at the beginning of words, while the leer ‘w’ occurs mainly at the beginning of words
(Taylor, 2015). A non-padding model cannot capture this distinction, while a padding
model can.
One advantage of n-gram models is that the decomposition of a string into smaller
units reduces the impact of typing errors (Cavnar and Trenkle, 1994). Indeed, a typ-
ing error only affects a limited number of units (Cavnar and Trenkle, 1994). Due to
this property, n-gram models have been shown to be able to deal well with noisy text
(Cavnar and Trenkle, 1994).

3.1.2 Formal definition


Traditional n-gram language models predict the next word wi given the previous words
w1 , . . . , wi−1 . is prediction uses the conditional probability P (wi |w1 , . . . , wi−1 ). In-
stead of using the entire history w1 , . . . , wi−1 , the probability is approximated by using
only the n previous words wi−n+1 , . . . , wi−1 .

P (wi |w1 , . . . , wi−1 ) = P (wi |wi−n+1 , . . . , wi−1 ) (10)


e probability can be estimated using the Maximum Likelihood Estimation (MLE):

C(wi−n+1 , . . . , wi )
P (wi |wi−n+1 , . . . , wi−1 ) = (11)
C(wi−n+1 , . . . , wi−1 )
Where C(wi−n+1 , . . . , wi ) represents the number of times the n-gram sequence
wi−n+1 , . . . , wi occurred in the training corpus and C(wi−n+1 , . . . , wi−1 ) represents the
number of times the (n − 1)-gram sequence wi−n+1 , . . . , wi−1 was seen in the training
corpus.

3.1.3 Smoothing
e problem with MLE is that sequences not seen during training will have a prob-
ability of zero. In order to avoid this problem, different smoothing techniques can
be used (Chen and Goodman, 1996). e simplest smoothing technique is additive
(Laplace) smoothing (Chen and Goodman, 1996). Let V be the vocabulary size (i.e. the
total number of unique words in the test corpus). e smoothed probability PLaplace
becomes:

C(wi−n+1 , . . . , wi ) + λ
PLaplace (wi |wi−n+1 , . . . , wi−1 ) = (12)
C(wi−n+1 , . . . , wi−1 ) + λV
With λ the smoothing factor. If we choose λ = 1, we speak of “add one” smoothing
(Jurafsky and Martin, 2000). In practice, λ < 1 is oen chosen (Manning and Schütze,
1999).

10
An important estimation is the Good-Turing estimation (Chen and Goodman, 1996).
While not directly a smoothing method, it estimates the frequency of a given observa-
tion with
Nc+1
c∗ = (c + 1) (13)
Nc
where c is the number of times the observation was made, Nc is the number of times
the frequency c was observed and Nc+1 the frequency of the frequency c + 1. us,
instead of using the actual count c, the count is taken to be c∗ (Chen and Goodman,
1996).
Another way to avoid assigning probabilities of zero to unseen sequences is by
using back-off models. ere are linear and non-linear back-off models. In non-linear
back-off models, if the original n-gram probability falls below a certain threshold value,
the probability is estimated by the next lowest n-gram model. Katz’s back-off model
(Katz, 1987) for instance calculates probability Pbo using the formula:
{
C(wi−n+1 ,...,wi )
dwi−n+1 ,...,wi C(w if C(wi−n+1 , . . . , wi ) > k
Pbo = i−n+1 ,...,wi−1 )
(14)
αwi−n+1 ,...,wi−1 Pbo (wi |wi−n+2 , . . . , wi−1 ) otherwise
With d and α as smoothing parameters. e parameter k is oen chosen k = 0.
is means that if the probability given a high-order n-gram model is zero, we back
off to the next lowest model. For tri-gram models, the formula becomes:


P (wi |wi−2 , wi−1 )if C(wi−2 , wi−1 ) > 0
Pbo (wi |wi−2 , wi−1 ) = α1 P (wi |wi−1 ) if C(wi−2 , wi−1 ) = 0 and C(wi−1 , wi ) > 0


α2 P (wi ) otherwise
(15)
In contrast, linear back-off models use an interpolated probability estimate by com-
bining multiple probability estimates and weighting each estimate. e probability PLI
for a tri-gram model is:

PLI (wi |wi−2 , wi−1 ) = λ3 P (wi |wi−2 , wi−1 ) + λ2 P (wi |wi−1 ) + λ1 P (wi ) (16)

with λi = 1

3.2 Unsupervised clustering


Clustering consists in the grouping of objects based on their mutual similarity (Bie-
mann, 2006). Objects to be clustered are typically represented as feature vectors (Bie-
mann, 2006); from the original objects, a feature representation is calculated and used
for further processing.

11
Clustering can be partitional or hierarchical (Yin et al., 2007). Partitional clustering
divides the initial objects into separate groups in one step, whereas hierarchical clus-
tering builds a hierarchy of objects by first grouping the most similar objects together
and then clustering the next level hierarchy with regard to the existing clusters (Yin
et al., 2007).
e clustering algorithm uses a distance metric to measure the distance between the
feature vectors of objects (Biemann, 2006). e distance metric defines the similarity
of objects based on the feature space in which the objects are represented (Jain et al.,
1999). ere are different metrics available. A frequently chosen metric is the cosine
similarity that calculates the distance between two vectors, i.e. the angle between them
(Biemann, 2006).
In order for a clustering algorithm to work, features that represent the object to be
clustered have to be defined (Jain et al., 1999). Features can be quantitative (e.g. word
length) or qualitative (e.g. word starts with a capital leer) (Jain et al., 1999).
Most clustering algorithms, e.g. k-means, need the number of clusters to generate
(Jain et al., 1999). e question how to best choose this key number has been addressed
in-depth by Dubes (1987).
Clustering can be so or hard. When hard-clustering, an object can belong to one
class only, while in so-clustering, an object can belong to one or more classes, some-
times with different probabilities (Jain et al., 1999).

3.3 Weakly supervised language model induction


e main idea behind language model induction is that by inducing language models
from the text itself, the models are highly specialized but the approach is generally
more flexible since genre or text specific issues do not arise.
is approach is similar in character to the work by Seldin et al. (2001) in that the
text itself is used as data set. However, the realization differs greatly. Whereas Seldin
et al. (2001) use predictive suffix trees, I use n-gram language models.
e intuition is to learn the language models from the text itself, in an iterative
manner. Suppose we have a document as follows where wi represents the word at
position i in the text. Suppose the text contains two languages, marked in red and
blue.

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 2: Simple text illustration

We take the first word and create a language model m1 from that word.

12
m1

create
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 3: Initial model creation

We then evaluate the second word using the first language model. If the language
model score is high enough, we update the language model with the second word.

m1
eva
l
uat
e

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 4: Initial model evaluation

m1
upd
ate

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 5: Model update

If the score is below a certain threshold, the existing language model does not model
the word well enough and a new model is created.

m1
eva
lua
t e

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 6: Evaluation

13
m1 m2
cre
ate

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 7: New model creation

When there is more than one language model, each word is evaluated by every
language model, and the highest scoring model is updated, or a new model is created
if no language model models the word well enough.

m1 m2
eva
lua
t e

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 8: Multiple model evaluation

m1 m2
upd
ate

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 9: Updating relevant model

m1 m2
eva
lu ate

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 10: Multiple model evaluation 2

14
m1 m2 m3
cre
ate

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 11: New model creation 2

e last example shows that it is not necessarily the case that exactly one language
model is created per language; it oen is the case that many language models are
created for one language.
At the beginning, the models are not very reliable, as they only have a few words
as basis, but the more text is analyzed, the more reliable the models become.
However, the approach is problematic in that the text structure itself influences
the language models created. If the text starts with a foreign language inclusion, as
illustrated in figure 12, the initial model might be too frail to recognize the following
words as being a different language, updating the first model with the second and third
word and so on. us, the approach would fail at recognizing the foreign language
inclusion.

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 12: Problematic text sample

If we were to start from the end of the text and work towards the beginning, the
probability of having a relatively robust language model for the ‘blue’ language would
be high, and so, it would theoretically be easier to recognize the first word as not being
‘blue’.
erefore, one induction step involves one forward generation and one backwards
generation. is yields two sets, the set of models from the forward generation F =
{f1 , f2 , . . . , fn } and the set from the backwards generation B = {b1 , b2 , . . . , bm }.
en, from the two sets of models, the most similar models are selected. For this,
every model from F is compared to every model from B, as figure 13 shows. e most
similar models are then merged, as illustrated in figure 14. Indeed, if both the forward
and backwards generation yielded a similar language model, it is probable that the
model is correct.
Even so, both forward and backwards generation can not guarantee ideal results,
there is the option to run the generation from a random position. is random induc-
tion picks a random position in the text and runs one induction step from that position,
meaning one forward and one backwards generation. Finally, the most similar models
are merged as for the general generation.

15
b1 b2 b3 b4 b5 b6

f1 f2 f3 f4

Figure 13: Finding the most similar models

b1 b2 b3 b4 b5 b6

Merged model

f1 f2 f3 f4

Figure 14: Merging most similar models

is only yields one probable language model, therefore the induction is repeated
with the difference that all probable models are taken into consideration as well. For
each word, if a probable model models the word well enough, no new model is created,
otherwise a new model is created.
At the end of the induction loop, the set of probable models P is examined. As long
as there are two models that have a similarity score below a certain threshold, the two
most similar models are merged.
Finally, aer the language models have been induced, another pass is made over
the text and each word is assigned to the language model which yields the highest
score for that word, resulting in a word-to-model assignment as illustrated in figure
15.
m1 m2 m3 m4 m5 m6

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …

Figure 15: Word-Model assignment

I have made the approach parametric with parameters being:


• Induction iterations: Number of induction iterations
• Random iterations: Number of random iterations

16
• Forward/Backwards threshold: reshold for forward/backwards merging

• Silver threshold: reshold for P model merging

ese parameters can be adapted, in the hope that some parameter configurations
will work beer on certain data sets than other configurations. Since the approach
has parameters that have to be learned from a development set, the approach is said to
be weakly supervised; the development set is not used to train any language specifics,
only for the estimation of the parameters of the approach.

17
4 Experimental setup
In this chapter I present experiments done using the approaches delineated in the pre-
vious section in order to find out whether there are approaches that work beer on
certain types of text.
e central hypothesis is that unsupervised language segmentation approaches are
more successful on difficult data. Difficult data is data for which there is not enough
data to train a language model or data which contains a lot of non-standard language
such as abbreviations.
First, I present the data used to test the language segmentation systems and elabo-
rate on the different aspects that had to be considered for the data compilation.
I then present two supervised language segmentation experiments using n-gram
language models and Textcat.
For unsupervised language segmentation, I will first present experiments using
clustering algorithms before presenting experiments using language model induction.

4.1 Data
In order to test the different language segmentation approaches, I compiled different
sets of test data. As I want to focus on short texts, most texts from the test corpus are
rather small, sometimes consisting of only one sentence. However, in order to test the
general applicability of the approach, the test corpus also contains larger text samples.
e test corpus can be subdivided into different sub-corpora:
• Latin-based: Texts consisting of languages using Latin-based scripts, such as
German, English, Finnish or Italian

• Mixed script: Texts consisting of languages using Latin-based scripts and lan-
guages using non-Latin-based scripts

• Twier data: Short texts taken from Twier

• Pali dictionary data: Unstructured texts containing many different language in-
clusions such as Vedic Sanskrit, Sanskrit, Indogermanic reconstructions, Old Bul-
garian, Lithuanian, Greek, Latin, Old Irish, many abbreviations and references
to text passages
As every outcome has to be manually checked, the test corpus is rather small. Every
category consists of five texts. Each texts consists of two or three languages with the
exception of the Pali dictionary data that oen contains inclusions from many different
languages in the etymological explanations.
For each text, I also created a gold standard version with the expected clusters. In
some cases it is not clear how to cluster certain objects. In that case, I use a clustering

18
that makes sense to me, but this need not mean that it is the correct or only possible
clustering.
For the parameter estimation of the language model induction approach, I also
compiled a set of development data. All texts can be found in the appendix under 8.1
and 8.2.

4.2 Supervised language model


4.2.1 Implementation
For the supervised language segmentation method, I implemented an n-gram language
model as described by Dunning (1994). e n-gram language model is implemented
as a character trigram model with non-linear back-off to bigram and unigram models.
e conditional probability P is calculated using the formula:


 α1 C(w i−2 ,wi−1 ,wi )
if C(wi−2 , wi−1 , wi ) > 0

 C(wi−2 ,wi−1 )
α C(wi−1 ,wi ) if C(wi−1 , wi ) > 0
2 C(wi−1 )
P (wi |wi−2 , wi−1 ) = (17)

 α C(w i )
if C(w ) > 0


3 V i
α 1
otherwise
4 V +W +X

with α1 = 0.7, α2 = 0.2, α3 = 0.09, α4 = 0.01, V the number of unigrams, W the


number of bigrams and X the number of trigrams.
Each word is padded by two different start symbols and two different end symbols.
e joint probability for a word w of length n is calculated as
1
P (w) = ∑n (18)
i=2 | log P (wi |wi−2 , wi−1 )|
In the denominator, I use the log probability instead of the probability to increase
numerical stability. Indeed, multiplying very small numbers can lead to the result
being approximated as zero by the computer when the numbers become too small to
be represented as normalized number (Goldberg, 1991). Using the sum of logarithms
avoids this problem and is less computationally expensive (Bürgisser et al., 1997).
As the logarithm of a number approaching zero tends to infinity, rare observations
get a higher score than frequent observations. As such, the denominator can be seen
as a scale of rarity, with a higher score corresponding to a rarer word. By taking the
inverse of this scale, we get a score corresponding to the “commonness” (≈ frequency)
of a word.

4.2.2 Training phase


First, models are trained on training data in the relevant languages. I have not included
the languages from the Pali dictionary data, as there are too many different languages

19
Language Size in MB
Amharic 9
Arabic 747
Chinese 1005
English 2097
Finnish 570
French 2097
German 2097
Greek 464
Italian 2097
Polish 2097
Russian 2097
Spanish 2097
Turkish 386
Ukrainian 1456

Table 1: Training data: Size

and there are typically only small inclusions of different languages in a dictionary en-
try; as such, it would not have made sense to train a language model just to recognize
a single word. Another reason for not using the Pali dictionary data languages is that
sometimes it is not possible to find data for a language, e.g. Old Bulgarian or recon-
structed Indogermanic. In some cases, it would have been conceivable to train models
on similar languages, but again, the effort of training a model is disproportionately
high compared to the (uncertain) result of recognizing a single inclusion. Instead, an
additional catch-all language model is used to capture words that do not seem to belong
to a trained model.
e training data consists of Wikipedia dumps from the months June and July 2015;
a dump is a copy of the whole encyclopedia for a given language. Due to the difference
in size of the Wikipedia of the different languages, I choose the full dump for languages
with less than 3 GB of compressed data and limited the amount of data to maximally
3 GB of compressed data.
e Wikipedia data was processed using the Wikipedia Extractor3 version 2.8 in
order to extract the textual content from the article pages. Indeed, the Wikipedia pages
are wrien using the MediaWiki Markup Language4 . While this markup is useful for
meta-data annotation and cross-referencing, the encoded information is superfluous
for language model training and has to be removed before training a model on the
data. Table 1 shows the size of the training data per language aer text extraction.
3
[Link]
4
[Link]

20
As the test data only contains transliterated Amharic text, the Wikipedia data, writ-
ten in the Ge’ez script, had to be transliterated. e text was transliterated according
to the EAE transliteration scheme by the Encyclopaedia Aethiopica.
As the test data contains transliterated Greek, the Greek data was used once as-is
and once transliterated according to the ELOT (Hellenic Organization for Standardiza-
tion) transliteration scheme for Modern monotonic Greek.
It should be borne in mind that the training data influences the quality and accuracy
of the model. Furthermore, a model might work well on certain text types and less well
on other text types. It is not possible to train a perfect, universal model.

4.2.3 Application of the approa


In the second step, an input text is segmented into words. en, each word is evaluated
by each language model and the model with the highest score is assigned as the word’s
language model.
e approach taken consists in classifying words as either belonging to a trained
language model or to the additional, catch-all model other, which simply means that
the word could not be assigned to a trained model class.

4.2.4 Textcat and language segmentation


I also tested how well Textcat is suited to the task of language segmentation. e
approach is similar to the n-gram approach, with the exception that I do not train any
models and rely on Textcat’s classifier for language prediction.
In the first step, an input text is segmented into words. en, each word is passed
to Textcat and the guess made by Textcat is taken as the word’s language.

4.3 Unsupervised clustering


In order to test the efficiency of clustering algorithms on the task of language segmen-
tation, I looked at various algorithms readily available through WEKA, “a collection of
machine learning algorithms for data mining tasks” by the University of Waikato in
New Zealand (Hall et al., 2009) and the Environment for Developing KDD-Applications
Supported by Index-Structures (ELKI), “an open source data mining soware […] with
an emphasis on unsupervised methods in cluster analysis and outlier detection” by
the Ludwig-Maximilians-Universität München (Achtert et al., 2013). I also looked at
JavaML, “a collection of machine learning and data mining algorithms” (Abeel et al.,
2009), in order to integrate clusterers into my own code framework. JavaML offers dif-
ferent clustering algorithms and also offers access to WEKA’s clustering algorithms. In
contrast to WEKA and ELKI, which can be used in stand-alone mode, JavaML is meant

21
to be integrated into bigger programs and provides an application programming inter-
face (API) that allows the provided algorithms to be accessed in a programmatic way,
i.e. from inside a program.

4.3.1 Preprocessing
However, in order for the clustering algorithms to work, the document to segment has
to be preprocessed in a number of ways, as shown in figure 16.

Figure 16: Clustering preprocessor

Read input

Tokenize input

No
Has token?

Yes
Yes

Normalize

Remove tags

is empty?

No

Extract features

First of all, the document has to be read in by the program. is step is straightfor-
ward.

22
e document then has to be tokenized. Tokenization is not trivial and depends on
the definition of a ‘word’. For this task I have used a whitespace tokenizer that defines
a word as a continuous sequence of character literals separated by one or more whites-
pace characters. While it can be objected that for scripts that don’t use whitespace to
separate words, such as Chinese, tokenization fails, this is not too big a concern. In-
deed, if a continuous block of Chinese characters is treated as one word, it is likely to
be clustered separately due to the different in ”word” length and the different charac-
ter set. If, however, a document contains two scripts that do not separate words by
whitespace, the approach totally fails. It is beyond the scope of this thesis, and possi-
bly of any thesis, to implement a universal tokenizer that works regardless of language
without prior knowledge about the languages at hand.
Each token is then normalized. Normalization of a non-Latin-based input (e.g. Ara-
bic or Cyrillic script) returns the input without modification. Otherwise, the following
modifications are made, if applicable:

• remove leading and trailing whitespace

• remove punctuation

• remove control characters

Control characters are defined as the set


( [ ] ) \
Punctuation is defined as the set
. , ” ’ : ; ! ? −
e token is then stripped of XML-like tags, if applicable. e following example
illustrates this step. Let us assume we have the following token:
<word i d = ” 1 ” lemma =” go ” > goes < / word >
e token is replaced by the text content of the node, thus the resulting token is ‘goes’.
If, aer all these modifications, the token corresponds to the empty string, we con-
tinue with the next token. Otherwise, the token is passed on to the feature extraction
module. e algorithm terminates when all tokens have been consumed.

4.3.2 Defining features


e final step consists in defining features by which to cluster and implementing fea-
ture extractors that build the feature vectors from the input. Since the features are to
be language independent, using features such as ‘occurs in an English lexicon’ cannot
be used. e following features were devised:

23
1. word length: the length of the word in characters

2. X tail bigrams: bigrams calculated from the end of the word

3. Y tail trigrams: trigrams calculated from the end of the word

4. X first bigrams: bigrams calculcated from the beginning of the word

5. Y first trigrams: trigrams calculated from the beginning of the word

6. latin basic: is the word latin basic?

7. latin extended: is the word latin extended?

8. capitalized: is the word capitalized?

9. contains non-word: does the word contain a non-word?

10. is non-word: is the word a non-word?

11. number of latin leers: number of latin leers

12. number of non-latin leers: number of non-latin leers

13. vowel ratio: number of vowels divided by the word length

14. basic latin leer ratio: number of latin leers divided by the word length

15. max consonant cluster: the longest consonant cluster size in characters

16. is digit: is the word a digit?

17. is ideographic: is the word ideographic?

18. directionality: what directionality does first character of the word have?

19. is BMP codepoint: does the word contain non-BMP characters?

20. general type: what is the general type of the first character of the word?

e last two features are based on the Java Character class. is class provides
methods to check for specific implementation-based properties of characters.
While most features are rather self-explanatory, a few require further explanation.
For the n-grams, the number of n-grams is restricted so as to keep the resulting vec-
tors the same size. is is important because the clustering algorithm considers one
data column as one feature, and having vectors of different length would disrupt this
precondition. Implementing the comparison of vectors of different lengths, or rather

24
or vectors containing vectors as features would have been possible, but rather time-
consuming. If a word is too short to generate the required number of n-grams, only
the possible n-grams are generated and all other positions filled with 0.
e ‘latin’ features check whether the word consists only of the basic latin leers
A-Z and a-z (‘basic’) while the ‘extended’ feature also covers leers derived from the
latin leers (e.g. ë, ç, ṃ, ñ).
Non-words are defined as anything not consisting of leers, such as punctuation
marks or digits.
Directionality indicates which direction a character should be wrien. While the
actual list is much more exhaustive, this property basically indicates whether the char-
acter is wrien from le to right or from right to le. 5
BMP stands for Basic Multilingual Plane and refers to an encoding unit known as
plane, which consists of 216 = 65536 codepoints (i.e. encoding slots for characters)
(e Unicode Consortium, 2014). e BMP is the first plane, covering the codepoints
U+0000 to U+FFFF (e Unicode Consortium, 2014). While it is not important to un-
derstand the technical details fully, it is interesting to note that most characters are
covered by the BMP, including Chinese, Japanese and Korean characters (e Unicode
Consortium, 2014). e next plane, called Supplementary Multilingual Plane or Plane
1 contains historic scripts such as Egyptian hieroglyphs and cuneiform scripts, but also
musical notation, game symbols and various other scripts and symbols (e Unicode
Consortium, 2014). ere are 17 planes in total (e Unicode Consortium, 2014).
e last feature in the list, General Type is also an implementation-related property.
Type can be, for example5 , END_PUNCTUATION, LETTER_NUMBER or
MATH_SYMBOL. ese constants are represented as numbers internally, which are
taken as feature for the clustering algorithm.

4.3.3 Mapping features to a common scale


As JavaML requires numerical features, all features were mapped to numerical scales:
• Binary features were mapped to 0 (false) and 1 (true)

• Ternary features were mapped to 0 (false), 1 (true) and 99 (not applicable)

• Numerical features were represented as themselves, either as whole numbers


(e.g. word length) or as floating point numbers (e.g. vowel ratio)

• Java specific features (18,20) take the underlying numerical value as feature

• N-grams were encoded numerically using algorithm 1

5
e full list can be found under the documentation of the Java Character class
hp://[Link]/javase/7/docs/api/java/lang/[Link]

25
Algorithm 1 N-gram numerical encoding
1: function (word)
2: sum ← 0
3: for character in word do
4: value ←code-point of character
5: sum ← sum + value
6: end for
7: return sum
8: end function

While algorithm 1 does not encode n-grams in an unambiguous way (“en” and “ne”
are both encoded as 211), it provides a sufficiently good encoding.

4.3.4 e problem of unambiguous encoding


I have tried using unambiguous encodings. e main problem with unambiguous en-
coding is that the notion of “distance” is distorted. e idea behind the unambiguous
encoding is that each “word” (i.e. string of characters) is encoded numerically so that
no two “words” are represented as the same number. Besides the encoding of each sep-
arate character, the position of the character inside the string also has to be encoded.
A possible encoding e for a string w1 w2 w3 could be

ew1 w2 w3 = n(w1 ) + x ∗ n(w2 ) + y ∗ n(w3 ) (19)


with wi the character of the string at position i, n(wi ) the numerical encoding of
the character wi and x and y parameters. If |A| is the alphabet size of the alphabet A
in which the word is encoded, the following constraints must be true for the encoding
to be unambiguous:

x ≥ |A| (20)
y ≥ |A|2 (21)
If we take for example the English alphabet with 26 lowercase and 26 uppercase
leers, not counting punctuation, digits and other characters, it has to be true that
x ≥ 52 and y ≥ 2704. e problem is that we cannot know in advance what size
the alphabet will be. If we have English and German texts, the size can be estimated
around 60. However, if we have English, Russian and Arabic text, the size drastically
increases. We could choose any two very big numbers, but if we want to guarantee
our encoding to be unambiguous, we run the risk of ending up with numbers too big
to be represented efficiently.

26
In this encoding scheme, distance is skewed: changes to the first character result
in linear distance. ‘man’ and ‘nan’ have a distance of 1, because ‘m’ and ‘n’ have a
distance of 1. ‘man’ and ‘lan’ have a distance of 2, etc. Changes to the second character
are multiplied by x. ‘man’ and ‘men’ have a distance of x ∗ (distance(a, e)) = 4 ∗ x.
Changes to the third character are scaled by y. For any sufficiently big x and y, the
distances are too skewed to be used for automatic cluster analysis. Let us consider the
following example with only two characters for simplicity. For this example, let us
assume x = 1373.

na ma ne me
na 0 1 5492 5491
ma 1 0 5493 5492
ne 5492 5493 0 1
me 5491 5492 1 0

Table 2: Unambiguous encoding: distances

It should be apparent from table 2 that the notion of “distance” is distorted. In


comparison, table 3 shows the encoding achieved with algorithm 1.

na ma ne me
na 0 1 4 3
ma 1 0 5 4
ne 4 5 0 1
me 3 4 1 0

Table 3: Simplified encoding: distances

While this encoding is not unambiguous, it is considered sufficiently good for our
purposes.

4.3.5 e clusterer
Most clustering algorithms such as k-means need to be passed the number of clusters to
generate. As we want to work as flexibly as possible, I ignored all algorithms that need
the number of clusters before clustering. In contrast, the x-means algorithm (Pelleg
and Moore, 2000) estimates the number of clusters to generate itself. is algorithm
has been chosen to perform the language clustering tasks.
While WEKA and ELKI offer a graphical user interface and various graphical rep-
resentations of the results, the output is not easily interpretable. Indeed, we can get a
visualization of a clustering operation as shown in figures 17 (WEKA) and 18 (ELKI).
However, all data points have to be manually checked by either clicking each point

27
in order to get additional information about that data point (WEKA) or by hovering
over the data points aer having selected the Object Label Tooltip option (ELKI). Fig-
ure 18 shows the information for the lowest orange rectangle data point in the ELKI
visualization.

Figure 17: WEKA: Cluster visualization

28
Figure 18: ELKI: Cluster visualization

erefore, I have decided to embed the x-means clustering algorithm into a custom
framework. Originally part of the WEKA algorithms, the x-means algorithm has been
integrated into a Java program via the JavaML library. e framework takes an input
file, constructs the aforementioned feature vectors from the input, performs normal-
ization, passes the calculated feature vectors to the clustering algorithm and displays
the results in a text-based easily interpretable manner.
Preliminary analyses have shown that the first clustering result oen is not dis-
criminating enough. Hence, I perform a first clustering analysis, followed by a second
clustering analysis on the clusters obtained from the first analysis.

4.3.6 Evaluating clusterings


e clustering results are evaluated using four common similarity measures used in
evaluating the accuracy of clustering algorithms. ese methods are based on counting

29
pairs (Wagner and Wagner, 2007).
Let us consider the clustering C = {C1 , . . . , Ck }. C is a set of non-empty disjoint
clusters C1 , . . . , Ck . Let us consider the reference clustering C ′ = {C1 , . . . , Cl }. We
define the following sets.

• S11 : set of pairs that are in the same cluster in C and C ′

• S00 : set of pairs that are in different clusters in C and C ′

• S10 : set of pairs that are in the same cluster in C and in different clusters in C ′

• S01 : set of pairs that are in different clusters in C and in the same cluster in C ′

Let nij = |Sij |, with i, j ∈ {0, 1} be the size of a given set Sij .
e Rand Index is defined as
n11 + n00
RI = (22)
n11 + n10 + n01 + n00
e Rand Index measures the accuracy of the clustering given a reference partition
(Wagner and Wagner, 2007). However, it is criticized for being highly dependent on
the number of clusters (Wagner and Wagner, 2007).
e Jaccard Index measures the similarity of sets. It is similar to the Rand Index,
but it disregards S00 , the set of pairs that are clustered into different clusters in C and
C ′ (Wagner and Wagner, 2007). It is calculated as
n11
J= (23)
n11 + n10 + n01
e Fowlkes-Mallows Index measures precision. It is calculated as
n11
FM = √ (24)
(n11 + n10 )(n11 + n01 )
e Fowlkes-Mallows Index has the undesired property of yielding high values
when the number of clusters is small (Wagner and Wagner, 2007).
Finally, I will indicate the F-Score. According to Manning et al. (2008), in the context
of clustering evaluation the F(β) score is defined as

(β 2 + 1) ∗ P ∗ R
F (β) = (25)
(β 2 )P + R
with precision P and recall R defined as
n11
P = (26)
n11 + n10

30
n11
R= (27)
n11 + n01
By varying β, it is possible to give more weight to either precision (β < 0) or recall
(β > 1) (Manning et al., 2008). As I value recall higher than precision, I will indicate F1
(β = 1) and F5 (β = 5) scores. Indeed, I want to penalize the algorithm for clustering
together pairs that are separate in the gold standard while not penalizing the algorithm
for spliing pairs that are together in the gold standard.
All measures of similarity fall between [0, 1] with 0 being most dissimilar and 1 be-
ing identical. As there is no ultimate measure and all measures of similarity have their
drawbacks (Wagner and Wagner, 2007), all measures will be indicated in the results
section.

4.4 Weakly supervised language model induction


e language model induction approach works in two stages. In the first stage, n-gram
language models are induced from the text. In the second stage, the text is mapped to
the induced models. e algorithm for the language model induction is as follows:

Algorithm 2 Model induction


1: IM
2: for word in words do
3: modelAndScore ← MS(word)
4: score ← [Link]
5: if score < threshold then
6: model ← M(word)
7: [Link](model)
8: else
9: maxM odel ← [Link]
10: maxM [Link](word)
11: end if
12: end for

First of all, an initial language model is created. For each word, the maximum model
and maximum score is calculated. ese values correspond to the language model that
yielded the highest probability for the word in question, and the associated probability.
If the score falls below a threshold t (i.e. none of the existing language models model
the word well enough), a new language model is created on the basis of the word and
added to the list of language models. Otherwise, the top scoring language model is
updated with the word in question.

31
As the text structure itself influences the quality of the induced models, the lan-
guage model induction is run i times (i 6 1), with one iteration consisting of two
induction steps, once forward and once backward, and j times from a random position
(j 6 0). e initial model creation thus either picks the first word of the text (as shown
in algorithm 3 line 2), or the last word of the text, or a random word.

Algorithm 3 Initial model creation


1: function IM
2: word ← words.f irst
3: model ← createM odel(word)
4: [Link](model)
5: end function

Algorithm 4 Max model and max score


1: function MS(word)
2: maxScore ← 0
3: maxM odel ← none
4: for model in models do
5: score ← [Link](word)
6: if score > maxScore then
7: maxScore ← score
8: maxM odel ← model
9: end if
10: end for
11: return maxM odel, maxScore
12: end function

Algorithm 4 returns both the max model and the max score wrapped as a custom
object. e individual values can then be read as necessary.
Aer the models have been induced, the most similar models are merged based
on distributional similarity. Distributional similarity is calculated as explained below.
is merging step only merges one model from the forward induction group with one
model from the backward induction group. e resulting model is added to the set of
probable (“silver”) models.
Merging is performed according to algorithm 5. e merging algorithm only re-
tains the common set of unigrams from both models, and all resulting bi- and trigrams,
excluding any bi- and trigrams that contain character that occur only in one of the
models. e values for the resulting language model are calculated according to one
of four different merge modes.
e merge modes are:

32
Algorithm 5 Model merger
1: function (model1 ,model2 , mode)
2: merged ← ∅
3: for unigram u1 in model1 .unigrams do
4: for unigram u2 in model2 .unigrams do
5: if u1 = u2 then
6: v1 ← f (u1 ) ◃ f (u1 ) is the frequency of u1
7: v2 ← f (u2 )
8: value ← mode(v1 , v2 )
9: unigram ← u1 ◃ or u2 , since both are equal
10: merged ← (unigram, value)
11: else
12: exclude ← u1
13: exclude ← u2
14: end if
15: end for
16: end for
17: for all bigrams b in model1 and model2 do
18: if not exclude contains any char in b then
19: v1 ← f (b, model1 ) or 0 ◃ frequency of b in model1
20: ◃ or 0 if it does not exist
21: v2 ← f (b, model2 ) or 0
22: value ← mode(v1 , v2 )
23: merged ← (b, value)
24: end if
25: end for
26: for all trigrams t in model1 and model2 do
27: if not exclude contains any char in t then
28: v1 ← f (t, model1 ) or 0
29: v2 ← f (t, model2 ) or 0
30: value ← mode(v1 , v2 )
31: merged ← (t, value)
32: end if
33: end for
34: return merged
35: end function

33
• MAX: use the maximum value (max(v1 , v2 ))

• MIN: use the minimum value (min(v1 , v2 ))

• MEAN: use the mean value ( v1 +v


2
2
)

• ADD: use the sum of the values (v1 + v2 )

If the random iteration count j > 0, a random word is chosen and the induction
is run once forward and once backward starting from this position. en, the most
similar models from each set are merged and added to the set of probable models.
It should be noted that seing the parameter j > 0 will make the algorithm non-
deterministic.
e model induction is then repeated while the iteration count i has not been
reached or until no more models are induced, with the difference that for each word,
each probable model is first consulted. If any of the probable models yields a score
higher than the threshold value t, it is assumed that the word is already well repre-
sented by one of the probable models and no models are induced for this word. If the
score falls below the threshold value t, induction is run as described.
At the end of the induction loop, all probable models are checked against each other.
While there are two models that have a similarity below the silver threshold value s,
the two models are merged and added to the set of very probable (“gold”) models.
If the set of probable models is not empty aer this merging step, all remaining
probable models are added to the set of very probable models.
In the second stage, the text is segmented according to the induced “gold” models.
For each word, the language model with the highest probability for the word is chosen
as that word’s hypothetical language model.

4.4.1 Distributional similarity


Suppose we have three models with the distributions of leers as shown in figures 19,
20 and 216 . Similarity could be calculated based on the occurrence of unigrams/leers
alone, i.e. if model1 contains the leer ‘a’ and model2 also contains the leer ‘a’, their
similarity increases by 1.
However, if we calculate similarity in such a way, all three models are equally simi-
lar to each other, as each of the leers occurs at least once in each model. Yet, it should
be clear that models 1 and 2 are very similar to each other while model 3 is dissimilar.
erefore, in order to include the distribution of leers in the similarity measure,
similarity is calculated as shown in algorithm 6.
6
e figures shown are used for illustration purposes only and do not necessarily reflect real language
models.

34
6

0 a c e g
b d f h i

Figure 19: Language model: Distribution 1

0 a c e g
b d f h i

Figure 20: Language Model: Distribution 2

35
6

0 a c e g
b d f h i

Figure 21: Language model: Distribution 3

Algorithm 6 Distributional Similarity Calculation


1: function (model1 ,model2 )
2: similarity ← 0
3: difference ← 1 ◃ Initialize difference to 1 to avoid division by zero
4: for unigram u1 in model1 .unigrams do
5: for unigram u2 in model2 .unigrams do
6: if u1 = u2 then ◃ unigram occurs in both models
7: v1 ← model f (u1 )
1 .size
◃ Normalize value by model size
8: v2 ← model2 .size
f (u2 )

9: q ← |vv11 −v
+v2
2|

10: similarity ← similarity +(2 − q)


11: else
12: difference ← difference +1
13: end if
14: end for
15: end for
similarity
16: return difference
17: end function

with f (c) returning the frequency of the character c. e number 2 in (2−q) in line
10 can be explained as follows: q expresses the dissimilarity of the models with regard
to a unigram distribution with 0 6 q 6 1, hence (1 − q) expresses the similarity. To
this, we add 1, as we increase similarity by 1 due to the match; we augment the simple
increase of 1 by the similarity of the distribution.

36
4.4.2 Evaluating results
e results of this approach can be interpreted as clusters, where each language model
represents one cluster core and all words assigned to that model making up that cluster.
Evaluation will hence be analogous to the evaluation of the clustering approach.

4.4.3 Estimating the parameters


As the language model induction can be controlled by parameters, we have to find a
combination of parameters that works well for our task. e parameters i, j and “merge
mode” have been estimated on the development set. e development set contains
similar documents to those in the test set. e development set can be found in the
appendix.
It has been found that the parameter combination i = 4, j = 2, ADD yields good
results across the development set. Hence, these values have been used for the test set
evaluation.

37
5 Results
‘Baseline’ indicates the measurement where all words have been thrown into one clus-
ter, measured against the gold standard. For ‘Baseline 2’, every word has been put into
its own cluster and this clustering is evaluated against the gold standard. e column
‘F1’ stands for the F1 score and the ‘F5’ column stands for the F5 score.
If any of the ‘runs’ yields a higher score than any of the baseline values, the max-
imum score is indicated in bold. If a field contains ‘n/a’, this means that the value
could not be calculated for whatever reason (most oen a division by zero would have
occurred).

5.1 N-Gram language model

Rand Jaccard Fowlkes- F1 F5


Mallows
German–English
Baseline 0.9259 0.9259 0.9622 0.9615 0.9285
Baseline 2 0.0000 0.0740 n/a n/a n/a
NGLM 0.5200 0.4428 0.6597 0.6138 0.9275
German–Finnish–Turkish
Baseline 0.3312 0.3312 0.5755 0.4976 0.3400
Baseline 2 0.6721 0.0103 0.1015 0.0204 0.2132
NGLM 0.8104 0.3872 0.5615 0.5582 0.5081
English–Fren
Baseline 0.7038 0.7038 0.8389 0.8261 0.7119
Baseline 2 0.3064 0.0145 0.1207 0.0287 0.2777
NGLM 0.6246 0.3540 0.5322 0.5229 0.4459
English–Transliterated Greek
Baseline 0.8809 0.8809 0.9385 0.9385 0.8850
Baseline 2 0.1269 0.0090 0.0949 0.0178 0.1911
NGLM 0.6932 0.5492 0.7117 0.7090 0.7708
Italian–German
Baseline 0.5807 0.5807 0.7620 0.7347 0.5902
Baseline 2 0.4227 0.0060 0.0776 0.0119 0.1360
NGLM 0.7010 0.2977 0.4740 0.4589 0.5969

Table 4: N-Gram language model results: Latin script

38
Rand Jaccard Fowlkes- F1 F5
Mallows
Greek–Russian
Baseline 0.5578 0.5578 0.7468 0.7161 0.5674
Baseline 2 0.4440 0.0034 0.0584 0.0068 0.0817
NGLM 0.7597 0.5108 0.6762 0.6762 0.6694
English–Greek
Baseline 0.9179 0.9179 0.9580 0.9571 0.9208
Baseline 2 0.0946 0.0136 0.1167 0.0269 0.2643
NGLM 0.5665 0.3867 0.5586 0.5577 0.5877
English–Spanish–Arabic
Baseline 0.3354 0.3354 0.5791 0.5023 0.3442
Baseline 2 0.6682 0.0109 0.1044 0.0215 0.2227
NGLM 0.9204 0.7489 0.8573 0.8564 0.8936
English–Chinese
Baseline 0.8474 0.8474 0.9205 0.9174 0.8524
Baseline 2 0.1595 0.0082 0.0909 0.0164 0.1781
NGLM 0.6573 0.4476 0.6259 0.6184 0.7208
Ukrainian–Russian
Baseline 0.4950 0.4950 0.7035 0.6622 0.5048
Baseline 2 0.5060 0.0022 0.0472 0.0044 0.0550
NGLM 0.6755 0.3857 0.5644 0.5567 0.4831

Table 5: N-Gram language model results: Mixed script

39
Rand Jaccard Fowlkes- F1 F5
Mallows
Pali 1
Baseline 0.3131 0.3131 0.5595 0.4768 0.3216
Baseline 2 0.6906 0.0118 0.1089 0.0234 0.2379
NGLM 0.8153 0.2069 0.3434 0.3429 0.3608
Pali 2
Baseline 0.3589 0.3589 0.5991 0.5283 0.3680
Baseline 2 0.6495 0.0238 0.1543 0.0465 0.3880
NGLM 0.7173 0.1958 0.3336 0.3275 0.3971
Pali 3
Baseline 0.4947 0.4947 0.7033 0.6619 0.5045
Baseline 2 0.5075 0.0045 0.0676 0.0091 0.1067
NGLM 0.7874 0.0816 0.1692 0.1508 0.1064
Pali 4
Baseline 0.4000 0.4000 0.6324 0.5714 0.4094
Baseline 2 0.6000 0.0000 n/a n/a n/a
NGLM 0.3000 0.1250 0.2357 0.2222 0.1699
Pali 5
Baseline 0.5800 0.5800 0.7615 0.7341 0.5895
Baseline 2 0.4236 0.0063 0.0798 0.0126 0.1430
NGLM 0.4777 0.2496 0.4065 0.3995 0.4816

Table 6: N-Gram language model results: Pali data

40
Rand Jaccard Fowlkes- F1 F5
Mallows
Twitter 1
Baseline 0.4615 0.4615 0.6793 0.6315 0.4712
Baseline 2 0.5384 0.0000 n/a n/a n/a
NGLM 0.8589 0.5925 0.7542 0.7441 0.8757
Twitter 2
Baseline 0.5555 0.5555 0.7453 0.7142 0.5652
Baseline 2 0.4444 0.0000 n/a n/a n/a
NGLM 0.7485 0.6090 0.7591 0.7570 0.8121
Twitter 3
Baseline 0.6583 0.6583 0.8113 0.7939 0.6670
Baseline 2 0.3416 0.0000 n/a n/a n/a
NGLM 0.6750 0.4347 0.6479 0.6060 0.8996
Twitter 4
Baseline 0.8750 0.8750 0.9354 0.9333 0.8792
Baseline 2 0.1250 0.0000 n/a n/a n/a
NGLM 0.7250 0.5822 0.7597 0.7360 0.9545
Twitter 5
Baseline 0.4285 0.4285 0.6546 0.6000 0.4382
Baseline 2 0.5714 0.0000 n/a n/a n/a
NGLM 0.6666 0.1250 0.2672 0.2222 0.4561

Table 7: N-Gram language model results: Twier data

41
5.2 Textcat

Rand Jaccard Fowlkes- F1 F5


Mallows
German–English
Baseline 0.9259 0.9259 0.9622 0.9615 0.9285
Baseline 2 0.0000 0.0740 n/a n/a n/a
Textcat 0.8632 0.8518 0.9200 0.9200 0.9200
German–Finnish–Turkish
Baseline 0.3312 0.3312 0.5755 0.4976 0.3400
Baseline 2 0.6721 0.0103 0.1015 0.0204 0.2132
Textcat 0.4095 0.1903 0.3823 0.3198 0.2124
English–Fren
Baseline 0.7038 0.7038 0.8389 0.8261 0.7119
Baseline 2 0.3064 0.0145 0.1207 0.0287 0.2777
Textcat 0.3890 0.3211 0.5476 0.4861 0.3411
English–Transliterated Greek
Baseline 0.8809 0.8809 0.9385 0.9385 0.8850
Baseline 2 0.1269 0.0090 0.0949 0.0178 0.1911
Textcat 0.5202 0.4853 0.6678 0.6535 0.5492
Italian–German
Baseline 0.5807 0.5807 0.7620 0.7347 0.5902
Baseline 2 0.4227 0.0060 0.0776 0.0119 0.1360
Textcat 0.4030 0.3057 0.5014 0.4682 0.3520

Table 8: Textcat results: Latin script

42
Rand Jaccard Fowlkes- F1 F5
Mallows
Greek–Russian
Baseline 0.5578 0.5578 0.7468 0.7161 0.5674
Baseline 2 0.4440 0.0034 0.0584 0.0068 0.0817
Textcat 0.4468 0.2971 0.4769 0.4581 0.3644
English–Greek
Baseline 0.9179 0.9179 0.9580 0.9571 0.9208
Baseline 2 0.0946 0.0136 0.1167 0.0269 0.2643
Textcat 0.5357 0.4933 0.6730 0.6607 0.5619
English–Spanish–Arabic
Baseline 0.3354 0.3354 0.5791 0.5023 0.3442
Baseline 2 0.6682 0.0109 0.1044 0.0215 0.2227
Textcat 0.3956 0.2832 0.5042 0.4414 0.3052
English–Chinese
Baseline 0.8474 0.8474 0.9205 0.9174 0.8524
Baseline 2 0.1595 0.0082 0.0909 0.0164 0.1781
Textcat 0.5018 0.4468 0.6251 0.6177 0.5408
Ukrainian–Russian
Baseline 0.4950 0.4950 0.7035 0.6622 0.5048
Baseline 2 0.5060 0.0022 0.0472 0.0044 0.0550
Textcat 0.3787 0.2625 0.4472 0.4159 0.3105

Table 9: Textcat results: Mixed script

43
Rand Jaccard Fowlkes- F1 F5
Mallows
Pali 1
Baseline 0.3131 0.3131 0.5595 0.4768 0.3216
Baseline 2 0.6906 0.0118 0.1089 0.0234 0.2379
Textcat 0.4531 0.2508 0.4849 0.4011 0.2641
Pali 2
Baseline 0.3589 0.3589 0.5991 0.5283 0.3680
Baseline 2 0.6495 0.0238 0.1543 0.0465 0.3880
Textcat 0.4307 0.2745 0.5088 0.4307 0.2888
Pali 3
Baseline 0.4947 0.4947 0.7033 0.6619 0.5045
Baseline 2 0.5075 0.0045 0.0676 0.0091 0.1067
Textcat 0.2032 0.0704 0.2502 0.1315 0.0736
Pali 4
Baseline 0.4000 0.4000 0.6324 0.5714 0.4094
Baseline 2 0.6000 0.0000 n/a n/a n/a
Textcat 0.5000 0.1666 0.2886 0.2857 0.2524
Pali 5
Baseline 0.5800 0.5800 0.7615 0.7341 0.5895
Baseline 2 0.4236 0.0063 0.0798 0.0126 0.1430
Textcat 0.5090 0.3458 0.5141 0.5140 0.5236

Table 10: Textcat results: Pali data

44
Rand Jaccard Fowlkes- F1 F5
Mallows
Twitter 1
Baseline 0.4615 0.4615 0.6793 0.6315 0.4712
Baseline 2 0.5384 0.0000 n/a n/a n/a
Textcat 0.3736 0.2597 0.4460 0.4123 0.3049
Twitter 2
Baseline 0.5555 0.5555 0.7453 0.7142 0.5652
Baseline 2 0.4444 0.0000 n/a n/a n/a
Textcat 0.4678 0.4347 0.6158 0.6060 0.5207
Twitter 3
Baseline 0.6583 0.6583 0.8113 0.7939 0.6670
Baseline 2 0.3416 0.0000 n/a n/a n/a
Textcat 0.6838 0.6446 0.8011 0.7839 0.6586
Twitter 4
Baseline 0.8750 0.8750 0.9354 0.9333 0.8792
Baseline 2 0.1250 0.0000 n/a n/a n/a
Textcat 0.8833 0.8666 0.9309 0.9285 0.8711
Twitter 5
Baseline 0.4285 0.4285 0.6546 0.6000 0.4382
Baseline 2 0.5714 0.0000 n/a n/a n/a
Textcat 0.3333 0.3333 0.5773 0.5000 0.3421

Table 11: Textcat results: Twier data

45
5.3 Clustering
e first run indicates the value aer one clustering step, and the second run indicates
the value aer applying the clustering algorithm to the results of the first run.

Rand Jaccard Fowlkes- F1 F5


Mallows
German–English
Baseline 0.9259 0.9259 0.9622 0.9615 0.9285
Baseline 2 0.0000 0.0740 n/a n/a n/a
First run 0.4102 0.3929 0.6069 0.5642 0.8549
Second 0.2336 0.1970 0.4199 0.3291 0.7712
run
German–Finnish–Turkish
Baseline 0.3312 0.3312 0.5755 0.4976 0.3400
Baseline 2 0.6721 0.0103 0.1015 0.0204 0.2132
First run 0.4841 0.1764 0.3369 0.2998 0.2110
Second 0.6259 0.1611 0.2840 0.2775 0.2320
run
English–Fren
Baseline 0.7038 0.7038 0.8389 0.8261 0.7119
Baseline 2 0.3064 0.0145 0.1207 0.0287 0.2777
First run 0.4051 0.2980 0.5001 0.4592 0.3362
Second 0.4601 0.1836 0.3116 0.3103 0.2857
run
English–Transliterated Greek
Baseline 0.8809 0.8809 0.9385 0.9385 0.8850
Baseline 2 0.1269 0.0090 0.0949 0.0178 0.1911
First run 0.5867 0.3977 0.5725 0.5691 0.6320
Second 0.5423 0.3161 0.4909 0.4804 0.5934
run
Italian–German
Baseline 0.5807 0.5807 0.7620 0.7347 0.5902
Baseline 2 0.4227 0.0060 0.0776 0.0119 0.1360
First run 0.4222 0.2838 0.4640 0.4421 0.3453
Second 0.4915 0.2472 0.4006 0.3964 0.3499
run

Table 12: Clustering results: Latin script

46
Rand Jaccard Fowlkes- F1 F5
Mallows
Greek–Russian
Baseline 0.5578 0.5578 0.7468 0.7161 0.5674
Baseline 2 0.4440 0.0034 0.0584 0.0068 0.0817
First run 0.5787 0.3811 0.5672 0.5519 0.4549
Second 0.7536 0.3883 0.5899 0.4494 0.7914
run
English–Greek
Baseline 0.9179 0.9179 0.9580 0.9571 0.9208
Baseline 2 0.0946 0.0136 0.1167 0.0269 0.2643
First run 0.4244 0.2482 0.4015 0.3977 0.4553
Second 0.3705 0.0855 0.1784 0.1576 0.2777
run
English–Spanish–Arabic
Baseline 0.3354 0.3354 0.5791 0.5023 0.3442
Baseline 2 0.6682 0.0109 0.1044 0.0215 0.2227
First run 0.8016 0.5650 0.7400 0.7221 0.6008
Second 0.7226 0.2860 0.4495 0.4448 0.5130
run
English–Chinese
Baseline 0.8474 0.8474 0.9205 0.9174 0.8524
Baseline 2 0.1595 0.0082 0.0909 0.0164 0.1781
First run 0.5480 0.3356 0.5087 0.5025 0.5866
Second 0.5138 0.2584 0.4361 0.4107 0.5957
run
Ukrainian–Russian
Baseline 0.4950 0.4950 0.7035 0.6622 0.5048
Baseline 2 0.5060 0.0022 0.0472 0.0044 0.0550
First run 0.5867 0.1953 0.3268 0.3267 0.3305
Second 0.5934 0.1154 0.2178 0.2070 0.2907
run

Table 13: Clustering results: Mixed script

47
Rand Jaccard Fowlkes- F1 F5
Mallows
Pali 1
Baseline 0.3131 0.3131 0.5595 0.4768 0.3216
Baseline 2 0.6906 0.0118 0.1089 0.0234 0.2379
First run 0.4674 0.2540 0.4898 0.4051 0.2666
Second 0.7168 0.2547 0.4118 0.4060 0.3516
run
Pali 2
Baseline 0.3589 0.3589 0.5991 0.5283 0.3680
Baseline 2 0.6495 0.0238 0.1543 0.0465 0.3880
First run 0.6738 0.3026 0.4777 0.4646 0.3825
Second 0.6646 0.1865 0.3147 0.3144 0.3021
run
Pali 3
Baseline 0.4947 0.4947 0.7033 0.6619 0.5045
Baseline 2 0.5075 0.0045 0.0676 0.0091 0.1067
First run 0.5686 0.0746 0.2002 0.1389 0.0831
Second 0.7534 0.0911 0.1962 0.1670 0.1125
run
Pali 4
Baseline 0.4000 0.4000 0.6324 0.5714 0.4094
Baseline 2 0.6000 0.0000 n/a n/a n/a
First run 0.5333 0.3000 0.5477 0.4615 0.3083
Second 0.3000 0.3000 0.5477 0.4615 0.3083
run
Pali 5
Baseline 0.5800 0.5800 0.7615 0.7341 0.5895
Baseline 2 0.4236 0.0063 0.0798 0.0126 0.1430
First run 0.5294 0.2472 0.4111 0.3965 0.5242
Second 0.4666 0.1214 0.2524 0.2166 0.4117
run

Table 14: Clustering results: Pali data

48
Rand Jaccard Fowlkes- F1 F5
Mallows
Twitter 1
Baseline 0.4615 0.4615 0.6793 0.6315 0.4712
Baseline 2 0.5384 0.0000 n/a n/a n/a
First run 0.8681 0.7142 0.8451 0.8333 0.7222
Second 0.8461 0.6000 0.7745 0.7499 0.9750
run
Twitter 2
Baseline 0.5555 0.5555 0.7453 0.7142 0.5652
Baseline 2 0.4444 0.0000 n/a n/a n/a
First run 0.4575 0.3941 0.5655 0.5654 0.5573
Second 0.4967 0.3888 0.5615 0.5600 0.6012
run
Twitter 3
Baseline 0.6583 0.6583 0.8113 0.7939 0.6670
Baseline 2 0.3416 0.0000 n/a n/a n/a
First run 0.4571 0.3595 0.5525 0.5289 0.7215
Second 0.3523 0.2093 0.3997 0.3461 0.6428
run
Twitter 4
Baseline 0.8750 0.8750 0.9354 0.9333 0.8792
Baseline 2 0.1250 0.0000 n/a n/a n/a
First run 0.9019 0.8584 0.9265 0.9238 0.8631
Second 0.6250 0.5000 0.6789 0.6666 0.8080
run
Twitter 5
Baseline 0.4285 0.4285 0.6546 0.6000 0.4382
Baseline 2 0.5714 0.0000 n/a n/a n/a
First run 0.7142 0.4666 0.6831 0.6363 0.4764
Second 0.5714 0.3076 0.4780 0.4705 0.4046
run

Table 15: Clustering results: Twier data

49
5.4 Language model induction
In addition to highlighting results that outperform the baseline values, the following
tables have been color coded. Results that outperform the clustering algorithm are
indicated in red and results that outperform both the clustering algorithm and the n-
gram language model are indicated in blue.7

Rand Jaccard Fowlkes- F1 F5


Mallows
German–English
Baseline 0.9259 0.9259 0.9622 0.9615 0.9285
Baseline 2 0.0000 0.0740 n/a n/a n/a
Inducted 0.6837 0.6574 0.7988 0.7932 0.8896
German–Finnish–Turkish
Baseline 0.3312 0.3312 0.5755 0.4976 0.3400
Baseline 2 0.6721 0.0103 0.1015 0.0204 0.2132
Inducted 0.6438 0.1771 0.3057 0.3009 0.2588
English–Fren
Baseline 0.7038 0.7038 0.8389 0.8261 0.7119
Baseline 2 0.3064 0.0145 0.1207 0.0287 0.2777
Inducted 0.6171 0.2835 0.4427 0.4418 0.4692
English–Transliterated Greek
Baseline 0.8809 0.8809 0.9385 0.9385 0.8850
Baseline 2 0.1269 0.0090 0.0949 0.0178 0.1911
Inducted 0.4436 0.2398 0.4277 0.3868 0.6382
Italian–German
Baseline 0.5807 0.5807 0.7620 0.7347 0.5902
Baseline 2 0.4227 0.0060 0.0776 0.0119 0.1360
Inducted 0.5658 0.1536 0.2871 0.2664 0.4065

Table 16: Induction results: Latin script


7
Results that outperform only the n-gram language model would have been indicated in green, but
there is no score that outperforms only the n-gram language model.

50
Rand Jaccard Fowlkes- F1 F5
Mallows
Greek–Russian
Baseline 0.5578 0.5578 0.7468 0.7161 0.5674
Baseline 2 0.4440 0.0034 0.0584 0.0068 0.0817
Inducted 0.7142 0.4222 0.5940 0.5937 0.6125
English–Greek
Baseline 0.9179 0.9179 0.9580 0.9571 0.9208
Baseline 2 0.0946 0.0136 0.1167 0.0269 0.2643
Inducted 0.4769 0.3266 0.5089 0.4924 0.6423
English–Spanish–Arabic
Baseline 0.3354 0.3354 0.5791 0.5023 0.3442
Baseline 2 0.6682 0.0109 0.1044 0.0215 0.2227
Inducted 0.7783 0.5677 0.7534 0.7242 0.5773
English–Chinese
Baseline 0.8474 0.8474 0.9205 0.9174 0.8524
Baseline 2 0.1595 0.0082 0.0909 0.0164 0.1781
Inducted 0.5657 0.3343 0.5258 0.5011 0.6953
Ukrainian–Russian
Baseline 0.4950 0.4950 0.7035 0.6622 0.5048
Baseline 2 0.5060 0.0022 0.0472 0.0044 0.0550
Inducted 0.6289 0.1000 0.1935 0.1818 0.2659

Table 17: Induction results: Mixed script

51
Rand Jaccard Fowlkes- F1 F5
Mallows
Pali 1
Baseline 0.3131 0.3131 0.5595 0.4768 0.3216
Baseline 2 0.6906 0.0118 0.1089 0.0234 0.2379
Inducted 0.7856 0.1683 0.2898 0.2882 0.3188
Pali 2
Baseline 0.3589 0.3589 0.5991 0.5283 0.3680
Baseline 2 0.6495 0.0238 0.1543 0.0465 0.3880
Inducted 0.8148 0.5000 0.6686 0.6666 0.7176
Pali 3
Baseline 0.4947 0.4947 0.7033 0.6619 0.5045
Baseline 2 0.5075 0.0045 0.0676 0.0091 0.1067
Inducted 0.8492 0.0569 0.1083 0.1078 0.1186
Pali 4
Baseline 0.4000 0.4000 0.6324 0.5714 0.4094
Baseline 2 0.6000 0.0000 n/a n/a n/a
Inducted 0.6000 0.0000 0.0000 n/a n/a
Pali 5
Baseline 0.5800 0.5800 0.7615 0.7341 0.5895
Baseline 2 0.4236 0.0063 0.0798 0.0126 0.1430
Inducted 0.4033 0.2082 0.3504 0.3446 0.4134

Table 18: Induction results: Pali data

52
Rand Jaccard Fowlkes- F1 F5
Mallows
Twitter 1
Baseline 0.4615 0.4615 0.6793 0.6315 0.4712
Baseline 2 0.5384 0.0000 n/a n/a n/a
Inducted 0.6282 0.3695 0.5515 0.5396 0.4533
Twitter 2
Baseline 0.5555 0.5555 0.7453 0.7142 0.5652
Baseline 2 0.4444 0.0000 n/a n/a n/a
Inducted 0.7719 0.6020 0.7687 0.7515 0.9325
Twitter 3
Baseline 0.6583 0.6583 0.8113 0.7939 0.6670
Baseline 2 0.3416 0.0000 n/a n/a n/a
Inducted 0.5916 0.3000 0.5236 0.4615 0.8185
Twitter 4
Baseline 0.8750 0.8750 0.9354 0.9333 0.8792
Baseline 2 0.1250 0.0000 n/a n/a n/a
Inducted 0.5250 0.3736 0.5615 0.5439 0.7055
Twitter 5
Baseline 0.4285 0.4285 0.6546 0.6000 0.4382
Baseline 2 0.5714 0.0000 n/a n/a n/a
Inducted 1.0000 1.0000 1.0000 1.0000 1.0000

Table 19: Induction results: Twier data

53
6 Discussion
e work by Seldin et al. (2001) is similar to the work presented here. ey propose
an unsupervised language (and protein sequence) segmentation approach that yields
accurate segmentations. While their work looks promising, it also has its drawbacks.
eir method requires longer monolingual text fragments and a sizable amount of text.
Furthermore, they disallow switching language models aer each word. is presump-
tion will fail to detect single-word inclusions and structures as shown in figure 22,
where the language alternates aer each word.

w1 w2 w3 w4 w5 w6 w7 …

Figure 22: Alternating language structure

While this structure looks very artificial, such a structure is found, for instance,
in the fih Pali dictionary text, in the passage “Pacati, [Ved. pacati, Igd. *peqǔō, Av.
pac-;”. In this case, ‘red’ corresponds to Pali, ‘blue’ to (abbreviations in) English and
‘green’ to reconstructed Indo-european.

6.1 N-Gram language models


e trained n-gram language model approach works well on the Latin script data, man-
aging to single out the German inclusion from the English–German text (even though
it is classified as “other” instead of German).
For German–Finnish–Turkish, English–French, English–Transliterated Greek and
Italian–German, the separation of the main languages involved is good, although there
appear to be some problems when words contain non-word characters such as quotes
or parentheses.
Some puzzling misclassifications happen in the English–Transliterated Greek case:
agápe is considered English and éros is considered Transliterated Amharic.
In the Italian–German text, the Italian language leads to a rather important Spanish
cluster due to the relatedness of the two Romance languages.
On the mixed script data set, the results are more diverse. Greek–Russian, En-
glish–Spanish–Arabic and Ukrainian–Russian are segmented well, with English– Span-
ish–Arabic having Spanish split into Spanish, French and Italian due to the relatedness
of the languages.
In contrast, the segmentation of English–Greek did not work well at all. Of the two
Greek words ἀγάπη and ἔρως, ἀγάπη was considered French and ἔρως was considered
Russian. It must be noted, though, that these words bear polytonic diacritics, whereas
the model was trained on monotonic Greek.

54
Also, the segmentation of English–Chinese did not work well. is is probably
due to the way the model was trained. Chinese script is wrien without whitespace
characters between words, and the correct segmentation of a text wrien in Chinese
requires in-depth knowledge of the language. Some words are wrien with only one
character, but others are composed of two or more characters, with the meaning oen
being non-compositional; the meaning of a two-character word is different from the
sum of the meaning of the two characters. Sometimes, more than one segmentation
would be possible and the context decides on which segmentation is correct. In other
cases, more than one segmentation might be correct. is problem occurs with all
scripts that are wrien without whitespace.
As with the simplified assumption in the tokenization of whitespace-scripts, where
I consider a word to be a character sequence delineated by whitespace, I have treated
each character as a word. Adapting the method to Chinese and similar scripts would
have been possible, but would have introduced the need for large amounts of external
linguistic knowledge. Indeed, every possible non-whitespace-script would have to be
considered, and each of the tokenizers would be language dependent, i.e. a tokenizer
for Chinese would not work on Korean or Japanese.
e supervised approach did not work well on the Pali dictionary data. While
English words could be isolated somewhat successfully, the rest of the data proved
difficult to segment. As an example, let us look at the first Pali text. e English
cluster contains almost only English words, but not all, the “other” cluster contains
mainly marked up words, and the rest is seemingly haphazardly distributed among
the other models.

Pali 1: abbha

• (AR) ., 134., 289.

• (DE) Miln), imber, dark), Miln

• (EL) (=, (abbhaŋ

• (EN) water, mountain, of, free, (used, or, like, referred, (also, A, is, cloudy,
clouds, later, a, froth, 1, summit, thundering, by, mass, Pv, Oir, obscure, scum,
that, water]., thick, As, from, It, is, at, as, the, in, clouds, things, also

• (ES) (dense, f., sense, expl, rajo

• (FI) 239., rain;, Lat., Vin, perhaps, SnA

• (FR) cloud, Dh, adj., point, cloud, Dhs, A), rain, VvA, DhsA, list

• (IT) \”dark, &, ambha, 3, 1, 317, J, sunshine, cp., abhra, [Vedic, (megho

55
• (PL) 487, =, S, 295, <br, moon–, 249

• (RU) 348, 53

• (TR) viz., ambu, Vv

• (TrAM) 687, PvA, (°sama, 101, (nīl°, (cp., 64;, (nt.), 581, m., Sn, 1064;

• (TrEL) , Gr., Sk., Idg., to, pabbata, nt.

• (UK) 12)., 273, 617, 348)., 250;, 251)., 382).

• (other) <b> –saŋvilāpa </b>, <b> –mua </b>, <smallcaps> vi. </smallcaps>,
(mahiyā, <smallcaps> iv. </smallcaps>, cloud\”;, <b> Rāhu </b>, <b> abbhā
</b>, <b> abbhaŋ, <superscript> 9 </superscript>, marajo </b>, abbhāmua,
valāhaka);, <smallcaps> i. </smallcaps>, <b> abbhāmaa </b>, valāhaka–sikhara,
<superscript> s. </superscript>, <smallcaps> ii. </smallcaps>, <b> dhū-, storm–
cloud, /><b> –kūṭa </b>, thunder–cloud);, <at> a)fro\\s </at>, <b> –paṭala </b>,
<at>o)/mbros</at>, nīla–megha, <superscript>1</superscript>, *m̊bhro, \”dull\”;,
acchādesi);, mahikā</b>, <b> –ghana </b>
On the Twier data, the supervised approach achieved passable results. While
the numbers look great, the actual segmentations do not. For Twier 1, too many
clusters were generated, for Twier 2 and 3, the recognition of French words worked
somewhat, also recognizing English words as French and French words as English.
For Twier 4, the Polish inclusion was isolated but recognized as “other”, together
with “strawberries”. e recognition of transliterated Amharic worked satisfactorily,
yielding ‘naw’ to the Polish model.
As the number of language models increases, so does the risk of misclassification.
As can be seen, we already have quite some misclassification with only 15 language
models. For example, in our data, the English preposition ‘to’ is oen erroneously clas-
sified as ‘transliterated Greek’. e Greek particle το ‘to’ can be either the neuter sin-
gular accusative or nominative definite article ‘the’, the masculine singular accusative
or nominative definite article ‘the’ or the 3rd person neuter singular nominative/ac-
cusative weak pronoun ‘it’, and as such is rather frequent in the language. is is
especially problematic with the transliterated Greek language model, which tends to
misclassify the English preposition ‘to’ as transliterated Greek.
A quick corpus study using the Corpus of Modern Greek8 and the Corpus of Con-
temporary American English9 reveals that the frequency per million words for the
Greek particle το is 22666, while the English preposition ‘to’ has a frequency per mil-
lion words of 25193. eir relative frequencies are very close together, and it might
8
[Link]
9
[Link]

56
just have happened that the training data used in this work contained more Greek ‘to’s
than English ‘to’s, leading to this misclassification.
Other reasons for misclassification include relatedness of the modeled languages
as in the case of Germanic or Romance language families. Also, the text types used
for training and the text types used for testing play an important role, as well as the
amount of training data.
For n-gram language models, the quality of the model is dependent on the texts
used for training and the texts used in evaluation. It is probable that a different training
set would have yielded different results. is is also the problem with the supervised
approach; it is necessary to have language data for training and the trained models
reflect the training data to some extent.

6.2 Textcat
Textcat works well on monolingual texts. However, it fails on multilingual texts and
does not work well on short fragments of text, such as single words. Many of the words
are tagged as unknown, and if a language has been identified, the language guess oen
is not correct. Hence, Textcat cannot be used for language segmentation purposes.
Indeed, Textcat fails to exceed the baseline values except for two cases: ‘Twier
3’ and ‘Twier 4’ yield beer values than the baseline values. However, upon closer
inspection, it is clear that the numerical index values do not give a reliable picture of
the quality of the clustering.
Indeed, while the clustering of ‘Twier 3’ is not nonsensical, it is not very good,
failing to extract the French insertion ‘breuvages’. e Rand Index also only shows a
slightly beer value than the baseline values. It seems that the outstanding score for
‘Twier 4’ is achieved because both the clustering by Textcat and the gold standard
have the same number of clusters.
Tables 20 and 21 show the clusterings side by side. Clearly, Textcat performed
poorly despite the high numerical index values. A closer inspection of all the Textcat
results shows that Textcat performs poorly at the task of language segmentation; oen,
a word cannot be assigned a language and thus is added to the cluster of ‘unknown’
language words. For the words where a language has been identified, it most oen
is not the correct language. While language identification is not necessary for the
task of language segmentation, it helps to understand why Textcat failed at the task of
language segmentation.

57
Textcat Gold standard
Cluster 1 ∅ breuvages
Cluster 2 #bilingualism #FWWC2015, #bilingualism
Cluster 3 Food, and, breuvages, in, Ed- Food, and, in, Edmonton, are,
monton, are, ready, to, go, ready, to, go, just, waiting, for,
just, waiting, for, the, fans, the, fans
#FWWC2015

Table 20: ‘Twier 3’: Textcat versus Gold clustering

Textcat Gold standard


Cluster 1 strawberries, żubrówka
Cluster 2 my, dad, comes, back, from, my, dad, comes, back, from,
poland, with, two, crates, of, poland, with, two, crates,
żubrówka, and, adidas, jack- of, strawberries, and, adidas,
ets, omg jackets, omg

Table 21: ‘Twier 4’: Textcat versus Gold clustering

6.3 Clustering
e clustering results are more difficult to interpret. Oen, the first distinction made
seems to be based on case, i.e. words that begin with a capital leer versus words that
are all lowercase leers. e second run on the ‘mixed script: English – Greek’ data
shows that the first cluster from the first run has been separated into a cluster with
words that begin with a capital leer and two clusters with words that don’t begin
with a capital leer.

English–Greek: First run: First cluster

• “intimate, “without, Although, Aquinas, Christians, Corinthians, Socrates, Sym-


posium, Testament, Whether, affection, ancient, another.”, appreciation, aspires,
araction, araction.”, becomes, benevolence., biblical, brotherly, chapter,”, char-
ity;, children, children., contemplation, content, continues, contributes, defini-
tion:, described, existence;, explained, express, feeling, feelings, finding, further,
holding, initially, inspired, knowledge, marriage., necessary, non-corporeal, pas-
sage, passion.”, philosophers, physical, platonic, refined, relationships, returned,
self-benefit)., sensually, spiritual, subject, suggesting, through, throughout, tran-
scendence., unconditional, understanding, without, youthful

58
English–Greek: Second run: Splitting of first cluster

• affection, ancient, another.”, aspires, becomes, biblical, chapter,”, charity;, chil-


dren, children., content, definition:, feeling, feelings, finding, holding, marriage.,
necessary, passage, passion.”, platonic, refined, returned, subject, through, with-
out

• Although, Aquinas, Christians, Corinthians, Socrates, Symposium, Testament,


Whether

• “intimate, appreciation, araction, araction.”, benevolence., brotherly, contem-


plation, continues, contributes, described, existence;, explained, express, further,
initially, inspired, knowledge, non-corporeal, philosophers, physical, relation-
ships, self-benefit)., sensually, spiritual, suggesting, throughout, transcendence.,
unconditional, understanding, youthful

Another important distinction seems to be the length of words. Indeed, the results
oen show clusters that clearly are based on the length of the contained words. e
first run on the ‘latin script: German – Italian’ data shows that short words have been
singled out into the first cluster.

Italian–German: First run: First cluster

• (il, E, So, a, ad, da, di, e, es, ha, i, il, in, la, le, lo, ma, ne, se, si, un, va, zu

e clustering works well when the scripts involved are dissimilar, as in the case
of the English–Chinese text, where the Chinese characters were isolated aer the first
run, and also the English–Spanish–Arabic example, where the Arabic part was com-
pletely isolated in the first run.
e closer the scripts become, the less well clear cut the results are. For Greek–
Russian, the results are acceptable, with one mixed cluster. However, the number of
clusters is too high for the number of languages involved and the separation is only
achieved aer two consecutive clusterings.
e clustering of closer scripts, such as Ukrainian–Russian does not work well. e
clusters, with the exception of the cluster containing the datum ‘9—13’ are all impure,
consisting of Ukrainian and Russian words. e second run also fails at improving the
clustering.
Finally, clustering of latin based scripts does not perform well unless diacritics are
involved and the diacritics form the most salient distinction. Word containing leers
with diacritics are then generally separated from words containing no diacritics, as in
the German–Finnish-Turkish example. e first run generates a cluster for numbers,
two clusters with diacritics and one cluster without diacritics.

59
Probably for this reason, the clustering of Transliterated Greek–English and Greek–
English worked surprisingly well. In both cases, the first run managed to separate
the (transliterated) Greek parts from the English words. However, unaccented Greek
words such as Agape, erotas or eros were clustered with English.

English–Transliterated Greek: First run: Transliterated Greek cluster


• agápe, philía, storgē., éros

English–Greek: First run: Greek cluster


• (ἀγάπη, (ἔρως, Agápe, agápē), Éros, érōs), –
e problem is that when there are other salient distinguishing features besides
diacritics, the result is less good, as can be seen on the Pali data.

Pali: abhijjhitar: Second run


• abhijjhita, abhijjhātar, covets, function], med., one, who, °itar), °itar, °ātar).

• (T., A, M

• =, l., v.

• <smallcaps> i. </smallcaps>, <smallcaps> v. </smallcaps>, ag., fr., in

• 265, 287

• [n.
In some cases, the clustering fails at the task of language segmentation, as in the
case of the various English–French texts and the English–German example with the
German inclusion. We can thus say that the surface structure or morphology, or in
other words the basis from which we can extract features, is not sufficient to deduce
relevant information about ‘language’.
When there are more than two languages that are to be separated, the cluster-
ing also does not work well. Indeed, the most dissimilar objects are separated first.
In the case of English–Spanish–Arabic, the Arabic part is separated first, as well as
words with diacritics, while English and Spanish words without diacritics are thrown
together. Subsequent runs show no improvement of the clustering concerning the
separation of English and Spanish.
In the case of German–Finnish–Turkish, the clustering algorithm seems to cluster
out Turkish first, followed by Finnish. e results are however much less clear-cut
than for English–Spanish–Arabic.

60
6.4 Language model induction
e language model induction does not seem to work very well on the Latin script data.
ere are almost only impure clusters, containing more than one language. However,
the approach consistently outperforms the clustering approach when we look at the
F5 score. For the English–French data set, the clustering approach even outperforms
the n-gram language model approach. Indeed, the French words are relatively well
separated from the English text, with the exception of ‘sucré’, which is still thrown
together with English words.

Latin script: English–Fren


• both, “so”, in, English, although, their, is, is, the, opposite, of, “rough”, or, is,
the, opposite, of, sweet, only, for, wines, (otherwise, is

• mou, :, mou, but

• doux,

• Doux, (rugueux), Doux

• while

• “hard”., used).,

• translate, as, meaning, very, different., ”coarse”, can, also, mean, almost,sucré,
In contrast, the approach works well on the mixed script data. Indeed, we achieve
a good separation of the languages by script. However, when there are also Latin
based scripts, we encounter the same problems as mentioned above with rather modest
results. For example, for the English–Greek text, the approach separates out the Greek
character words but it fails to separate transliterated Greek and English. Also, for the
English–Spanish–Arabic text, Arabic is separated out, but English and Spanish are not
separated well.
One interesting observation can be made in the case of the English–Chinese text.
e Chinese characters have been isolated, but the Pinyin transcription is thrown to-
gether with the Chinese characters. Based on the prior observations, this is rather
unexpected. is raises the question of whether Pinyin ought to be clustered out, or
clustered together with English or Chinese.
Again, the language model induction approach outperforms the clustering approach,
and also the n-gram language model approach in the case of the English–Greek text.
On the larger Pali dictionary entries, the language model induction approach yields
acceptable results. On the shorter Pali dictionary entries, the language model induction
approach yields good results.

61
e quite low performance must be blamed on the data. Indeed, the Pali dictionary
data contain various problematic characters such as ‘comma/dot and whitespace’ as
one character. On such characters, whitespace tokenization fails, yielding big chunks
of nonsense tokens. For example, the fourth Pali dictionary entry was split into five
chunks (while it might not be displayed as such, all commata and all dots are in fact
not followed by whitespace, the whitespace is part of the character,10 hence whitespace
tokenization fails).

Pali: gūhanā: Chunks

• Gūhanā,(f.)

• [abstr.fr.gūhati]=gūhanā

•(q.v.)

• Pug.19.Cp.pari°.(Page

• 253)

Furthermore, the data contains markup, abbreviations, references, typing mistakes


and signs such as <-> that are difficult to assign to a language.
On the Twier data, the language model induction approach works rather well.
For example, on the first text, separation is not perfect with the Greek cluster still
containing some English words.

Twitter 1: English–Greek

• BUSINESS, EXCELLENCE.

• Μόλις, ψήφισα, αυτή, τη, λύση, Internet, of, στο, διαγωνισμό

• ings, IT

For the third and fourth text, the approach manages to single out the other-language
inclusions, but not exclusively. Both times, there is one additional item in the cluster
(the relevant clusters are marked in red).
10
e comma has the Unicode codepoint U+FF0C (FULLWIDTH COMMA) and the dot has the Uni-
code codepoint U+FF0E (FULLWIDTH FULL STOP)

62
Twitter 3: Fren–English

• #FWWC2015

• breuvages, go

• Food, Edmonton, to, for, the

• in, waiting, #bilingualism

• and, are, ready, just, fans

Twitter 4: English–Polish

• comes, from, with, two, crates, of, strawberries, jackets, omg

• my, dad, poland, and, adidas

• back, żubrówka

e approach exceeded expectations on the second and fih Twier text. On the
second text, the ‘French’ cluster does not only contain the French words ‘Demain’ and
‘par’, but also the French way of notating time ‘18h’.

Twitter 2: Fren–English

• Keynote, “e, collective, of, science-publish, or, perish;, it, all, that, counts?”

• Demain, 18h, par

• #dhiha6, David

• @dhiparis, dynamics, is

On the fih text, an almost perfect result was achieved, with only one additional
subdivision of the ‘English’ cluster.

Twitter 5: Transliterated Amharic–English

• (coffee

• bread). is, our

• Buna, dabo, naw

63
It seems that the language model approach does not work very well on longer texts,
especially on longer texts in Latin-based scripts, with the chosen parameter set; still,
the approach outperforms the clustering approach and achieves scores in the vicinity
of the scores achieved with the supervised trained n-gram language model approach.
On mixed script texts, the approach consistently outperforms the clustering approach
and we also reach scores in the vicinity of the scores achieved with the supervised
trained n-gram language model approach.
Moreover, on short texts, the approach works rather well. We succeed in outper-
forming the supervised trained n-gram language model approach on a number of texts,
and we achieve scores close to the scores achieved with the supervised trained n-gram
language model approach.
Although the language model induction approach tends to generate too many clus-
ters, it also generally succeeds at separating the languages involved.

6.5 Scores
Of the scores I used for evaluation purposes, it seems that a combination of a high
Rand Index and a high F5 score indicate a good language segmentation. A high F5
score alone is not significant. For example, the clustering algorithm achieves an F5
score of 0.7215 on ‘Twier 3’. is score looks good, but the Rand Index score is at
0.4571, and the segmentation is not good.

Twitter 3: Cluster analysis


• Edmonton, Food
• go, in, to
• and, are, breuvages, fans, for, just, ready, the, waiting
Similarly, a high Rand Index score alone is not significant. For example, the clus-
tering algorithm achieves a Rand Index score of 0.6738 on the ‘Pali 2’ text, but the F5
score is at 0.3825 and the clustering is not good.

Pali 2: Cluster analysis


• abhijjhita, abhijjhātar, covets, function], med., one, who, °itar), °itar, °ātar).
• (T., <smallcaps>i.</smallcaps>, <smallcaps>v.</smallcaps>, =, A, M, ag., fr., in,
l., v.
• 265, 287
• [n.

64
7 Conclusion
In this thesis, I have asked the question of whether unsupervised approaches to lan-
guage segmentation perform beer on short and difficult texts than supervised ap-
proaches by overcoming some of the difficulties associated with supervised approaches,
such as the need for (enough and adequate)11 training data, the language-specificity of
the language model or the inflexibility of trained language models when it comes to
spelling variation and abbreviations, unless the training data also contained spelling
variation and abbreviations.
I have given an overview over related work, presenting supervised approaches that
have been used in monolingual language identification and the amelioration of such
approaches through unsupervised approaches such as clustering.
Unfortunately, the body of literature covering the topic of language segmentation
is sparse. e work by Yin et al. (2007) and the work by Seldin et al. (2001) are closest
in topic to this thesis. However, Yin et al. (2007) concern themselves with spoken
language, with requires a different approach than dealing with wrien language. As I
concentrated on wrien language, their work was not conducive to this thesis.
In contrast, Seldin et al. (2001) present a work that looks promising. ey present
a system that finds language borders in a text with great accuracy using unsuper-
vised algorithms. However, they restrict their algorithm in such a way that switching
language models aer each word is disallowed. us, they are unable to detect single-
word inclusions and cannot handle situations where the language switches every word,
as has been shown to occur in the test data used in section 4.
Another major drawback of the approach is that it also needs longer fragments of
monolingual text and an overall longer text. Hence, their approach would not work
well on short texts, if at all.
Next, I have presented the theoretical foundations of a supervised n-gram language
model approach and an unsupervised clustering approach. Finally, I have introduced a
weakly supervised n-gram language model inducing approach devised by myself. All
of these approaches can be used for language segmentation. In order to test how well
the different approaches perform on different text types, I have performed experiments.
Section 4 presents the experiments made. I have first compiled a small corpus
of texts ranging from longer texts with clearly separated languages to one-sentence
Twier messages containing foreign language inclusions. I have also included a set
of dictionary entries from the Pali dictionary by the Pali Text Society. Indeed, these
entries contain a lot of different languages and abbreviations, and (unfortunately) are
not consistently formaed.
I have then presented my implementations of the supervised and weakly super-
11
e question of what is to be considered ‘enough’ or ‘adequate’ is another point of contention; the
data always influences the resulting models.

65
vised approaches and the choice of the unsupervised clustering algorithms. en, I
have presented the results of their application to the data.
It can be said that the supervised approach works reasonably well. e drawbacks
are that the approach needs training data to train the models on. e problems of the
training data and its influence on the models have been raised more than once.
e supervised approach failed for non-whitespace scripts. e models would have
to be adapted for non-whitespace scripts, introducing more complexity. Also, the
training and test texts would have to be split in meaningful ways, introducing the
need for a vast array of language-specific text spliers, should the approach work on
a wide range of languages.
e unsupervised approach generally succeeded in separating languages by script
when different scripts were involved. Other than that, it seems that the chosen mor-
phological features, or possibly morphological features in general, are insufficient for
the algorithm to separate languages effectively.
e weakly supervised approach worked well on short texts and on difficult short
texts, but less well on long texts, while still outperforming the clustering approach
on long texts. e approach consistently outperforms the clustering approach and
reaches scores in the vicinity of the scores achieved by the supervised approach, even
surpassing the supervised approach in some cases. ese results are promising, but
more thorough investigations have to be undertaken.
In conclusion, it can be said that some unsupervised (or weakly supervised) ap-
proaches can perform beer on the task of language segmentation on difficult and
short texts. e presented weakly supervised approach does not only outperform the
unsupervised clustering approach, it also achieves scores comparable to the scores
achieved with the supervised approach.
Future work could concentrate on the reduction of the number of generated clus-
ters, ideally geing down to one cluster per language; it would also be thinkable to
prevent overly frequent language model switching by taking a word’s context into
account. Finally, the parameters could conceivably be adapted automatically. With
an increased interest in the area of multilingual text processing lately, the emergence
and evolution of the texts themselves will influence the direction of the work in that
direction.
“Il est venu le temps des cathédrales
le monde est entré
dans un nouveau millénaire

L’homme a voulu monter vers les étoiles


écrire son histoire
dans le verre ou dans la pierre”

— Gringoire

66
References
Abeel, T., de Peer, Y. V., and Saeys, Y. (2009). Java-ML: A Machine Learning Library.
Journal of Machine Learning Research, pages 931–934.

Achtert, E., Kriegel, H., Schubert, E., and Zimek, A. (2013). Interactive data mining
with 3D-parallel-coordinate-trees. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27,
2013, pages 1009–1012.

Alex, B. (2005). An unsupervised system for identifying English inclusions in German


text. In Proceedings of the 43rd Annual Meeting of the Association for Computational
Linguistics (ACL 2005), Student Research Workshop, pages 133––138. Association for
Computational Linguistics.

Alex, B. (2006). Integrating language knowledge resources to extend the English inclu-
sion classifier to a new language. In Proceedings of the 5th International Conference
on Language Resources and Evaluation (LREC). European Language Resources Asso-
ciation.

Alex, B. (2007). Automatic detection of English inclusions in mixed-lingual data with an


application to parsing. PhD thesis, University of Edinburgh.

Alex, B., Dubey, A., and Keller, F. (2007). Using Foreign Inclusion Detection to Improve
Parsing Performance. In EMNLP-CoNLL, pages 151–160.

Alex, B. and Onysko, A. (2010). Zum Erkennen von Anglizismen im Deutschen: der
Vergleich von einer automatisierten mit einer manuellen Erhebung. In Scherer, C.
and Holler, A., editors, Strategien der Integration und Isolation nicht-nativer Einheiten
und Strukturen, pages 223–239. de Gruyter.

Begleiter, R., El-Yaniv, R., and Yona, G. (2004). On prediction using variable order
Markov models. Journal of Artificial Intelligence Research, pages 385–421.

Biemann, C. (2006). Chinese whispers: an efficient graph clustering algorithm and


its application to natural language processing problems. In Proceedings of the first
workshop on graph based methods for natural language processing, pages 73–80. As-
sociation for Computational Linguistics.

Brants, S., Dipper, S., Hansen, S., Lezius, W., and Smith, G. (2002). e TIGER treebank.
In Proceedings of the workshop on treebanks and linguistic theories, volume 168.

Bürgisser, P., Clausen, M., and Shokrollahi, M. A. (1997). Algebraic complexity theory,
volume 315. Springer.

67
Carter, D. (1994). Improving language models by clustering training sentences. In
Proceedings of the fourth conference on Applied natural language processing, pages
59–64. Association for Computational Linguistics.

Cavnar, W. B. and Trenkle, J. M. (1994). N-gram-based text categorization. In Pro-


ceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information
Retrieval, pages 161–175.

Chen, S. F. and Goodman, J. (1996). An empirical study of smoothing techniques for


language modeling. In Proceedings of the 34th annual meeting on Association for Com-
putational Linguistics, pages 310–318. Association for Computational Linguistics.

Dreyfuss, E., Goodfellow, I., and Baumstarck, P. (2007). Clustering Methods for Improv-
ing Language Models.

Dubes, R. C. (1987). How many clusters are best?-an experiment. Paern Recognition,
20(6):645–663.

Dunning, T. (1994). Statistical Identification of Language. Computing Research Labo-


ratory, New Mexico State University.

Gale, W. and Sampson, G. (1995). Good-turing smoothing without tears. Journal of


antitative Linguistics, 2(3):217–237.

Gao, J., Goodman, J., Miao, J., et al. (2001). e use of clustering techniques for language
modeling–application to Asian languages. International Journal of Computational
Linguistics and Chinese Language Processing, 6(1):27–60.

Goldberg, D. (1991). What every computer scientist should know about floating-point
arithmetic. ACM Computing Surveys (CSUR), 23(1):5–48.

Goodman, J. and Gao, J. (2000). Language model size reduction by pruning and clus-
tering. In INTERSPEECH, pages 110–113.

Goodman, J. T. (2001). A bit of progress in language modeling. Computer Speech and


Language, 15(4):403–434.

Grefenstee, G. (1995). Comparing two language identification schemes. In Proceedings


of the 3rd International conference on Statistical Analysis of Textual Data. JADT 1995.

Grünwald, P. D. (2007). e minimum description length principle. MIT press.

Guthrie, D., Allison, B., Liu, W., Guthrie, L., and Wilks, Y. (2006). A closer look at
skip-gram modelling. In Proceedings of the 5th international Conference on Language
Resources and Evaluation (LREC-2006), pages 1–4.

68
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Wien, I. H. (2009).
e WEKA Data Mining Soware: An Update. SIGKDD Explorations, 11.

Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM
computing surveys (CSUR), 31(3):264–323.

Jain, N. and Bhat, R. A. (2014). Language Identification in Code-Switching Scenario. In


Proceedings of the Conference on Empirical Methods on Natural Language Processing,
pages 87–93.

Jurafsky, D. and Martin, J. H. (2000). Speech and language processing. An Introduction


to Natural Language Processing, Computational Linguistics, and Speech Recognition.
Pearson Education India, 2nd edition.

Katz, S. (1987). Estimation of probabilities from sparse data for the language model
component of a speech recognizer. Acoustics, Speech and Signal Processing, IEEE
Transactions on, 35(3):400–401.

King, B. and Abney, S. P. (2013). Labeling the Languages of Words in Mixed-Language


Documents using Weakly Supervised Methods. In Proceedings of the Conference of
the North American Chapter of the Association for Computational Linguistics – Human
Language Technologies, pages 1110–1119.

Liu, H. and Cong, J. (2013). Language clustering with word co-occurrence networks
based on parallel texts. Chinese Science Bulletin, 58(10):1139–1144.

Logan, B. et al. (2000). Mel frequency cepstral coefficients for music modeling. In
Proceedings of the 1st International Symposium on Music Information Retrieval (ISMIR).

Lui, M., Lau, J. H., and Baldwin, T. (2014). Automatic detection and language identifi-
cation of multilingual documents. Transactions of the Association for Computational
Linguistics, 2:27–40.

Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to information re-
trieval, volume 1. Cambridge University Press.

Manning, C. D. and Schütze, H. (1999). Foundations of statistical natural language pro-


cessing. MIT press.

Marsland, S. (2003). Novelty detection in learning systems. Neural computing surveys,


3(2):157–195.

Mendizabal, I., Carandell, J., and Horowitz, D. (2014). TweetSafa: Tweet language
identification. TweetLID @ SEPLN.

69
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word
representations in vector space. In Proceedings of the International Conference on
Learning Representations (ICLR) 2013.

Ney, H., Essen, U., and Kneser, R. (1994). On structuring probabilistic dependences in
stochastic language modelling. Computer Speech & Language, 8(1):1–38.

Pelleg, D. and Moore, A. W. (2000). X-means: Extending K-means with Efficient Es-
timation of the Number of Clusters. In Proceedings of the Seventeenth International
Conference on Machine Learning (ICML 2000), pages 727–734.

Pereira, F., Tishby, N., and Lee, L. (1993). Distributional clustering of english words. In
Proceedings of the 31st annual meeting on Association for Computational Linguistics,
pages 183–190. Association for Computational Linguistics.

Porta, J. (2014). Twier Language Identification using Rational Kernels and its potential
application to Sociolinguistics. TweetLID @ SEPLN.

Ravi, S., Vassilivitskii, S., and Rastogi, V. (2014). Parallel Algorithms for Unsupervised
Tagging. Transactions of the Association for Computational Linguistics, 2:105–118.

Ron, D., Singer, Y., and Tishby, N. (1996). e power of amnesia: Learning probabilistic
automata with variable memory length. Machine learning, 25(2-3):117–149.

Schölkopf, B., Williamson, R. C., Smola, A. J., Shawe-Taylor, J., and Pla, J. C. (1999).
Support vector method for novelty detection. In Advances in Neural Information
Processing Systems (NIPS), volume 12, pages 582–588.

Seldin, Y., Bejerano, G., and Tishby, N. (2001). Unsupervised sequence segmentation
by a mixture of switching variable memory Markov sources. In Proceedings of the
Seventeenth International Conference on Machine Learning (ICML), pages 513–520.

Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Gohneim, M., Hawwari, A., Al-
Ghamdi, F., Hirschberg, J., Chang, A., et al. (2014). Overview for the First Shared Task
on Language Identification in Code-Switched Data. In Proceedings of the Conference
on Empirical Methods on Natural Language Processing, pages 62–72.

Taylor, D. (2015). Graphing the distribution of English leers towards the be-
ginning, middle or end of words. [Link]
[Link].

e Unicode Consortium (2014). e Unicode Standard. [Link]


standard/[Link]. [Online; accessed 21-July-2015].

70
Uszkoreit, J. and Brants, T. (2008). Distributed word clustering for large scale class-
based language modeling in machine translation. In Proceedings of the 46th Annual
Meeting of the Association for Computational Linguistics, pages 755–762.

Wagner, S. and Wagner, D. (2007). Comparing clusterings: an overview. Universität


Karlsruhe, Fakultät für Informatik Karlsruhe.

Yamaguchi, H. and Tanaka-Ishii, K. (2012). Text segmentation by language using mini-


mum description length. In Proceedings of the 50th Annual Meeting of the Association
for Computational Linguistics, pages 969–978. Association for Computational Lin-
guistics.

Yin, B., Ambikairajah, E., and Chen, F. (2007). Hierarchical language identification
based on automatic language clustering. In INTERSPEECH, pages 178–181.

Yuan, L. (2006). Language model based on word clustering. In Proceedings of the 20th
Pacific Asia Conference on Language, Information and Computation, pages 394–397.

Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R., Alegria, I., Aranberri, N., Ezeiza,
A., and Fresno, V. (2014). Overview of TweetLID: Tweet language identification at
SEPLN 2014. TweetLID @ SEPLN.

71
8 Appendix
8.1 Development data
8.1.1 Latin script data
Karl Marx anses som en af de fire klassiske sociologer. Marx er epokegørende for den
historiske videnskab. Og Marx spillede en vigtig rolle for den samtidige og eerføl-
gende arbejderbevægelse.
1891, nach einer Tuberkuloseerkrankung Hopes, eröffnete das Ehepaar ein mod-
ernes Lungensanatorium in Nordrach im Schwarzwald, das sie bis 1893 gemeinsam
ührten. 1895 wurde die Ehe geschieden.
Sources:
hps://[Link]/wiki/Karl_Marx
hps://[Link]/wiki/Hope_Bridges_Adams_Lehmann

8.1.2 Mixed script data


Capitalism is an economic system and a mode of production in which trade, industries,
and the means of production are largely or entirely privately owned. Private firms and
proprietorships usually operate in order to generate profit, but may operate as private
nonprofit organizations.
‫ ولودیا‬.‫او را ولودیا خطاب می​کردند که مخفف ولادمیر است نام اصلی او ولادمیر ایلیچ اولیانوف بود ولی در دنیا به اسم لنین مشهور شد‬
‫ در یک خانواده مرفه در سیمبریسک‬،‫ یعنی یک سال قبل از کمون پاریس‬۱۸۷۰ ‫سومین فرزند از شش فرزند خانواده اولیانوف بود که در سال‬
‫ پدرش یک‬.‫در ساحل رود ولگا که در ٓان زمان شهرکی بیش نبود ولی بعدها به صورت شهر بزرگی به نام اولیاء نوفسک در ٓامد متولد گردید‬
‫خرده بورژوای لیبرال و معلم ریاضی و مادرش دختر یک پزشک المانی بود وبه همین جهت لنین در تمام مدت عمر به المانیها و طرز تفکر‬
‫ ولودیا در دبیرستان شاگرد خوبی بود و قوه استدلال درخشانی داشت ولی در‬.‫المانی که مارکس مولود ٓان بود به دیده اغماض می​نگریست‬
.‫عین حال بچه​ای موذی بود‬
Sources:
hps://[Link]/wiki/Capitalism
hps://[Link]/wiki/‫ولادیمیر_لنین‬

8.1.3 Twitter data


Twitter 1 »Fallo ergo sum«: On being wrong.
Source:
Roland Hieber (daniel_bohrer). “»Fallo ergo sum«: On being wrong.”. 26 July 2015,
16:47. Tweet.

72
Twitter 2 Music for Airports > le piano en libre-accès dans l’aéroport Charles-de-
Gaulles
Source:
Yannick Rochat (yrochat). “Music for Airports > le piano en libre-accès dans l’aéroport
Charles-de-Gaulles”. 26 July 2015, 18:12. Tweet.

8.1.4 Pali dictionary data


All entries have been taken from the Pali Text Society’s Pali-English dictionary (T. W.
Rhys Davids, William Stede, editors, e Pali Text Society’s Pali–English dictionary.
Chipstead: Pali Text Society, 1921–5). 8 parts [738 pp.].)

Hambho Hambho,(indecl.)[haṁ+bho] a particle expressing surprise or haughti-


ness J.I,184,494.See also ambho.(Page 729)

Ussada Ussada,[most likely to ud + syad;see ussanna]:this word is beset with


difficulties,the phrase sa-ussada is applied in all kinds of meanings,evidently the
result of an original application & meaning having become obliterated.sa° is taken
as *sapta (seven)as well as *sava (being),ussada as prominence,protuberance,
fulness,arrogance.e meanings may be tabulated as follows:(1)prominence(cp.
Sk.utsedha),used in characterisation of the Nirayas,as “projecting,prominent
hells”,ussadanirayā (but see also below 4)J.I,174;IV,3,422 (pallaṅkaṁ,
v.l.caturassạṁ,with four corners);V,266.– adj.prominent A.13 (tej-
ussadehi ariyamaggadhammehi,or as below 4?).– 2.protuberance,bump,swelling
J.IV,188;also in phrase saussada having 7 protuberances,a qualification of the
Mahāpurisa D.III,151 (viz.on both hands,feet,shoulders,and on his back).
– 3.rubbing in,anointing,ointment;adj.anointed with (-°),in candan° J.III,
139;IV,60;.1,267;Vv 537;DhA.I,28;VvA.237.– 4.a crowd adj.full
of (-°)in phrase saussada crowded with (human beings)D.I,87 (cp.DA.I,
245:aneka-saa-samākiṇṇa;but in same sense BSk.sapt-otsada Divy 620,621);Pv
IV.18 (of Niraya = full of beings,expld.by saehi ussanna uparûpari nicita PvA.
221.– 5.qualification,characteristic,mark,aribute,in catussada “having the
four qualifications (of a good village)”J.IV,309 (viz.plenty of people,corn,
wood and water C.).e phrase is evidently shaped aer D.I,87 (under 4).As
“preponderant quality,characteristic”we find ussada used at Vism.103 (cf.Asl.
267)in combns.lobh°,dos°,moh°,alobh° etc.(quoted from the“Ussadakiana”) ,
and similarly at VvA.19 in Dhammapāla’s definition of manussa(lobh’ādīhi alobh’
ādīhi sahitassa manassa ussannatāya manussā),viz.saā manussa-jātikā tesu lobh’
‹-› ādayo alobh’ādayo ca ussadā.– 6.(metaph. )self-elevation,arrogance,conceit,
haughtiness Vin.I,3;Sn.515,624 (an° = taṇhā-ussada-abhāvena SnA 467),783

73
(expld.by Nd1 72 under formula saussada;i.e.showing 7 bad qualities,viz.rāga,
dosa,moha etc.) ,855.– See also ussādana,ussādeti etc.(Page 157)

8.2 Test data


8.2.1 Latin script data
English - German e German word Nabelschau means ”navel-gazing” or ”staring
at your navel”. But in this case, it doesn’t refer to anyone else’s belly buon – just your
own.
Source:
Glass, Nicole (2015): ”German Missions in the United States - Word of the Week”.
[Link].

English - Fren doux, mou : both translate as ”so” in English, although their mean-
ing is very different. Doux is the opposite of ”rough” or ”coarse” (rugueux), while mou
is the opposite of ”hard”. Doux can also mean sweet, but almost only for wines (oth-
erwise sucré is used).
Source:
Maciamo, (2015): ”French words and nuances that don’t exist in English”. Eupedia.

English - Transliterated Greek e Greek language distinguishes at least four dif-


ferent ways as to how the word love is used. Ancient Greek has four distinct words for
love: agápe, éros, philía, and storgē. However, as with other languages, it has been his-
torically difficult to separate the meanings of these words when used outside of their
respective contexts. Nonetheless, the senses in which these words were generally used
are as follows.
Source:
hps://[Link]/wiki/Greek_words_for_love

Italian - German Milano ne custodisce l’esempio più struggente: quel Cenacolo


che il vinciano affrescò con amore, cura e rivoluzionaria psicologia (il Giuda non vie-
ne privato dell’aureola, ma si condanna da solo, con la consapevolezza del peccato)
cominciò subito ad autodistruggersi, con un cancro che solo un lunghissimo restauro
ha di recente arginato.
Kaum eine Woche vergeht, in der es keine neue Studie, Umfrage oder Warnung
zum ema Fachkräemangel in Deutschland gibt.
Certo, lo faceva per definire le idee, ma anche perché consapevole che le intuizioni
sono periture, che la vita stessa va caurata in qualche modo.

74
Dabei mehren sich letzter Zeit auch Stimmen, die Entwarnung geben. So kam
jüngst eine Studie des Stierverbands ür die Deutsche Wissenscha zu dem Ergebnis,
dass ”ein allgemeiner Fachkräemangel in den MINT-Berufen eher nicht mehr” drohe.
Come anche i riccioli del Baista richiamano il movimento delle acque, moto che
poi Leonardo studierà più approfonditamente a Venezia, nelle ricerche sui bacini in
chiave di difesa anti-Turchi. E si vada alla bellissima Annunciazione, con un occhio
aento alle ali dell’angelo: la delicatezza delle punte all’insù che cosa sono se non
il barbaglio di un sogno che lo ossessionava da anni, ovvero quello di volare?
Ist das seit Jahren angemahnte Szenario vom drohenden Fachkräemangel bei In-
genieuren und Naturwissenschalern also nur ein Mythos?
Source:
Stalinski, Sandra (2015): ”Ingenieure: Mythos Fachkräemangel?”. [Link].
Scorranese, Roberta (2015): ”Nelle grandi opere il racconto sofferto della natura mor-
tale”. [Link].

German - Finnish - Turkish Der Sommer ist die wärmste der vier Jahreszeiten in der
gemäßigten und arktischen Klimazone. Je nachdem, ob er gerade auf der Nord- oder
Südhalbkugel herrscht, spricht man vom Nord- oder Südsommer. Der Nordsommer
findet gleichzeitig mit dem Südwinter sta.
Kesä eli suvi on vuodenaika kevään ja syksyn välissä. Kesä on vuodenajoista läm-
pimin, koska maapallo on silloin kallistunut niin, eä aurinko säteilee maan pinnalle
jyrkemmässä kulmassa kuin muina vuodenaikoina. Pohjoisella pallonpuoliskolla kesä-
kuukausiksi lasketaan tavallisesti kesä-. heinä- ja elokuu, eteläisellä pallonpuoliskolla
joulu-, tammi- ja helmikuu.
Yaz, en sıcak mevsimdir. Kuzey Yarım Küre’de en uzun günler yazda gerçekleşir.
Dünya ısıyı depo eiği için en sıcak günler genellikle yaklaşık iki ay sonra ortaya
çıkar. Sıcak günler Kuzey Yarım Küre’de 21 Haziran ile 22 Eylül arasında, Güney Yarım
Küre’de ise 22 Aralık ile 21 Mart arasındadır.
Source:
hps://fi.[Link]/wiki/Kesä
hps://[Link]/wiki/Sommer
hps://[Link]/wiki/Yaz

8.2.2 Mixed script data


Greek - Russian Η ελληνική γλώσσα είναι μία από τις ινδοευρωπαϊκές γλώσσες.
Αποτελεί το μοναδικό μέλος ενός ανεξάρτητου κλάδου της ινδοευρωπαϊκής οικο-
γένειας γλωσσών. Ανήκει επίσης στον βαλκανικό γλωσσικό δεσμό. Στην ελληνική
γλώσσα, έχουμε γραπτά κείμενα από τον 15ο αιώνα π.Χ. μέχρι σήμερα.
На греческом языке на всех этапах его существования была создана богатей-
шая литература. В Римской империи знание греческого языка считалось обяза-

75
тельным для всякого образованного человека. В латинском языке присутствует
большое количество греческих заимствований, а в греческом —значительное
количество латинских и романских слов. В новое время древнегреческий язык
стал (наряду с латинским) источником создания новых научных и технических
терминов (так называемая международная лексика). В русский язык греческие
слова проникали в основном двумя путями —через международную лексику и
через церковнославянский язык.
Source:
hps://[Link]/wiki/Ελληνική_γλώσσα
hps://[Link]/wiki/Греческий_язык

English - Greek - Transliterated Greek Agápe (ἀγάπη agápē) means ”love: esp.
brotherly love, charity; the love of God for man and of man for God.” Agape is used
in the biblical passage known as the ”love chapter,” 1 Corinthians 13, and is described
there and throughout the New Testament as brotherly love, affection, good will, love,
and benevolence. Whether the love given is returned or not, the person continues to
love (even without any self-benefit). Agape is also used in ancient texts to denote feel-
ings for one’s children and the feelings for a spouse, and it was also used to refer to
a love feast. It can also be described as the feeling of being content or holding one in
high regard. Agape is used by Christians to express the unconditional love of God for
his children. is type of love was further explained by omas Aquinas as ”to will
the good of another.”
Éros (ἔρως érōs) means ”love, mostly of the sexual passion.” e Modern Greek
word ”erotas” means ”intimate love.” It can also apply to dating relationships as well as
marriage. Plato refined his own definition: Although eros is initially felt for a person,
with contemplation it becomes an appreciation of the beauty within that person, or
even becomes appreciation of beauty itself. Plato does not talk of physical araction as
a necessary part of love, hence the use of the word platonic to mean, ”without physical
araction.”
In the Symposium, the most famous ancient work on the subject, Plato has Socrates
argue that eros helps the soul recall knowledge of beauty, and contributes to an under-
standing of spiritual truth, the ideal ”Form” of youthful beauty that leads us humans
to feel erotic desire – thus suggesting that even that sensually based love aspires to
the non-corporeal, spiritual plane of existence; that is, finding its truth, just like find-
ing any truth, leads to transcendence. Lovers and philosophers are all inspired to seek
truth through the means of eros.
Source:
hps://[Link]/wiki/Greek_words_for_love

76
English - Spanish - Arabic A black ribbon is a symbol of remembrance or mourn-
ing. Wearing or displaying a black ribbon has been used for POW/MIA remembrance,
mourning tragedies or as a political statement.
El crespón negro o lazo negro es un símbolo utilizado por personas, estados, so-
ciedades y organizaciones, representando un sentimiento político-social en señal de
duelo.
‫الرمز يعني الرسم الذي يعبر عن شيء معين وعموما فٔان العلامة ينبغي ٔان تنقل رسالتها بنظرة واحدة دون الحاجة لاية كلمات و من‬
‫المعروف ٔان قدماء المصريين والٔاغريق ٔاستخدموا العلامات ولكن ٔاكثر من استخدم العلامات هم‬
Source:

hps://[Link]/?title=Lazo_negro
hps://[Link]/wiki/Black_ribbon
hps://[Link]/wiki/‫رمز‬

English - Chinese - (Pinyin) e Chinese word for ”crisis” (simplified Chinese: 危


机; traditional Chinese: 危機; pinyin: wēijī) is frequently invoked in Western
motivational speaking because the word is composed of two Chinese characters that
can represent ”danger” and ”opportunity”. Some linguists have criticized this usage
because the component pronounced jī (simplified Chinese: 机; traditional Chinese:
機) has other meanings besides ”opportunity”. In Chinese tradition, certain numbers
are believed by some to be auspicious (吉利) or inauspicious (不利) based on the
Chinese word that the number name sounds similar to. e numbers 0, 6, 8, and 9 are
believed to have auspicious meanings because their names sound similar to words
that have positive meanings.
Source:
hps://[Link]/w/[Link]?title=Chinese_word_for_”crisis”

Ukrainian - Russian Віддавна на території України існували держави скіфів,


сарматів, готів та інших народів, але відправним пунктом української
державності й культури вважається Київська Русь 9—13 століття.
На юге омывается водами Чёрного и Азовского морей. Имеет сухопутную
границу с Россией, Белоруссией, Польшей, Словакией, Венгрией, Румынией и
Молдавией.
Source:
hps://[Link]/wiki/Україна
[Link], (2015): ”Страны - Safari Tour”.

77
8.2.3 Twitter data
Tweet 1: Greek – English Μόλις ψήφισα αυτή τη λύση Internet of ings, στο
διαγωνισμό BUSINESS IT EXCELLENCE.
Source:
GaloTyri. ”Μόλις ψήφισα αυτή τη λύση Internet of ings, στο διαγωνισμό
BUSINESS IT EXCELLENCE.”. 19 June 2015, 12:06. Tweet

Tweet 2: English – Fren Demain #dhiha6 Keynote 18h @dhiparis “e collective
dynamics of science-publish or perish; is it all that counts?” par David @chavalarias
Source:
Claudine Moulin (ClaudineMoulin). ”Demain #dhiha6 Keynote 18h @dhiparis ”e
collective dynamics of science-publish or perish; is it all that counts?” par David
@chavalarias”. 10 June 2015, 17:35. Tweet.

Tweet 3: English – Fren Food and breuvages in Edmonton are ready to go, just
waiting for the fans #FWWC2015 #bilingualism
Source:
HBS (HBS_Tweets). ”Food and breuvages in Edmonton are ready to go, just waiting
for the fans #FWWC2015 #bilingualism”. 6 June 2015, 23:29. Tweet.

Tweet 4: English – Polish my dad comes back from poland with two crates of
strawberries, żubrówka and adidas jackets omg
Source:
katarzyne (wifeyriddim). ”my dad comes back from poland with two crates of
strawberries, żubrówka and adidas jackets omg”. 8 June 2015, 08:49. Tweet.

Tweet 5: Transliterated Amharic – English Buna dabo naw (coffee is our bread).
Source:
eCodeswitcher. ”Buna dabo naw (coffee is our bread).”. 9 June 2015, 02:12. Tweet.

8.2.4 Pali dictionary data


All entries have been taken from the Pali Text Society’s Pali-English dictionary (T. W.
Rhys Davids, William Stede, editors, e Pali Text Society’s Pali–English dictionary.
Chipstead: Pali Text Society, 1921–5. 8 parts [738 pp.].)

abbha (nt.) [Vedic abhra nt. & later Sk. abhra m. ”dark cloud”; Idg. *m̊bhro, cp. Gr.
<at>a)fro\\s</at> scum, froth, Lat. imber rain; also Sk. ambha water, Gr.
<at>o)/mbros</at> rain, Oir ambu water]. A (dense & dark) cloud, a cloudy mass A
<smallcaps>ii.</smallcaps> 53 = Vin <smallcaps>ii.</smallcaps> 295 = Miln 273 in

78
list of to things that obscure moon– & sunshine, viz. <b>abbhaŋ mahikā</b> (mahiyā
A) <b>dhū- marajo</b> (megho Miln), <b>Rāhu</b> . is list is referred to at SnA
487 & VvA 134. S <smallcaps>i.</smallcaps> 101 (°sama pabbata a mountain like a
thunder–cloud); J <smallcaps>vi.</smallcaps> 581 (abbhaŋ rajo acchādesi); Pv
<smallcaps>iv.</smallcaps> 3 <superscript>9</superscript> (nīl° = nīla–megha PvA
251). As f. <b>abbhā</b> at Dhs 617 & DhsA 317 (used in sense of adj. ”dull”; DhsA
expl <superscript>s.</superscript> by valāhaka); perhaps also in <b>abbhāmaa</b>
. <br /><b>–kūṭa</b> the point or summit of a storm–cloud  1, 1064; J
<smallcaps>vi.</smallcaps> 249, 250; Vv 1 <superscript>1</superscript> (=
valāhaka–sikhara VvA 12). <b>–ghana</b> a mass of clouds, a thick cloud It 64; Sn
348 (cp. SnA 348). <b>–paṭala</b> a mass of clouds DhsA 239. <b>–mua</b> free
from clouds Sn 687 (also as abbhāmua Dh 382). <b>–saŋvilāpa</b> thundering S
<smallcaps>iv.</smallcaps> 289.

abhijjhitar [n. ag. fr. abhijjhita in med. function] one who covets M
<smallcaps>i.</smallcaps> 287 (T. abhijjhātar, v. l. °itar) = A
<smallcaps>v.</smallcaps> 265 (T. °itar, v. l. °ātar).

ajja Ajja,& Ajjā (adv.)[Vedic adya & adyā,a + dyā,a° being base of demonstr.
pron. (see a3)and dyā an old Loc. of dyaus (see diva) ,thus “on this day”]
to-day,now Sn.75,153,158,970,998;Dh.326;J.I,279;III,425 (read bahutaṁ
ajjā;not with Kern,Toev. s. v. as “food”);Pv.I,117 (= idāni PvA.59) ;PvA.6,
23;Mhvs 15,64. ‹-› Freq. in phrase ajjatagge (= ajjato + agge(?)or ajja-tagge,
see agga3)from this day onward,henceforth Vin.I,18;D.I,85;DA.I,235.
–kālaṁ (adv.)this morning [Link],180;–divasa the present day Mhvs 32,23.
(Page 10)

gūhanā Gūhanā,(f.)[abstr.fr.gūhati]=gūhanā (q.v.)Pug.19.Cp.


pari°.(Page 253)

pacati Pacati,[Ved.pacati,Idg.*peqǔō,Av.pac-;Obulg.peka to fry,roast,


Lith,kepū bake,Gr.pέssw cook,pέpwn ripe] to cook,boil,roast Vin.IV,264;
fig.torment in purgatory (trs.and intrs.):Niraye pacitvā aer roasting in N.S.
II,225,PvA.10,14.– ppr.pacanto tormenting,Gen.pacato (+Caus.
pācayato)D.I,52 (expld at DA.I,159,where read pacato for paccato,by pare
daṇḍena pīḷentassa).– pp.pakka (q.v.) .‹-› Caus.pacāpeti & pāceti (q.v.
).
– Pass.paccati to be roasted or tormented (q.v. ).(Page 382)

79
8.3 Results
8.3.1 N-Gram Language Models
For the n-gram language model approach, the identified language is indicated in
parentheses. e language abbreviations are:

Abbreviation Language

AR Arabic
DE German
EL Greek
EN English
ES Spanish
FI Finnish
FR French
IT Italian
PL Polish
RU Russian
UK Ukrainian
TR Turkish
TrAM Transliterated Amharic
TrEL Transliterated Greek
ZH Chinese

Data: Latin script: German – English

• (EN) own., belly, refer, buon, But, it, or, your, at, in, ”staring, anyone, doesn’t,
else’s, word, this

• (FI) –

• (FR) case, just, means, navel”.

• (TrAM) e

• (TrEL) to, German

• (other) Nabelschau, ”navel-gazing”

80
Data: Latin script: German – Finnish – Turkish
• (DE) ob, oder, Sommer, und, Nord-, arktischen, der, Der, dem, gemäßigten, mit,
er, Südsommer., spricht, Jahreszeiten, Südwinter, herrscht, wärmste, vom, die,
sta., nachdem, auf
• (EN) ist, Nordsommer, Mart, in
• (ES) en, depo
• (FI) joulu-, kevään, suvi, on, eli, vuodenajoista, syksyn, koska, kesä-., kuin, Po-
hjoisella, man, helmikuu., tammi-, lämpimin, heinä-, niin, maapallo, maan, pin-
nalle, Kesä, säteilee, tavallisesti, vuodenaika, kallistunut, lasketaan, muina, eiği,
jyrkemmässä, elokuu, välissä., eä, eteläisellä, silloin, ja, kulmassa
• (FR) vier, Je
• (PL) aurinko
• (RU) 22, 21
• (TR) yaklaşık, ortaya, genellikle, Eylül, Sıcak, çıkar., Yaz, sonra, arasında, Kuzey,
Güney, Aralık, gerade, ısıyı, gerçekleşir., Küre’de, günler, için, findet, mevsimdir.,
arasındadır., Haziran, iki, yazda, uzun, ise, ay, sıcak, ile, Yarım, Dünya
• (TrAM) Der
• (other) Klimazone., gleichzeitig,kesäkuukausiksi, vuodenaikoina., pallonpuolis-
kolla,Südhalbkugel

Data: Latin script: English – French


• (EL) ”coarse”
• (EN) but, both, for, while, wines, almost, sweet, of, although, only, is, ”rough”,
used)., or, as, meaning, the, in, translate, ”hard”., their, English, also, different.,
very
• (ES) can
• (FI) mean
• (FR) opposite, Doux, doux, sucré, :
• (RU) ”so”
• (TrEL) mou
• (other) (otherwise, (rugueux)

81
Data: Latin script: English – Transliterated Greek

• (EN) for, meanings, least, used, been, distinct, love, of, were, are, when, agápe,
these, how, and, Greek, word, used., outside, ways, different, other, follows.,
words, respective, generally, However, is, with, it, at, as, historically, the, in,
which, their

• (ES) has, separate

• (FR) language, senses, Ancient, languages, difficult, four

• (IT) contexts.

• (TrAM) éros, e, love:

• (TrEL) to, storgē., philía

• (other) Nonetheless, distinguishes

Data: Latin script: Italian – German

• (DE) drohe., geben., allgemeiner, Studie, jüngst, ür, Ergebnis, keine, kam, dro-
henden, oder, und, letzter, neue, Mythos?, Deutschland, Ist, sich, der, vergeht,
studierà, Dabei, Studie, den, dem, auch, Entwarnung, dass, nur, eher, nicht, gibt.,
Umfrage, Woche, eine, Kaum, Jahren, bei, mehren, Stimmen, Deutsche, das, zum,
mehr”, angemahnte, ”ein, Zeit, ein, So, vom, zu, die, seit, Warnung, Wissenscha

• (EL) affrescò

• (EN) moto, aento, a, in, ad, also

• (ES) custodisce, cura, subito, Certo, Giuda, lo, del, difesa, con, definire, restauro,
se, modo., la, arginato., recente, vada, movimento, Leonardo, Szenario, quel,
cominciò

• (FI) va, si, Baista, ema

• (FR) l’esempio, non, des, acque, perché, un, es, le, sui, condanna

• (IT) solo, faceva, caurata, chiave, peccato), periture, (il, delicatezza, cancro, pri-
vato, bellissima, anni, bacini, ovvero, delle, sogno, di, barbaglio, ma, qualche, e,
amore, ricerche, Come, per, richiamano, ne, intuizioni, punte, occhio, struggente:,
nelle, vita, riccioli, solo, che, volare?, sono, alla, alle, anche, Cenacolo, quello,
cosa, ali, viene, il, psicologia, vinciano, Venezia

82
• (PL) i

• (TR) ha, più, da

• (TrAM) Milano, E

• (TrEL) poi, idee, stessa

• (other) MINT-Berufen, Fachkräemangel, dell’ angelo:, consapevole, anti-Turchi.,


Annunciazione, lunghissimo, consapevolezza, ossessionava, dell’ aureola, appro-
fonditamente, autodistruggersi, rivoluzionaria, Stierverbands, all’insù, Natur-
wissenschalern, Ingenieuren

Data: Mixed script: Greek – Russian

• (EL) κείμενα, βαλκανικό, από, το, αιώνα, Αποτελεί, ελληνική, μία, επίσης, στον,
γλωσσικό, γλωσσών., είναι, Στην, έχουμε, μέλος, ανεξάρτητου, τις, γλώσσες.,
15ο, Ανήκει, γραπτά, π.Χ., σήμερα., γλώσσα, γλώσσα, κλάδου, οικογένειας, τον,
της, δεσμό., μέχρι, μοναδικό, ενός

• (RU) слов., с, богатейшая, образованного, человека., этапах, значительное,


знание, научных, лексика)., называемая, технических, источником, стал,
латинских, существования, слова, греческом, всех, —, В, романских, но-
вых, Римской, и, проникали, в, греческие, терминов, присутствует, грече-
ских, новое, русский, империи, латинском, литература., создана, создания,
путями, основном, язык., язык, (так, его, количество, считалось, обязатель-
ным, время, двумя, была, греческого, большое, языке, языка

• (TrAM) Η

• (UK) лексику, (наряду, через, всякого, а, На, для, на

• (other) ινδοευρωπαϊκές, ινδοευρωπαϊκής,латинским), международную, меж-


дународная, церковнославянский, заимствований, древнегреческий

Data: Mixed script: English – Greek

• (DE) Symposium, Modern, being, felt

• (EL) ”Form”

83
• (EN) sensually, platonic, for, holding, existence;, refined, its, explained, arac-
tion, of, (even, are, spiritual, given, refer, Agape, beauty, or, araction.”, like,
without, not, further, will, own, love, knowledge, will, one’s, most, use, ex-
press, is, another.”, e, leads, truth, suggesting, dating, relationships, in-
spired, ”love, mostly, hence, definition:, regard., appreciation, a, ideal, us, helps,
seek, Agápe, plane, recall, feeling, within, returned, chapter,”, based, described,
apply, physical, Although, good, by, used, love, God.”, children., his, any, char-
ity;, Socrates, be, work, throughout, and, that, Greek, even, word, agápē), love.”,
known, biblical, feelings, does, famous, In, subject, becomes, one, understand-
ing, children, ”love, through, beauty, well, It, was, initially, feast., finding, itself.,
13, all, ”without, feel, with, is, it, thus, New, as, the, brotherly, in, is, an, there,
God, youthful, necessary, high, Lovers, also, Whether
• (ES) person, Aquinas, esp., continues, has, omas, truth, can, erotic, sexual,
desire
• (FI) on, –, man, mean
• (FR) (ἀγάπη, spouse, not, ancient, marriage., soul, person, content, Christians,
Testament, Éros, just, part, type, passage, means, humans, passion.”, aspires, con-
templation, contributes, argue, affection
• (IT) texts, 1, ”intimate, Plato, ”to
• (RU) (ἔρως
• (TR) talk
• (TrAM) érōs), ”love:
• (TrEL) ”erotas”, denote, eros., to, eros
• (other) non-corporeal, Corinthians, self-benefit)., benevolence., unconditional,
philosophers, transcendence.

Data: Mixed script: English – Spanish – Arabic


,
• (AR) ‫من كلمات‬
,
, ,‫احدة قدماء‬ , , , ‫و‬, ,‫ٔاستخدموا‬
‫ينبغي العلامة العلامات و‬ , ‫لاية‬
‫ٔاكثر‬
, ,
‫دون‬
,
‫فٔان‬ , ‫الذي‬
‫الرسم‬ , ,‫المصريين رسالتها‬
,
, ٔ‫ وال‬,‫تنقل وعموما‬
, ‫اغريق‬
‫معين الحاجة‬
, , ,
‫عن بنظرة المعروف‬
,
, ‫استخدم‬
, ‫يعبر‬ ,‫ يعني‬,‫ٔان شيء‬, ,
‫هم‬ , ‫ولكن‬
‫الرمز‬ ,

• (EN) for, used, been, displaying, of, ribbon, black, or, mourning., statement.,
tragedies, is, political, a, Wearing, as, mourning
• (ES) por, has, crespón, sociedades, personas, sentimiento, representando, esta-
dos, de, El, señal, lazo, símbolo, en, utilizado, y

84
• (FR) remembrance, remembrance, un, es
• (IT) negro, duelo., POW/MIA
• (TrAM)
• (TrEL) symbol, o
• (other) político-social, organizaciones

Data: Mixed script: English – Chinese


• (DE) 机;, Chinese:, Western
• (EL) 機)
• (EN) Some, for, meanings, by, of, are, 8, positive, speaking, be, composed, or,
meanings., tradition, number, and, that, sound, linguists, word, some, this, other,
In, have, invoked, criticized, 6, because, e, believed, words, numbers, sounds,
frequently, is, pronounced, besides, traditional, the, in, represent, two, motiva-
tional, usage, their, based
• (ES) 危機;, has, 危机;, can, Chinese, ”crisis”, similar
• (FI) on
• (FR) (吉利), component, ”danger”, characters, (不利), certain, jī
• (PL) pinyin:
• (RU) 0, 9, wēijī)
• (TrEL) to, to., names, name
• (other) inauspicious, ”opportunity”., (simplified, auspicious

Data: Mixed script: Ukrainian – Russian


• (RU) Польшей, Румынией, Венгрией, юге, границу, с, омывается, Имеет, 9
—13, Молдавией., Азовского, водами, Россией, Чёрного, Русь, и, пунктом,
Словакией
• (TrAM) й
• (UK) держави, скіфів, України, народів, На, державності, вважається, від-
правним, території, української, готів, культури, але, сарматів, існували,
століття., Київська, на, Віддавна, інших, та, морей.
• (other) сухопутную, Белоруссией

85
Data: Pali: abbha

• (AR) ., 134., 289.

• (DE) Miln), imber, dark), Miln

• (EL) (=, (abbhaŋ

• (EN) water, mountain, of, free, (used, or, like, referred, (also, A, is, cloudy,
clouds, later, a, froth, 1, summit, thundering, by, mass, Pv, Oir, obscure, scum,
that, water]., thick, As, from, It, is, at, as, the, in, clouds, things, also

• (ES) (dense, f., sense, expl, rajo

• (FI) 239., rain;, Lat., Vin, perhaps, SnA

• (FR) cloud, Dh, adj., point, cloud, Dhs, A), rain, VvA, DhsA, list

• (IT) \”dark, &, ambha, 3, 1, 317, J, sunshine, cp., abhra, [Vedic, (megho

• (PL) 487, =, S, 295, <br, moon–, 249

• (RU) 348, 53

• (TR) viz., ambu, Vv

• (TrAM) 687, PvA, (°sama, 101, (nīl°, (cp., 64;, (nt.), 581, m., Sn, 1064;

• (TrEL) , Gr., Sk., Idg., to, pabbata, nt.

• (UK) 12)., 273, 617, 348)., 250;, 251)., 382).

• (other) <b> –saŋvilāpa </b>, <b> –mua </b>, <smallcaps> vi. </smallcaps>,
(mahiyā, <smallcaps> iv. </smallcaps>, cloud\”;, <b> Rāhu </b>, <b> abbhā
</b>, <b> abbhaŋ, <superscript> 9 </superscript>, marajo </b>, abbhāmua,
valāhaka);, <smallcaps> i. </smallcaps>, <b> abbhāmaa </b>, valāhaka–sikhara,
<superscript> s. </superscript>, <smallcaps> ii. </smallcaps>, <b> dhū-, storm–
cloud, /><b> –kūṭa </b>, thunder–cloud);, <at>a)fro\\s</at>, <b>–paṭala</b>,
<at>o)/mbros</at>, nīla–megha, <superscript>1</superscript>, *m̊bhro, \”dull\”;,
acchādesi);, mahikā</b>, <b> –ghana </b>

86
Data: Pali: abhijjhitar
• (DE) v.
• (EN) A, one, in, who, covets, med., function]
• (IT) ag., M, fr.
• (PL) 287, =
• (RU) 265
• (TrAM) l., [n.
• (TrEL) (T.
• (other) <smallcaps> v. </smallcaps>, abhijjhātar, abhijjhita, °ātar)., <smallcaps>
i. </smallcaps>, °itar, °itar)

Data: Pali: ajja


• (DE) (see, v., being, Ajjā
• (EN) of, or, and, not, present, Freq., day, this, “on, from, adyā,a, with, as, the,
morning, in, day”], an
• (ES) bahutaṁ,
• (FI) 32,23., ajjato
• (FR) Loc., dyaus, 15,64., dyā, pron.
• (IT) [Vedic, Mhvs, &, –divasa
• (PL) (=, +, demonstr., s.
• (RU) III,425, agge(?)
• (TR) old, adya, 10), idāni
• (TrAM) ‹-›
• (TrEL) phrase, base
• (UK) a3)
• (other) onward,henceforth, ajjā;, DA.I,235., (adv.), J.I,279;, D.I,85;,
ajja-tagge,see, Sn.75,153,158,970,998;, [Link],180;, PvA.6,23;, –kālaṁ,
diva) ,thus, PvA.59);, agga3), Kern,Toev., Pv.I,117, Dh.326;, ajjatagge,
(read, (Page, Vin.I,18;, dyā,a°, Ajja,&, to-day,now, “food” );

87
Data: Pali: gūhanā

• (ES) 253)

• (other) [abstr.fr.gūhati]=gūhanā, Pug.19.Cp.pari°.(Page, (q.v.),


Gūhanā, (f.)

Data: Pali: pacati

• (EL) 382)

• (EN) for, aer, roasting, read, roasted, be, or, at, tormented, in

• (FR) pare, D.I,52

• (IT) &, pacato, purgatory

• (TrAM) pāceti, ripe]

• (TrEL) to, daṇḍena

• (other) bake,
Gr.pέssw,(+Caus.pācayato),(q.v.) .(Page, DA.I, 159,
where,
Caus.pacāpeti, intrs.):Niraye, pacitvā, Pass.paccati,(trs.and, tormenting,
Gen.pacato, pīḷentassa).–, fig.torment, cook,pέpwn, Pacati,[Ved.pacati,
Idg.*peqǔō,Av.pac-;, paccato,by, ppr.pacanto, cook,boil,roast, fry,
roast,Lith,kepū, (q.v.) .–, (expld, Vin.IV,264;, Obulg.peka, pp.
pakka, (q.v.).‹-›, N.S.II,225,PvA.10,14.–

Data: Twier 1 (Greek–English)

• (DE) Internet

• (EL) στο, τη, αυτή, διαγωνισμό, λύση, ψήφισα

• (EN) of, IT, ings

• (ES) BUSINESS

• (TrAM) Μόλις

• (other) EXCELLENCE.

88
Data: Twier 2 (French–English)
• (EN) David, ”e, is, it, perish;, or, collective, Demain, counts?”, that, of, dynam-
ics, all
• (FI) 18h
• (FR) par, Keynote
• (other) #dhiha6, @dhiparis, science-publish

Data: Twier 3 (French–English)


• (EN) for, Food, waiting, the, in, ready, and, are
• (ES) go
• (FI) Edmonton
• (FR) just, breuvages, fans
• (TrEL) to
• (other) #bilingualism, #FWWC2015

Data: Twier 4 (English–Polish)


• (EN) with, back, from, comes, crates, and, poland, two, of, jackets
• (ES) dad, adidas
• (TrAM) my
• (TrEL) omg
• (other) żubrówka, strawberries

Data: Twier 5 (Transliterated Amharic–English)


• (EN) is, bread).
• (FR) our
• (IT) (coffee
• (PL) naw
• (TrAM) Buna, dabo

89
8.3.2 Textcat
For Textcat, the identified language is indicated in parentheses. As Textcat returns
unknown for many words, I merely indicate the non-unknown categories to save
space and write rest to indicate that all other words of the text have been classified as
unknown. e language abbreviations are:

Abbreviation Language

DA Danish
DE German
EL Greek
EN English
ES Spanish
FI Finnish
FR French
HU Hungarian
ID Indonesian
IT Italian
LT Lithuanian
LV Latvian
NL Dutch
PT Portuguese
RU Russian
TH ai
ZH Chinese

Data: Latin script: German – English

• (HU) “navel-gazing”

• (ZH) Nabelschau

• (unknown) rest

Data: Latin script: German – Finnish – Turkish

• (DA) Südsommer., genellikle,

• (DE) Jahreszeiten, arktischen,

• (FI) vuodenajoista, kallistunut, tavallisesti,

90
• (ZH) gemäßigten, Klimazone., Südhalbkugel, Nordsommer, gleichzeitig, vuoden-
aika, jyrkemmässä, vuodenaikoina., Pohjoisella, pallonpuoliskolla, kesäkuukausik-
si, eteläisellä, mevsimdir., gerçekleşir., arasındadır.,

• (unknown) rest

Data: Latin script: English – French

• (HU) different.,

• (ZH) (rugueux),(otherwise,

• (unknown) rest

Data: Latin script: English – Transliterated Greek

• (EN) historically, respective,

• (LT) languages,

• (ZH) distinguishes, Nonetheless,

• (unknown) rest

Data: Latin script: Italian – German

• (DE) allgemeiner, angemahnte,

• (ES) delicatezza,

• (HU) bellissima,

• (IT) dell’aureola, consapevole, richiamano, anti-Turchi., ossessionava,

• (NL) Ingenieuren,

• (PT) approfonditamente,

• (ZH) custodisce, struggente:, rivoluzionaria, psicologia, consapevolezza, auto-


distruggersi, lunghissimo, Fachkräemangel, Deutschland, intuizioni, Entwar-
nung, Stierverbands, Wissenscha, MINT-Berufen, Annunciazione, dell’an-
gelo:, Naturwissenschalern,

• (unknown) rest

91
Data: Mixed script: Greek – Russian

• (EL) ανεξάρτητου, οικογένειας,

• (RU) существования, богатейшая, литература., греческого, обязательным,


образованного, присутствует, количество, заимствований, значительное,
источником, технических, называемая, международная,

• (TH) латинским),

• (ZH) ινδοευρωπαϊκές, ινδοευρωπαϊκής, древнегреческий, международную,


церковнославянский,

• (unknown) rest

Data: Mixed script: English – Greek

• (DA) definition:, understanding,

• (EN) affection, unconditional, suggesting,

• (FR) relationships, contemplation, appreciation, araction, araction.”, transcen-


dence.,

• (HU) benevolence., self-benefit).,

• (IT) non-corporeal,

• (PT) contributes,

• (ZH) Corinthians, throughout, Christians, Symposium, existence;, philosophers,

• (unknown) rest

Data: Mixed script: English – Spanish – Arabic

• (ES) sociedades, organizaciones, sentimiento, político-social,

• (FR) remembrance, remembrance, statement.,

• (ID) displaying,

• (PT) representando,

• (unknown) rest

92
Data: Mixed script: English – Chinese

• (EN) traditional, motivational, pronounced, tradition„

• (FR) characters,

• (ZH) simplified, frequently, ”opportunity”., criticized, auspicious, inauspicious,

• (unknown) rest

Data: Mixed script: Ukrainian – Russian

• (RU) державності, Словакией, Молдавией.,

• (TH) вважається,

• (ZH) відправним, української, сухопутную, Белоруссией,

• (unknown) rest

Data: Pali: abbha

• (DA) storm–cloud, thundering,

• (HU) marajo</b>, nīla–megha, valāhaka–sikhara,

• (ZH)
<at> a)fro\\</at>, <at> o)/mbros </at>, <smallcaps> ii. </smallcaps>, mahikā</b>,
<b> Rāhu </b>, <smallcaps> i. </smallcaps>, thunder–cloud);, <smallcaps> vi.
</smallcaps>, acchādesi);, <smallcaps> iv. </smallcaps>, <superscript> 9 </su-
perscript>, <b> abbhā </b>, <superscript> s. </superscript>, valāhaka);, <b>
abbhāmaa </b>, /><b> –kūṭa </b>, <superscript> 1 </superscript>, <b> –ghana
</b>, <b> –paṭala </b>, <b> –mua </b>, abbhāmua, <b> –saŋvilāpa </b>

• (unknown) rest

Data: Pali: abhijjhitar

• (ZH) abhijjhita, <smallcaps> i. </smallcaps>, abhijjhātar, <smallcaps> v.


</smallcaps>,

• (unknown) rest

93
Data: Pali: ajja

• (ZH) diva),thus, to-day,now, Sn.75,153,158,970,998;, Kern,Toev.,


ajja-tagge,see, onward,henceforth,

• (unknown) rest

Data: Pali: gūhanā

• (ZH) Gūhanā,(f.), [abstr.fr.gūhati]hanā, Pug.19.Cp.pari°.(Page,

• (unknown) rest

Data: Pali: pacati

• (ZH) fig.torment, Pacati,[Ved.pacati,Idg.*peqǔō,Av.pac-;, Obulg.peka,


fry,roast,Lith,kepū, bake,Gr.pέssw, cook,pέpwn, cook,boil,roast, Vin.
IV,264;, intrs.):Niraye, N.S.II,225,PvA.10,14.–, ppr.pacanto,
tormenting,Gen.pacato,(+Caus.pācayato), DA.I,159,where, paccato,by,
pīḷentassa)
.–, (q.v.).‹-›, Caus.pacāpeti, Pass.paccati, (q.v.).(Page,

• (unknown) rest

Data: Twier 1 (Greek–English)

• (ZH) διαγωνισμό, EXCELLENCE.,

• (unknown) rest

Data: Twier 2 (French–English)

• (IT) collective,

• (ZH) science-publish,

• (unknown) rest

Data: Twier 3 (French–English)

• (ZH) #bilingualism,

• (unknown) rest

94
Data: Twier 4 (English–Polish)

• (LV) strawberries,

• (unknown) rest

Data: Twier 5 (Transliterated Amharic–English)

• (unknown) rest

8.3.3 Clustering
Clustering the different data sets produced the following clusters. e second run
uses the clusters from the first run and possibly subdivides each cluster into two or
more clusters.

Data: Latin script: German – English

First run

• “navel-gazing”, doesn’t, else’s

• “staring, But, German, Nabelschau, anyone, belly, buon, case, just, means,
navel”., own., refer, this, word, your

• at, in, it, or, to

• –, e

Second run

• doesn’t, else’s

• “navel-gazing”

• “staring, But, German, Nabelschau, belly, case, means, navel”., refer, this

• anyone, buon, just, own., word, your

• it, or, to

• at, in

• –, e

95
Data: Latin script: German – Finnish – Turkish

First run

• Dünya, Güney, Küre’de, Südhalbkugel, Südsommer., Südwinter, Sıcak, arasında,


gemäßigten, günler, için, kesäkuukausiksi, lämpimin, säteilee, sıcak, wärmste,
çıkar., Der

• Aralık, Eylül, Kesä, Yarım, arasındadır., eteläisellä, eiği, eä, gerçekleşir., heinä-
, jyrkemmässä, kesä-., kevään, välissä., yaklaşık, ısıyı

• 21, 22

• Der, Haziran, Jahreszeiten, Je, Klimazone., Kuzey, Mart, Nord-, Nordsommer,


Pohjoisella, Sommer, Yaz, arktischen, auf, aurinko, ay, dem, depo, der, die, eli,
elokuu, en, er, findet, genellikle, gerade, gleichzeitig, helmikuu., herrscht, iki,
ile, in, ise, ist, ja, joulu-, kallistunut, koska, kuin, kulmassa, lasketaan, maan,
maapallo, man, mevsimdir., mit, muina, nachdem, niin, ob, oder, on, ortaya, pal-
lonpuoliskolla, pinnalle, silloin, sonra, spricht, sta., suvi, syksyn, tammi-, taval-
lisesti, und, uzun, vier, vom, vuodenaika, vuodenaikoina., vuodenajoista, yazda

Second run

• Südhalbkugel, Südsommer., Südwinter, arasında, gemäßigten, kesäkuukausiksi,


lämpimin, säteilee, wärmste

• Dünya, Güney, Küre’de, Sıcak, günler, için, sıcak, çıkar., Der

• arasındadır., eteläisellä, eiği, eä, gerçekleşir., heinä-, jyrkemmässä, kesä-.,


kevään, välissä., yaklaşık, ısıyı

• Aralık, Eylül, Yarım

• Kesä

• 22

• 21

• Der, Haziran, Jahreszeiten, Klimazone., Kuzey, Mart, Nord-, Nordsommer, Po-


hjoisella, Sommer, Yaz,

96
• arktischen, auf, aurinko, dem, depo, der, die, eli, elokuu, findet, genellikle, gerade,
gleichzeitig, helmikuu., herrscht, iki, ile, ise, ist, joulu-, kallistunut, koska, kuin,
kulmassa, lasketaan, maan, maapallo, man, mevsimdir., mit, muina, nachdem,
niin, oder, ortaya, pallonpuoliskolla, pinnalle, silloin, sonra, spricht, sta., suvi,
syksyn, tammi-, tavallisesti, und, uzun, vier, vom, vuodenaika, vuodenaikoina.,
vuodenajoista, yazda

• Je, ay, en, er, in, ja, ob, on

Data: Latin script: English – French

First run

• ”coarse”, ”hard”., ”rough”, ”so”, (otherwise, (rugueux), Doux, English, almost,


also, although, both, but, can, different., doux, for, mean, meaning, mou, only,
opposite, sucré, sweet, the, their, translate, used)., very, while, wines

• is, or

• as, in, of

Second run

• Doux, English,

• “coarse”, (otherwise, (rugueux), almost, although, different., meaning, opposite,


translate

• “hard”., ”rough”, ”so”, also, both, but, can, doux, for, mean, mou, only, sucré,
sweet, the, their, used)., very, while, wines

• or

• is

• in

• of

• as

97
Data: Latin script: English – Transliterated Greek

First run

• e

• agápe, philía, storgē., éros,

• Ancient, However, Nonetheless, contexts., different, difficult, distinct, distin-


guishes, follows., generally, historically, language, languages, meanings, outside,
respective, senses, separate, which, words

• Greek, and, are, as, at, been, for, four, has, how, in, is, it, least, love, love:, of,
other, the, their, these, to, used, used., ways, were, when, with, word

Second run

• e

• philía, storgē.

• agápe, éros,

• Ancient, However, Nonetheless, contexts., different, difficult, distinct, distin-


guishes, follows., generally, historically, meanings, respective

• words

• language, languages, outside, senses, separate, which

• and, are, as, at, been, for, four, has, how, in, is, it, least, love, love:, of, other, the,
their, these, to, used, used., ways, were, when, with, word

• Greek

Data: Latin script: German – Italian

First run

• (il, E, So, a, ad, da, di, e, es, ha, i, il, in, la, le, lo, ma, ne, se, si, un, va, zu

98
• “ein , Annunciazione, Baista, Cenacolo, Certo, Come, Dabei, Deutsche, Deutsch-
land, Entwarnung, Ergebnis, Giuda, Ingenieuren, Ist, Jahren, Kaum, Leonardo,
MINT-Berufen, Mythos?, Naturwissenschalern, Stierverbands, Stimmen, Stu-
die, Studie, Szenario, ema, Umfrage, Venezia, Warnung, Wissenscha, Woche,
Zeit, acque, ali, alla, alle, allgemeiner, also, amore, anche, angemahnte, anni,
anti-Turchi., approfonditamente, arginato., aento, auch, autodistruggersi, baci-
ni, barbaglio, bei, bellissima, cancro, caurata, che, chiave, con, condanna, consa-
pevole, consapevolezza, cosa, cura, custodisce, das, dass, definire, del, delicatezza,
delle, dem, den, der, des, die, difesa, drohe., drohenden, eher, ein, eine, faceva,
geben., gibt., idee, intuizioni, kam, keine, letzter, lunghissimo, mehr”, mehren,
modo., moto, movimento, nelle, neue, nicht, non, nur, occhio, oder, ossessiona-
va, ovvero, peccato), per, periture, poi, privato, psicologia, punte, qualche, quel,
quello, recente, restauro, riccioli, ricerche, richiamano, rivoluzionaria, seit, sich,
sogno, solo, solo, sono, stessa, struggente:, subito, sui, und, vada, vergeht, viene,
vinciano, vita, volare?, vom, zum

• all’insù, dell’angelo:, dell’aureola, l’esempio, Milano

• Fachkräemangel, affrescò, cominciò, ür, jüngst, perché, più, studierà

Second run

• a, e, i

• E

• So

• (il, ad, da, di, es, ha, il, in, la, le, lo, ma, ne, se, si, un, va, zu

• Annunciazione, Baista, Cenacolo, Certo, Come, Dabei, Deutsche, Deutschland,


Entwarnung, Ergebnis, Giuda, Ingenieuren, Ist, Jahren, Kaum, Leonardo, MINT-
Berufen, Mythos?, Naturwissenschalern, Stierverbands, Stimmen, Studie, Stu-
die, Szenario, ema, Umfrage, Venezia, Warnung, Wissenscha, Woche, Zeit

• “ein, acque, ali, alla, alle, allgemeiner, also, amore, anche, angemahnte, an-
ni, anti-Turchi., approfonditamente, arginato., aento, auch, autodistruggersi,
bacini, barbaglio, bei, bellissima, cancro, caurata, che, chiave, con, condanna,
consapevole, consapevolezza, cosa, cura, custodisce, das, dass, definire, del, de-
licatezza, delle, dem, den, der, des, die, difesa, drohe., drohenden, eher, ein,
eine, faceva, geben., gibt., idee, intuizioni, kam, keine, letzter, lunghissimo,
mehr”, mehren, modo., moto, movimento, nelle, neue, nicht, non, nur, oc-
chio, oder, ossessionava, ovvero, peccato), per, periture, poi, privato, psicologia,

99
punte, qualche, quel, quello, recente, restauro, riccioli, ricerche, richiamano, ri-
voluzionaria, seit, sich, sogno, solo, solo, sono, stessa, struggente:, subito, sui,
und, vada, vergeht, viene, vinciano, vita, volare?, vom, zum

• all’insù, dell’angelo:, dell’aureola, l’esempio, Milano

• Fachkräemangel

• affrescò, cominciò, jüngst, perché, studierà

• ür

• più

Data: Mixed script: Greek – Russian

First run

• 15ο, —, Η

• το, В, На, а, в, и, на, с

• (наряду, (так, γλωσσών., γλώσσα, γλώσσες., δεσμό., π.Χ., σήμερα., заимство-


ваний, латинским), лексика)., литература., слов., человека., язык.

• Ανήκει, Αποτελεί, Στην, έχουμε, αιώνα, ανεξάρτητου, από, βαλκανικό, γλωσ-


σικό, γλώσσα, γραπτά, είναι, ελληνική, ενός, επίσης, ινδοευρωπαϊκές, ινδο-
ευρωπαϊκής, κείμενα, κλάδου, μέλος, μέχρι, μία, μοναδικό, οικογένειας, στον,
της, τις, τον, Римской, богатейшая, большое, была, время, всех, всякого,
греческие, греческих, греческого, греческом, двумя, для, древнегреческий,
его, знание, значительное, империи, источником, количество, латинских,
латинском, лексику, международная, международную, называемая, науч-
ных, новое, новых, образованного, обязательным, основном, присутству-
ет, проникали, путями, романских, русский, слова, создана, создания, стал,
существования, считалось, терминов, технических, церковнославянский,
через, этапах, язык, языка, языке

Second run

• 15ο

• —

• Η

100
• а, в, и, с

• В

• το, На, на

• (наряду, (так

• γλωσσών., γλώσσα, γλώσσες., δεσμό., π.Χ., σήμερα., заимствований, латин-


ским), лексика)., литература., слов., человека., язык.

• έχουμε, αιώνα, ανεξάρτητου, από, βαλκανικό, γλωσσικό, γλώσσα, γραπτά, εί-


ναι, ελληνική, ενός, επίσης, ινδοευρωπαϊκές, ινδοευρωπαϊκής, κείμενα, κλάδου,
μέλος, μέχρι, μία, μοναδικό, οικογένειας, στον, της, τις, τον

• Ανήκει, Αποτελεί, Στην

• богатейшая, греческие, греческих, греческого, греческом, древнегреческий,


значительное, источником, количество, латинских, латинском, междуна-
родная, международную, называемая, образованного, обязательным, ос-
новном, присутствует, проникали, романских, создания, существования,
считалось, терминов, технических, церковнославянский

• Римской, большое, была, время, всех, всякого, двумя, для, его, знание, им-
перии, лексику, научных, новое, новых, путями, русский, слова, создана,
стал, через, этапах, язык, языка, языке

Data: Mixed script: English – Greek

First run

• “intimate, “without, Although, Aquinas, Christians, Corinthians, Socrates, Sym-


posium, Testament, Whether, affection, ancient, another.”, appreciation, aspires,
araction, araction.”, becomes, benevolence., biblical, brotherly, chapter,”, char-
ity;, children, children., contemplation, content, continues, contributes, defini-
tion:, described, existence;, explained, express, feeling, feelings, finding, further,
holding, initially, inspired, knowledge, marriage., necessary, non-corporeal, pas-
sage, passion.”, philosophers, physical, platonic, refined, relationships, returned,
self-benefit)., sensually, spiritual, subject, suggesting, through, throughout, tran-
scendence., unconditional, understanding, without, youthful

• (ἀγάπη, (ἔρως, Agápe, agápē), Éros, érōs), –

101
• “Form”, “erotas”, “love, “love, “love:, (even, Agape, Greek, Lovers, Modern, Plato,
is, omas, also, apply, argue, based, beauty, beauty, being, dating, denote,
desire, does, eros, eros., erotic, even, famous, feast., feel, felt, given, good, helps,
hence, high, humans, ideal, itself., just, known, leads, like, love, love, love.”, mean,
means, most, mostly, one’s, part, person, person, plane, recall, refer, regard., seek,
sexual, soul, spouse, talk, texts, that, there, thus, truth, truth, type, used, well,
will, will, with, within, word, work
• “to, 1, 13, God, God.”, In, It, New, e, a, all, an, and, any, are, as, be, by, can, esp.,
for, has, his, in, is, is, it, its, man, not, not, of, on, one, or, own, the, to, us, use,
was

Second run
• affection, ancient, another.”, aspires, becomes, biblical, chapter,”, charity;, chil-
dren, children., content, definition:, feeling, feelings, finding, holding, marriage.,
necessary, passage, passion.”, platonic, refined, returned, subject, through, with-
out
• Although, Aquinas, Christians, Corinthians, Socrates, Symposium, Testament,
Whether
• “intimate, appreciation, araction, araction.”, benevolence., brotherly, contem-
plation, continues, contributes, described, existence;, explained, express, further,
initially, inspired, knowledge, non-corporeal, philosophers, physical, relation-
ships, self-benefit)., sensually, spiritual, suggesting, throughout, transcendence.,
unconditional, understanding, youthful
• Agápe, agápē), Éros, érōs)
• (ἀγάπη, (ἔρως
• –
• “erotas”, beauty, beauty, dating, denote, desire, erotic, famous, humans, itself.,
mostly, person, person, recall, regard., sexual, spouse, within
• “Form”, Agape, Greek, Lovers, Modern, Plato, is, omas, based, being, feast.,
hence, ideal, leads, means, plane, refer, there
• apply, felt, helps, high, just, known, most, part, talk, texts, that, thus, truth, truth,
type, well, will, will, with, word, work
• “love, “love, “love:, (even, also, argue, does, eros, eros., even, feel, given, good,
like, love, love, love.”, mean, one’s, seek, soul, used

102
• 1, 13, In, It

• “to, a, an, as, be, by, in, is, is, it, of, on, or, to, us

• God, God.”, New, e, all, and, any, esp., its, own, the

• are, can, for, has, his, man, not, not, one, use, was

Data: Mixed script: English – Spanish – Arabic

First run
• El, POW/MIA, Wearing, a, as, been, black, de, displaying, duelo., en, es, estados,
for, has, is, lazo, mourning, mourning., negro, o, of, or, organizaciones, personas,
political, por, remembrance, remembrance, representando, ribbon, sentimiento,
sociedades, statement., symbol, tragedies, un, used, utilizado, y

• crespón, político-social, señal, símbolo

• A

• ،‫ رسالتها‬،‫ دون‬،‫ تنقل‬،‫ بنظرة‬،‫ المعروف‬،‫ المصريين‬،‫ العلامة‬،‫ العلامات‬،‫ الرمز‬،‫ الرسم‬،‫ الذي‬،‫ الحاجة‬،‫ استخدم‬،‫ ٔان‬،‫ ٔاكثر‬،‫ٔاستخدموا‬
‫ ينبغي‬،‫ يعني‬،‫ يعبر‬،‫ ولكن‬،‫ وعموما‬،‫ والٔاغريق‬،‫ واحدة‬،‫ و‬،‫ هم‬،‫ من‬،‫ معين‬،‫ لاية‬،‫ كلمات‬،‫ قدماء‬،‫ فٔان‬،‫ عن‬،‫شيء‬

Second run
• a, o, y

• El, as, de, en, es, is, of, or, un

• Wearing, been, black, displaying, duelo., estados, for, has, lazo, mourning, mourn-
ing., negro, organizaciones, personas, political, por, remembrance, remembrance,
representando, ribbon, sentimiento, sociedades, statement., symbol, tragedies,
used, utilizado

• POW/MIA

• político-social, símbolo

• crespón, señal

• A

• ‫ وعموما‬،‫ والٔاغريق‬،‫ رسالتها‬،‫ المعروف‬،‫ المصريين‬،‫ العلامة‬،‫ العلامات‬،‫ الحاجة‬،‫ استخدم‬،‫ٔاستخدموا‬

• ‫ ينبغي‬،‫ يعني‬،‫ يعبر‬،‫ ولكن‬،‫ واحدة‬،‫ معين‬،‫ لاية‬،‫ كلمات‬،‫ قدماء‬،‫ فٔان‬،‫ شيء‬،‫ دون‬،‫ تنقل‬،‫ بنظرة‬،‫ الرمز‬،‫ الرسم‬،‫ الذي‬،‫ٔاكثر‬

103
• ‫ هم‬،‫ من‬،‫ عن‬،‫ٔان‬

• ‫و‬

Data: Mixed script: English – Chinese

First run

• “crisis”, “danger”, “opportunity”., (simplified, Chinese, Chinese:, Western, aus-


picious, because, believed, besides, certain, characters, component, composed,
criticized, frequently, inauspicious, invoked, linguists, meanings, meanings., mo-
tivational, number, numbers, pinyin:, positive, pronounced, represent, similar,
sounds, speaking, tradition, traditional, wēijī)

• (不利), (吉利), 危机;, 危機;, 机;, 機)

• 0, 6, 8, 9

• In, Some, e, and, are, based, be, by, can, for, has, have, in, is, jī, name, names,
of, on, or, other, some, sound, that, the, their, this, to, to., two, usage, word, words

Second run

• Chinese, Chinese:

• Western

• “crisis”, “danger”, “opportunity”., (simplified, auspicious, because, believed, be-


sides, certain, characters, component, composed, criticized, frequently, inaus-
picious, invoked, linguists, meanings, meanings., motivational, number, num-
bers, pinyin:, positive, pronounced, represent, similar, sounds, speaking, tradi-
tion, traditional, wēijī)

• (不利), (吉利)

• 危机;, 危機;

• 机;, 機)

• 6, 8, 9

• 0,

• Some, e, and, are, based, can, for, has, have, name, names, other, some, sound,
that, the, their, this, two, usage, word, words

104
• In, be, by, in, is, of, on, or, to, to.

• jī

Data: Mixed script: Ukrainian – Russian

First run

• 9—13

• Белоруссией, Венгрией, Молдавией., Польшей, Россией, Словакией, мо-


рей., народів, сарматів, скіфів, століття.

• Азовского, Віддавна, Київська, Румынией, України, , Чёрного, вважаєть-


ся, відправним, , границу, держави, державності, , культури, омывается,
пунктом, сухопутную, території, української, існували,

• Имеет, На, Русь, але, водами, готів, и, й, на, с, та, юге, інших

Second run

• 9—13

• морей., народів, сарматів, скіфів, століття.

• Белоруссией, Венгрией, Молдавией., Польшей, Россией, Словакией,

• Азовского, Віддавна, Київська, Румынией, України, Чёрного, границу,


держави, культури, пунктом, існували

• вважається, відправним, державності, омывается, сухопутную, території,


української

• и, й, с

• На, на, та

• але, водами, готів, юге, інших

• Имеет, Русь

105
Data: Pali: abbha

First run
• (also, (cp., (dense, (megho, (used, (°sama, 1, 1, 101, 1064;, 12)., 134., 239., 249,
250;, 251)., 273, 289., 295, 3, 317, 348, 348)., 382)., 487, 53, 581, 617, 64;, 687, <at>
a)fro\\s </at>, <at> o)/mbros </at>, <smallcaps> i. </smallcaps>, <smallcaps>
ii. </smallcaps>, <smallcaps> iv. </smallcaps>, <smallcaps> vi. </smallcaps>,
<superscript> 1 </superscript>, <superscript> 9 </superscript>, <superscript>
s. </superscript>, A, A), As, Dh, Dhs, DhsA, Gr., Idg., It, J, Lat., Miln, Miln),
Oir, Pv, PvA, S, Sk., Sn, SnA, , is, Vin, Vv, VvA, [Vedic, a, abhra, adj., also,
ambha, ambu, as, at, by, cloud, cloud, cloud\”;, clouds, clouds, cloudy, cp., dark),
expl, f., free, from, froth, imber, in, is, later, like, list, m., marajo</b>, mass,
moon–, mountain, nt., obscure, of, or, pabbata, perhaps, point, rain, rain;, rajo,
referred, scum, sense, storm–cloud, summit, sunshine, that, the, thick, things,
thunder–cloud);, thundering, to, viz., water, water].
• &, (=, <b>–ghana</b>, <b>–mua</b>, <br, =, \”dark, \”dull\”;
• (abbhaŋ, (mahiyā, (nīl°, <b> –saŋvilāpa </b>, <b> Rāhu </b>, <b> abbhā </b>,
<b> abbhāmaa </b>, abbhāmua, acchādesi);, mahikā </b>, nīla–megha, valā-
haka);, valāhaka– sikhara
• *m̊bhrocite /><b>–kūṭa</b>, <b>–paṭala</b>, <b>abbhaŋ, <b>dhū-, (nt.)

Second run
• (cp., Dhs, DhsA, Idg., Lat., Miln, Miln), Oir, PvA, SnA, is, Vin, VvA, [Vedic, as,
at, by, cp., in, is, nt., of, or, to
• (also, (dense, (megho, (used, (°sama, <at> a)fro\\s </at>, <at> o)/mbros </at>,
<smallcaps> ii. </smallcaps>, <smallcaps> iv. </smallcaps>, <smallcaps> vi.
</smallcaps>, abhra, adj., also, ambha, ambu, cloud, cloud, cloud\”;, clouds, clouds,
cloudy, dark), expl, free, from, froth, imber, later, like, list, marajo </b>, mass,
moon–, mountain, obscure, pabbata, perhaps, point, rain, rain;, rajo, referred,
scum, sense, storm– cloud, summit, sunshine, that, the, thick, things, thunder–
cloud);, thundering, viz., water, water].
• 1, 1, 101, 1064;, 12)., 134., 239., 249, 250;, 251)., 273, 289., 295, 3, 317, 348, 348).,
382)., 487, 53, 581, 617, 64;, 687, <superscript> 1 </superscript>, <superscript> 9
</superscript>
• <smallcaps> i. </smallcaps>, <superscript> s. </superscript>, A, A), As, Dh, Gr.,
It, J, Pv, S, Sk., Sn, , Vv, a, f., m.

106
• <b> –ghana </b>, <b> –mua </b>, <br, \”dark, \”dull\”;

• &, (=, =

• (abbhaŋ, (mahiyā, (nīl°, <b> Rāhu </b>, <b> abbhā </b>, nīla–megha

• <b> –saŋvilāpa </b>, <b> abbhāmaa </b>, abbhāmua, acchādesi);, mahikā


</b>, valāhaka);, valāhaka–sikhara

• *m̊bhro, /><b> –kūṭa </b>, <b> –paṭala </b>, <b> abbhaŋ, <b> dhū-

• (nt.)

Data: Pali: abhijjhitar

First run

• abhijjhita, abhijjhātar, covets, function], med., one, who, °itar), °itar, °ātar).

• (T., <smallcaps> i. </smallcaps>, <smallcaps> v. </smallcaps>, =, A, M, ag., fr.,


in, l., v.

• 265, 287

• [n.

Second run

• abhijjhita, abhijjhātar, covets, function], med., one, who, °itar), °itar, °ātar).

• (T., A, M

• =, l., v.

• <smallcaps> i. </smallcaps>, <smallcaps> v. </smallcaps>, ag., fr., in

• 265, 287

• [n.

107
Data: Pali: ajja

First run

• –divasa, Freq., Loc., [Vedic, adya, ajjatagge, ajjato, an, and, as, base, being, day,
demonstr., dyaus, from, in, morning, not, of, old, or, phrase, present, pron., the,
this, with

• &, +, Mhvs, s., v.

• –kālaṁ, 10), 15,64., 32,23., Ajjā, D.I,85;, DA.I,235., Dh.326;, III,425, J.I,
279;, [Link],180;, Kern,Toev., Pv.I,117, PvA.59) ;, PvA.6,23;, Sn.75,153,
158,970,998;, Vin.I,18;, a3), adyā,a, agga3), agge(?), ajja-tagge,see,
ajjā;, bahutaṁ, day”], diva),thus, dyā, dyā,a°, idāni, onward,henceforth,
to-day,now, “food”);, “on, ‹-›, Ajja,&, (=, (Page, (adv.), (read, (see

Second run

• an, as, in, of, or

• Freq., Loc., [Vedic

• –divasa, adya, ajjatagge, ajjato, and, base, being, day, demonstr., dyaus, from,
morning, not, old, phrase, present, pron., the, this, with

• &

• +

• Mhvs

• s., v.

•“on, ‹-›, Ajja,&, (=, (Page, (adv.), (read, (see

• –kālaṁ, 10), 15,64., 32,23., Ajjā, D.I,85;, DA.I,235., Dh.326;, III,425,


J.I,279;, [Link],180;, Kern,Toev., Pv.I,117, PvA.6,23;, Sn.75,153,158,
970,998;, Vin.I,18;, a3), agga3), ajja-tagge,see, ajjā;, bahutaṁ, day”
], diva),thus, dyā, idāni, onward,henceforth, to-day,now

• PvA.59);, adyā,a, agge(?), dyā,a°, “food”);

108
Data: Pali: gūhanā

First run

• 253), Pug.19.Cp.pari°.(Page, [abstr.fr.gūhati]=gūhanā, Gūhanā,(f.)


, (q.v.)

Second run

• 253), Pug.19.Cp.pari°.(Page, [abstr.fr.gūhati]=gūhanā, Gūhanā,(f.)


, (q.v.)

Data: Pali: pacati

First run

• 382), Caus.pacāpeti, DA.I,159,where, Obulg.peka, Pass.paccati, Vin.


IV,264;, bake,Gr.pέssw, cook,boil,roast, cook,pέpwn, daṇḍena, fig.
torment, fry,roast,Lith,kepū, intrs.):Niraye, paccato,by, ppr.pacanto,
pp.pakka, pīḷentassa).–, tormenting,Gen.pacato

• D.I,52, N.S.II,225,PvA.10,14.–, Pacati,[Ved.pacati,Idg.*peqǔō,


Av.pac-;, (+Caus.pācayato), (expld, (q.v.).–, (q.v. ).‹-›, (q.
v.).(Page, (trs.and

• aer, at, be, for, in, or, pacato, pare, purgatory, read, ripe], roasted, roasting, to,
tormented

• &, pacitvā, pāceti

Second run

• Caus.pacāpeti, DA.I,159,where, Obulg.peka, Pass.paccati, Vin.IV,264;


, bake,Gr.pέssw, cook,boil,roast, cook,pέpwn, daṇḍena, fig.torment, fry,
roast,Lith,kepū, intrs.):Niraye, paccato,by, ppr.pacanto, pīḷentassa)
.–,
tormenting,Gen.pacato

• 382), pp.pakka

• D.I,52, N.S.II,225,PvA.10,14.–, (q.v.


).–

• Pacati,[Ved.pacati,Idg.*peqǔō,Av.pac-;,(+Caus.pācayato),(expld,
(q.v. ).‹-›, (q.v.).(Page, (trs.and

109
• for, pacato, pare, read, ripe]

• aer, purgatory, roasted, roasting, tormented

• or, to

• at, be, in

• &

• pacitvā, pāceti

Data: Twier 1 (Greek–English)

First run

• αυτή, διαγωνισμό, λύση, στο, τη, ψήφισα, Μόλις

• BUSINESS, EXCELLENCE., IT, Internet, ings, of

Second run

• Μόλις

• αυτή, διαγωνισμό, λύση, στο, τη, ψήφισα

• IT, of

• Internet, ings,

• BUSINESS, EXCELLENCE.

Data: Twier 2 (French–English)

First run

• “e, 18h, @dhiparis, David, Demain, Keynote, all, collective, counts?”, dynam-
ics, par, perish;, science-publish, that

• is, it, of, or

110
Second run

• “e, @dhiparis, David, Demain, Keynote, all, collective, counts?”, dynamics,


par, perish;, science-publish, that

• 18h

• is, it, or

• of

Data: Twier 3 (French–English)

First run

• Edmonton, Food

• go, in, to

• and, are, breuvages, fans, for, just, ready, the, waiting

Second run

• Edmonton, Food

• to

• go, in

• for, just

• and, are, breuvages, fans, ready, the, waiting

Data: Twier 4 (English–Polish)

First run

• żubrówka, my

• adidas, and, back, comes, crates, dad, from, jackets, of, omg, poland, strawberries,
two, with

111
Second run

• żubrówka, my

• adidas, comes, dad, of

• and, back, crates, from, jackets, omg, poland, strawberries, two, with

Data: Twier 5 (Transliterated Amharic–English)

First run

• Buna

• (coffee, bread)., dabo, is, naw, our

Second run

• Buna

• our

• (coffee, bread)., dabo, is, naw

8.3.4 Language Model Induction


For all language model induction tasks, the threshold value t has been set t = 0.02
and the silver threshold value s has been set s = 0.1. e other parameters have been
set to “maximum iteration count” i = 4, “maximum random iteration count” j = 2
and “merge mode ADD”.

Data: Latin script: German–English

• e, German, word, Nabelschau, means, or, “staring, at, your, But, in, this, it,
doesn’t, refer, to, anyone, else’s, buon, just, your, own.,

• –

• “navel-gazing”, navel”., case, belly

112
Data: Latin script: German–Finnish–Turkish

• die, in, und, Klimazone., Je, ob, auf, Südhalbkugel, vom, eli, on, vuodenaika, ja, on,
vuodenajoista, koska, maapallo, on, silloin, kallistunut, aurinko, maan, pinnalle,
kulmassa, muina, vuodenaikoina., Pohjoisella, pallonpuoliskolla, lasketaan, ta-
vallisesti, ja, elokuu, eteläisellä, pallonpuoliskolla, joulu-, ja, helmikuu., en, sı-
cak, en, yazda, Dünya, depo, en, sıcak, yaklaşık, ay, sonra, ortaya, Sıcak, Haziran,
Eylül, ise, Aralık, arasındadır.

• Der, ist, wärmste, der, vier, Jahreszeiten, der, arktischen, nachdem, er, der, Nord-,
oder, herrscht, spricht, Nord-, oder, Der, findet, mit, Südwinter, sta., suvi, läm-
pimin, niin, eä, säteilee, heinä-, Yaz, mevsimdir., Küre’de, Küre’de, 21, 22, ara-
sında, Küre’de, 22, 21, Mart

• gemäßigten, gerade, gleichzeitig, kuin, Kuzey, uzun, günler, gerçekleşir., eiği,


için, günler, genellikle, iki, günler, Kuzey, ile, ile

• Sommer, man, Südsommer., Nordsommer, dem, Kesä, kevään, syksyn, välissä.,


Kesä, jyrkemmässä, kesäkuukausiksi, kesä-., tammi-, Yarım, ısıyı, çıkar., Yarım,
Güney, Yarım

Data: Latin script: English–French

• both, “so”, in, English, although, their, is, is, the, opposite, of, “rough”, or, is,
the, opposite, of, sweet, only, for, wines, (otherwise, is

• mou, :, mou, but

• doux,

• Doux, (rugueux), Doux

• while

• “hard”., used).,

• translate, as, meaning, very, different., ”coarse”, can, also, mean, almost,sucré,

Data: Latin script: English–Transliterated Greek

• at, least, ways, as, to, is, has, philía, and, storgē., as, has, historically, difficult, to,
which, generally, as

113
• e, language, distinguishes, different, the, Ancient, distinct, with, languages, it,
been, separate, the, meanings, these, used, outside, their, respective, the, senses,
in, these, used

• Greek, how, word, Greek, agápe, éros, However, other, when, were, are

• four, love, used., four, words, for, love:, of, words, of, contexts., Nonetheless,
words, follows.

Data: Latin script: Italian–German

• affrescò, privato, Studie, definire, periture,Stierverbands, Wissenscha,studierà,


difesa, ovvero, Szenario, Naturwissenschalern

• dell’aureola, da, del, di, der, zum, modo., dem, den, drohe., Come, vom

• custodisce, quel, es, oder, per, le, idee, stessa, des, dass, delle, E, se, Ist, das, seit

• più, Cenacolo, vinciano, rivoluzionaria, Giuda, condanna, con, peccato), comin-


ciò, con, cancro, faceva, intuizioni, vita, va, Dabei, Ergebnis, in, i, riccioli, poi,
più, bacini, in, Annunciazione, con, ali, la, cosa, barbaglio, anni, bei,

• ne, struggente:, che, amore, e, non, viene, ma, consapevolezza, ad, che, ha, re-
cente, Kaum, eine, Woche, vergeht, keine, neue, Umfrage, Warnung, ema,
Fachkräemangel, Deutschland, Certo, ma, anche, consapevole, che, qualche,
mehren, letzter, Zeit, Stimmen, Entwarnung, geben., kam, jüngst, eine, Deutsche,
”ein, allgemeiner, Fachkräemangel, eher, mehr”, anche, Baista, che, Leonardo,
approfonditamente, a, Venezia, nelle, vada, alla, aento, alle, dell’angelo:, deli-
catezza, punte, che, non, che, volare?, Jahren, angemahnte, drohenden, Fachkräe-
mangel, Ingenieuren, ein

• Milano, l’esempio, psicologia, (il, subito, autodistruggersi, solo, lunghissimo,


So, il, movimento, moto, sui, si, bellissima, occhio, all’insù, sono, sogno, lo,
ossessionava, quello, und, also, Mythos?

• un, ür, MINT-Berufen

• cura, restauro, arginato., gibt., perché, caurata, sich, auch, zu, nicht, richia-
mano, acque, ricerche, chiave, anti-Turchi., nur

114
Data: Mixed script: Greek–Russian

• ελληνική, γλώσσα, είναι, μία, από, τις, ινδοευρωπαϊκές, γλώσσες., Αποτελεί, το,
μοναδικό, μέλος, ενός, ανεξάρτητου, κλάδου, της, ινδοευρωπαϊκής, οικογένειας,
γλωσσών., Ανήκει, επίσης, στον, βαλκανικό, γλωσσικό, δεσμό., Στην, ελληνική,
γλώσσα, έχουμε, γραπτά, κείμενα, από, τον, 15ο, αιώνα, μέχρι, σήμερα.

• На, греческом, на, всех, его, существования, была, создана, богатейшая,


греческого, обязательным, всякого, образованного, большое, заимствова-
ний, а, в, греческом, новое, время, (наряду, новых, научных, терминов, на-
зываемая, международная, слова, в, основном, двумя, через

• Η, π.Χ.,языке, этапах, литература., В, Римской, империи, знание, языка,


считалось, для, человека., В, латинском, языке, присутствует, количество,
греческих, —, значительное, количество, латинских, и, романских, слов., В,
древнегреческий, язык, стал, с, латинским), источником, создания, и, тех-
нических, (так, лексика)., В, русский, язык, греческие, проникали, путями,
—, международную, лексику, и, церковнославянский, язык.

Data: Mixed script: English–Greek

• is, biblical, is, will, is, without, self-benefit)., is, feelings, feelings, it, be, feeling,
being, high, is, by, his, is, by, will, mostly, sexual, ”intimate, well, refined, his,
definition:, is, initially, felt, with, it, beauty, within, beauty, itself., use, ”with-
out, helps, soul, beauty, spiritual, youthful, beauty, feel, suggesting, sensually,
spiritual, finding, its, like, finding, all, seek

• (ἀγάπη, (ἔρως

• Agápe, ”love:, brotherly, love, love, of, God, for, of, for, in, known, ”love, 1, 13,
throughout, New, brotherly, love, affection, good, love, love, given, or, not, per-
son, continues, love, (even, in, for, one’s, for, spouse, refer, love, of, content, or,
holding, one, in, unconditional, love, of, God, for, of, love, ”to, good, of, Éros,
”love, of, e, Modern, Greek, word, love.”, own, Although, eros, for, person,
contemplation, becomes, of, person, or, even, becomes, of, not, of, of, love, of,
word, mean, In, Symposium, work, on, subject, eros, knowledge, of, of, ”Form”,
of, erotic, –, even, love, non-corporeal, of, is, Lovers, philosophers, through, of,

• agápē), means, esp., charity;, the, man, and, man, God.”, Agape, used, the, pas-
sage, as, the, chapter,”, Corinthians, and, described, there, and, the, Testament,
as, and, benevolence., Whether, the, returned, the, to, any, Agape, also, used,
ancient, texts, to, denote, children, and, the, a, and, was, also, used, to, to, a,
feast., It, can, also, described, as, the, regard., Agape, used, Christians, to, express,

115
the, children., type, was, further, explained, omas, Aquinas, as, the, another.”,
érōs), means, the, passion.”, ”erotas”, means, It, can, also, apply, to, dating, re-
lationships, as, as, marriage., Plato, a, an, appreciation, the, that, appreciation,
Plato, does, talk, physical, araction, as, a, necessary, part, hence, the, the, pla-
tonic, to, physical, araction.”, the, the, most, famous, ancient, the, Plato, has,
Socrates, argue, that, the, recall, and, contributes, to, an, understanding, truth,
the, ideal, that, leads, us, humans, to, desire, thus, that, that, based, aspires, to,
the, plane, existence;, that, truth, just, any, truth, leads, to, transcendence., and,
are, inspired, to, truth, the, means, eros.

Data: Mixed script: English–Spanish–Arabic


• ،‫ دون‬،‫ واحدة‬،‫ بنظرة‬،‫ رسالتها‬،‫ تنقل‬،‫ ٔان‬،‫ ينبغي‬،‫ العلامة‬،‫ فٔان‬،‫ وعموما‬،‫ معين‬،‫ شيء‬،‫ عن‬،‫ يعبر‬،‫ الذي‬،‫ الرسم‬،‫ يعني‬،‫الرمز‬
،‫ استخدم‬،‫ من‬،‫ ٔاكثر‬،‫ ولكن‬،‫ العلامات‬،‫ ٔاستخدموا‬،‫ والٔاغريق‬،‫ المصريين‬،‫ قدماء‬،‫ ٔان‬،‫ المعروف‬،‫ من‬،‫ و‬،‫ كلمات‬،‫ لاية‬،‫الحاجة‬
‫ هم‬،‫العلامات‬
• ribbon, symbol, mourning., ribbon, mourning, El, un, y, un, en
• black, is, a, of, remembrance, or, Wearing, or, displaying, a, black, has, been, used,
for, remembrance, tragedies, or, as, a, political, statement., crespón, negro, o, lazo,
negro, es, símbolo, utilizado, por, personas, estados, sociedades, organizaciones,
representando, sentimiento, político-social, señal, de, duelo.
• A, POW/MIA

Data: Mixed script: English–Chinese


• e, Chinese, (simplified, traditional, Chinese:, invoked, motivational, speaking,
because, the, composed, characters, that, represent, linguists, have, criticized,
this, usage, because, the, component, (simplified, Chinese:, traditional, Chinese:,
has, other, besides, Chinese, certain, some, be, based, the, Chinese, that, the, e,
numbers, believed, have, because, their, similar, words, that, have, positive
• (不利)
• Western, can, and, Some, meanings, In, are, number, name, and, are, meanings,
names, meanings.
• 0, 6, 8, 9
• ”crisis”, is, auspicious, inauspicious, sounds, sound
• for, pinyin:, frequently, in, word, of, two, ”danger”, ”opportunity”., pronounced,
tradition, by, or, on, word, to., to
• 危机;, 危機;, wēijī), jī, 机;, 機), (吉利)

116
Data: Mixed script: Ukrainian–Russian

• й, Русь, морей., Россией, Белоруссией, Польшей, Словакией, Венгрией, Ру-


мынией

• але, 9—13, юге, Имеет, Молдавией.

• існували, інших

• Чёрного, Азовского, границу

• культури

• території, України, пунктом, української, и, сухопутную, и

• Віддавна, на, держави, скіфів, сарматів, готів, народів, відправним, держав-


ності, На, водами

• та, вважається, Київська, століття., омывается, с

Data: Pali: abbha

• (nt.), nt., Sk., \”dark, Idg., cp., Gr., Lat., Sk., water, Gr., water]., dark), at, SnA, S,
at, It, Sn, (cp., SnA, Sn, S

• &, A, A), ., J, 251)., 1, 1064;, 249, 250;, 12)., 64;, 348)., 382).

• viz., 134., 101, 581, f., 289.

• 53, 295, 273, 487, 3, 617, 317, 348, 239., 687

• cloud\”;, also, cloud, cloudy, <smallcaps> ii. </smallcaps>, =, list, is, <smallcaps>
i. </smallcaps>, (°sama, <smallcaps> vi. </smallcaps>, (abbhaŋ, <smallcaps> iv.
</smallcaps>, (nīl°, As, Dhs, DhsA, (used, (=, clouds, cloud, (also, as

• m., adj.

• abhra, (mahiyā, VvA, acchādesi);, Pv, PvA, \”dull\”;, valāhaka);, Vv, valāhaka–
sikhara

• <at>a)fro\\s</at>, froth, of, <superscript> 9 </superscript>, <superscript> s. </su-


perscript>, <superscript> 1 </superscript>

• later, scum, rain;, ambha, rain, a, Miln, (megho, Miln), nīla–megha, sense, expl,
, Dh

117
• *m̊bhro, <at>o)/mbros</at>, ambu, mass, to, obscure, moon–, <b>abbhaŋ, mahikā
</b>, <b>dhū-, marajo</b>, <b>Rāhu</b>, pabbata, rajo, <b>abbhā</b>, by,
perhaps, <b>abbhāmaa</b>, <br, /><b>–kūṭa</b>, or, summit, storm–cloud,
<b>–ghana</b>, <b>–paṭala</b>, mass, <b>–mua</b>, from, abbhāmua, <b>
–saŋvilāpa</b>,

• [Vedic, imber, Oir, (dense, Vin, in, things, that, sunshine, is, referred, moun-
tain, like, thunder–cloud);, the, point, thick, free, thundering

Data: Pali: abhijjhitar

• <smallcaps>i.</smallcaps>, v., l., <smallcaps>v.</smallcaps>

• abhijjhita, abhijjhātar, °itar), °itar, °ātar).,

• [n., ag., fr., med., M, 287, (T., =, A, 265

• in, function], one, who, covets

Data: Pali: ajja

• Ajja,&, Ajjā, (adv.), base, a3), diva),thus, Dh.326;, ajjā;, v., PvA.59) ;,
PvA.6,23;, phrase, ajjatagge, ajjato, agge(?), ajja-tagge,see, agga3),(adv.)
, the, 32,23., (Page

• ‹-›, –kālaṁ

• [Vedic, &, +, being, (see, (see, (read, as, (=, Mhvs, (=, +, Mhvs

• of, of, “on, “food”);

• and, an, old, not

• adya, adyā,a, dyā,a°, dyā, dyaus, day”], to-day,now, bahutaṁ, with, day,
–divasa, day

• demonstr., pron., Loc., this, Kern,Toev., s., Freq., or, from, this, onward,hence-
forth, this, morning, present

• Sn.75,153,158,970,998;, J.I,279;, III,425, Pv.I,117, idāni, 15,64., in,


Vin.I,18;, D.I,85;, DA.I,235., [Link],180;, 10)

118
Data: Pali: gūhanā
• Pug.19.Cp.pari°.(Page
• Gūhanā,(f.), [abstr.fr.gūhati]=gūhanā
• 253),(q.v.)

Data: Pali: pacati


• Vin.IV,264;, N.S.II,225,PvA.10,14.–, D.I,52
• DA.I,159,where, 382)
• in
• at, &
• cook,pέpwn, cook,boil,roast
• Pacati,[Ved.pacati,Idg.*peqǔō,Av.pac-;, Obulg.peka, to, fry,roast,
Lith,kepū, ripe], to, fig.torment, purgatory,(trs.and, pacitvā, aer, roasting,
ppr.pacanto, tormenting,Gen.pacato,(+Caus.pācayato), read, pacato, for,
paccato,by, pare, pp.pakka, Caus.pacāpeti, pāceti, Pass.paccati, to, roasted,
or, tormented
• bake,Gr.pέssw, intrs.):Niraye, (expld, daṇḍena, pīḷentassa).–, (q.v.
).
‹-›, (q.v.
).–, be, (q.v.).(Page

Normalized data
• pacati, peka, pέssw, pέpwn, pacitvā, ppr., pacanto, Gen., pacato, (+Caus., pācay-
ato), pacato, paccato, pare, pīḷentassa)., pp., pakka, Caus., pacāpeti, Pass., paccati
• *peqǔō, bake
• pac-;, 264;, 52, &, 382)
• 10,14.–, 159, –, <->, –
• fry, Niraye, I, I, by
• Av., Obulg., Gr., (trs., D., DA., (q.v.)., (q.v.)., (q.v.).
• [Ved., to, roast, kepū, cook, ripe], to, cook, roast, torment, purgatory, and, aer,
roasting, tormenting, (expld, at, where, read, for, daṇḍena, pāceti, to, be, roasted,
or, tormented, (Page
• Pacati, Idg., Lith, boil, [Link], fig., in, intrs.):, in, [Link],225,PvA.

119
Data: Twier 1 (Greek–English)

• BUSINESS, EXCELLENCE.

• Μόλις, ψήφισα, αυτή, τη, λύση, Internet, of, στο, διαγωνισμό

• ings, IT

Data: Twier 2 (French–English)

• Keynote, “e, collective, of, science-publish, or, perish;, it, all, that, counts?”

• Demain, 18h, par

• #dhiha6, David

• @dhiparis, dynamics, is

Data: Twier 3 (French–English)

• #FWWC2015

• breuvages, go,

• Food, Edmonton, to, for, the

• in, waiting, #bilingualism

• and, are, ready, just, fans

Data: Twier 4 (English–Polish)

• comes, from, with, two, crates, of, strawberries, jackets, omg

• my, dad, poland, and, adidas

• back, żubrówka

Data: Twier 5 (Transliterated Amharic–English)

• (coffee

• bread). is, our

• Buna, dabo, naw

120

You might also like