Language Segmentation
Language Segmentation
F II
C D H
Language Segmentation
Supervisors:
Author:
Prof. Dr. Caroline S
David A
Dr. Sven N
Datum Unterschri
i
Abstract
Language segmentation consists in finding the boundaries where one lan-
guage ends and another language begins in a text wrien in more than one lan-
guage. is is important for all natural language processing tasks.
e problem can be solved by training language models on language data.
However, in the case of low- or no-resource languages, this is problematic. I
therefore investigate whether unsupervised methods perform beer than super-
vised methods when it is difficult or impossible to train supervised approaches.
A special focus is given to difficult texts, i.e. texts that are rather short (one
sentence), containing abbreviations, low-resource languages and non-standard
language.
I compare three approaches: supervised n-gram language models, unsuper-
vised clustering and weakly supervised n-gram language model induction. I de-
vised the weakly supervised approach in order to deal with difficult text specif-
ically. In order to test the approach, I compiled a small corpus of different text
types, ranging from one-sentence texts to texts of about 300 words.
e weakly supervised language model induction approach works well on
short and difficult texts, outperforming the clustering algorithm and reaching
scores in the vicinity of the supervised approach. e results look promising,
but there is room for improvement and a more thorough investigation should be
undertaken.
ii
Anowledgements
My thanks go to professor Caroline Sporleder for sharing her knowledge with me, for
her inspiring ideas and for agreeing to supervise my Bachelor’s and Master’s esis
despite her busy schedule. It was also thanks to the topic she suggested for my Bach-
elor’s esis that I met Jürgen Knauth and later was able to get a research assistant
position at the SeNeReKo project, collaborating closely with Jürgen.
Which brings me to the next person on the list. I would like to thank Jürgen Knauth
for the wonderful collaboration, for his patience, for his contagious enthusiasm, and
all the interesting conversations in passing that always lasted longer than intended.
I would like to thank Stephan Faber for his insightful comments when I couldn’t
see the wood for the trees, for his patience and optimism, for pushing me to go further
and to persevere.
I would also like to thank Julian Vaudroz for accompanying me throughout the
degree program. We both didn’t know what we were in for when we started, but we
persevered and it paid off. It wouldn’t have been the same without you.
Finally, I would like to thank all the people that volunteered to proofread my thesis
and all the people that helped me during the writing of this thesis. Unfortunately, I
cannot list everyone. You know who you are!
iii
List of Figures
1 Out-of-place metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Simple text illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Initial model creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Initial model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Model update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7 New model creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
8 Multiple model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 14
9 Updating relevant model . . . . . . . . . . . . . . . . . . . . . . . . . . 14
10 Multiple model evaluation 2 . . . . . . . . . . . . . . . . . . . . . . . . 14
11 New model creation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
12 Problematic text sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
13 Finding the most similar models . . . . . . . . . . . . . . . . . . . . . . 16
14 Merging most similar models . . . . . . . . . . . . . . . . . . . . . . . . 16
15 Word-Model assignment . . . . . . . . . . . . . . . . . . . . . . . . . . 16
16 Clustering preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
17 WEKA: Cluster visualization . . . . . . . . . . . . . . . . . . . . . . . . 28
18 ELKI: Cluster visualization . . . . . . . . . . . . . . . . . . . . . . . . . 29
19 Language model: Distribution 1 . . . . . . . . . . . . . . . . . . . . . . 35
20 Language Model: Distribution 2 . . . . . . . . . . . . . . . . . . . . . . 35
21 Language model: Distribution 3 . . . . . . . . . . . . . . . . . . . . . . 36
22 Alternating language structure . . . . . . . . . . . . . . . . . . . . . . . 54
iv
List of Tables
1 Training data: Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Unambiguous encoding: distances . . . . . . . . . . . . . . . . . . . . . 27
3 Simplified encoding: distances . . . . . . . . . . . . . . . . . . . . . . . 27
4 N-Gram language model results: Latin script . . . . . . . . . . . . . . . 38
5 N-Gram language model results: Mixed script . . . . . . . . . . . . . . 39
6 N-Gram language model results: Pali data . . . . . . . . . . . . . . . . . 40
7 N-Gram language model results: Twier data . . . . . . . . . . . . . . . 41
8 Textcat results: Latin script . . . . . . . . . . . . . . . . . . . . . . . . . 42
9 Textcat results: Mixed script . . . . . . . . . . . . . . . . . . . . . . . . 43
10 Textcat results: Pali data . . . . . . . . . . . . . . . . . . . . . . . . . . 44
11 Textcat results: Twier data . . . . . . . . . . . . . . . . . . . . . . . . 45
12 Clustering results: Latin script . . . . . . . . . . . . . . . . . . . . . . . 46
13 Clustering results: Mixed script . . . . . . . . . . . . . . . . . . . . . . 47
14 Clustering results: Pali data . . . . . . . . . . . . . . . . . . . . . . . . . 48
15 Clustering results: Twier data . . . . . . . . . . . . . . . . . . . . . . . 49
16 Induction results: Latin script . . . . . . . . . . . . . . . . . . . . . . . . 50
17 Induction results: Mixed script . . . . . . . . . . . . . . . . . . . . . . . 51
18 Induction results: Pali data . . . . . . . . . . . . . . . . . . . . . . . . . 52
19 Induction results: Twier data . . . . . . . . . . . . . . . . . . . . . . . 53
20 ‘Twier 3’: Textcat versus Gold clustering . . . . . . . . . . . . . . . . 58
21 ‘Twier 4’: Textcat versus Gold clustering . . . . . . . . . . . . . . . . 58
v
List of Algorithms
1 N-gram numerical encoding . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Model induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Initial model creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Max model and max score . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Model merger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Distributional Similarity Calculation . . . . . . . . . . . . . . . . . . . . 36
vi
Contents
1 Introduction 1
2 Related work 2
2.1 N-Grams and rank order statistics . . . . . . . . . . . . . . . . . . . . . 2
2.2 N-Grams and maximum likelihood estimator . . . . . . . . . . . . . . . 3
2.3 Trigrams and short words . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 N-Grams and clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Inclusion detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Clustering and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.7 Monolingual training data . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.8 Predictive suffix trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 eory 9
3.1 Supervised language model . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 N-Gram models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Unsupervised clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Weakly supervised language model induction . . . . . . . . . . . . . . 12
4 Experimental setup 18
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Supervised language model . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Training phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.3 Application of the approach . . . . . . . . . . . . . . . . . . . . 21
4.2.4 Textcat and language segmentation . . . . . . . . . . . . . . . . 21
4.3 Unsupervised clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Defining features . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.3 Mapping features to a common scale . . . . . . . . . . . . . . . 25
4.3.4 e problem of unambiguous encoding . . . . . . . . . . . . . . 26
4.3.5 e clusterer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.6 Evaluating clusterings . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Weakly supervised language model induction . . . . . . . . . . . . . . 31
4.4.1 Distributional similarity . . . . . . . . . . . . . . . . . . . . . . 34
4.4.2 Evaluating results . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.3 Estimating the parameters . . . . . . . . . . . . . . . . . . . . . 37
vii
5 Results 38
5.1 N-Gram language model . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Textcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Language model induction . . . . . . . . . . . . . . . . . . . . . . . . . 50
6 Discussion 54
6.1 N-Gram language models . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Textcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4 Language model induction . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7 Conclusion 65
8 Appendix 72
8.1 Development data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.1.1 Latin script data . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.1.2 Mixed script data . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.1.3 Twier data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.1.4 Pali dictionary data . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2 Test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.1 Latin script data . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.2 Mixed script data . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.2.3 Twier data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2.4 Pali dictionary data . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.3.1 N-Gram Language Models . . . . . . . . . . . . . . . . . . . . . 80
8.3.2 Textcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.3.4 Language Model Induction . . . . . . . . . . . . . . . . . . . . . 112
viii
1 Introduction
Language segmentation and identification are important for all natural language pro-
cessing operations that are language-specific, such as taggers, parsers or machine
translation (Jain and Bhat, 2014; Zubiaga et al., 2014). Indeed, using “traditional” mono-
lingual natural language processing components on mixed language data leads to mis-
erable results (Jain and Bhat, 2014). Even if the results are not terrible, language identi-
fication and segmentation can improve the overall results. For example, by identifying
foreign language inclusions in an otherwise monolingual text, parser accuracy can be
increased (Alex et al., 2007).
One important point that has to be borne in mind is the difference between lan-
guage identification and language segmentation. Language identification is concerned
with recognizing the language at hand. It is possible to use language identification
for language segmentation. Indeed, by identifying the languages in a text, the seg-
mentation is implicitly obtained. Language segmentation on the other hand is only
concerned with identifying language boundaries. No claims about the languages in-
volved are made.
Aer giving an overview over related work and different approaches that can be
taken for language segmentation, I will present the theory behind supervised methods
as well as unsupervised methods. Finally, I will introduce a weakly supervised method
for language segmentation that I developed.
Aer the theoretical part, I will present experiments done with the different ap-
proaches, comparing their effectiveness on the task of language segmentation on dif-
ferent text types. A special focus will be given to difficult text types, such as short texts,
texts containing under-resourced languages or texts containing a lot of abbreviations
or other non-standard features.
A big advantage of unsupervised methods is language independence. If the ap-
proach used does not rely on language-specific details, the approach is more flexible
as no language resources have to be adapted for the method to work on other lan-
guages. ese advantages might be especially useful for under-resourced languages.
When there is no or insufficient data available to train a supervised language model,
an unsupervised approach might yield beer results.
Another advantage is that unsupervised methods do not require prior training.
ey are not dependent on training data and thus cannot be skewed by the data. In-
deed, supervised approaches that are trained on data are qualitatively tied to their
training data; different training data will, in all probability, yield different models.
is thesis aims at answering the question whether unsupervised language seg-
mentation approaches work beer on difficult text types than supervised language
approaches.
1
2 Related work
2.1 N-Grams and rank order statistics
Cavnar and Trenkle (1994) use an n-gram language model for language identification
purposes. eir program ‘Textcat’ is intended to classify documents by language. e
system calculates n-grams for 1 6 n 6 5 from training data and orders the n-grams
according to inverse frequency, i.e. from the most frequent n-grams to the most infre-
quent n-grams. e numerical frequency data is then discarded and only inherently
present.
During training, the program calculates an n-gram profile consisting of these n-
gram lists for each category (i.e. language to classify).
New data is classified by first calculating the n-gram profile and then comparing
the profile to existing profiles. e category with the lowest difference score is taken
as the category for the document.
e score they use for classification is called out-of-place metric. For each n-gram
in the document n-gram profile, the corresponding n-gram in the category profile is
looked up and the absolute difference of ranks is taken as score. e sum is calculated
over all n-grams. More formally, the out-of-place metric moop is calculated as:
∑
n
moop = (|r(xi , d) − r(xi , c)|) (1)
i=1
With n the number of n-grams in the document profile, xi the i-th n-gram, r(xi , d)
the rank of the i-th n-gram in the document profile, r(xi , c) the rank of the i-th n-gram
in the category profile.
Figure 1 illustrates the out-of-place metric.
ING ING 0
TH AT 2
WH TH 1
Least frequent
2
In figure 1, the document profile has ‘ER’ as most frequent n-gram, at rank 1, fol-
lowed by ‘ING’ at rank 2, etc. e category profile does not contain the n-gram ‘ER’; in
that case, an arbitrary fixed maximum value is assigned. e category profile contains
the n-gram ‘ING’ at rank 2, the same rank as in the document profile; the difference is
0. e category profile contains the n-gram ‘AT’ at rank 1, while in the document pro-
file, it occurs at rank 3. e absolute difference is 2. e out-of-place metric consists
of the sum of all scores thus calculated.
Cavnar and Trenkle (1994) collected 3713 Usenet texts with a cultural theme in
different languages. ey filtered out non-monolingual texts and texts that had no
useful content for language classification. In the end, they had 3478 articles ranging
from a single line of text to 50 KB of text.
eir results indicated that length had no significant impact on the classification,
contrary to what they thought. Also, they found that training the system with 400
n-grams yielded the best result with a precision of 99.8%.
ey also showed that their approach could be used for subject classification of
texts in the same language with reasonable precision. is finding indicates that lan-
guage and domain are linked to a certain degree.
C(w1 , . . . , wi ) + 1
P (wi |w1 , . . . , wi−1 ) = (2)
C(w1 , . . . , wi−1 ) + |V |
with C(w1 , . . . , Ci ) the number of times the n-gram w1 , . . . , wi occurred,
C(w1 , . . . , Ci−1 ) the number of times the (n − 1)-gram w1 , . . . , wi−1 occurred and |V |
the size of the vocabulary.
For a string S, the string is decomposed into n-grams and the log probability lk is
calculated as:
∑
lk = C(w1 , . . . , wk ) log P (wk |w1 , . . . , wk−1 ) (3)
w1 ,...,wk ∈S
3
e results indicate that bigram models perform beer for shorter strings and less
training data while trigram models work beer for larger strings and more training
data.
Dunning (1994) criticizes Cavnar and Trenkle (1994) for saying that their system
would be insensitive to the length of the string to be classified, as the shortest text they
classified was about 50 words. e system implemented by Dunning (1994) can classify
strings of 10 characters in length “moderately well”, while strings of 50 characters or
more are classified “very well”. Accuracies given vary from 92% for 20 bytes of training
data to 99.9% for 500 bytes of text.
C(wi )
P (wi |ci ) = (5)
C(ci )
with C(wi ) the count of the word wi and C(ci ) the count of the cluster ci .
4
e probability of a cluster given the preceding clusters is calculated using the
Maximum Likelihood Estimator
C(ci−2 ci−1 ci )
P (ci |ci−2 ci−1 ) = (6)
C(ci−2 ci−1 )
Gao et al. (2001) derive from this three ways of using clusters to augment language
models: predictive clustering (7), conditional clustering (8) and combined clustering
(9).
P (wi |wi−2 wi−1 ) = P (ci |wi−2 wi−1 ) × P (wi |wi−2 wi−1 ci ) (7)
P (wi |wi−2 wi−1 ) = P (ci |ci−2 ci−1 ) × P (wi |ci−2 ci−1 ci ) (9)
Similarly, Dreyfuss et al. (2007) use clustering to cluster words by their context in
order to improve trigram language models. In addition to Gao et al. (2001), they also
use information about the subject-verb and verb-object relations of the sentence.
ey show that their model, using clustering, subject-verb information, verb-object
information, and the Porter stemmer outperforms a traditional trigram model.
Carter (1994) clusters training sentences (i.e. the corpus) into subcorpora of similar
sentences and calculates separate language model parameters for each subcorpus in
order to capture contextual information. In contrast to other works, Carter (1994)
clusters sentences instead of single words (compare Pereira et al. (1993) and Ney et al.
(1994)). Carter (1994) shows that the subdivision into smaller clusters increases the
accuracy of bigram language models, but not trigram models.
5
by using inclusion detection when parsing a German text with a parser trained on the
TIGER corpus (Brants et al., 2002).
6
in discriminating between English and non-English text. ey found that a Condi-
tional Random Field model augmented with Generalized Expectation criteria worked
best, yielding accuracies of 88% with as lile as 10 words used for training.
Lui et al. (2014) consider the task as multi-label classification task. ey represent
a document as an n-gram distribution of byte sequences in a bag-of-words manner.
ey report F-scores of 0.957 and 0.959. ey note that similar languages will pose
problems when trying to identify a language, and solve this problem by identifying a
set of languages that most probably are correct instead of a single language.
One problem that these approaches all have is that they need to know the languages
that will occur in the test data (King and Abney, 2013; Lui et al., 2014).
7
synthetic multilingual text that switches language approximately every two sentences.
One important point that they note is that “too short segments do not enable reliable
discrimination between different models”. erefore, they disallow switching models
aer every word.
ey report very good results on the language segmentation task (and on the pro-
tein segmentation task). Aer 2000-3000 iterations of the Blahut-Arimoto algorithm,
the correct number of languages is identified and the segmentation is accurate up to a
few leers.
8
3 eory
3.1 Supervised language model
3.1.1 N-Gram models
Among supervised language models, n-gram models are very popular (Gao et al., 2001).
An n-gram is a slice from the original string (Cavnar and Trenkle, 1994). ese slices
can be contiguous or not. Non-contiguous n-grams are also called skip-grams (Guthrie
et al., 2006). In skip-grams, an additional parameter k indicates the maximum distance
that is allowed between units. In this parlance, contiguous n-grams can be regarded
as 0-skip-n-grams (Guthrie et al., 2006).
e following example demonstrates the difference between (traditional) n-grams
and skip-grams. Given the following sentence:
This i s a sample s e n t e n c e .
We can construct, for example, the following word k-skip-n-grams:
(0-skip-)2-grams: is is, is a, a sample, sample sentence
2-skip-2-grams: is is, is a, is sample, is a, is sample, is sentence, a sample, a
sentence, sample sentence
(0-skip-)3-grams:is is a, is a sample, a sample sentence
2-skip-3-grams:is is a, is is sample, is is sentence, is a sample, is a sen-
tence, is sample sentence, is a sample, is a sentence, is sample sentence, a sample
sentence
e results for 2-skip-2-grams does not include the skip-gram “is sentence”, as
the distance in words between these two words is 3, higher than the allowed k of 2. As
can be seen from this example, the number of skip-grams is more than two times higher
than the number of contiguous n-grams, and this trend continues the more skips are
allowed (Guthrie et al., 2006). Skip-grams, unlike n-grams, do not incur the problem
of data sparseness with an increase of n.
Instead of using words as unit for n-gram decompositions, we can also choose char-
acters. Each word is then decomposed into sequences of n characters. For example,
the word
model
can be decomposed into the 2-grams: mo, de, el. Oen, the word to decompose
is padded with start and end tags in order to improve the model (Cavnar and Tren-
kle, 1994). If we pad the word with <w> and </w>, the 2-gram decomposition yields:
<w>m, mo, de, el, l </w>. e use of paddings allows the model to capture details about
character distribution with regard to the start and end of words (Cavnar and Trenkle,
1994). For example, in English the leer ‘y’ occurs more oen at the end of words than
9
at the beginning of words, while the leer ‘w’ occurs mainly at the beginning of words
(Taylor, 2015). A non-padding model cannot capture this distinction, while a padding
model can.
One advantage of n-gram models is that the decomposition of a string into smaller
units reduces the impact of typing errors (Cavnar and Trenkle, 1994). Indeed, a typ-
ing error only affects a limited number of units (Cavnar and Trenkle, 1994). Due to
this property, n-gram models have been shown to be able to deal well with noisy text
(Cavnar and Trenkle, 1994).
C(wi−n+1 , . . . , wi )
P (wi |wi−n+1 , . . . , wi−1 ) = (11)
C(wi−n+1 , . . . , wi−1 )
Where C(wi−n+1 , . . . , wi ) represents the number of times the n-gram sequence
wi−n+1 , . . . , wi occurred in the training corpus and C(wi−n+1 , . . . , wi−1 ) represents the
number of times the (n − 1)-gram sequence wi−n+1 , . . . , wi−1 was seen in the training
corpus.
3.1.3 Smoothing
e problem with MLE is that sequences not seen during training will have a prob-
ability of zero. In order to avoid this problem, different smoothing techniques can
be used (Chen and Goodman, 1996). e simplest smoothing technique is additive
(Laplace) smoothing (Chen and Goodman, 1996). Let V be the vocabulary size (i.e. the
total number of unique words in the test corpus). e smoothed probability PLaplace
becomes:
C(wi−n+1 , . . . , wi ) + λ
PLaplace (wi |wi−n+1 , . . . , wi−1 ) = (12)
C(wi−n+1 , . . . , wi−1 ) + λV
With λ the smoothing factor. If we choose λ = 1, we speak of “add one” smoothing
(Jurafsky and Martin, 2000). In practice, λ < 1 is oen chosen (Manning and Schütze,
1999).
10
An important estimation is the Good-Turing estimation (Chen and Goodman, 1996).
While not directly a smoothing method, it estimates the frequency of a given observa-
tion with
Nc+1
c∗ = (c + 1) (13)
Nc
where c is the number of times the observation was made, Nc is the number of times
the frequency c was observed and Nc+1 the frequency of the frequency c + 1. us,
instead of using the actual count c, the count is taken to be c∗ (Chen and Goodman,
1996).
Another way to avoid assigning probabilities of zero to unseen sequences is by
using back-off models. ere are linear and non-linear back-off models. In non-linear
back-off models, if the original n-gram probability falls below a certain threshold value,
the probability is estimated by the next lowest n-gram model. Katz’s back-off model
(Katz, 1987) for instance calculates probability Pbo using the formula:
{
C(wi−n+1 ,...,wi )
dwi−n+1 ,...,wi C(w if C(wi−n+1 , . . . , wi ) > k
Pbo = i−n+1 ,...,wi−1 )
(14)
αwi−n+1 ,...,wi−1 Pbo (wi |wi−n+2 , . . . , wi−1 ) otherwise
With d and α as smoothing parameters. e parameter k is oen chosen k = 0.
is means that if the probability given a high-order n-gram model is zero, we back
off to the next lowest model. For tri-gram models, the formula becomes:
P (wi |wi−2 , wi−1 )if C(wi−2 , wi−1 ) > 0
Pbo (wi |wi−2 , wi−1 ) = α1 P (wi |wi−1 ) if C(wi−2 , wi−1 ) = 0 and C(wi−1 , wi ) > 0
α2 P (wi ) otherwise
(15)
In contrast, linear back-off models use an interpolated probability estimate by com-
bining multiple probability estimates and weighting each estimate. e probability PLI
for a tri-gram model is:
PLI (wi |wi−2 , wi−1 ) = λ3 P (wi |wi−2 , wi−1 ) + λ2 P (wi |wi−1 ) + λ1 P (wi ) (16)
∑
with λi = 1
11
Clustering can be partitional or hierarchical (Yin et al., 2007). Partitional clustering
divides the initial objects into separate groups in one step, whereas hierarchical clus-
tering builds a hierarchy of objects by first grouping the most similar objects together
and then clustering the next level hierarchy with regard to the existing clusters (Yin
et al., 2007).
e clustering algorithm uses a distance metric to measure the distance between the
feature vectors of objects (Biemann, 2006). e distance metric defines the similarity
of objects based on the feature space in which the objects are represented (Jain et al.,
1999). ere are different metrics available. A frequently chosen metric is the cosine
similarity that calculates the distance between two vectors, i.e. the angle between them
(Biemann, 2006).
In order for a clustering algorithm to work, features that represent the object to be
clustered have to be defined (Jain et al., 1999). Features can be quantitative (e.g. word
length) or qualitative (e.g. word starts with a capital leer) (Jain et al., 1999).
Most clustering algorithms, e.g. k-means, need the number of clusters to generate
(Jain et al., 1999). e question how to best choose this key number has been addressed
in-depth by Dubes (1987).
Clustering can be so or hard. When hard-clustering, an object can belong to one
class only, while in so-clustering, an object can belong to one or more classes, some-
times with different probabilities (Jain et al., 1999).
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
We take the first word and create a language model m1 from that word.
12
m1
create
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
We then evaluate the second word using the first language model. If the language
model score is high enough, we update the language model with the second word.
m1
eva
l
uat
e
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
m1
upd
ate
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
If the score is below a certain threshold, the existing language model does not model
the word well enough and a new model is created.
m1
eva
lua
t e
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
Figure 6: Evaluation
13
m1 m2
cre
ate
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
When there is more than one language model, each word is evaluated by every
language model, and the highest scoring model is updated, or a new model is created
if no language model models the word well enough.
m1 m2
eva
lua
t e
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
m1 m2
upd
ate
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
m1 m2
eva
lu ate
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
14
m1 m2 m3
cre
ate
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
e last example shows that it is not necessarily the case that exactly one language
model is created per language; it oen is the case that many language models are
created for one language.
At the beginning, the models are not very reliable, as they only have a few words
as basis, but the more text is analyzed, the more reliable the models become.
However, the approach is problematic in that the text structure itself influences
the language models created. If the text starts with a foreign language inclusion, as
illustrated in figure 12, the initial model might be too frail to recognize the following
words as being a different language, updating the first model with the second and third
word and so on. us, the approach would fail at recognizing the foreign language
inclusion.
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
If we were to start from the end of the text and work towards the beginning, the
probability of having a relatively robust language model for the ‘blue’ language would
be high, and so, it would theoretically be easier to recognize the first word as not being
‘blue’.
erefore, one induction step involves one forward generation and one backwards
generation. is yields two sets, the set of models from the forward generation F =
{f1 , f2 , . . . , fn } and the set from the backwards generation B = {b1 , b2 , . . . , bm }.
en, from the two sets of models, the most similar models are selected. For this,
every model from F is compared to every model from B, as figure 13 shows. e most
similar models are then merged, as illustrated in figure 14. Indeed, if both the forward
and backwards generation yielded a similar language model, it is probable that the
model is correct.
Even so, both forward and backwards generation can not guarantee ideal results,
there is the option to run the generation from a random position. is random induc-
tion picks a random position in the text and runs one induction step from that position,
meaning one forward and one backwards generation. Finally, the most similar models
are merged as for the general generation.
15
b1 b2 b3 b4 b5 b6
f1 f2 f3 f4
b1 b2 b3 b4 b5 b6
Merged model
f1 f2 f3 f4
is only yields one probable language model, therefore the induction is repeated
with the difference that all probable models are taken into consideration as well. For
each word, if a probable model models the word well enough, no new model is created,
otherwise a new model is created.
At the end of the induction loop, the set of probable models P is examined. As long
as there are two models that have a similarity score below a certain threshold, the two
most similar models are merged.
Finally, aer the language models have been induced, another pass is made over
the text and each word is assigned to the language model which yields the highest
score for that word, resulting in a word-to-model assignment as illustrated in figure
15.
m1 m2 m3 m4 m5 m6
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 …
16
• Forward/Backwards threshold: reshold for forward/backwards merging
ese parameters can be adapted, in the hope that some parameter configurations
will work beer on certain data sets than other configurations. Since the approach
has parameters that have to be learned from a development set, the approach is said to
be weakly supervised; the development set is not used to train any language specifics,
only for the estimation of the parameters of the approach.
17
4 Experimental setup
In this chapter I present experiments done using the approaches delineated in the pre-
vious section in order to find out whether there are approaches that work beer on
certain types of text.
e central hypothesis is that unsupervised language segmentation approaches are
more successful on difficult data. Difficult data is data for which there is not enough
data to train a language model or data which contains a lot of non-standard language
such as abbreviations.
First, I present the data used to test the language segmentation systems and elabo-
rate on the different aspects that had to be considered for the data compilation.
I then present two supervised language segmentation experiments using n-gram
language models and Textcat.
For unsupervised language segmentation, I will first present experiments using
clustering algorithms before presenting experiments using language model induction.
4.1 Data
In order to test the different language segmentation approaches, I compiled different
sets of test data. As I want to focus on short texts, most texts from the test corpus are
rather small, sometimes consisting of only one sentence. However, in order to test the
general applicability of the approach, the test corpus also contains larger text samples.
e test corpus can be subdivided into different sub-corpora:
• Latin-based: Texts consisting of languages using Latin-based scripts, such as
German, English, Finnish or Italian
• Mixed script: Texts consisting of languages using Latin-based scripts and lan-
guages using non-Latin-based scripts
• Pali dictionary data: Unstructured texts containing many different language in-
clusions such as Vedic Sanskrit, Sanskrit, Indogermanic reconstructions, Old Bul-
garian, Lithuanian, Greek, Latin, Old Irish, many abbreviations and references
to text passages
As every outcome has to be manually checked, the test corpus is rather small. Every
category consists of five texts. Each texts consists of two or three languages with the
exception of the Pali dictionary data that oen contains inclusions from many different
languages in the etymological explanations.
For each text, I also created a gold standard version with the expected clusters. In
some cases it is not clear how to cluster certain objects. In that case, I use a clustering
18
that makes sense to me, but this need not mean that it is the correct or only possible
clustering.
For the parameter estimation of the language model induction approach, I also
compiled a set of development data. All texts can be found in the appendix under 8.1
and 8.2.
19
Language Size in MB
Amharic 9
Arabic 747
Chinese 1005
English 2097
Finnish 570
French 2097
German 2097
Greek 464
Italian 2097
Polish 2097
Russian 2097
Spanish 2097
Turkish 386
Ukrainian 1456
and there are typically only small inclusions of different languages in a dictionary en-
try; as such, it would not have made sense to train a language model just to recognize
a single word. Another reason for not using the Pali dictionary data languages is that
sometimes it is not possible to find data for a language, e.g. Old Bulgarian or recon-
structed Indogermanic. In some cases, it would have been conceivable to train models
on similar languages, but again, the effort of training a model is disproportionately
high compared to the (uncertain) result of recognizing a single inclusion. Instead, an
additional catch-all language model is used to capture words that do not seem to belong
to a trained model.
e training data consists of Wikipedia dumps from the months June and July 2015;
a dump is a copy of the whole encyclopedia for a given language. Due to the difference
in size of the Wikipedia of the different languages, I choose the full dump for languages
with less than 3 GB of compressed data and limited the amount of data to maximally
3 GB of compressed data.
e Wikipedia data was processed using the Wikipedia Extractor3 version 2.8 in
order to extract the textual content from the article pages. Indeed, the Wikipedia pages
are wrien using the MediaWiki Markup Language4 . While this markup is useful for
meta-data annotation and cross-referencing, the encoded information is superfluous
for language model training and has to be removed before training a model on the
data. Table 1 shows the size of the training data per language aer text extraction.
3
[Link]
4
[Link]
20
As the test data only contains transliterated Amharic text, the Wikipedia data, writ-
ten in the Ge’ez script, had to be transliterated. e text was transliterated according
to the EAE transliteration scheme by the Encyclopaedia Aethiopica.
As the test data contains transliterated Greek, the Greek data was used once as-is
and once transliterated according to the ELOT (Hellenic Organization for Standardiza-
tion) transliteration scheme for Modern monotonic Greek.
It should be borne in mind that the training data influences the quality and accuracy
of the model. Furthermore, a model might work well on certain text types and less well
on other text types. It is not possible to train a perfect, universal model.
21
to be integrated into bigger programs and provides an application programming inter-
face (API) that allows the provided algorithms to be accessed in a programmatic way,
i.e. from inside a program.
4.3.1 Preprocessing
However, in order for the clustering algorithms to work, the document to segment has
to be preprocessed in a number of ways, as shown in figure 16.
Read input
Tokenize input
No
Has token?
Yes
Yes
Normalize
Remove tags
is empty?
No
Extract features
First of all, the document has to be read in by the program. is step is straightfor-
ward.
22
e document then has to be tokenized. Tokenization is not trivial and depends on
the definition of a ‘word’. For this task I have used a whitespace tokenizer that defines
a word as a continuous sequence of character literals separated by one or more whites-
pace characters. While it can be objected that for scripts that don’t use whitespace to
separate words, such as Chinese, tokenization fails, this is not too big a concern. In-
deed, if a continuous block of Chinese characters is treated as one word, it is likely to
be clustered separately due to the different in ”word” length and the different charac-
ter set. If, however, a document contains two scripts that do not separate words by
whitespace, the approach totally fails. It is beyond the scope of this thesis, and possi-
bly of any thesis, to implement a universal tokenizer that works regardless of language
without prior knowledge about the languages at hand.
Each token is then normalized. Normalization of a non-Latin-based input (e.g. Ara-
bic or Cyrillic script) returns the input without modification. Otherwise, the following
modifications are made, if applicable:
• remove punctuation
23
1. word length: the length of the word in characters
14. basic latin leer ratio: number of latin leers divided by the word length
15. max consonant cluster: the longest consonant cluster size in characters
18. directionality: what directionality does first character of the word have?
20. general type: what is the general type of the first character of the word?
e last two features are based on the Java Character class. is class provides
methods to check for specific implementation-based properties of characters.
While most features are rather self-explanatory, a few require further explanation.
For the n-grams, the number of n-grams is restricted so as to keep the resulting vec-
tors the same size. is is important because the clustering algorithm considers one
data column as one feature, and having vectors of different length would disrupt this
precondition. Implementing the comparison of vectors of different lengths, or rather
24
or vectors containing vectors as features would have been possible, but rather time-
consuming. If a word is too short to generate the required number of n-grams, only
the possible n-grams are generated and all other positions filled with 0.
e ‘latin’ features check whether the word consists only of the basic latin leers
A-Z and a-z (‘basic’) while the ‘extended’ feature also covers leers derived from the
latin leers (e.g. ë, ç, ṃ, ñ).
Non-words are defined as anything not consisting of leers, such as punctuation
marks or digits.
Directionality indicates which direction a character should be wrien. While the
actual list is much more exhaustive, this property basically indicates whether the char-
acter is wrien from le to right or from right to le. 5
BMP stands for Basic Multilingual Plane and refers to an encoding unit known as
plane, which consists of 216 = 65536 codepoints (i.e. encoding slots for characters)
(e Unicode Consortium, 2014). e BMP is the first plane, covering the codepoints
U+0000 to U+FFFF (e Unicode Consortium, 2014). While it is not important to un-
derstand the technical details fully, it is interesting to note that most characters are
covered by the BMP, including Chinese, Japanese and Korean characters (e Unicode
Consortium, 2014). e next plane, called Supplementary Multilingual Plane or Plane
1 contains historic scripts such as Egyptian hieroglyphs and cuneiform scripts, but also
musical notation, game symbols and various other scripts and symbols (e Unicode
Consortium, 2014). ere are 17 planes in total (e Unicode Consortium, 2014).
e last feature in the list, General Type is also an implementation-related property.
Type can be, for example5 , END_PUNCTUATION, LETTER_NUMBER or
MATH_SYMBOL. ese constants are represented as numbers internally, which are
taken as feature for the clustering algorithm.
• Java specific features (18,20) take the underlying numerical value as feature
5
e full list can be found under the documentation of the Java Character class
hp://[Link]/javase/7/docs/api/java/lang/[Link]
25
Algorithm 1 N-gram numerical encoding
1: function (word)
2: sum ← 0
3: for character in word do
4: value ←code-point of character
5: sum ← sum + value
6: end for
7: return sum
8: end function
While algorithm 1 does not encode n-grams in an unambiguous way (“en” and “ne”
are both encoded as 211), it provides a sufficiently good encoding.
x ≥ |A| (20)
y ≥ |A|2 (21)
If we take for example the English alphabet with 26 lowercase and 26 uppercase
leers, not counting punctuation, digits and other characters, it has to be true that
x ≥ 52 and y ≥ 2704. e problem is that we cannot know in advance what size
the alphabet will be. If we have English and German texts, the size can be estimated
around 60. However, if we have English, Russian and Arabic text, the size drastically
increases. We could choose any two very big numbers, but if we want to guarantee
our encoding to be unambiguous, we run the risk of ending up with numbers too big
to be represented efficiently.
26
In this encoding scheme, distance is skewed: changes to the first character result
in linear distance. ‘man’ and ‘nan’ have a distance of 1, because ‘m’ and ‘n’ have a
distance of 1. ‘man’ and ‘lan’ have a distance of 2, etc. Changes to the second character
are multiplied by x. ‘man’ and ‘men’ have a distance of x ∗ (distance(a, e)) = 4 ∗ x.
Changes to the third character are scaled by y. For any sufficiently big x and y, the
distances are too skewed to be used for automatic cluster analysis. Let us consider the
following example with only two characters for simplicity. For this example, let us
assume x = 1373.
na ma ne me
na 0 1 5492 5491
ma 1 0 5493 5492
ne 5492 5493 0 1
me 5491 5492 1 0
na ma ne me
na 0 1 4 3
ma 1 0 5 4
ne 4 5 0 1
me 3 4 1 0
While this encoding is not unambiguous, it is considered sufficiently good for our
purposes.
4.3.5 e clusterer
Most clustering algorithms such as k-means need to be passed the number of clusters to
generate. As we want to work as flexibly as possible, I ignored all algorithms that need
the number of clusters before clustering. In contrast, the x-means algorithm (Pelleg
and Moore, 2000) estimates the number of clusters to generate itself. is algorithm
has been chosen to perform the language clustering tasks.
While WEKA and ELKI offer a graphical user interface and various graphical rep-
resentations of the results, the output is not easily interpretable. Indeed, we can get a
visualization of a clustering operation as shown in figures 17 (WEKA) and 18 (ELKI).
However, all data points have to be manually checked by either clicking each point
27
in order to get additional information about that data point (WEKA) or by hovering
over the data points aer having selected the Object Label Tooltip option (ELKI). Fig-
ure 18 shows the information for the lowest orange rectangle data point in the ELKI
visualization.
28
Figure 18: ELKI: Cluster visualization
erefore, I have decided to embed the x-means clustering algorithm into a custom
framework. Originally part of the WEKA algorithms, the x-means algorithm has been
integrated into a Java program via the JavaML library. e framework takes an input
file, constructs the aforementioned feature vectors from the input, performs normal-
ization, passes the calculated feature vectors to the clustering algorithm and displays
the results in a text-based easily interpretable manner.
Preliminary analyses have shown that the first clustering result oen is not dis-
criminating enough. Hence, I perform a first clustering analysis, followed by a second
clustering analysis on the clusters obtained from the first analysis.
29
pairs (Wagner and Wagner, 2007).
Let us consider the clustering C = {C1 , . . . , Ck }. C is a set of non-empty disjoint
clusters C1 , . . . , Ck . Let us consider the reference clustering C ′ = {C1 , . . . , Cl }. We
define the following sets.
• S10 : set of pairs that are in the same cluster in C and in different clusters in C ′
• S01 : set of pairs that are in different clusters in C and in the same cluster in C ′
Let nij = |Sij |, with i, j ∈ {0, 1} be the size of a given set Sij .
e Rand Index is defined as
n11 + n00
RI = (22)
n11 + n10 + n01 + n00
e Rand Index measures the accuracy of the clustering given a reference partition
(Wagner and Wagner, 2007). However, it is criticized for being highly dependent on
the number of clusters (Wagner and Wagner, 2007).
e Jaccard Index measures the similarity of sets. It is similar to the Rand Index,
but it disregards S00 , the set of pairs that are clustered into different clusters in C and
C ′ (Wagner and Wagner, 2007). It is calculated as
n11
J= (23)
n11 + n10 + n01
e Fowlkes-Mallows Index measures precision. It is calculated as
n11
FM = √ (24)
(n11 + n10 )(n11 + n01 )
e Fowlkes-Mallows Index has the undesired property of yielding high values
when the number of clusters is small (Wagner and Wagner, 2007).
Finally, I will indicate the F-Score. According to Manning et al. (2008), in the context
of clustering evaluation the F(β) score is defined as
(β 2 + 1) ∗ P ∗ R
F (β) = (25)
(β 2 )P + R
with precision P and recall R defined as
n11
P = (26)
n11 + n10
30
n11
R= (27)
n11 + n01
By varying β, it is possible to give more weight to either precision (β < 0) or recall
(β > 1) (Manning et al., 2008). As I value recall higher than precision, I will indicate F1
(β = 1) and F5 (β = 5) scores. Indeed, I want to penalize the algorithm for clustering
together pairs that are separate in the gold standard while not penalizing the algorithm
for spliing pairs that are together in the gold standard.
All measures of similarity fall between [0, 1] with 0 being most dissimilar and 1 be-
ing identical. As there is no ultimate measure and all measures of similarity have their
drawbacks (Wagner and Wagner, 2007), all measures will be indicated in the results
section.
First of all, an initial language model is created. For each word, the maximum model
and maximum score is calculated. ese values correspond to the language model that
yielded the highest probability for the word in question, and the associated probability.
If the score falls below a threshold t (i.e. none of the existing language models model
the word well enough), a new language model is created on the basis of the word and
added to the list of language models. Otherwise, the top scoring language model is
updated with the word in question.
31
As the text structure itself influences the quality of the induced models, the lan-
guage model induction is run i times (i 6 1), with one iteration consisting of two
induction steps, once forward and once backward, and j times from a random position
(j 6 0). e initial model creation thus either picks the first word of the text (as shown
in algorithm 3 line 2), or the last word of the text, or a random word.
Algorithm 4 returns both the max model and the max score wrapped as a custom
object. e individual values can then be read as necessary.
Aer the models have been induced, the most similar models are merged based
on distributional similarity. Distributional similarity is calculated as explained below.
is merging step only merges one model from the forward induction group with one
model from the backward induction group. e resulting model is added to the set of
probable (“silver”) models.
Merging is performed according to algorithm 5. e merging algorithm only re-
tains the common set of unigrams from both models, and all resulting bi- and trigrams,
excluding any bi- and trigrams that contain character that occur only in one of the
models. e values for the resulting language model are calculated according to one
of four different merge modes.
e merge modes are:
32
Algorithm 5 Model merger
1: function (model1 ,model2 , mode)
2: merged ← ∅
3: for unigram u1 in model1 .unigrams do
4: for unigram u2 in model2 .unigrams do
5: if u1 = u2 then
6: v1 ← f (u1 ) ◃ f (u1 ) is the frequency of u1
7: v2 ← f (u2 )
8: value ← mode(v1 , v2 )
9: unigram ← u1 ◃ or u2 , since both are equal
10: merged ← (unigram, value)
11: else
12: exclude ← u1
13: exclude ← u2
14: end if
15: end for
16: end for
17: for all bigrams b in model1 and model2 do
18: if not exclude contains any char in b then
19: v1 ← f (b, model1 ) or 0 ◃ frequency of b in model1
20: ◃ or 0 if it does not exist
21: v2 ← f (b, model2 ) or 0
22: value ← mode(v1 , v2 )
23: merged ← (b, value)
24: end if
25: end for
26: for all trigrams t in model1 and model2 do
27: if not exclude contains any char in t then
28: v1 ← f (t, model1 ) or 0
29: v2 ← f (t, model2 ) or 0
30: value ← mode(v1 , v2 )
31: merged ← (t, value)
32: end if
33: end for
34: return merged
35: end function
33
• MAX: use the maximum value (max(v1 , v2 ))
If the random iteration count j > 0, a random word is chosen and the induction
is run once forward and once backward starting from this position. en, the most
similar models from each set are merged and added to the set of probable models.
It should be noted that seing the parameter j > 0 will make the algorithm non-
deterministic.
e model induction is then repeated while the iteration count i has not been
reached or until no more models are induced, with the difference that for each word,
each probable model is first consulted. If any of the probable models yields a score
higher than the threshold value t, it is assumed that the word is already well repre-
sented by one of the probable models and no models are induced for this word. If the
score falls below the threshold value t, induction is run as described.
At the end of the induction loop, all probable models are checked against each other.
While there are two models that have a similarity below the silver threshold value s,
the two models are merged and added to the set of very probable (“gold”) models.
If the set of probable models is not empty aer this merging step, all remaining
probable models are added to the set of very probable models.
In the second stage, the text is segmented according to the induced “gold” models.
For each word, the language model with the highest probability for the word is chosen
as that word’s hypothetical language model.
34
6
0 a c e g
b d f h i
0 a c e g
b d f h i
35
6
0 a c e g
b d f h i
9: q ← |vv11 −v
+v2
2|
with f (c) returning the frequency of the character c. e number 2 in (2−q) in line
10 can be explained as follows: q expresses the dissimilarity of the models with regard
to a unigram distribution with 0 6 q 6 1, hence (1 − q) expresses the similarity. To
this, we add 1, as we increase similarity by 1 due to the match; we augment the simple
increase of 1 by the similarity of the distribution.
36
4.4.2 Evaluating results
e results of this approach can be interpreted as clusters, where each language model
represents one cluster core and all words assigned to that model making up that cluster.
Evaluation will hence be analogous to the evaluation of the clustering approach.
37
5 Results
‘Baseline’ indicates the measurement where all words have been thrown into one clus-
ter, measured against the gold standard. For ‘Baseline 2’, every word has been put into
its own cluster and this clustering is evaluated against the gold standard. e column
‘F1’ stands for the F1 score and the ‘F5’ column stands for the F5 score.
If any of the ‘runs’ yields a higher score than any of the baseline values, the max-
imum score is indicated in bold. If a field contains ‘n/a’, this means that the value
could not be calculated for whatever reason (most oen a division by zero would have
occurred).
38
Rand Jaccard Fowlkes- F1 F5
Mallows
Greek–Russian
Baseline 0.5578 0.5578 0.7468 0.7161 0.5674
Baseline 2 0.4440 0.0034 0.0584 0.0068 0.0817
NGLM 0.7597 0.5108 0.6762 0.6762 0.6694
English–Greek
Baseline 0.9179 0.9179 0.9580 0.9571 0.9208
Baseline 2 0.0946 0.0136 0.1167 0.0269 0.2643
NGLM 0.5665 0.3867 0.5586 0.5577 0.5877
English–Spanish–Arabic
Baseline 0.3354 0.3354 0.5791 0.5023 0.3442
Baseline 2 0.6682 0.0109 0.1044 0.0215 0.2227
NGLM 0.9204 0.7489 0.8573 0.8564 0.8936
English–Chinese
Baseline 0.8474 0.8474 0.9205 0.9174 0.8524
Baseline 2 0.1595 0.0082 0.0909 0.0164 0.1781
NGLM 0.6573 0.4476 0.6259 0.6184 0.7208
Ukrainian–Russian
Baseline 0.4950 0.4950 0.7035 0.6622 0.5048
Baseline 2 0.5060 0.0022 0.0472 0.0044 0.0550
NGLM 0.6755 0.3857 0.5644 0.5567 0.4831
39
Rand Jaccard Fowlkes- F1 F5
Mallows
Pali 1
Baseline 0.3131 0.3131 0.5595 0.4768 0.3216
Baseline 2 0.6906 0.0118 0.1089 0.0234 0.2379
NGLM 0.8153 0.2069 0.3434 0.3429 0.3608
Pali 2
Baseline 0.3589 0.3589 0.5991 0.5283 0.3680
Baseline 2 0.6495 0.0238 0.1543 0.0465 0.3880
NGLM 0.7173 0.1958 0.3336 0.3275 0.3971
Pali 3
Baseline 0.4947 0.4947 0.7033 0.6619 0.5045
Baseline 2 0.5075 0.0045 0.0676 0.0091 0.1067
NGLM 0.7874 0.0816 0.1692 0.1508 0.1064
Pali 4
Baseline 0.4000 0.4000 0.6324 0.5714 0.4094
Baseline 2 0.6000 0.0000 n/a n/a n/a
NGLM 0.3000 0.1250 0.2357 0.2222 0.1699
Pali 5
Baseline 0.5800 0.5800 0.7615 0.7341 0.5895
Baseline 2 0.4236 0.0063 0.0798 0.0126 0.1430
NGLM 0.4777 0.2496 0.4065 0.3995 0.4816
40
Rand Jaccard Fowlkes- F1 F5
Mallows
Twitter 1
Baseline 0.4615 0.4615 0.6793 0.6315 0.4712
Baseline 2 0.5384 0.0000 n/a n/a n/a
NGLM 0.8589 0.5925 0.7542 0.7441 0.8757
Twitter 2
Baseline 0.5555 0.5555 0.7453 0.7142 0.5652
Baseline 2 0.4444 0.0000 n/a n/a n/a
NGLM 0.7485 0.6090 0.7591 0.7570 0.8121
Twitter 3
Baseline 0.6583 0.6583 0.8113 0.7939 0.6670
Baseline 2 0.3416 0.0000 n/a n/a n/a
NGLM 0.6750 0.4347 0.6479 0.6060 0.8996
Twitter 4
Baseline 0.8750 0.8750 0.9354 0.9333 0.8792
Baseline 2 0.1250 0.0000 n/a n/a n/a
NGLM 0.7250 0.5822 0.7597 0.7360 0.9545
Twitter 5
Baseline 0.4285 0.4285 0.6546 0.6000 0.4382
Baseline 2 0.5714 0.0000 n/a n/a n/a
NGLM 0.6666 0.1250 0.2672 0.2222 0.4561
41
5.2 Textcat
42
Rand Jaccard Fowlkes- F1 F5
Mallows
Greek–Russian
Baseline 0.5578 0.5578 0.7468 0.7161 0.5674
Baseline 2 0.4440 0.0034 0.0584 0.0068 0.0817
Textcat 0.4468 0.2971 0.4769 0.4581 0.3644
English–Greek
Baseline 0.9179 0.9179 0.9580 0.9571 0.9208
Baseline 2 0.0946 0.0136 0.1167 0.0269 0.2643
Textcat 0.5357 0.4933 0.6730 0.6607 0.5619
English–Spanish–Arabic
Baseline 0.3354 0.3354 0.5791 0.5023 0.3442
Baseline 2 0.6682 0.0109 0.1044 0.0215 0.2227
Textcat 0.3956 0.2832 0.5042 0.4414 0.3052
English–Chinese
Baseline 0.8474 0.8474 0.9205 0.9174 0.8524
Baseline 2 0.1595 0.0082 0.0909 0.0164 0.1781
Textcat 0.5018 0.4468 0.6251 0.6177 0.5408
Ukrainian–Russian
Baseline 0.4950 0.4950 0.7035 0.6622 0.5048
Baseline 2 0.5060 0.0022 0.0472 0.0044 0.0550
Textcat 0.3787 0.2625 0.4472 0.4159 0.3105
43
Rand Jaccard Fowlkes- F1 F5
Mallows
Pali 1
Baseline 0.3131 0.3131 0.5595 0.4768 0.3216
Baseline 2 0.6906 0.0118 0.1089 0.0234 0.2379
Textcat 0.4531 0.2508 0.4849 0.4011 0.2641
Pali 2
Baseline 0.3589 0.3589 0.5991 0.5283 0.3680
Baseline 2 0.6495 0.0238 0.1543 0.0465 0.3880
Textcat 0.4307 0.2745 0.5088 0.4307 0.2888
Pali 3
Baseline 0.4947 0.4947 0.7033 0.6619 0.5045
Baseline 2 0.5075 0.0045 0.0676 0.0091 0.1067
Textcat 0.2032 0.0704 0.2502 0.1315 0.0736
Pali 4
Baseline 0.4000 0.4000 0.6324 0.5714 0.4094
Baseline 2 0.6000 0.0000 n/a n/a n/a
Textcat 0.5000 0.1666 0.2886 0.2857 0.2524
Pali 5
Baseline 0.5800 0.5800 0.7615 0.7341 0.5895
Baseline 2 0.4236 0.0063 0.0798 0.0126 0.1430
Textcat 0.5090 0.3458 0.5141 0.5140 0.5236
44
Rand Jaccard Fowlkes- F1 F5
Mallows
Twitter 1
Baseline 0.4615 0.4615 0.6793 0.6315 0.4712
Baseline 2 0.5384 0.0000 n/a n/a n/a
Textcat 0.3736 0.2597 0.4460 0.4123 0.3049
Twitter 2
Baseline 0.5555 0.5555 0.7453 0.7142 0.5652
Baseline 2 0.4444 0.0000 n/a n/a n/a
Textcat 0.4678 0.4347 0.6158 0.6060 0.5207
Twitter 3
Baseline 0.6583 0.6583 0.8113 0.7939 0.6670
Baseline 2 0.3416 0.0000 n/a n/a n/a
Textcat 0.6838 0.6446 0.8011 0.7839 0.6586
Twitter 4
Baseline 0.8750 0.8750 0.9354 0.9333 0.8792
Baseline 2 0.1250 0.0000 n/a n/a n/a
Textcat 0.8833 0.8666 0.9309 0.9285 0.8711
Twitter 5
Baseline 0.4285 0.4285 0.6546 0.6000 0.4382
Baseline 2 0.5714 0.0000 n/a n/a n/a
Textcat 0.3333 0.3333 0.5773 0.5000 0.3421
45
5.3 Clustering
e first run indicates the value aer one clustering step, and the second run indicates
the value aer applying the clustering algorithm to the results of the first run.
46
Rand Jaccard Fowlkes- F1 F5
Mallows
Greek–Russian
Baseline 0.5578 0.5578 0.7468 0.7161 0.5674
Baseline 2 0.4440 0.0034 0.0584 0.0068 0.0817
First run 0.5787 0.3811 0.5672 0.5519 0.4549
Second 0.7536 0.3883 0.5899 0.4494 0.7914
run
English–Greek
Baseline 0.9179 0.9179 0.9580 0.9571 0.9208
Baseline 2 0.0946 0.0136 0.1167 0.0269 0.2643
First run 0.4244 0.2482 0.4015 0.3977 0.4553
Second 0.3705 0.0855 0.1784 0.1576 0.2777
run
English–Spanish–Arabic
Baseline 0.3354 0.3354 0.5791 0.5023 0.3442
Baseline 2 0.6682 0.0109 0.1044 0.0215 0.2227
First run 0.8016 0.5650 0.7400 0.7221 0.6008
Second 0.7226 0.2860 0.4495 0.4448 0.5130
run
English–Chinese
Baseline 0.8474 0.8474 0.9205 0.9174 0.8524
Baseline 2 0.1595 0.0082 0.0909 0.0164 0.1781
First run 0.5480 0.3356 0.5087 0.5025 0.5866
Second 0.5138 0.2584 0.4361 0.4107 0.5957
run
Ukrainian–Russian
Baseline 0.4950 0.4950 0.7035 0.6622 0.5048
Baseline 2 0.5060 0.0022 0.0472 0.0044 0.0550
First run 0.5867 0.1953 0.3268 0.3267 0.3305
Second 0.5934 0.1154 0.2178 0.2070 0.2907
run
47
Rand Jaccard Fowlkes- F1 F5
Mallows
Pali 1
Baseline 0.3131 0.3131 0.5595 0.4768 0.3216
Baseline 2 0.6906 0.0118 0.1089 0.0234 0.2379
First run 0.4674 0.2540 0.4898 0.4051 0.2666
Second 0.7168 0.2547 0.4118 0.4060 0.3516
run
Pali 2
Baseline 0.3589 0.3589 0.5991 0.5283 0.3680
Baseline 2 0.6495 0.0238 0.1543 0.0465 0.3880
First run 0.6738 0.3026 0.4777 0.4646 0.3825
Second 0.6646 0.1865 0.3147 0.3144 0.3021
run
Pali 3
Baseline 0.4947 0.4947 0.7033 0.6619 0.5045
Baseline 2 0.5075 0.0045 0.0676 0.0091 0.1067
First run 0.5686 0.0746 0.2002 0.1389 0.0831
Second 0.7534 0.0911 0.1962 0.1670 0.1125
run
Pali 4
Baseline 0.4000 0.4000 0.6324 0.5714 0.4094
Baseline 2 0.6000 0.0000 n/a n/a n/a
First run 0.5333 0.3000 0.5477 0.4615 0.3083
Second 0.3000 0.3000 0.5477 0.4615 0.3083
run
Pali 5
Baseline 0.5800 0.5800 0.7615 0.7341 0.5895
Baseline 2 0.4236 0.0063 0.0798 0.0126 0.1430
First run 0.5294 0.2472 0.4111 0.3965 0.5242
Second 0.4666 0.1214 0.2524 0.2166 0.4117
run
48
Rand Jaccard Fowlkes- F1 F5
Mallows
Twitter 1
Baseline 0.4615 0.4615 0.6793 0.6315 0.4712
Baseline 2 0.5384 0.0000 n/a n/a n/a
First run 0.8681 0.7142 0.8451 0.8333 0.7222
Second 0.8461 0.6000 0.7745 0.7499 0.9750
run
Twitter 2
Baseline 0.5555 0.5555 0.7453 0.7142 0.5652
Baseline 2 0.4444 0.0000 n/a n/a n/a
First run 0.4575 0.3941 0.5655 0.5654 0.5573
Second 0.4967 0.3888 0.5615 0.5600 0.6012
run
Twitter 3
Baseline 0.6583 0.6583 0.8113 0.7939 0.6670
Baseline 2 0.3416 0.0000 n/a n/a n/a
First run 0.4571 0.3595 0.5525 0.5289 0.7215
Second 0.3523 0.2093 0.3997 0.3461 0.6428
run
Twitter 4
Baseline 0.8750 0.8750 0.9354 0.9333 0.8792
Baseline 2 0.1250 0.0000 n/a n/a n/a
First run 0.9019 0.8584 0.9265 0.9238 0.8631
Second 0.6250 0.5000 0.6789 0.6666 0.8080
run
Twitter 5
Baseline 0.4285 0.4285 0.6546 0.6000 0.4382
Baseline 2 0.5714 0.0000 n/a n/a n/a
First run 0.7142 0.4666 0.6831 0.6363 0.4764
Second 0.5714 0.3076 0.4780 0.4705 0.4046
run
49
5.4 Language model induction
In addition to highlighting results that outperform the baseline values, the following
tables have been color coded. Results that outperform the clustering algorithm are
indicated in red and results that outperform both the clustering algorithm and the n-
gram language model are indicated in blue.7
50
Rand Jaccard Fowlkes- F1 F5
Mallows
Greek–Russian
Baseline 0.5578 0.5578 0.7468 0.7161 0.5674
Baseline 2 0.4440 0.0034 0.0584 0.0068 0.0817
Inducted 0.7142 0.4222 0.5940 0.5937 0.6125
English–Greek
Baseline 0.9179 0.9179 0.9580 0.9571 0.9208
Baseline 2 0.0946 0.0136 0.1167 0.0269 0.2643
Inducted 0.4769 0.3266 0.5089 0.4924 0.6423
English–Spanish–Arabic
Baseline 0.3354 0.3354 0.5791 0.5023 0.3442
Baseline 2 0.6682 0.0109 0.1044 0.0215 0.2227
Inducted 0.7783 0.5677 0.7534 0.7242 0.5773
English–Chinese
Baseline 0.8474 0.8474 0.9205 0.9174 0.8524
Baseline 2 0.1595 0.0082 0.0909 0.0164 0.1781
Inducted 0.5657 0.3343 0.5258 0.5011 0.6953
Ukrainian–Russian
Baseline 0.4950 0.4950 0.7035 0.6622 0.5048
Baseline 2 0.5060 0.0022 0.0472 0.0044 0.0550
Inducted 0.6289 0.1000 0.1935 0.1818 0.2659
51
Rand Jaccard Fowlkes- F1 F5
Mallows
Pali 1
Baseline 0.3131 0.3131 0.5595 0.4768 0.3216
Baseline 2 0.6906 0.0118 0.1089 0.0234 0.2379
Inducted 0.7856 0.1683 0.2898 0.2882 0.3188
Pali 2
Baseline 0.3589 0.3589 0.5991 0.5283 0.3680
Baseline 2 0.6495 0.0238 0.1543 0.0465 0.3880
Inducted 0.8148 0.5000 0.6686 0.6666 0.7176
Pali 3
Baseline 0.4947 0.4947 0.7033 0.6619 0.5045
Baseline 2 0.5075 0.0045 0.0676 0.0091 0.1067
Inducted 0.8492 0.0569 0.1083 0.1078 0.1186
Pali 4
Baseline 0.4000 0.4000 0.6324 0.5714 0.4094
Baseline 2 0.6000 0.0000 n/a n/a n/a
Inducted 0.6000 0.0000 0.0000 n/a n/a
Pali 5
Baseline 0.5800 0.5800 0.7615 0.7341 0.5895
Baseline 2 0.4236 0.0063 0.0798 0.0126 0.1430
Inducted 0.4033 0.2082 0.3504 0.3446 0.4134
52
Rand Jaccard Fowlkes- F1 F5
Mallows
Twitter 1
Baseline 0.4615 0.4615 0.6793 0.6315 0.4712
Baseline 2 0.5384 0.0000 n/a n/a n/a
Inducted 0.6282 0.3695 0.5515 0.5396 0.4533
Twitter 2
Baseline 0.5555 0.5555 0.7453 0.7142 0.5652
Baseline 2 0.4444 0.0000 n/a n/a n/a
Inducted 0.7719 0.6020 0.7687 0.7515 0.9325
Twitter 3
Baseline 0.6583 0.6583 0.8113 0.7939 0.6670
Baseline 2 0.3416 0.0000 n/a n/a n/a
Inducted 0.5916 0.3000 0.5236 0.4615 0.8185
Twitter 4
Baseline 0.8750 0.8750 0.9354 0.9333 0.8792
Baseline 2 0.1250 0.0000 n/a n/a n/a
Inducted 0.5250 0.3736 0.5615 0.5439 0.7055
Twitter 5
Baseline 0.4285 0.4285 0.6546 0.6000 0.4382
Baseline 2 0.5714 0.0000 n/a n/a n/a
Inducted 1.0000 1.0000 1.0000 1.0000 1.0000
53
6 Discussion
e work by Seldin et al. (2001) is similar to the work presented here. ey propose
an unsupervised language (and protein sequence) segmentation approach that yields
accurate segmentations. While their work looks promising, it also has its drawbacks.
eir method requires longer monolingual text fragments and a sizable amount of text.
Furthermore, they disallow switching language models aer each word. is presump-
tion will fail to detect single-word inclusions and structures as shown in figure 22,
where the language alternates aer each word.
w1 w2 w3 w4 w5 w6 w7 …
While this structure looks very artificial, such a structure is found, for instance,
in the fih Pali dictionary text, in the passage “Pacati, [Ved. pacati, Igd. *peqǔō, Av.
pac-;”. In this case, ‘red’ corresponds to Pali, ‘blue’ to (abbreviations in) English and
‘green’ to reconstructed Indo-european.
54
Also, the segmentation of English–Chinese did not work well. is is probably
due to the way the model was trained. Chinese script is wrien without whitespace
characters between words, and the correct segmentation of a text wrien in Chinese
requires in-depth knowledge of the language. Some words are wrien with only one
character, but others are composed of two or more characters, with the meaning oen
being non-compositional; the meaning of a two-character word is different from the
sum of the meaning of the two characters. Sometimes, more than one segmentation
would be possible and the context decides on which segmentation is correct. In other
cases, more than one segmentation might be correct. is problem occurs with all
scripts that are wrien without whitespace.
As with the simplified assumption in the tokenization of whitespace-scripts, where
I consider a word to be a character sequence delineated by whitespace, I have treated
each character as a word. Adapting the method to Chinese and similar scripts would
have been possible, but would have introduced the need for large amounts of external
linguistic knowledge. Indeed, every possible non-whitespace-script would have to be
considered, and each of the tokenizers would be language dependent, i.e. a tokenizer
for Chinese would not work on Korean or Japanese.
e supervised approach did not work well on the Pali dictionary data. While
English words could be isolated somewhat successfully, the rest of the data proved
difficult to segment. As an example, let us look at the first Pali text. e English
cluster contains almost only English words, but not all, the “other” cluster contains
mainly marked up words, and the rest is seemingly haphazardly distributed among
the other models.
Pali 1: abbha
• (EN) water, mountain, of, free, (used, or, like, referred, (also, A, is, cloudy,
clouds, later, a, froth, 1, summit, thundering, by, mass, Pv, Oir, obscure, scum,
that, water]., thick, As, from, It, is, at, as, the, in, clouds, things, also
• (FR) cloud, Dh, adj., point, cloud, Dhs, A), rain, VvA, DhsA, list
• (IT) \”dark, &, ambha, 3, 1, 317, J, sunshine, cp., abhra, [Vedic, (megho
55
• (PL) 487, =, S, 295, <br, moon–, 249
• (RU) 348, 53
• (TrAM) 687, PvA, (°sama, 101, (nīl°, (cp., 64;, (nt.), 581, m., Sn, 1064;
• (other) <b> –saŋvilāpa </b>, <b> –mua </b>, <smallcaps> vi. </smallcaps>,
(mahiyā, <smallcaps> iv. </smallcaps>, cloud\”;, <b> Rāhu </b>, <b> abbhā
</b>, <b> abbhaŋ, <superscript> 9 </superscript>, marajo </b>, abbhāmua,
valāhaka);, <smallcaps> i. </smallcaps>, <b> abbhāmaa </b>, valāhaka–sikhara,
<superscript> s. </superscript>, <smallcaps> ii. </smallcaps>, <b> dhū-, storm–
cloud, /><b> –kūṭa </b>, thunder–cloud);, <at> a)fro\\s </at>, <b> –paṭala </b>,
<at>o)/mbros</at>, nīla–megha, <superscript>1</superscript>, *m̊bhro, \”dull\”;,
acchādesi);, mahikā</b>, <b> –ghana </b>
On the Twier data, the supervised approach achieved passable results. While
the numbers look great, the actual segmentations do not. For Twier 1, too many
clusters were generated, for Twier 2 and 3, the recognition of French words worked
somewhat, also recognizing English words as French and French words as English.
For Twier 4, the Polish inclusion was isolated but recognized as “other”, together
with “strawberries”. e recognition of transliterated Amharic worked satisfactorily,
yielding ‘naw’ to the Polish model.
As the number of language models increases, so does the risk of misclassification.
As can be seen, we already have quite some misclassification with only 15 language
models. For example, in our data, the English preposition ‘to’ is oen erroneously clas-
sified as ‘transliterated Greek’. e Greek particle το ‘to’ can be either the neuter sin-
gular accusative or nominative definite article ‘the’, the masculine singular accusative
or nominative definite article ‘the’ or the 3rd person neuter singular nominative/ac-
cusative weak pronoun ‘it’, and as such is rather frequent in the language. is is
especially problematic with the transliterated Greek language model, which tends to
misclassify the English preposition ‘to’ as transliterated Greek.
A quick corpus study using the Corpus of Modern Greek8 and the Corpus of Con-
temporary American English9 reveals that the frequency per million words for the
Greek particle το is 22666, while the English preposition ‘to’ has a frequency per mil-
lion words of 25193. eir relative frequencies are very close together, and it might
8
[Link]
9
[Link]
56
just have happened that the training data used in this work contained more Greek ‘to’s
than English ‘to’s, leading to this misclassification.
Other reasons for misclassification include relatedness of the modeled languages
as in the case of Germanic or Romance language families. Also, the text types used
for training and the text types used for testing play an important role, as well as the
amount of training data.
For n-gram language models, the quality of the model is dependent on the texts
used for training and the texts used in evaluation. It is probable that a different training
set would have yielded different results. is is also the problem with the supervised
approach; it is necessary to have language data for training and the trained models
reflect the training data to some extent.
6.2 Textcat
Textcat works well on monolingual texts. However, it fails on multilingual texts and
does not work well on short fragments of text, such as single words. Many of the words
are tagged as unknown, and if a language has been identified, the language guess oen
is not correct. Hence, Textcat cannot be used for language segmentation purposes.
Indeed, Textcat fails to exceed the baseline values except for two cases: ‘Twier
3’ and ‘Twier 4’ yield beer values than the baseline values. However, upon closer
inspection, it is clear that the numerical index values do not give a reliable picture of
the quality of the clustering.
Indeed, while the clustering of ‘Twier 3’ is not nonsensical, it is not very good,
failing to extract the French insertion ‘breuvages’. e Rand Index also only shows a
slightly beer value than the baseline values. It seems that the outstanding score for
‘Twier 4’ is achieved because both the clustering by Textcat and the gold standard
have the same number of clusters.
Tables 20 and 21 show the clusterings side by side. Clearly, Textcat performed
poorly despite the high numerical index values. A closer inspection of all the Textcat
results shows that Textcat performs poorly at the task of language segmentation; oen,
a word cannot be assigned a language and thus is added to the cluster of ‘unknown’
language words. For the words where a language has been identified, it most oen
is not the correct language. While language identification is not necessary for the
task of language segmentation, it helps to understand why Textcat failed at the task of
language segmentation.
57
Textcat Gold standard
Cluster 1 ∅ breuvages
Cluster 2 #bilingualism #FWWC2015, #bilingualism
Cluster 3 Food, and, breuvages, in, Ed- Food, and, in, Edmonton, are,
monton, are, ready, to, go, ready, to, go, just, waiting, for,
just, waiting, for, the, fans, the, fans
#FWWC2015
6.3 Clustering
e clustering results are more difficult to interpret. Oen, the first distinction made
seems to be based on case, i.e. words that begin with a capital leer versus words that
are all lowercase leers. e second run on the ‘mixed script: English – Greek’ data
shows that the first cluster from the first run has been separated into a cluster with
words that begin with a capital leer and two clusters with words that don’t begin
with a capital leer.
58
English–Greek: Second run: Splitting of first cluster
Another important distinction seems to be the length of words. Indeed, the results
oen show clusters that clearly are based on the length of the contained words. e
first run on the ‘latin script: German – Italian’ data shows that short words have been
singled out into the first cluster.
• (il, E, So, a, ad, da, di, e, es, ha, i, il, in, la, le, lo, ma, ne, se, si, un, va, zu
e clustering works well when the scripts involved are dissimilar, as in the case
of the English–Chinese text, where the Chinese characters were isolated aer the first
run, and also the English–Spanish–Arabic example, where the Arabic part was com-
pletely isolated in the first run.
e closer the scripts become, the less well clear cut the results are. For Greek–
Russian, the results are acceptable, with one mixed cluster. However, the number of
clusters is too high for the number of languages involved and the separation is only
achieved aer two consecutive clusterings.
e clustering of closer scripts, such as Ukrainian–Russian does not work well. e
clusters, with the exception of the cluster containing the datum ‘9—13’ are all impure,
consisting of Ukrainian and Russian words. e second run also fails at improving the
clustering.
Finally, clustering of latin based scripts does not perform well unless diacritics are
involved and the diacritics form the most salient distinction. Word containing leers
with diacritics are then generally separated from words containing no diacritics, as in
the German–Finnish-Turkish example. e first run generates a cluster for numbers,
two clusters with diacritics and one cluster without diacritics.
59
Probably for this reason, the clustering of Transliterated Greek–English and Greek–
English worked surprisingly well. In both cases, the first run managed to separate
the (transliterated) Greek parts from the English words. However, unaccented Greek
words such as Agape, erotas or eros were clustered with English.
• (T., A, M
• =, l., v.
• 265, 287
• [n.
In some cases, the clustering fails at the task of language segmentation, as in the
case of the various English–French texts and the English–German example with the
German inclusion. We can thus say that the surface structure or morphology, or in
other words the basis from which we can extract features, is not sufficient to deduce
relevant information about ‘language’.
When there are more than two languages that are to be separated, the cluster-
ing also does not work well. Indeed, the most dissimilar objects are separated first.
In the case of English–Spanish–Arabic, the Arabic part is separated first, as well as
words with diacritics, while English and Spanish words without diacritics are thrown
together. Subsequent runs show no improvement of the clustering concerning the
separation of English and Spanish.
In the case of German–Finnish–Turkish, the clustering algorithm seems to cluster
out Turkish first, followed by Finnish. e results are however much less clear-cut
than for English–Spanish–Arabic.
60
6.4 Language model induction
e language model induction does not seem to work very well on the Latin script data.
ere are almost only impure clusters, containing more than one language. However,
the approach consistently outperforms the clustering approach when we look at the
F5 score. For the English–French data set, the clustering approach even outperforms
the n-gram language model approach. Indeed, the French words are relatively well
separated from the English text, with the exception of ‘sucré’, which is still thrown
together with English words.
• doux,
• while
• “hard”., used).,
• translate, as, meaning, very, different., ”coarse”, can, also, mean, almost,sucré,
In contrast, the approach works well on the mixed script data. Indeed, we achieve
a good separation of the languages by script. However, when there are also Latin
based scripts, we encounter the same problems as mentioned above with rather modest
results. For example, for the English–Greek text, the approach separates out the Greek
character words but it fails to separate transliterated Greek and English. Also, for the
English–Spanish–Arabic text, Arabic is separated out, but English and Spanish are not
separated well.
One interesting observation can be made in the case of the English–Chinese text.
e Chinese characters have been isolated, but the Pinyin transcription is thrown to-
gether with the Chinese characters. Based on the prior observations, this is rather
unexpected. is raises the question of whether Pinyin ought to be clustered out, or
clustered together with English or Chinese.
Again, the language model induction approach outperforms the clustering approach,
and also the n-gram language model approach in the case of the English–Greek text.
On the larger Pali dictionary entries, the language model induction approach yields
acceptable results. On the shorter Pali dictionary entries, the language model induction
approach yields good results.
61
e quite low performance must be blamed on the data. Indeed, the Pali dictionary
data contain various problematic characters such as ‘comma/dot and whitespace’ as
one character. On such characters, whitespace tokenization fails, yielding big chunks
of nonsense tokens. For example, the fourth Pali dictionary entry was split into five
chunks (while it might not be displayed as such, all commata and all dots are in fact
not followed by whitespace, the whitespace is part of the character,10 hence whitespace
tokenization fails).
• Gūhanā,(f.)
• [abstr.fr.gūhati]=gūhanā
•(q.v.)
• Pug.19.Cp.pari°.(Page
• 253)
Twitter 1: English–Greek
• BUSINESS, EXCELLENCE.
• ings, IT
For the third and fourth text, the approach manages to single out the other-language
inclusions, but not exclusively. Both times, there is one additional item in the cluster
(the relevant clusters are marked in red).
10
e comma has the Unicode codepoint U+FF0C (FULLWIDTH COMMA) and the dot has the Uni-
code codepoint U+FF0E (FULLWIDTH FULL STOP)
62
Twitter 3: Fren–English
• #FWWC2015
• breuvages, go
Twitter 4: English–Polish
• back, żubrówka
e approach exceeded expectations on the second and fih Twier text. On the
second text, the ‘French’ cluster does not only contain the French words ‘Demain’ and
‘par’, but also the French way of notating time ‘18h’.
Twitter 2: Fren–English
• Keynote, “e, collective, of, science-publish, or, perish;, it, all, that, counts?”
• #dhiha6, David
• @dhiparis, dynamics, is
On the fih text, an almost perfect result was achieved, with only one additional
subdivision of the ‘English’ cluster.
• (coffee
63
It seems that the language model approach does not work very well on longer texts,
especially on longer texts in Latin-based scripts, with the chosen parameter set; still,
the approach outperforms the clustering approach and achieves scores in the vicinity
of the scores achieved with the supervised trained n-gram language model approach.
On mixed script texts, the approach consistently outperforms the clustering approach
and we also reach scores in the vicinity of the scores achieved with the supervised
trained n-gram language model approach.
Moreover, on short texts, the approach works rather well. We succeed in outper-
forming the supervised trained n-gram language model approach on a number of texts,
and we achieve scores close to the scores achieved with the supervised trained n-gram
language model approach.
Although the language model induction approach tends to generate too many clus-
ters, it also generally succeeds at separating the languages involved.
6.5 Scores
Of the scores I used for evaluation purposes, it seems that a combination of a high
Rand Index and a high F5 score indicate a good language segmentation. A high F5
score alone is not significant. For example, the clustering algorithm achieves an F5
score of 0.7215 on ‘Twier 3’. is score looks good, but the Rand Index score is at
0.4571, and the segmentation is not good.
64
7 Conclusion
In this thesis, I have asked the question of whether unsupervised approaches to lan-
guage segmentation perform beer on short and difficult texts than supervised ap-
proaches by overcoming some of the difficulties associated with supervised approaches,
such as the need for (enough and adequate)11 training data, the language-specificity of
the language model or the inflexibility of trained language models when it comes to
spelling variation and abbreviations, unless the training data also contained spelling
variation and abbreviations.
I have given an overview over related work, presenting supervised approaches that
have been used in monolingual language identification and the amelioration of such
approaches through unsupervised approaches such as clustering.
Unfortunately, the body of literature covering the topic of language segmentation
is sparse. e work by Yin et al. (2007) and the work by Seldin et al. (2001) are closest
in topic to this thesis. However, Yin et al. (2007) concern themselves with spoken
language, with requires a different approach than dealing with wrien language. As I
concentrated on wrien language, their work was not conducive to this thesis.
In contrast, Seldin et al. (2001) present a work that looks promising. ey present
a system that finds language borders in a text with great accuracy using unsuper-
vised algorithms. However, they restrict their algorithm in such a way that switching
language models aer each word is disallowed. us, they are unable to detect single-
word inclusions and cannot handle situations where the language switches every word,
as has been shown to occur in the test data used in section 4.
Another major drawback of the approach is that it also needs longer fragments of
monolingual text and an overall longer text. Hence, their approach would not work
well on short texts, if at all.
Next, I have presented the theoretical foundations of a supervised n-gram language
model approach and an unsupervised clustering approach. Finally, I have introduced a
weakly supervised n-gram language model inducing approach devised by myself. All
of these approaches can be used for language segmentation. In order to test how well
the different approaches perform on different text types, I have performed experiments.
Section 4 presents the experiments made. I have first compiled a small corpus
of texts ranging from longer texts with clearly separated languages to one-sentence
Twier messages containing foreign language inclusions. I have also included a set
of dictionary entries from the Pali dictionary by the Pali Text Society. Indeed, these
entries contain a lot of different languages and abbreviations, and (unfortunately) are
not consistently formaed.
I have then presented my implementations of the supervised and weakly super-
11
e question of what is to be considered ‘enough’ or ‘adequate’ is another point of contention; the
data always influences the resulting models.
65
vised approaches and the choice of the unsupervised clustering algorithms. en, I
have presented the results of their application to the data.
It can be said that the supervised approach works reasonably well. e drawbacks
are that the approach needs training data to train the models on. e problems of the
training data and its influence on the models have been raised more than once.
e supervised approach failed for non-whitespace scripts. e models would have
to be adapted for non-whitespace scripts, introducing more complexity. Also, the
training and test texts would have to be split in meaningful ways, introducing the
need for a vast array of language-specific text spliers, should the approach work on
a wide range of languages.
e unsupervised approach generally succeeded in separating languages by script
when different scripts were involved. Other than that, it seems that the chosen mor-
phological features, or possibly morphological features in general, are insufficient for
the algorithm to separate languages effectively.
e weakly supervised approach worked well on short texts and on difficult short
texts, but less well on long texts, while still outperforming the clustering approach
on long texts. e approach consistently outperforms the clustering approach and
reaches scores in the vicinity of the scores achieved by the supervised approach, even
surpassing the supervised approach in some cases. ese results are promising, but
more thorough investigations have to be undertaken.
In conclusion, it can be said that some unsupervised (or weakly supervised) ap-
proaches can perform beer on the task of language segmentation on difficult and
short texts. e presented weakly supervised approach does not only outperform the
unsupervised clustering approach, it also achieves scores comparable to the scores
achieved with the supervised approach.
Future work could concentrate on the reduction of the number of generated clus-
ters, ideally geing down to one cluster per language; it would also be thinkable to
prevent overly frequent language model switching by taking a word’s context into
account. Finally, the parameters could conceivably be adapted automatically. With
an increased interest in the area of multilingual text processing lately, the emergence
and evolution of the texts themselves will influence the direction of the work in that
direction.
“Il est venu le temps des cathédrales
le monde est entré
dans un nouveau millénaire
— Gringoire
66
References
Abeel, T., de Peer, Y. V., and Saeys, Y. (2009). Java-ML: A Machine Learning Library.
Journal of Machine Learning Research, pages 931–934.
Achtert, E., Kriegel, H., Schubert, E., and Zimek, A. (2013). Interactive data mining
with 3D-parallel-coordinate-trees. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27,
2013, pages 1009–1012.
Alex, B. (2006). Integrating language knowledge resources to extend the English inclu-
sion classifier to a new language. In Proceedings of the 5th International Conference
on Language Resources and Evaluation (LREC). European Language Resources Asso-
ciation.
Alex, B., Dubey, A., and Keller, F. (2007). Using Foreign Inclusion Detection to Improve
Parsing Performance. In EMNLP-CoNLL, pages 151–160.
Alex, B. and Onysko, A. (2010). Zum Erkennen von Anglizismen im Deutschen: der
Vergleich von einer automatisierten mit einer manuellen Erhebung. In Scherer, C.
and Holler, A., editors, Strategien der Integration und Isolation nicht-nativer Einheiten
und Strukturen, pages 223–239. de Gruyter.
Begleiter, R., El-Yaniv, R., and Yona, G. (2004). On prediction using variable order
Markov models. Journal of Artificial Intelligence Research, pages 385–421.
Brants, S., Dipper, S., Hansen, S., Lezius, W., and Smith, G. (2002). e TIGER treebank.
In Proceedings of the workshop on treebanks and linguistic theories, volume 168.
Bürgisser, P., Clausen, M., and Shokrollahi, M. A. (1997). Algebraic complexity theory,
volume 315. Springer.
67
Carter, D. (1994). Improving language models by clustering training sentences. In
Proceedings of the fourth conference on Applied natural language processing, pages
59–64. Association for Computational Linguistics.
Dreyfuss, E., Goodfellow, I., and Baumstarck, P. (2007). Clustering Methods for Improv-
ing Language Models.
Dubes, R. C. (1987). How many clusters are best?-an experiment. Paern Recognition,
20(6):645–663.
Gao, J., Goodman, J., Miao, J., et al. (2001). e use of clustering techniques for language
modeling–application to Asian languages. International Journal of Computational
Linguistics and Chinese Language Processing, 6(1):27–60.
Goldberg, D. (1991). What every computer scientist should know about floating-point
arithmetic. ACM Computing Surveys (CSUR), 23(1):5–48.
Goodman, J. and Gao, J. (2000). Language model size reduction by pruning and clus-
tering. In INTERSPEECH, pages 110–113.
Guthrie, D., Allison, B., Liu, W., Guthrie, L., and Wilks, Y. (2006). A closer look at
skip-gram modelling. In Proceedings of the 5th international Conference on Language
Resources and Evaluation (LREC-2006), pages 1–4.
68
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Wien, I. H. (2009).
e WEKA Data Mining Soware: An Update. SIGKDD Explorations, 11.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM
computing surveys (CSUR), 31(3):264–323.
Katz, S. (1987). Estimation of probabilities from sparse data for the language model
component of a speech recognizer. Acoustics, Speech and Signal Processing, IEEE
Transactions on, 35(3):400–401.
Liu, H. and Cong, J. (2013). Language clustering with word co-occurrence networks
based on parallel texts. Chinese Science Bulletin, 58(10):1139–1144.
Logan, B. et al. (2000). Mel frequency cepstral coefficients for music modeling. In
Proceedings of the 1st International Symposium on Music Information Retrieval (ISMIR).
Lui, M., Lau, J. H., and Baldwin, T. (2014). Automatic detection and language identifi-
cation of multilingual documents. Transactions of the Association for Computational
Linguistics, 2:27–40.
Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to information re-
trieval, volume 1. Cambridge University Press.
Mendizabal, I., Carandell, J., and Horowitz, D. (2014). TweetSafa: Tweet language
identification. TweetLID @ SEPLN.
69
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word
representations in vector space. In Proceedings of the International Conference on
Learning Representations (ICLR) 2013.
Ney, H., Essen, U., and Kneser, R. (1994). On structuring probabilistic dependences in
stochastic language modelling. Computer Speech & Language, 8(1):1–38.
Pelleg, D. and Moore, A. W. (2000). X-means: Extending K-means with Efficient Es-
timation of the Number of Clusters. In Proceedings of the Seventeenth International
Conference on Machine Learning (ICML 2000), pages 727–734.
Pereira, F., Tishby, N., and Lee, L. (1993). Distributional clustering of english words. In
Proceedings of the 31st annual meeting on Association for Computational Linguistics,
pages 183–190. Association for Computational Linguistics.
Porta, J. (2014). Twier Language Identification using Rational Kernels and its potential
application to Sociolinguistics. TweetLID @ SEPLN.
Ravi, S., Vassilivitskii, S., and Rastogi, V. (2014). Parallel Algorithms for Unsupervised
Tagging. Transactions of the Association for Computational Linguistics, 2:105–118.
Ron, D., Singer, Y., and Tishby, N. (1996). e power of amnesia: Learning probabilistic
automata with variable memory length. Machine learning, 25(2-3):117–149.
Schölkopf, B., Williamson, R. C., Smola, A. J., Shawe-Taylor, J., and Pla, J. C. (1999).
Support vector method for novelty detection. In Advances in Neural Information
Processing Systems (NIPS), volume 12, pages 582–588.
Seldin, Y., Bejerano, G., and Tishby, N. (2001). Unsupervised sequence segmentation
by a mixture of switching variable memory Markov sources. In Proceedings of the
Seventeenth International Conference on Machine Learning (ICML), pages 513–520.
Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Gohneim, M., Hawwari, A., Al-
Ghamdi, F., Hirschberg, J., Chang, A., et al. (2014). Overview for the First Shared Task
on Language Identification in Code-Switched Data. In Proceedings of the Conference
on Empirical Methods on Natural Language Processing, pages 62–72.
Taylor, D. (2015). Graphing the distribution of English leers towards the be-
ginning, middle or end of words. [Link]
[Link].
70
Uszkoreit, J. and Brants, T. (2008). Distributed word clustering for large scale class-
based language modeling in machine translation. In Proceedings of the 46th Annual
Meeting of the Association for Computational Linguistics, pages 755–762.
Yin, B., Ambikairajah, E., and Chen, F. (2007). Hierarchical language identification
based on automatic language clustering. In INTERSPEECH, pages 178–181.
Yuan, L. (2006). Language model based on word clustering. In Proceedings of the 20th
Pacific Asia Conference on Language, Information and Computation, pages 394–397.
Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R., Alegria, I., Aranberri, N., Ezeiza,
A., and Fresno, V. (2014). Overview of TweetLID: Tweet language identification at
SEPLN 2014. TweetLID @ SEPLN.
71
8 Appendix
8.1 Development data
8.1.1 Latin script data
Karl Marx anses som en af de fire klassiske sociologer. Marx er epokegørende for den
historiske videnskab. Og Marx spillede en vigtig rolle for den samtidige og eerføl-
gende arbejderbevægelse.
1891, nach einer Tuberkuloseerkrankung Hopes, eröffnete das Ehepaar ein mod-
ernes Lungensanatorium in Nordrach im Schwarzwald, das sie bis 1893 gemeinsam
ührten. 1895 wurde die Ehe geschieden.
Sources:
hps://[Link]/wiki/Karl_Marx
hps://[Link]/wiki/Hope_Bridges_Adams_Lehmann
72
Twitter 2 Music for Airports > le piano en libre-accès dans l’aéroport Charles-de-
Gaulles
Source:
Yannick Rochat (yrochat). “Music for Airports > le piano en libre-accès dans l’aéroport
Charles-de-Gaulles”. 26 July 2015, 18:12. Tweet.
73
(expld.by Nd1 72 under formula saussada;i.e.showing 7 bad qualities,viz.rāga,
dosa,moha etc.) ,855.– See also ussādana,ussādeti etc.(Page 157)
English - Fren doux, mou : both translate as ”so” in English, although their mean-
ing is very different. Doux is the opposite of ”rough” or ”coarse” (rugueux), while mou
is the opposite of ”hard”. Doux can also mean sweet, but almost only for wines (oth-
erwise sucré is used).
Source:
Maciamo, (2015): ”French words and nuances that don’t exist in English”. Eupedia.
74
Dabei mehren sich letzter Zeit auch Stimmen, die Entwarnung geben. So kam
jüngst eine Studie des Stierverbands ür die Deutsche Wissenscha zu dem Ergebnis,
dass ”ein allgemeiner Fachkräemangel in den MINT-Berufen eher nicht mehr” drohe.
Come anche i riccioli del Baista richiamano il movimento delle acque, moto che
poi Leonardo studierà più approfonditamente a Venezia, nelle ricerche sui bacini in
chiave di difesa anti-Turchi. E si vada alla bellissima Annunciazione, con un occhio
aento alle ali dell’angelo: la delicatezza delle punte all’insù che cosa sono se non
il barbaglio di un sogno che lo ossessionava da anni, ovvero quello di volare?
Ist das seit Jahren angemahnte Szenario vom drohenden Fachkräemangel bei In-
genieuren und Naturwissenschalern also nur ein Mythos?
Source:
Stalinski, Sandra (2015): ”Ingenieure: Mythos Fachkräemangel?”. [Link].
Scorranese, Roberta (2015): ”Nelle grandi opere il racconto sofferto della natura mor-
tale”. [Link].
German - Finnish - Turkish Der Sommer ist die wärmste der vier Jahreszeiten in der
gemäßigten und arktischen Klimazone. Je nachdem, ob er gerade auf der Nord- oder
Südhalbkugel herrscht, spricht man vom Nord- oder Südsommer. Der Nordsommer
findet gleichzeitig mit dem Südwinter sta.
Kesä eli suvi on vuodenaika kevään ja syksyn välissä. Kesä on vuodenajoista läm-
pimin, koska maapallo on silloin kallistunut niin, eä aurinko säteilee maan pinnalle
jyrkemmässä kulmassa kuin muina vuodenaikoina. Pohjoisella pallonpuoliskolla kesä-
kuukausiksi lasketaan tavallisesti kesä-. heinä- ja elokuu, eteläisellä pallonpuoliskolla
joulu-, tammi- ja helmikuu.
Yaz, en sıcak mevsimdir. Kuzey Yarım Küre’de en uzun günler yazda gerçekleşir.
Dünya ısıyı depo eiği için en sıcak günler genellikle yaklaşık iki ay sonra ortaya
çıkar. Sıcak günler Kuzey Yarım Küre’de 21 Haziran ile 22 Eylül arasında, Güney Yarım
Küre’de ise 22 Aralık ile 21 Mart arasındadır.
Source:
hps://fi.[Link]/wiki/Kesä
hps://[Link]/wiki/Sommer
hps://[Link]/wiki/Yaz
75
тельным для всякого образованного человека. В латинском языке присутствует
большое количество греческих заимствований, а в греческом —значительное
количество латинских и романских слов. В новое время древнегреческий язык
стал (наряду с латинским) источником создания новых научных и технических
терминов (так называемая международная лексика). В русский язык греческие
слова проникали в основном двумя путями —через международную лексику и
через церковнославянский язык.
Source:
hps://[Link]/wiki/Ελληνική_γλώσσα
hps://[Link]/wiki/Греческий_язык
English - Greek - Transliterated Greek Agápe (ἀγάπη agápē) means ”love: esp.
brotherly love, charity; the love of God for man and of man for God.” Agape is used
in the biblical passage known as the ”love chapter,” 1 Corinthians 13, and is described
there and throughout the New Testament as brotherly love, affection, good will, love,
and benevolence. Whether the love given is returned or not, the person continues to
love (even without any self-benefit). Agape is also used in ancient texts to denote feel-
ings for one’s children and the feelings for a spouse, and it was also used to refer to
a love feast. It can also be described as the feeling of being content or holding one in
high regard. Agape is used by Christians to express the unconditional love of God for
his children. is type of love was further explained by omas Aquinas as ”to will
the good of another.”
Éros (ἔρως érōs) means ”love, mostly of the sexual passion.” e Modern Greek
word ”erotas” means ”intimate love.” It can also apply to dating relationships as well as
marriage. Plato refined his own definition: Although eros is initially felt for a person,
with contemplation it becomes an appreciation of the beauty within that person, or
even becomes appreciation of beauty itself. Plato does not talk of physical araction as
a necessary part of love, hence the use of the word platonic to mean, ”without physical
araction.”
In the Symposium, the most famous ancient work on the subject, Plato has Socrates
argue that eros helps the soul recall knowledge of beauty, and contributes to an under-
standing of spiritual truth, the ideal ”Form” of youthful beauty that leads us humans
to feel erotic desire – thus suggesting that even that sensually based love aspires to
the non-corporeal, spiritual plane of existence; that is, finding its truth, just like find-
ing any truth, leads to transcendence. Lovers and philosophers are all inspired to seek
truth through the means of eros.
Source:
hps://[Link]/wiki/Greek_words_for_love
76
English - Spanish - Arabic A black ribbon is a symbol of remembrance or mourn-
ing. Wearing or displaying a black ribbon has been used for POW/MIA remembrance,
mourning tragedies or as a political statement.
El crespón negro o lazo negro es un símbolo utilizado por personas, estados, so-
ciedades y organizaciones, representando un sentimiento político-social en señal de
duelo.
الرمز يعني الرسم الذي يعبر عن شيء معين وعموما فٔان العلامة ينبغي ٔان تنقل رسالتها بنظرة واحدة دون الحاجة لاية كلمات و من
المعروف ٔان قدماء المصريين والٔاغريق ٔاستخدموا العلامات ولكن ٔاكثر من استخدم العلامات هم
Source:
hps://[Link]/?title=Lazo_negro
hps://[Link]/wiki/Black_ribbon
hps://[Link]/wiki/رمز
77
8.2.3 Twitter data
Tweet 1: Greek – English Μόλις ψήφισα αυτή τη λύση Internet of ings, στο
διαγωνισμό BUSINESS IT EXCELLENCE.
Source:
GaloTyri. ”Μόλις ψήφισα αυτή τη λύση Internet of ings, στο διαγωνισμό
BUSINESS IT EXCELLENCE.”. 19 June 2015, 12:06. Tweet
Tweet 2: English – Fren Demain #dhiha6 Keynote 18h @dhiparis “e collective
dynamics of science-publish or perish; is it all that counts?” par David @chavalarias
Source:
Claudine Moulin (ClaudineMoulin). ”Demain #dhiha6 Keynote 18h @dhiparis ”e
collective dynamics of science-publish or perish; is it all that counts?” par David
@chavalarias”. 10 June 2015, 17:35. Tweet.
Tweet 3: English – Fren Food and breuvages in Edmonton are ready to go, just
waiting for the fans #FWWC2015 #bilingualism
Source:
HBS (HBS_Tweets). ”Food and breuvages in Edmonton are ready to go, just waiting
for the fans #FWWC2015 #bilingualism”. 6 June 2015, 23:29. Tweet.
Tweet 4: English – Polish my dad comes back from poland with two crates of
strawberries, żubrówka and adidas jackets omg
Source:
katarzyne (wifeyriddim). ”my dad comes back from poland with two crates of
strawberries, żubrówka and adidas jackets omg”. 8 June 2015, 08:49. Tweet.
Tweet 5: Transliterated Amharic – English Buna dabo naw (coffee is our bread).
Source:
eCodeswitcher. ”Buna dabo naw (coffee is our bread).”. 9 June 2015, 02:12. Tweet.
abbha (nt.) [Vedic abhra nt. & later Sk. abhra m. ”dark cloud”; Idg. *m̊bhro, cp. Gr.
<at>a)fro\\s</at> scum, froth, Lat. imber rain; also Sk. ambha water, Gr.
<at>o)/mbros</at> rain, Oir ambu water]. A (dense & dark) cloud, a cloudy mass A
<smallcaps>ii.</smallcaps> 53 = Vin <smallcaps>ii.</smallcaps> 295 = Miln 273 in
78
list of to things that obscure moon– & sunshine, viz. <b>abbhaŋ mahikā</b> (mahiyā
A) <b>dhū- marajo</b> (megho Miln), <b>Rāhu</b> . is list is referred to at SnA
487 & VvA 134. S <smallcaps>i.</smallcaps> 101 (°sama pabbata a mountain like a
thunder–cloud); J <smallcaps>vi.</smallcaps> 581 (abbhaŋ rajo acchādesi); Pv
<smallcaps>iv.</smallcaps> 3 <superscript>9</superscript> (nīl° = nīla–megha PvA
251). As f. <b>abbhā</b> at Dhs 617 & DhsA 317 (used in sense of adj. ”dull”; DhsA
expl <superscript>s.</superscript> by valāhaka); perhaps also in <b>abbhāmaa</b>
. <br /><b>–kūṭa</b> the point or summit of a storm–cloud 1, 1064; J
<smallcaps>vi.</smallcaps> 249, 250; Vv 1 <superscript>1</superscript> (=
valāhaka–sikhara VvA 12). <b>–ghana</b> a mass of clouds, a thick cloud It 64; Sn
348 (cp. SnA 348). <b>–paṭala</b> a mass of clouds DhsA 239. <b>–mua</b> free
from clouds Sn 687 (also as abbhāmua Dh 382). <b>–saŋvilāpa</b> thundering S
<smallcaps>iv.</smallcaps> 289.
abhijjhitar [n. ag. fr. abhijjhita in med. function] one who covets M
<smallcaps>i.</smallcaps> 287 (T. abhijjhātar, v. l. °itar) = A
<smallcaps>v.</smallcaps> 265 (T. °itar, v. l. °ātar).
ajja Ajja,& Ajjā (adv.)[Vedic adya & adyā,a + dyā,a° being base of demonstr.
pron. (see a3)and dyā an old Loc. of dyaus (see diva) ,thus “on this day”]
to-day,now Sn.75,153,158,970,998;Dh.326;J.I,279;III,425 (read bahutaṁ
ajjā;not with Kern,Toev. s. v. as “food”);Pv.I,117 (= idāni PvA.59) ;PvA.6,
23;Mhvs 15,64. ‹-› Freq. in phrase ajjatagge (= ajjato + agge(?)or ajja-tagge,
see agga3)from this day onward,henceforth Vin.I,18;D.I,85;DA.I,235.
–kālaṁ (adv.)this morning [Link],180;–divasa the present day Mhvs 32,23.
(Page 10)
79
8.3 Results
8.3.1 N-Gram Language Models
For the n-gram language model approach, the identified language is indicated in
parentheses. e language abbreviations are:
Abbreviation Language
AR Arabic
DE German
EL Greek
EN English
ES Spanish
FI Finnish
FR French
IT Italian
PL Polish
RU Russian
UK Ukrainian
TR Turkish
TrAM Transliterated Amharic
TrEL Transliterated Greek
ZH Chinese
• (EN) own., belly, refer, buon, But, it, or, your, at, in, ”staring, anyone, doesn’t,
else’s, word, this
• (FI) –
• (TrAM) e
80
Data: Latin script: German – Finnish – Turkish
• (DE) ob, oder, Sommer, und, Nord-, arktischen, der, Der, dem, gemäßigten, mit,
er, Südsommer., spricht, Jahreszeiten, Südwinter, herrscht, wärmste, vom, die,
sta., nachdem, auf
• (EN) ist, Nordsommer, Mart, in
• (ES) en, depo
• (FI) joulu-, kevään, suvi, on, eli, vuodenajoista, syksyn, koska, kesä-., kuin, Po-
hjoisella, man, helmikuu., tammi-, lämpimin, heinä-, niin, maapallo, maan, pin-
nalle, Kesä, säteilee, tavallisesti, vuodenaika, kallistunut, lasketaan, muina, eiği,
jyrkemmässä, elokuu, välissä., eä, eteläisellä, silloin, ja, kulmassa
• (FR) vier, Je
• (PL) aurinko
• (RU) 22, 21
• (TR) yaklaşık, ortaya, genellikle, Eylül, Sıcak, çıkar., Yaz, sonra, arasında, Kuzey,
Güney, Aralık, gerade, ısıyı, gerçekleşir., Küre’de, günler, için, findet, mevsimdir.,
arasındadır., Haziran, iki, yazda, uzun, ise, ay, sıcak, ile, Yarım, Dünya
• (TrAM) Der
• (other) Klimazone., gleichzeitig,kesäkuukausiksi, vuodenaikoina., pallonpuolis-
kolla,Südhalbkugel
81
Data: Latin script: English – Transliterated Greek
• (EN) for, meanings, least, used, been, distinct, love, of, were, are, when, agápe,
these, how, and, Greek, word, used., outside, ways, different, other, follows.,
words, respective, generally, However, is, with, it, at, as, historically, the, in,
which, their
• (IT) contexts.
• (DE) drohe., geben., allgemeiner, Studie, jüngst, ür, Ergebnis, keine, kam, dro-
henden, oder, und, letzter, neue, Mythos?, Deutschland, Ist, sich, der, vergeht,
studierà, Dabei, Studie, den, dem, auch, Entwarnung, dass, nur, eher, nicht, gibt.,
Umfrage, Woche, eine, Kaum, Jahren, bei, mehren, Stimmen, Deutsche, das, zum,
mehr”, angemahnte, ”ein, Zeit, ein, So, vom, zu, die, seit, Warnung, Wissenscha
• (EL) affrescò
• (ES) custodisce, cura, subito, Certo, Giuda, lo, del, difesa, con, definire, restauro,
se, modo., la, arginato., recente, vada, movimento, Leonardo, Szenario, quel,
cominciò
• (FR) l’esempio, non, des, acque, perché, un, es, le, sui, condanna
• (IT) solo, faceva, caurata, chiave, peccato), periture, (il, delicatezza, cancro, pri-
vato, bellissima, anni, bacini, ovvero, delle, sogno, di, barbaglio, ma, qualche, e,
amore, ricerche, Come, per, richiamano, ne, intuizioni, punte, occhio, struggente:,
nelle, vita, riccioli, solo, che, volare?, sono, alla, alle, anche, Cenacolo, quello,
cosa, ali, viene, il, psicologia, vinciano, Venezia
82
• (PL) i
• (TrAM) Milano, E
• (EL) κείμενα, βαλκανικό, από, το, αιώνα, Αποτελεί, ελληνική, μία, επίσης, στον,
γλωσσικό, γλωσσών., είναι, Στην, έχουμε, μέλος, ανεξάρτητου, τις, γλώσσες.,
15ο, Ανήκει, γραπτά, π.Χ., σήμερα., γλώσσα, γλώσσα, κλάδου, οικογένειας, τον,
της, δεσμό., μέχρι, μοναδικό, ενός
• (TrAM) Η
• (EL) ”Form”
83
• (EN) sensually, platonic, for, holding, existence;, refined, its, explained, arac-
tion, of, (even, are, spiritual, given, refer, Agape, beauty, or, araction.”, like,
without, not, further, will, own, love, knowledge, will, one’s, most, use, ex-
press, is, another.”, e, leads, truth, suggesting, dating, relationships, in-
spired, ”love, mostly, hence, definition:, regard., appreciation, a, ideal, us, helps,
seek, Agápe, plane, recall, feeling, within, returned, chapter,”, based, described,
apply, physical, Although, good, by, used, love, God.”, children., his, any, char-
ity;, Socrates, be, work, throughout, and, that, Greek, even, word, agápē), love.”,
known, biblical, feelings, does, famous, In, subject, becomes, one, understand-
ing, children, ”love, through, beauty, well, It, was, initially, feast., finding, itself.,
13, all, ”without, feel, with, is, it, thus, New, as, the, brotherly, in, is, an, there,
God, youthful, necessary, high, Lovers, also, Whether
• (ES) person, Aquinas, esp., continues, has, omas, truth, can, erotic, sexual,
desire
• (FI) on, –, man, mean
• (FR) (ἀγάπη, spouse, not, ancient, marriage., soul, person, content, Christians,
Testament, Éros, just, part, type, passage, means, humans, passion.”, aspires, con-
templation, contributes, argue, affection
• (IT) texts, 1, ”intimate, Plato, ”to
• (RU) (ἔρως
• (TR) talk
• (TrAM) érōs), ”love:
• (TrEL) ”erotas”, denote, eros., to, eros
• (other) non-corporeal, Corinthians, self-benefit)., benevolence., unconditional,
philosophers, transcendence.
• (EN) for, used, been, displaying, of, ribbon, black, or, mourning., statement.,
tragedies, is, political, a, Wearing, as, mourning
• (ES) por, has, crespón, sociedades, personas, sentimiento, representando, esta-
dos, de, El, señal, lazo, símbolo, en, utilizado, y
84
• (FR) remembrance, remembrance, un, es
• (IT) negro, duelo., POW/MIA
• (TrAM)
• (TrEL) symbol, o
• (other) político-social, organizaciones
85
Data: Pali: abbha
• (EN) water, mountain, of, free, (used, or, like, referred, (also, A, is, cloudy,
clouds, later, a, froth, 1, summit, thundering, by, mass, Pv, Oir, obscure, scum,
that, water]., thick, As, from, It, is, at, as, the, in, clouds, things, also
• (FR) cloud, Dh, adj., point, cloud, Dhs, A), rain, VvA, DhsA, list
• (IT) \”dark, &, ambha, 3, 1, 317, J, sunshine, cp., abhra, [Vedic, (megho
• (RU) 348, 53
• (TrAM) 687, PvA, (°sama, 101, (nīl°, (cp., 64;, (nt.), 581, m., Sn, 1064;
• (other) <b> –saŋvilāpa </b>, <b> –mua </b>, <smallcaps> vi. </smallcaps>,
(mahiyā, <smallcaps> iv. </smallcaps>, cloud\”;, <b> Rāhu </b>, <b> abbhā
</b>, <b> abbhaŋ, <superscript> 9 </superscript>, marajo </b>, abbhāmua,
valāhaka);, <smallcaps> i. </smallcaps>, <b> abbhāmaa </b>, valāhaka–sikhara,
<superscript> s. </superscript>, <smallcaps> ii. </smallcaps>, <b> dhū-, storm–
cloud, /><b> –kūṭa </b>, thunder–cloud);, <at>a)fro\\s</at>, <b>–paṭala</b>,
<at>o)/mbros</at>, nīla–megha, <superscript>1</superscript>, *m̊bhro, \”dull\”;,
acchādesi);, mahikā</b>, <b> –ghana </b>
86
Data: Pali: abhijjhitar
• (DE) v.
• (EN) A, one, in, who, covets, med., function]
• (IT) ag., M, fr.
• (PL) 287, =
• (RU) 265
• (TrAM) l., [n.
• (TrEL) (T.
• (other) <smallcaps> v. </smallcaps>, abhijjhātar, abhijjhita, °ātar)., <smallcaps>
i. </smallcaps>, °itar, °itar)
87
Data: Pali: gūhanā
• (ES) 253)
• (EL) 382)
• (EN) for, aer, roasting, read, roasted, be, or, at, tormented, in
• (other) bake,
Gr.pέssw,(+Caus.pācayato),(q.v.) .(Page, DA.I, 159,
where,
Caus.pacāpeti, intrs.):Niraye, pacitvā, Pass.paccati,(trs.and, tormenting,
Gen.pacato, pīḷentassa).–, fig.torment, cook,pέpwn, Pacati,[Ved.pacati,
Idg.*peqǔō,Av.pac-;, paccato,by, ppr.pacanto, cook,boil,roast, fry,
roast,Lith,kepū, (q.v.) .–, (expld, Vin.IV,264;, Obulg.peka, pp.
pakka, (q.v.).‹-›, N.S.II,225,PvA.10,14.–
• (DE) Internet
• (ES) BUSINESS
• (TrAM) Μόλις
• (other) EXCELLENCE.
88
Data: Twier 2 (French–English)
• (EN) David, ”e, is, it, perish;, or, collective, Demain, counts?”, that, of, dynam-
ics, all
• (FI) 18h
• (FR) par, Keynote
• (other) #dhiha6, @dhiparis, science-publish
89
8.3.2 Textcat
For Textcat, the identified language is indicated in parentheses. As Textcat returns
unknown for many words, I merely indicate the non-unknown categories to save
space and write rest to indicate that all other words of the text have been classified as
unknown. e language abbreviations are:
Abbreviation Language
DA Danish
DE German
EL Greek
EN English
ES Spanish
FI Finnish
FR French
HU Hungarian
ID Indonesian
IT Italian
LT Lithuanian
LV Latvian
NL Dutch
PT Portuguese
RU Russian
TH ai
ZH Chinese
• (HU) “navel-gazing”
• (ZH) Nabelschau
• (unknown) rest
90
• (ZH) gemäßigten, Klimazone., Südhalbkugel, Nordsommer, gleichzeitig, vuoden-
aika, jyrkemmässä, vuodenaikoina., Pohjoisella, pallonpuoliskolla, kesäkuukausik-
si, eteläisellä, mevsimdir., gerçekleşir., arasındadır.,
• (unknown) rest
• (HU) different.,
• (ZH) (rugueux),(otherwise,
• (unknown) rest
• (LT) languages,
• (unknown) rest
• (ES) delicatezza,
• (HU) bellissima,
• (NL) Ingenieuren,
• (PT) approfonditamente,
• (unknown) rest
91
Data: Mixed script: Greek – Russian
• (TH) латинским),
• (unknown) rest
• (IT) non-corporeal,
• (PT) contributes,
• (unknown) rest
• (ID) displaying,
• (PT) representando,
• (unknown) rest
92
Data: Mixed script: English – Chinese
• (FR) characters,
• (unknown) rest
• (TH) вважається,
• (unknown) rest
• (ZH)
<at> a)fro\\</at>, <at> o)/mbros </at>, <smallcaps> ii. </smallcaps>, mahikā</b>,
<b> Rāhu </b>, <smallcaps> i. </smallcaps>, thunder–cloud);, <smallcaps> vi.
</smallcaps>, acchādesi);, <smallcaps> iv. </smallcaps>, <superscript> 9 </su-
perscript>, <b> abbhā </b>, <superscript> s. </superscript>, valāhaka);, <b>
abbhāmaa </b>, /><b> –kūṭa </b>, <superscript> 1 </superscript>, <b> –ghana
</b>, <b> –paṭala </b>, <b> –mua </b>, abbhāmua, <b> –saŋvilāpa </b>
• (unknown) rest
• (unknown) rest
93
Data: Pali: ajja
• (unknown) rest
• (unknown) rest
• (unknown) rest
• (unknown) rest
• (IT) collective,
• (ZH) science-publish,
• (unknown) rest
• (ZH) #bilingualism,
• (unknown) rest
94
Data: Twier 4 (English–Polish)
• (LV) strawberries,
• (unknown) rest
• (unknown) rest
8.3.3 Clustering
Clustering the different data sets produced the following clusters. e second run
uses the clusters from the first run and possibly subdivides each cluster into two or
more clusters.
First run
• “staring, But, German, Nabelschau, anyone, belly, buon, case, just, means,
navel”., own., refer, this, word, your
• –, e
Second run
• doesn’t, else’s
• “navel-gazing”
• “staring, But, German, Nabelschau, belly, case, means, navel”., refer, this
• it, or, to
• at, in
• –, e
95
Data: Latin script: German – Finnish – Turkish
First run
• Aralık, Eylül, Kesä, Yarım, arasındadır., eteläisellä, eiği, eä, gerçekleşir., heinä-
, jyrkemmässä, kesä-., kevään, välissä., yaklaşık, ısıyı
• 21, 22
Second run
• Kesä
• 22
• 21
96
• arktischen, auf, aurinko, dem, depo, der, die, eli, elokuu, findet, genellikle, gerade,
gleichzeitig, helmikuu., herrscht, iki, ile, ise, ist, joulu-, kallistunut, koska, kuin,
kulmassa, lasketaan, maan, maapallo, man, mevsimdir., mit, muina, nachdem,
niin, oder, ortaya, pallonpuoliskolla, pinnalle, silloin, sonra, spricht, sta., suvi,
syksyn, tammi-, tavallisesti, und, uzun, vier, vom, vuodenaika, vuodenaikoina.,
vuodenajoista, yazda
First run
• is, or
• as, in, of
Second run
• Doux, English,
• “hard”., ”rough”, ”so”, also, both, but, can, doux, for, mean, mou, only, sucré,
sweet, the, their, used)., very, while, wines
• or
• is
• in
• of
• as
97
Data: Latin script: English – Transliterated Greek
First run
• e
• Greek, and, are, as, at, been, for, four, has, how, in, is, it, least, love, love:, of,
other, the, their, these, to, used, used., ways, were, when, with, word
Second run
• e
• philía, storgē.
• agápe, éros,
• words
• and, are, as, at, been, for, four, has, how, in, is, it, least, love, love:, of, other, the,
their, these, to, used, used., ways, were, when, with, word
• Greek
First run
• (il, E, So, a, ad, da, di, e, es, ha, i, il, in, la, le, lo, ma, ne, se, si, un, va, zu
98
• “ein , Annunciazione, Baista, Cenacolo, Certo, Come, Dabei, Deutsche, Deutsch-
land, Entwarnung, Ergebnis, Giuda, Ingenieuren, Ist, Jahren, Kaum, Leonardo,
MINT-Berufen, Mythos?, Naturwissenschalern, Stierverbands, Stimmen, Stu-
die, Studie, Szenario, ema, Umfrage, Venezia, Warnung, Wissenscha, Woche,
Zeit, acque, ali, alla, alle, allgemeiner, also, amore, anche, angemahnte, anni,
anti-Turchi., approfonditamente, arginato., aento, auch, autodistruggersi, baci-
ni, barbaglio, bei, bellissima, cancro, caurata, che, chiave, con, condanna, consa-
pevole, consapevolezza, cosa, cura, custodisce, das, dass, definire, del, delicatezza,
delle, dem, den, der, des, die, difesa, drohe., drohenden, eher, ein, eine, faceva,
geben., gibt., idee, intuizioni, kam, keine, letzter, lunghissimo, mehr”, mehren,
modo., moto, movimento, nelle, neue, nicht, non, nur, occhio, oder, ossessiona-
va, ovvero, peccato), per, periture, poi, privato, psicologia, punte, qualche, quel,
quello, recente, restauro, riccioli, ricerche, richiamano, rivoluzionaria, seit, sich,
sogno, solo, solo, sono, stessa, struggente:, subito, sui, und, vada, vergeht, viene,
vinciano, vita, volare?, vom, zum
Second run
• a, e, i
• E
• So
• (il, ad, da, di, es, ha, il, in, la, le, lo, ma, ne, se, si, un, va, zu
• “ein, acque, ali, alla, alle, allgemeiner, also, amore, anche, angemahnte, an-
ni, anti-Turchi., approfonditamente, arginato., aento, auch, autodistruggersi,
bacini, barbaglio, bei, bellissima, cancro, caurata, che, chiave, con, condanna,
consapevole, consapevolezza, cosa, cura, custodisce, das, dass, definire, del, de-
licatezza, delle, dem, den, der, des, die, difesa, drohe., drohenden, eher, ein,
eine, faceva, geben., gibt., idee, intuizioni, kam, keine, letzter, lunghissimo,
mehr”, mehren, modo., moto, movimento, nelle, neue, nicht, non, nur, oc-
chio, oder, ossessionava, ovvero, peccato), per, periture, poi, privato, psicologia,
99
punte, qualche, quel, quello, recente, restauro, riccioli, ricerche, richiamano, ri-
voluzionaria, seit, sich, sogno, solo, solo, sono, stessa, struggente:, subito, sui,
und, vada, vergeht, viene, vinciano, vita, volare?, vom, zum
• Fachkräemangel
• ür
• più
First run
• 15ο, —, Η
Second run
• 15ο
• —
• Η
100
• а, в, и, с
• В
• το, На, на
• (наряду, (так
• Римской, большое, была, время, всех, всякого, двумя, для, его, знание, им-
перии, лексику, научных, новое, новых, путями, русский, слова, создана,
стал, через, этапах, язык, языка, языке
First run
101
• “Form”, “erotas”, “love, “love, “love:, (even, Agape, Greek, Lovers, Modern, Plato,
is, omas, also, apply, argue, based, beauty, beauty, being, dating, denote,
desire, does, eros, eros., erotic, even, famous, feast., feel, felt, given, good, helps,
hence, high, humans, ideal, itself., just, known, leads, like, love, love, love.”, mean,
means, most, mostly, one’s, part, person, person, plane, recall, refer, regard., seek,
sexual, soul, spouse, talk, texts, that, there, thus, truth, truth, type, used, well,
will, will, with, within, word, work
• “to, 1, 13, God, God.”, In, It, New, e, a, all, an, and, any, are, as, be, by, can, esp.,
for, has, his, in, is, is, it, its, man, not, not, of, on, one, or, own, the, to, us, use,
was
Second run
• affection, ancient, another.”, aspires, becomes, biblical, chapter,”, charity;, chil-
dren, children., content, definition:, feeling, feelings, finding, holding, marriage.,
necessary, passage, passion.”, platonic, refined, returned, subject, through, with-
out
• Although, Aquinas, Christians, Corinthians, Socrates, Symposium, Testament,
Whether
• “intimate, appreciation, araction, araction.”, benevolence., brotherly, contem-
plation, continues, contributes, described, existence;, explained, express, further,
initially, inspired, knowledge, non-corporeal, philosophers, physical, relation-
ships, self-benefit)., sensually, spiritual, suggesting, throughout, transcendence.,
unconditional, understanding, youthful
• Agápe, agápē), Éros, érōs)
• (ἀγάπη, (ἔρως
• –
• “erotas”, beauty, beauty, dating, denote, desire, erotic, famous, humans, itself.,
mostly, person, person, recall, regard., sexual, spouse, within
• “Form”, Agape, Greek, Lovers, Modern, Plato, is, omas, based, being, feast.,
hence, ideal, leads, means, plane, refer, there
• apply, felt, helps, high, just, known, most, part, talk, texts, that, thus, truth, truth,
type, well, will, will, with, word, work
• “love, “love, “love:, (even, also, argue, does, eros, eros., even, feel, given, good,
like, love, love, love.”, mean, one’s, seek, soul, used
102
• 1, 13, In, It
• “to, a, an, as, be, by, in, is, is, it, of, on, or, to, us
• God, God.”, New, e, all, and, any, esp., its, own, the
• are, can, for, has, his, man, not, not, one, use, was
First run
• El, POW/MIA, Wearing, a, as, been, black, de, displaying, duelo., en, es, estados,
for, has, is, lazo, mourning, mourning., negro, o, of, or, organizaciones, personas,
political, por, remembrance, remembrance, representando, ribbon, sentimiento,
sociedades, statement., symbol, tragedies, un, used, utilizado, y
• A
• ، رسالتها، دون، تنقل، بنظرة، المعروف، المصريين، العلامة، العلامات، الرمز، الرسم، الذي، الحاجة، استخدم، ٔان، ٔاكثر،ٔاستخدموا
ينبغي، يعني، يعبر، ولكن، وعموما، والٔاغريق، واحدة، و، هم، من، معين، لاية، كلمات، قدماء، فٔان، عن،شيء
Second run
• a, o, y
• Wearing, been, black, displaying, duelo., estados, for, has, lazo, mourning, mourn-
ing., negro, organizaciones, personas, political, por, remembrance, remembrance,
representando, ribbon, sentimiento, sociedades, statement., symbol, tragedies,
used, utilizado
• POW/MIA
• político-social, símbolo
• crespón, señal
• A
• ينبغي، يعني، يعبر، ولكن، واحدة، معين، لاية، كلمات، قدماء، فٔان، شيء، دون، تنقل، بنظرة، الرمز، الرسم، الذي،ٔاكثر
103
• هم، من، عن،ٔان
• و
First run
• 0, 6, 8, 9
• In, Some, e, and, are, based, be, by, can, for, has, have, in, is, jī, name, names,
of, on, or, other, some, sound, that, the, their, this, to, to., two, usage, word, words
Second run
• Chinese, Chinese:
• Western
• (不利), (吉利)
• 危机;, 危機;
• 机;, 機)
• 6, 8, 9
• 0,
• Some, e, and, are, based, can, for, has, have, name, names, other, some, sound,
that, the, their, this, two, usage, word, words
104
• In, be, by, in, is, of, on, or, to, to.
• jī
First run
• 9—13
• Имеет, На, Русь, але, водами, готів, и, й, на, с, та, юге, інших
Second run
• 9—13
• и, й, с
• На, на, та
• Имеет, Русь
105
Data: Pali: abbha
First run
• (also, (cp., (dense, (megho, (used, (°sama, 1, 1, 101, 1064;, 12)., 134., 239., 249,
250;, 251)., 273, 289., 295, 3, 317, 348, 348)., 382)., 487, 53, 581, 617, 64;, 687, <at>
a)fro\\s </at>, <at> o)/mbros </at>, <smallcaps> i. </smallcaps>, <smallcaps>
ii. </smallcaps>, <smallcaps> iv. </smallcaps>, <smallcaps> vi. </smallcaps>,
<superscript> 1 </superscript>, <superscript> 9 </superscript>, <superscript>
s. </superscript>, A, A), As, Dh, Dhs, DhsA, Gr., Idg., It, J, Lat., Miln, Miln),
Oir, Pv, PvA, S, Sk., Sn, SnA, , is, Vin, Vv, VvA, [Vedic, a, abhra, adj., also,
ambha, ambu, as, at, by, cloud, cloud, cloud\”;, clouds, clouds, cloudy, cp., dark),
expl, f., free, from, froth, imber, in, is, later, like, list, m., marajo</b>, mass,
moon–, mountain, nt., obscure, of, or, pabbata, perhaps, point, rain, rain;, rajo,
referred, scum, sense, storm–cloud, summit, sunshine, that, the, thick, things,
thunder–cloud);, thundering, to, viz., water, water].
• &, (=, <b>–ghana</b>, <b>–mua</b>, <br, =, \”dark, \”dull\”;
• (abbhaŋ, (mahiyā, (nīl°, <b> –saŋvilāpa </b>, <b> Rāhu </b>, <b> abbhā </b>,
<b> abbhāmaa </b>, abbhāmua, acchādesi);, mahikā </b>, nīla–megha, valā-
haka);, valāhaka– sikhara
• *m̊bhrocite /><b>–kūṭa</b>, <b>–paṭala</b>, <b>abbhaŋ, <b>dhū-, (nt.)
Second run
• (cp., Dhs, DhsA, Idg., Lat., Miln, Miln), Oir, PvA, SnA, is, Vin, VvA, [Vedic, as,
at, by, cp., in, is, nt., of, or, to
• (also, (dense, (megho, (used, (°sama, <at> a)fro\\s </at>, <at> o)/mbros </at>,
<smallcaps> ii. </smallcaps>, <smallcaps> iv. </smallcaps>, <smallcaps> vi.
</smallcaps>, abhra, adj., also, ambha, ambu, cloud, cloud, cloud\”;, clouds, clouds,
cloudy, dark), expl, free, from, froth, imber, later, like, list, marajo </b>, mass,
moon–, mountain, obscure, pabbata, perhaps, point, rain, rain;, rajo, referred,
scum, sense, storm– cloud, summit, sunshine, that, the, thick, things, thunder–
cloud);, thundering, viz., water, water].
• 1, 1, 101, 1064;, 12)., 134., 239., 249, 250;, 251)., 273, 289., 295, 3, 317, 348, 348).,
382)., 487, 53, 581, 617, 64;, 687, <superscript> 1 </superscript>, <superscript> 9
</superscript>
• <smallcaps> i. </smallcaps>, <superscript> s. </superscript>, A, A), As, Dh, Gr.,
It, J, Pv, S, Sk., Sn, , Vv, a, f., m.
106
• <b> –ghana </b>, <b> –mua </b>, <br, \”dark, \”dull\”;
• &, (=, =
• (abbhaŋ, (mahiyā, (nīl°, <b> Rāhu </b>, <b> abbhā </b>, nīla–megha
• *m̊bhro, /><b> –kūṭa </b>, <b> –paṭala </b>, <b> abbhaŋ, <b> dhū-
• (nt.)
First run
• abhijjhita, abhijjhātar, covets, function], med., one, who, °itar), °itar, °ātar).
• 265, 287
• [n.
Second run
• abhijjhita, abhijjhātar, covets, function], med., one, who, °itar), °itar, °ātar).
• (T., A, M
• =, l., v.
• 265, 287
• [n.
107
Data: Pali: ajja
First run
• –divasa, Freq., Loc., [Vedic, adya, ajjatagge, ajjato, an, and, as, base, being, day,
demonstr., dyaus, from, in, morning, not, of, old, or, phrase, present, pron., the,
this, with
• –kālaṁ, 10), 15,64., 32,23., Ajjā, D.I,85;, DA.I,235., Dh.326;, III,425, J.I,
279;, [Link],180;, Kern,Toev., Pv.I,117, PvA.59) ;, PvA.6,23;, Sn.75,153,
158,970,998;, Vin.I,18;, a3), adyā,a, agga3), agge(?), ajja-tagge,see,
ajjā;, bahutaṁ, day”], diva),thus, dyā, dyā,a°, idāni, onward,henceforth,
to-day,now, “food”);, “on, ‹-›, Ajja,&, (=, (Page, (adv.), (read, (see
Second run
• –divasa, adya, ajjatagge, ajjato, and, base, being, day, demonstr., dyaus, from,
morning, not, old, phrase, present, pron., the, this, with
• &
• +
• Mhvs
• s., v.
108
Data: Pali: gūhanā
First run
Second run
First run
• aer, at, be, for, in, or, pacato, pare, purgatory, read, ripe], roasted, roasting, to,
tormented
Second run
• 382), pp.pakka
• Pacati,[Ved.pacati,Idg.*peqǔō,Av.pac-;,(+Caus.pācayato),(expld,
(q.v. ).‹-›, (q.v.).(Page, (trs.and
109
• for, pacato, pare, read, ripe]
• or, to
• at, be, in
• &
• pacitvā, pāceti
First run
Second run
• Μόλις
• IT, of
• Internet, ings,
• BUSINESS, EXCELLENCE.
First run
• “e, 18h, @dhiparis, David, Demain, Keynote, all, collective, counts?”, dynam-
ics, par, perish;, science-publish, that
110
Second run
• 18h
• is, it, or
• of
First run
• Edmonton, Food
• go, in, to
Second run
• Edmonton, Food
• to
• go, in
• for, just
First run
• żubrówka, my
• adidas, and, back, comes, crates, dad, from, jackets, of, omg, poland, strawberries,
two, with
111
Second run
• żubrówka, my
• and, back, crates, from, jackets, omg, poland, strawberries, two, with
First run
• Buna
Second run
• Buna
• our
• e, German, word, Nabelschau, means, or, “staring, at, your, But, in, this, it,
doesn’t, refer, to, anyone, else’s, buon, just, your, own.,
• –
112
Data: Latin script: German–Finnish–Turkish
• die, in, und, Klimazone., Je, ob, auf, Südhalbkugel, vom, eli, on, vuodenaika, ja, on,
vuodenajoista, koska, maapallo, on, silloin, kallistunut, aurinko, maan, pinnalle,
kulmassa, muina, vuodenaikoina., Pohjoisella, pallonpuoliskolla, lasketaan, ta-
vallisesti, ja, elokuu, eteläisellä, pallonpuoliskolla, joulu-, ja, helmikuu., en, sı-
cak, en, yazda, Dünya, depo, en, sıcak, yaklaşık, ay, sonra, ortaya, Sıcak, Haziran,
Eylül, ise, Aralık, arasındadır.
• Der, ist, wärmste, der, vier, Jahreszeiten, der, arktischen, nachdem, er, der, Nord-,
oder, herrscht, spricht, Nord-, oder, Der, findet, mit, Südwinter, sta., suvi, läm-
pimin, niin, eä, säteilee, heinä-, Yaz, mevsimdir., Küre’de, Küre’de, 21, 22, ara-
sında, Küre’de, 22, 21, Mart
• both, “so”, in, English, although, their, is, is, the, opposite, of, “rough”, or, is,
the, opposite, of, sweet, only, for, wines, (otherwise, is
• doux,
• while
• “hard”., used).,
• translate, as, meaning, very, different., ”coarse”, can, also, mean, almost,sucré,
• at, least, ways, as, to, is, has, philía, and, storgē., as, has, historically, difficult, to,
which, generally, as
113
• e, language, distinguishes, different, the, Ancient, distinct, with, languages, it,
been, separate, the, meanings, these, used, outside, their, respective, the, senses,
in, these, used
• Greek, how, word, Greek, agápe, éros, However, other, when, were, are
• four, love, used., four, words, for, love:, of, words, of, contexts., Nonetheless,
words, follows.
• dell’aureola, da, del, di, der, zum, modo., dem, den, drohe., Come, vom
• custodisce, quel, es, oder, per, le, idee, stessa, des, dass, delle, E, se, Ist, das, seit
• ne, struggente:, che, amore, e, non, viene, ma, consapevolezza, ad, che, ha, re-
cente, Kaum, eine, Woche, vergeht, keine, neue, Umfrage, Warnung, ema,
Fachkräemangel, Deutschland, Certo, ma, anche, consapevole, che, qualche,
mehren, letzter, Zeit, Stimmen, Entwarnung, geben., kam, jüngst, eine, Deutsche,
”ein, allgemeiner, Fachkräemangel, eher, mehr”, anche, Baista, che, Leonardo,
approfonditamente, a, Venezia, nelle, vada, alla, aento, alle, dell’angelo:, deli-
catezza, punte, che, non, che, volare?, Jahren, angemahnte, drohenden, Fachkräe-
mangel, Ingenieuren, ein
• cura, restauro, arginato., gibt., perché, caurata, sich, auch, zu, nicht, richia-
mano, acque, ricerche, chiave, anti-Turchi., nur
114
Data: Mixed script: Greek–Russian
• ελληνική, γλώσσα, είναι, μία, από, τις, ινδοευρωπαϊκές, γλώσσες., Αποτελεί, το,
μοναδικό, μέλος, ενός, ανεξάρτητου, κλάδου, της, ινδοευρωπαϊκής, οικογένειας,
γλωσσών., Ανήκει, επίσης, στον, βαλκανικό, γλωσσικό, δεσμό., Στην, ελληνική,
γλώσσα, έχουμε, γραπτά, κείμενα, από, τον, 15ο, αιώνα, μέχρι, σήμερα.
• is, biblical, is, will, is, without, self-benefit)., is, feelings, feelings, it, be, feeling,
being, high, is, by, his, is, by, will, mostly, sexual, ”intimate, well, refined, his,
definition:, is, initially, felt, with, it, beauty, within, beauty, itself., use, ”with-
out, helps, soul, beauty, spiritual, youthful, beauty, feel, suggesting, sensually,
spiritual, finding, its, like, finding, all, seek
• (ἀγάπη, (ἔρως
• Agápe, ”love:, brotherly, love, love, of, God, for, of, for, in, known, ”love, 1, 13,
throughout, New, brotherly, love, affection, good, love, love, given, or, not, per-
son, continues, love, (even, in, for, one’s, for, spouse, refer, love, of, content, or,
holding, one, in, unconditional, love, of, God, for, of, love, ”to, good, of, Éros,
”love, of, e, Modern, Greek, word, love.”, own, Although, eros, for, person,
contemplation, becomes, of, person, or, even, becomes, of, not, of, of, love, of,
word, mean, In, Symposium, work, on, subject, eros, knowledge, of, of, ”Form”,
of, erotic, –, even, love, non-corporeal, of, is, Lovers, philosophers, through, of,
• agápē), means, esp., charity;, the, man, and, man, God.”, Agape, used, the, pas-
sage, as, the, chapter,”, Corinthians, and, described, there, and, the, Testament,
as, and, benevolence., Whether, the, returned, the, to, any, Agape, also, used,
ancient, texts, to, denote, children, and, the, a, and, was, also, used, to, to, a,
feast., It, can, also, described, as, the, regard., Agape, used, Christians, to, express,
115
the, children., type, was, further, explained, omas, Aquinas, as, the, another.”,
érōs), means, the, passion.”, ”erotas”, means, It, can, also, apply, to, dating, re-
lationships, as, as, marriage., Plato, a, an, appreciation, the, that, appreciation,
Plato, does, talk, physical, araction, as, a, necessary, part, hence, the, the, pla-
tonic, to, physical, araction.”, the, the, most, famous, ancient, the, Plato, has,
Socrates, argue, that, the, recall, and, contributes, to, an, understanding, truth,
the, ideal, that, leads, us, humans, to, desire, thus, that, that, based, aspires, to,
the, plane, existence;, that, truth, just, any, truth, leads, to, transcendence., and,
are, inspired, to, truth, the, means, eros.
116
Data: Mixed script: Ukrainian–Russian
• існували, інших
• культури
• (nt.), nt., Sk., \”dark, Idg., cp., Gr., Lat., Sk., water, Gr., water]., dark), at, SnA, S,
at, It, Sn, (cp., SnA, Sn, S
• &, A, A), ., J, 251)., 1, 1064;, 249, 250;, 12)., 64;, 348)., 382).
• cloud\”;, also, cloud, cloudy, <smallcaps> ii. </smallcaps>, =, list, is, <smallcaps>
i. </smallcaps>, (°sama, <smallcaps> vi. </smallcaps>, (abbhaŋ, <smallcaps> iv.
</smallcaps>, (nīl°, As, Dhs, DhsA, (used, (=, clouds, cloud, (also, as
• m., adj.
• abhra, (mahiyā, VvA, acchādesi);, Pv, PvA, \”dull\”;, valāhaka);, Vv, valāhaka–
sikhara
• later, scum, rain;, ambha, rain, a, Miln, (megho, Miln), nīla–megha, sense, expl,
, Dh
117
• *m̊bhro, <at>o)/mbros</at>, ambu, mass, to, obscure, moon–, <b>abbhaŋ, mahikā
</b>, <b>dhū-, marajo</b>, <b>Rāhu</b>, pabbata, rajo, <b>abbhā</b>, by,
perhaps, <b>abbhāmaa</b>, <br, /><b>–kūṭa</b>, or, summit, storm–cloud,
<b>–ghana</b>, <b>–paṭala</b>, mass, <b>–mua</b>, from, abbhāmua, <b>
–saŋvilāpa</b>,
• [Vedic, imber, Oir, (dense, Vin, in, things, that, sunshine, is, referred, moun-
tain, like, thunder–cloud);, the, point, thick, free, thundering
• Ajja,&, Ajjā, (adv.), base, a3), diva),thus, Dh.326;, ajjā;, v., PvA.59) ;,
PvA.6,23;, phrase, ajjatagge, ajjato, agge(?), ajja-tagge,see, agga3),(adv.)
, the, 32,23., (Page
• ‹-›, –kālaṁ
• [Vedic, &, +, being, (see, (see, (read, as, (=, Mhvs, (=, +, Mhvs
• adya, adyā,a, dyā,a°, dyā, dyaus, day”], to-day,now, bahutaṁ, with, day,
–divasa, day
• demonstr., pron., Loc., this, Kern,Toev., s., Freq., or, from, this, onward,hence-
forth, this, morning, present
118
Data: Pali: gūhanā
• Pug.19.Cp.pari°.(Page
• Gūhanā,(f.), [abstr.fr.gūhati]=gūhanā
• 253),(q.v.)
Normalized data
• pacati, peka, pέssw, pέpwn, pacitvā, ppr., pacanto, Gen., pacato, (+Caus., pācay-
ato), pacato, paccato, pare, pīḷentassa)., pp., pakka, Caus., pacāpeti, Pass., paccati
• *peqǔō, bake
• pac-;, 264;, 52, &, 382)
• 10,14.–, 159, –, <->, –
• fry, Niraye, I, I, by
• Av., Obulg., Gr., (trs., D., DA., (q.v.)., (q.v.)., (q.v.).
• [Ved., to, roast, kepū, cook, ripe], to, cook, roast, torment, purgatory, and, aer,
roasting, tormenting, (expld, at, where, read, for, daṇḍena, pāceti, to, be, roasted,
or, tormented, (Page
• Pacati, Idg., Lith, boil, [Link], fig., in, intrs.):, in, [Link],225,PvA.
119
Data: Twier 1 (Greek–English)
• BUSINESS, EXCELLENCE.
• ings, IT
• Keynote, “e, collective, of, science-publish, or, perish;, it, all, that, counts?”
• #dhiha6, David
• @dhiparis, dynamics, is
• #FWWC2015
• breuvages, go,
• back, żubrówka
• (coffee
120