Language detection
Last updated
Last updated
Each time a post is created, the body content is analyzed to determine which language(s) it contains.
The result is stored in JSON format in the field body_language
in the table Comments
As a post can contain multiple languages, the result is an array.
Each item in the array contains the following values:
Language code
Confidence score
Is reliable - true/false
If the language cannot be determined (ex: post containing pictures only), the array will be left empty
The result is something like this:
A post with several languages will have his body_language
field set to something like
The confidence
value is related to how much text the post contains. The more text analyzed, the better the language analysis, the higher the confidence
value. Confidence is not a ratio and can be higher than 100.
If the post contains words in different languages, isReliable
will be set to true
to identify the most probable language, even if its confidence
value is lower.
If there is only one language and isReliable
is set to false
, this indicates confidence
is too low.
Be aware that language detector works using probabilities and sometimes it is not accurate with very short texts. The same happens when different languages used in the post have similar words.
The language detector can also be tricked when the content of a post contains lots of "technical noise" like pictures, source code, edit tags, ...
code
language
code
language
code
language
code
language
aa
Afar
ab
Abkhazian
af
Afrikaans
ak
Akan
am
Amharic
ar
Arabic
as
Assamese
ay
Aymara
az
Azerbaijani
ba
Bashkir
be
Belarusian
bg
Bulgarian
bh
Bihari
bi
Bislama
bn
Bengali
bo
Tibetan
br
Breton
bs
Bosnian
bug
Buginese
ca
Catalan
ceb
Cebuano
chr
Cherokee
co
Corsican
crs
Seselwa
cs
Czech
cy
Welsh
da
Danish
de
German
dv
Dhivehi
dz
Dzongkha
egy
Egyptian
el
Greek
en
English
eo
Esperanto
es
Spanish
et
Estonian
eu
Basque
fa
Persian
fi
Finnish
fj
Fijian
fo
Faroese
fr
French
fy
Frisian
ga
Irish
gd
Scots_Gaelic
gl
Galician
gn
Guarani
got
Gothic
gu
Gujarati
gv
Manx
ha
Hausa
haw
Hawaiian
hi
Hindi
hmn
Hmong
hr
Croatian
ht
Haitian Creole
hu
Hungarian
hy
Armenian
ia
Interlingua
id
Indonesian
ie
Interlingue
ig
Igbo
ik
Inupiak
is
Icelandic
it
Italian
iu
Inuktitut
iw
Hebrew
ja
Japanese
jw
Javanese
ka
Georgian
kha
Khasi
kk
Kazakh
kl
Greenlandic
km
Khmer
kn
Kannada
ko
Korean
ks
Kashmiri
ku
Kurdish
ky
Kyrgyz
la
Latin
lb
Luxembourgish
lg
Ganda
lif
Limbu
ln
Lingala
lo
Laothian
lt
Lithuanian
lv
Latvian
mfe
Mauritian Creole
mg
Malagasy
mi
Maori
mk
Macedonian
ml
Malayalam
mn
Mongolian
mr
Marathi
ms
Malay
mt
Maltese
my
Burmese
na
Nauru
ne
Nepali
nl
Dutch
no
Norwegian
nr
Ndebele
nso
Pedi
ny
Nyanja
oc
Occitan
om
Oromo
or
Oriya
pa
Punjabi
pl
Polish
ps
Pashto
pt
Portuguese
qu
Quechua
rm
Rhaeto Romance
rn
Rundi
ro
Romanian
ru
Russian
rw
Kinyarwanda
sa
Sanskrit
sco
Scots
sd
Sindhi
sg
Sango
si
Sinhalese
sk
Slovak
sl
Slovenian
sm
Samoan
sn
Shona
so
Somali
sq
Albanian
sr
Serbian
ss
Siswant
st
Sesotho
su
Sundanese
sv
Swedish
sw
Swahili
syr
Syriac
ta
Tamil
te
Telugu
tg
Tajik
th
Thai
ti
Tigrinya
tk
Turkmen
tl
Tagalog
tlh
Klingon
tn
Tswana
to
Tonga
tr
Turkish
ts
Tsonga
tt
Tatar
ug
Uighur
uk
Ukrainian
ur
Urdu
uz
Uzbek
ve
Venda
vi
Vietnamese
vo
Volapuk
war
Waray Philippines
wo
Wolof
xh
Xhosa
yi
Yiddish
yo
Yoruba
za
Zhuang
zh
Chinese Simplified
zh-hant
Chinese Traditional
zu
Zulu