Language detection

Each time a post is created, the body content is analyzed to determine which language(s) it contains. The result is stored in JSON format in the field body_language in the table Comments As a post can contain multiple languages, the result is an array.

Each item in the array contains the following values:

  • Language code

  • Confidence score

  • Is reliable - true/false

If the language cannot be determined (ex: post containing pictures only), the array will be left empty

The result is something like this:

A post with several languages will have his body_language field set to something like

[{"language":"es","isReliable":true,"confidence":8.22},
{"language":"en","isReliable":false,"confidence":5.12}]

The confidence value is related to how much text the post contains. The more text analyzed, the better the language analysis, the higher the confidence value. Confidence is not a ratio and can be higher than 100.

If the post contains words in different languages, isReliable will be set to true to identify the most probable language, even if its confidence value is lower.

If there is only one language and isReliable is set to false, this indicates confidence is too low.

Be aware that language detector works using probabilities and sometimes it is not accurate with very short texts. The same happens when different languages used in the post have similar words.

The language detector can also be tricked when the content of a post contains lots of "technical noise" like pictures, source code, edit tags, ...

code

language

code

language

code

language

code

language

aa

Afar

ab

Abkhazian

af

Afrikaans

ak

Akan

am

Amharic

ar

Arabic

as

Assamese

ay

Aymara

az

Azerbaijani

ba

Bashkir

be

Belarusian

bg

Bulgarian

bh

Bihari

bi

Bislama

bn

Bengali

bo

Tibetan

br

Breton

bs

Bosnian

bug

Buginese

ca

Catalan

ceb

Cebuano

chr

Cherokee

co

Corsican

crs

Seselwa

cs

Czech

cy

Welsh

da

Danish

de

German

dv

Dhivehi

dz

Dzongkha

egy

Egyptian

el

Greek

en

English

eo

Esperanto

es

Spanish

et

Estonian

eu

Basque

fa

Persian

fi

Finnish

fj

Fijian

fo

Faroese

fr

French

fy

Frisian

ga

Irish

gd

Scots_Gaelic

gl

Galician

gn

Guarani

got

Gothic

gu

Gujarati

gv

Manx

ha

Hausa

haw

Hawaiian

hi

Hindi

hmn

Hmong

hr

Croatian

ht

Haitian Creole

hu

Hungarian

hy

Armenian

ia

Interlingua

id

Indonesian

ie

Interlingue

ig

Igbo

ik

Inupiak

is

Icelandic

it

Italian

iu

Inuktitut

iw

Hebrew

ja

Japanese

jw

Javanese

ka

Georgian

kha

Khasi

kk

Kazakh

kl

Greenlandic

km

Khmer

kn

Kannada

ko

Korean

ks

Kashmiri

ku

Kurdish

ky

Kyrgyz

la

Latin

lb

Luxembourgish

lg

Ganda

lif

Limbu

ln

Lingala

lo

Laothian

lt

Lithuanian

lv

Latvian

mfe

Mauritian Creole

mg

Malagasy

mi

Maori

mk

Macedonian

ml

Malayalam

mn

Mongolian

mr

Marathi

ms

Malay

mt

Maltese

my

Burmese

na

Nauru

ne

Nepali

nl

Dutch

no

Norwegian

nr

Ndebele

nso

Pedi

ny

Nyanja

oc

Occitan

om

Oromo

or

Oriya

pa

Punjabi

pl

Polish

ps

Pashto

pt

Portuguese

qu

Quechua

rm

Rhaeto Romance

rn

Rundi

ro

Romanian

ru

Russian

rw

Kinyarwanda

sa

Sanskrit

sco

Scots

sd

Sindhi

sg

Sango

si

Sinhalese

sk

Slovak

sl

Slovenian

sm

Samoan

sn

Shona

so

Somali

sq

Albanian

sr

Serbian

ss

Siswant

st

Sesotho

su

Sundanese

sv

Swedish

sw

Swahili

syr

Syriac

ta

Tamil

te

Telugu

tg

Tajik

th

Thai

ti

Tigrinya

tk

Turkmen

tl

Tagalog

tlh

Klingon

tn

Tswana

to

Tonga

tr

Turkish

ts

Tsonga

tt

Tatar

ug

Uighur

uk

Ukrainian

ur

Urdu

uz

Uzbek

ve

Venda

vi

Vietnamese

vo

Volapuk

war

Waray Philippines

wo

Wolof

xh

Xhosa

yi

Yiddish

yo

Yoruba

za

Zhuang

zh

Chinese Simplified

zh-hant

Chinese Traditional

zu

Zulu