Language detection

Each time a post is created, the body content is analyzed to determine which language(s) it contains. The result is stored in JSON format in the field body_language in the table Comments As a post can contain multiple languages, the result is an array.

Each item in the array contains the following values:

Language code
Confidence score
Is reliable - true/false

If the language cannot be determined (ex: post containing pictures only), the array will be left empty

The result is something like this:

A post with several languages will have his body_language field set to something like

[{"language":"es","isReliable":true,"confidence":8.22},
 {"language":"en","isReliable":false,"confidence":5.12}]

The confidence value is related to how much text the post contains. The more text analyzed, the better the language analysis, the higher the confidence value. Confidence is not a ratio and can be higher than 100.

If the post contains words in different languages, isReliable will be set to true to identify the most probable language, even if its confidence value is lower.

If there is only one language and isReliable is set to false, this indicates confidence is too low.

Be aware that language detector works using probabilities and sometimes it is not accurate with very short texts. The same happens when different languages used in the post have similar words.

The language detector can also be tricked when the content of a post contains lots of "technical noise" like pictures, source code, edit tags, ...

code

language

code

language

code

language

code

language

Afar

Abkhazian

Afrikaans

Akan

Amharic

Arabic

Assamese

Aymara

Azerbaijani

Bashkir

Belarusian

Bulgarian

Bihari

Bislama

Bengali

Tibetan

Breton

Bosnian

bug

Buginese

Catalan

ceb

Cebuano

chr

Cherokee

Corsican

crs

Seselwa

Czech

Welsh

Danish

German

Dhivehi

Dzongkha

egy

Egyptian

Greek

English

Esperanto

Spanish

Estonian

Basque

Persian

Finnish

Fijian

Faroese

French

Frisian

Irish

Scots_Gaelic

Galician

Guarani

got

Gothic

Gujarati

Manx

Hausa

haw

Hawaiian

Hindi

hmn

Hmong

Croatian

Haitian Creole

Hungarian

Armenian

Interlingua

Indonesian

Interlingue

Igbo

Inupiak

Icelandic

Italian

Inuktitut

Hebrew

Japanese

Javanese

Georgian

kha

Khasi

Kazakh

Greenlandic

Khmer

Kannada

Korean

Kashmiri

Kurdish

Kyrgyz

Latin

Luxembourgish

Ganda

lif

Limbu

Lingala

Laothian

Lithuanian

Latvian

mfe

Mauritian Creole

Malagasy

Maori

Macedonian

Malayalam

Mongolian

Marathi

Malay

Maltese

Burmese

Nauru

Nepali

Dutch

Norwegian

Ndebele

nso

Pedi

Nyanja

Occitan

Oromo

Oriya

Punjabi

Polish

Pashto

Portuguese

Quechua

Rhaeto Romance

Rundi

Romanian

Russian

Kinyarwanda

Sanskrit

sco

Scots

Sindhi

Sango

Sinhalese

Slovak

Slovenian

Samoan

Shona

Somali

Albanian

Serbian

Siswant

Sesotho

Sundanese

Swedish

Swahili

syr

Syriac

Tamil

Telugu

Tajik

Thai

Tigrinya

Turkmen

Tagalog

tlh

Klingon

Tswana

Tonga

Turkish

Tsonga

Tatar

Uighur

Ukrainian

Urdu

Uzbek

Venda

Vietnamese

Volapuk

war

Waray Philippines

Wolof

Xhosa

Yiddish

Yoruba

Zhuang

Chinese Simplified

zh-hant

Chinese Traditional

Zulu

PreviousFull Text Search NextHiveSQL for Python developers

Last updated 4 years ago