Each time a post is created, the body content is analyzed to determine which language(s) it contains. The result is stored in JSON format in the field
body_languagein the table Comments As a post can contain multiple languages, the result is an array.
Each item in the array contains the following values:
- Language code
- Confidence score
- Is reliable - true/false
If the language cannot be determined (ex: post containing pictures only), the array will be left empty
The result is something like this:
A post with several languages will have his
body_languagefield set to something like
confidencevalue is related to how much text the post contains. The more text analyzed, the better the language analysis, the higher the
confidencevalue. Confidence is not a ratio and can be higher than 100.
If the post contains words in different languages,
isReliablewill be set to
trueto identify the most probable language, even if its
confidencevalue is lower.
If there is only one language and
isReliableis set to
false, this indicates
confidenceis too low.
Be aware that language detector works using probabilities and sometimes it is not accurate with very short texts. The same happens when different languages used in the post have similar words.
The language detector can also be tricked when the content of a post contains lots of "technical noise" like pictures, source code, edit tags, ...