Language detection is surprisingly difficult. The neural networks get basic things ...

Language detection is surprisingly difficult. The neural networks get basic things wrong. They will say that Korean text is actually Chinese, even though you can obviously see with your eyes that it's not.

After multiple libraries failed this basic test, I did something brave and implemented a naive regex solution in #Ditto. It does a first pass on the text before moving it on to the neural network.

안녕하세요

For example, if ALL the characters are in Korean script, it must be Korean. Even if it's a nonsensical sequence of Korean characters, it cannot be any language other than Korean due to the fact Korean makes exclusive use of this character set.

There are only a few languages where this is possible: Korean, Greek, and Hebrew.

Again, this is only possible if ALL characters in the text match a target language, so simply using "π" in a text does not make it Greek. So, currently this check is very narrow.

Notes about other languages:

Chinese: it's not possible to do a regex-only solution for Chinese, since Han script is also part of Japanese.

Japanese: we *can* definitively detect Japanese, as long as the text contains at least one Hirigana or Katakana character in addition to 0 or more Han characters. So at least *some* Japanese text can be unambiguously detected just by a regex.

Russian: Cyrillic text is used by a handful of languages besides Russian. BUT, if the text is entirely Cyrillic, that at least narrows down the *possible* languages it could be.

Next steps:

To optimize this, the regexes will narrow down possible languages of a text before passing it to the neural network.

For example, if a text is entirely Han, we would restrict the model to deciding only between Chinese and Japanese. If it's Cyrillic, we'd do the same thing, but with the 6 or so Cyrillic languages.

We could also try to match, say, 90% of the text instead of 100%, to any specific script, to catch outliers like occasional English words used in Japanese, etc. We are already stripping things like punctuation, emojis, and URLs before passing text to the model.

Finally, this is all so we can use a lightweight, embedded solution for language detection, instead of calling out to some proprietary API, or even a giant self-hosted solution. In that case, I believe a layered solution will always be needed. We have to do these naive checks to put "guardrails" on the model, so its guesses can't stray outside of common sense. Switching the model can improve it, but these naive checks will still be true.

Alex on Nostr: Language detection is surprisingly difficult. The neural networks get basic things ...