Marcs Notes

❯

❯

❯

Machine Learning

❯

Natural Language Processing

❯

Language Identification

Language Identification

10. Juni 20251 min read

Language Identification

Given a text, determine the Natural Language it is written in.

Method

Build a Trigram Model $P (c_{i} ∣ c_{i - 2 : i - 1}, ℓ)$ for every language $ℓ$ .

Apply Bayes Rule and the Markov Property to get the most likely language

\ell^* & =\underset{\ell}{\operatorname{argmax}}\left(P\left(\ell \mid \mathbf{c}_{1: N}\right)\right) \\ & =\underset{\ell}{\operatorname{argmax}}\left(P(\ell) \cdot P\left(\mathbf{c}_{1: N} \mid \ell\right)\right) \\ & =\underset{\ell}{\operatorname{argmax}}\left(P(\ell) \cdot \prod_{i=1}^N P\left(\mathbf{c}_{\boldsymbol{i}} \mid \mathbf{c}_{\boldsymbol{i}-2: i-1}, \ell\right)\right) \end{aligned}$$ The [[Prior Probability]] of the language is not a critical factor. **Intuition** When $l$ is not the correct language all probabilities of character sequences $c_{i-2:i}$ will be low. For the correct language they will be a lot higher.

Graphansicht

Language Identification
Method

Backlinks

N Gram Model
Natural Language Processing
Spam Detection

Erstellt mit Quartz v4.5.0 © 2025

GitHub