Text_LanguageDetect

Detects the language of a given piece of text.

The package attempts to detect the language of a sample of text by correlating ranked 3-gram frequencies to a table of 3-gram frequencies of known languages.

It implements a version of a technique originally proposed by Cavnar & Trenkle (1994): "N-Gram-Based Text Categorization".

Detecting the language

At first, you might want to get a list of supported languages. It can be retrieved by calling getLanguages() on a Text_LanguageDetect object. It returns an array of strings that represent the languages, e.g. array('albanian', 'arabic', 'azeri').

To actually detect the language of a piece of text, use the detect() method on the Text_LanguageDetect object. It takes the text as first parameter, and an optional $limit as second parameter, determining how many (likely) languages shall be returned at most. The method returns a sorted array with the languages as key, and their score as value. If no language is detected, an empty array is returned.

To get the most likely language only, use detectSimple() which directly returns the string of the language, or null if none was detected.

Example

<?php
require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();

echo "Supported languages:\n";
$langs = $l->getLanguages();
if (PEAR::isError($langs)) {
    die($langs->getMessage());
}
sort($langs);
echo implode(', ', $langs) . "\n\n";

$text = <<<EOD
Hallo! Das ist ein Text in deutscher Sprache.
Mal sehen, ob die Klasse erkennt, welche Sprache das hier ist.
EOD;

$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
    echo $result->getMessage(), "\n";
} else {
    print_r($result);
}
?>

The above example would give the following output:

Supported languages:
albanian, arabic, azeri, bengali, bulgarian, cebuano, croatian,
czech, danish, dutch, english, estonian, farsi, finnish, french,
german, hausa, hawaiian, hindi, hungarian, icelandic, indonesian,
italian, kazakh, kyrgyz, latin, latvian, lithuanian, macedonian,
mongolian, nepali, norwegian, pashto, pidgin, polish, portuguese,
romanian, russian, serbian, slovak, slovene, somali, spanish,
swahili, swedish, tagalog, turkish, ukrainian, urdu, uzbek,
vietnamese, welsh

Array
(
    [german] => 0.407037037037
    [dutch] => 0.288065843621
    [english] => 0.283333333333
    [danish] => 0.234526748971
)