Detects the language of a given piece of text.
The package attempts to detect the language of a sample of text by correlating ranked 3-gram frequencies to a table of 3-gram frequencies of known languages.
It implements a version of a technique originally proposed by Cavnar & Trenkle (1994): "N-Gram-Based Text Categorization".
At first, you might want to get a list of supported languages. It can be retrieved by calling getLanguages() on a Text_LanguageDetect object. It returns an array of strings that represent the languages, e.g. array('albanian', 'arabic', 'azeri').
To actually detect the language of a piece of text, use the detect() method on the Text_LanguageDetect object. It takes the text as first parameter, and an optional $limit as second parameter, determining how many (likely) languages shall be returned at most. The method returns a sorted array with the languages as key, and their score as value. If no language is detected, an empty array is returned.
To get the most likely language only, use detectSimple() which directly returns the string of the language, or null if none was detected.
Note: To detect the language correctly, the length of the input text should be at least some sentences.
<?php require_once 'Text/LanguageDetect.php'; $l = new Text_LanguageDetect(); echo "Supported languages:\n"; $langs = $l->getLanguages(); if (PEAR::isError($langs)) { die($langs->getMessage()); } sort($langs); echo implode(', ', $langs) . "\n\n"; $text = <<<EOD Hallo! Das ist ein Text in deutscher Sprache. Mal sehen, ob die Klasse erkennt, welche Sprache das hier ist. EOD; $result = $l->detect($text, 4); if (PEAR::isError($result)) { echo $result->getMessage(), "\n"; } else { print_r($result); } ?> |
The above example would give the following output:
Supported languages: albanian, arabic, azeri, bengali, bulgarian, cebuano, croatian, czech, danish, dutch, english, estonian, farsi, finnish, french, german, hausa, hawaiian, hindi, hungarian, icelandic, indonesian, italian, kazakh, kyrgyz, latin, latvian, lithuanian, macedonian, mongolian, nepali, norwegian, pashto, pidgin, polish, portuguese, romanian, russian, serbian, slovak, slovene, somali, spanish, swahili, swedish, tagalog, turkish, ukrainian, urdu, uzbek, vietnamese, welsh Array ( [german] => 0.407037037037 [dutch] => 0.288065843621 [english] => 0.283333333333 [danish] => 0.234526748971 ) |