Products :: Language Identifier
aka. LangID, Language Guesser, Language Detector
Language Identifier detects the language used in a piece of
text (e.g. document, email, web page). It uses statistical NLP
(Natural Language Processing) to learn about languages and to
identify them.
Business Value / Benefits
- Makes it possible to automatically detect a language - no human input required
- Essential for correct indexing of multi-lingual documents - each language requires different handling
Do You Need It?
How do you determine if Language Identifier is for you?
- You need to handle content in multiple languages
- Your content does not come with language information - there is no information whether a piece of content is in English or Chinese
- You currently treat content in all languages in the same fashion (but may know this is sub-optimal)
- Your search results don't seem to retrieve all content you think they should
Integration
Language Identifier exposes a simple Java API. Given a piece of
text it returns a list of languages ordered by confidence score.
It seamlessly integrates with Lucene and Solr, but is not tied
to search and can be used in applications that have nothing to
do with search. It also runs as a REST/Web service, thus
allowing integration with any software component that can invoke
it over HTTP.
FAQ
Q: Which languages can Language Identifier recognize?
A: It can detect any language it has been
trained for, regardless of type of character set used, encoding,
etc. This can easily be done with Wikipedia dumps for example or
any other custom corpora.
Q: How accurate is the Language Identifier?
A: Accuracy depends on the quality and size of the training set.
Q: How does one integrate or use Language Identifier?
A: Via very simple Java or REST APIs.
See also