sematext

home · products · services · technology · clients · testimonials · jobs · about · contact · blog

Products :: Key Phrase Extractor

aka. Concept Extractor, Collocation Extractor, SIP Extractor

Description

Key Phrase Extractor is a toolkit for extracting key terms and phrases from text. It is designed to be used in two main modes:
  1. Extractor of common (frequently occurring) phrases. These phrases are known as Collocations.

    When used in this mode, the Key Phrase Extractor identifies key phrases in the input text. For example, if Key Phrase Extractor were to analyze the content of Lucene in Action, it would find terms like "Lucene" and "search", as well as phrases such as "inverted index", "information retrieval", "query parser", and so on.

  2. Extractor of phrases based on the comparison and the difference between phrases found in two sets of documents (also known as background and foreground corpus). These phrases are known as Statistically Improbable Phrases or SIPs.

    When used in this mode, the Key Phrase Extractor finds key differentiating phrases between two document sets. For example, when given news articles from yesterday and news articles from today, the Key Phrase Extractor will identify key terms and phrases that make today's news different from yesterday's. Key terms and phrases may end up being names of people such as "Steve Jobs" or "Warren Buffett", as well as phrases such as "Swine Flu" or "Somali Pirates", thus identifying people and concepts that have more mentions today than they were yesterday.

Business Value / Benefits

  • Extracts key concepts from content
  • Extracts key concepts from multiple pieces of content based on content difference
  • Identifies key terms and phrases useful for describing main concepts from a larger piece of text
  • Finds key terms and phrases for search results enhancement by providing additional navigational meta-data

Integration

Key Phrase Extractor exposes a simple Java API. Given a piece of text it returns a list of phrases ordered by their computed score. The API includes the ability to filter out the returned phrases. The toolkit includes several useful filters. The extensible and very simple API lets you write and plug in your own filters, too.

FAQ

None - ask us!
Try Demo