sematext

home · products · services · technology · clients · testimonials · jobs · about · contact · blog

Technology

We are heavily involved in and are active developers of several excellent open-source search products: Lucene, Solr, Nutch (LSN), and Mahout. We know Lucene, Solr, and Nutch inside-out and rely on them for core indexing and search functionality.

The LSN trio is a mature set of products developed over the years under the Apache Software Foundation umbrella. Lucene, Solr, and Nutch are used by giants such as AOL, Apple, Comcast, SalesForce, and a number of other companies, some of which you can see on Sematext client list. All together, the Lucene family of products sees over 5000 downloads every day. That is nearly 2 million downloads a year!

Our own products seamlessly integrate with Lucene, Solr, and Nutch, but are designed to also be search-provider agnostic whenever possible. That makes them usable with Endeca, FAST, Google Search Appliance, Autonomy, Attivio, Vivisimo, or any other commercial/enterprise search solution. Our products are built on top of core search and are designed to enhance the overall search experience, be it through providing query spellchecking, offering of related searches, allowing search auto-completion, and so on.

Lucene

Lucene is a high performance, scalable Information Retrieval (IR) library. Information retrieval refers to the process of searching for documents, information within documents or metadata about documents. Lucene lets you add searching capabilities to your applications. It is a mature, free, open-source project implemented in Java; it’s a project in the Apache Software Foundation, licensed under the liberal Apache Software License. As such, Lucene is currently, and has been for quite a few years, the most popular free IR library. Sematext founder is a Lucene developer of 10+ years and the co-author of Lucene in Action (1st and 2nd ed.), the best selling Lucene book.

Solr

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g. Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat or Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required. Sematext founder is a Solr developer of 4+ years.

Elastic Search

Elastic Search is a cluster and cloud-aware, high-performance, open source search server. It features cluster node auto-discovery, index sharding and replication, distributed search, cluster monitoring, faceting, filtering, highlighting, etc.

It is written in Java and runs as a standalone full-text search server. Elastic Search uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP and JSON API that make it easy to use from virtually any programming language. It has extensible architecture and a growing list of pluggable modules.

Mahout

Mahout is a scalable Machine Learning Java library. It contains parallelizable implementations of a number of Machine Learning algoriths for Classification / Categorization, Clustering, Recommendations, Pattern Mining, Regression, Dimension Reduction, Evolutionary Algorithms, etc. Because it makes use of Hadoop and can run on many machines in parallel, it is suitable for very large data sets.

Hadoop / HDFS / MapReduce

Hadoop is an open-source solution for reliable, scalable, distributed computing. HDFS (Hadoop Distributed File System) is a core Hadoop sub-system that provides high throughput access to application data. Hadoop's MapReduce implementation is a framework suitable for distributed processing of large data sets on compute clusters and HDFS. Sematext employs 2 Certified Hadoop Developers.

HBase

HBase is a scalable, distributed column-oriented databas that supports structured data storage for large data sets. It works well with MapReduce, allowing developers to process data stored in HBase with MapReduce-based jobs. HBase is modeled after Google BigTable.

Voldemort

Voldemort is a high-performance distributed Key-Value Store that includes data partioning, replication, rebalancing, graceful failure handling, pluggable storage engines, etc. Voldemort was developed at LinkedIn, where it is used in high-data and high-request volume environments.

Cassandra

Cassandra is a highly scalable column-oriented distributed database originally developed by Facebook and later donated to Apache Software Foundation. Cassandra is a Google BigTable and Amazon Dynamo hybrid featuring data partitioning, replication, rebalancing, tunable eventual consistency settings, elastic run-time cluster expansion, data durability, fault tolerance, etc. We've gone through the Cassandra training course given by one of Cassandra developers.

Nutch

Nutch is an open source web-search software. It builds on Lucene Java and Solr, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. It is scalable and runs on top of Hadoop (i.e. it uses Hadoop's MapReduce and HDFS).

Other

Our experience and expertize doesn't end with search technologies. The following are some of the other technologies we use regularly:
  • BerkeleyDB (aka BDB)
  • Droids
  • Tika
  • MySQL
  • PostgreSQL
  • ...