Monday, May 17, 2010

Search implementation strategies

In this blog post, I will talk briefly about two approaches in providing search functionality in a JEE application. I will talk about how content is offered to a core search engine like Lucene, which is responsible for indexing, storing and retrieving. Lucene offers a library to build search engines. It does not provide a way to retrieve or accept content from sources for indexing. Third party vendors that provide complete search engine products based on Lucene exist to fulfill this need. How the indexing works is outside the scope of this blog post. Interesting sites powered by Lucene can be found here.

The two approaches I'm going to discuss:

  • Web crawler/spider
  • Enterprise search

The web crawler is an automated bot that starts with indexing a webpage supplied by an user or read from an internal list. Every hyperlink found in the page will also be scheduled to be indexed. In this way, the web crawler hops from page to page, indexing the content after every visit. The web crawler conforms to a pull model, as it initiates the request for content for indexing.

The enterprise search that conforms to a push model, is an API or service that waits for clients to provide content for indexing. A typical scenario is a CMS that connects to the search engine webservice, with the content as parameter or content. The client initiates the indexing process in this case.

Advantages of web crawling:

  • No modification needed in existing application
  • Easy to implement search on multiple websites

Disadvantages of web crawling:

  • Pages not reachable by hyperlinks cannot be indexed
  • New content only visible in search engine after a crawler visit
  • Crawling impacts performance due to crawler traffic
  • Only public HTML-pages can be indexed by default

Advantages of enterprise search:

  • Finer control of search results (authorization, meta-data, keywords)
  • New content explicitly added/removed/updated by application
  • Support for all kinds of content (even non-public content or database content)
  • Minimal performance impact

Disadvantages of enterprise search:

  • Integration code needed in every application to facilitate search functionality

There is no best way in search strategy. Choosing between the two approaches heavily depends on the context and requirements of the system.

