The two approaches I'm going to discuss:
- Web crawler/spider
- Enterprise search
The web crawler is an automated bot that starts with indexing a webpage supplied by an user or read from an internal list. Every hyperlink found in the page will also be scheduled to be indexed. In this way, the web crawler hops from page to page, indexing the content after every visit. The web crawler conforms to a pull model, as it initiates the request for content for indexing.
The enterprise search that conforms to a push model, is an API or service that waits for clients to provide content for indexing. A typical scenario is a CMS that connects to the search engine webservice, with the content as parameter or content. The client initiates the indexing process in this case.
Advantages of web crawling:
- No modification needed in existing application
- Easy to implement search on multiple websites
Disadvantages of web crawling:
- Pages not reachable by hyperlinks cannot be indexed
- New content only visible in search engine after a crawler visit
- Crawling impacts performance due to crawler traffic
- Only public HTML-pages can be indexed by default
Advantages of enterprise search:
- Finer control of search results (authorization, meta-data, keywords)
- New content explicitly added/removed/updated by application
- Support for all kinds of content (even non-public content or database content)
- Minimal performance impact
Disadvantages of enterprise search:
- Integration code needed in every application to facilitate search functionality
There is no best way in search strategy. Choosing between the two approaches heavily depends on the context and requirements of the system.
No comments:
Post a Comment