Blog

Improving Search & Query with Lucene

28 Aug, 2013
Xebia Background Header Wave

Most software solutions on different platforms will certainly have search functionality that needs to query data from datasources and serve to the consuming application. Application Search functionality can mean a search box in user interface or some internal data query.

Software solutions with large database will experience load performance issues due to several reasons, most probably due to combining data structures (ex: table joins) and retrieval (ex: query). This is true even if we introduce clusters, indexes, materialized views, etc. at database level in an attempt to improve query performance. Query execution slows down as records and indexes grow.

Lucene

Apache Lucene is a high-performance text search engine suitable for nearly any application that requires full-text search. (More details here.)

It can even support the “Did you mean?” functionality like in google search which gives suggestions for any incorrect/unrecognized words.

Why is Lucene faster?

Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the  data as an object or record, which in turn have fields and values. A search is done from top to bottom i.e it searches for objects having fields with matching values and returns those objects.

With inverted index, Lucene indexes all possible combination of the values in documents, when search is done, it first matches the value combinations with some fast algorithms and returns documents (objects) which have fields with those values.

Index and search algorithms make Lucene faster than any known databases.

Sharing index files

Since lucene index creation is time consuming process, the indexes can be created in one machine and distributed to the onsite and offshore development teams, teams can then just place the index files in a configured location in filesystem and use it.

Architecture

lucene architecture

 

Lucene Indexing

Software solution programmers can introduce a lucene layer which indexes all the related data from database as documents in lucene index. This is done programmatically by a one-time query of database, creating mapped objects for the records and serializing those objects as Lucene documents. These documents can have a unique Id as key. For example, we can combine the details of Department and Employee table into one object and index them as documents in lucene index with one document per employee. Employee id or code can be key.

Lucene can also index and search document files like Word, PDF, HTML, Text files. Lucene first time indexing is an expensive process, so care should be taken to perform the first time indexing at off-peak hours.

Lucene search

Normally, developers would code an application search or query functionality as database query and return records. With lucene, developers need to programmatically query the lucene index first which does fast retrieval of documents matching the search criteria. The returned documents can be read for required data and displayed to user

Alternatively, retrieve the Id or specific field(ex: employee code) from the returned documents. These Id or fields can subsequently be used for querying the database. This is very useful as Lucene can search millions of documents faster than searching millions of records in database. Idea is to search Lucene first, get a subset of only matching records and use that subset to query the database.

We have had a very good experience with the combined performance of filtering on Lucene index first before querying database, with its proven trackrecord we have incorporated it into all our existing as well as new software solutions. The performance improvement has been logged at over 150% faster than database search.

Update index

The lucene index needs to be frequently updated whenever data updates are made in database. This can be asynchronously handled by threaded programs that can re-index only the modified records. Lucene is capable of maintaining a snapshot index for continuing with application search even when re-index is in progress.

WorkFlow 

lucene workflow 

Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts