Adding Lucene Search to your Applications (part 2)

Searching the Lucene Index

In the previous post we created a Lucene index. Now let’s start with the real juice. Searching the Index.

adding-Apache-lucene-search-to-your-applications-part-2

Since we already downloaded Apache Lucene, it’s not necessary to download it again and we can proceed with the downloaded files from the day before. The procedure is identical. In stead of the file IndexFiles.java we will now use SearchFiles.java. Since both Java source (and their compiled .class files) go in the same directory (or package) we will need to change the way the the commands are executed. Unless you want to move the SearchFiles.java into a different package (not recommended). You will need to type the command as follows:

java -jar org.apache.lucene.demo.SearchFiles

And if you didn’t create the index yet create it as follows:

java -jar org.apache.lucene.demo.IndexFiles

Once you have everything working it’s time to learn something about Lucene’s Query syntax. You can find it at  https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

Try to query your index with Boolean queries like AND, OR, NOT and NEAR. Also try to nest the Boolean Queries with paranthesis like (cat OR dog) AND NOT food or (cat NEAR dog) AND food. (Replace cat, dog and food with your own terms)

Now it’s time to learn about scoring. Basically it can be summarized as follows: Lucene knows a few things about it’s Index; The Term Frequency (tf) and the Inverted Document Frequency (IDF). The first is a Map, say you search for the term “dog” , Lucene knows in which documents that term appears. So dog -> {4, 8, 12} would mean that this term appears in the documents 4, 8 and 12.

The Document Frequency is the number of documents in which the term appears. Notice how they differ from each other. The first is actually the inverted Index. During indexing, Lucene collects each term and the documents it’s found in. Then it inverts this to get a map of all terms and their documents. Don’t confuse this with the inverted Document Frequency (idf), which is only the reciprocal  value of the number of documents for a particular term, So (1/df).

Summarized the tf tells how many times a term appears in a single document and the df tells how many documents contain a term. Since tf could tell something about the importance of the document, the df tells something about the importance of the term.

The score according to Lucene is the tf divided by the df. To be more specific, the score is tf times the reciprocal of the df, which is mathematically equivalent to tf/df. Why so difficult by using the reciprocal (1/df)? I don’t know but it’s the official way for Lucene to compute the score. So just ignore it and say score = tf * idf.

How does this make sense?

  • tf – can be seen as a measure of the document’s  importance, the higher this value, the more important the document is.
  • df – can say something about the importance of the term. The higher this value is, the more common the term is.

So so to compensate tf for df it makes sense to divide tf by df. Of course this is somewhat simplyfied. In reality things are more complicated, for example when we would use phrases instead of single-word terms. For this I refer to my article about the Lucene Term Vector.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.