Apache Solr magic: mlt


I already introduced you to Solr in a different article. This time I will get further into Solr and one of its magical features: The more like this functionality.

We all have seen this, since the most modern web Search systems also have it.

One time however I was completely surprised by the Solr implementation.

What happened?

I was indexing a Wiki Garden for a large Dutch governmental organization. This Wiki Garden was a collection of Wiki’s used by the organization to document about everything you can think of; Policy’s, Ideas, Project Documentation as well as meeting minutes and discussion forums. All with all the Wiki garden contained a few hundred thousand documents. Not really Big Data yet but it came nearby.

I crawled and Indexed the whole garden with Apache Nutch 1.1, which by default stores it’s index in Apache Solr. After I finished building a nice looking interface, we decided to add a more like this (mlt) link to every single search result in order to find similar documents.

Document similarly

When the Lucene Term Vector is properly applied to an index, the Vector-Space model of Information Retrieval allows us to compute the ‘distance’ between two documents and express it in a number. This ‘distance’ ranges between 0 (not equal) to 1 (exactly equal or the same document).

The equalities 0 and 1 are not of interest, but what was of interest to us were equalities from 0.5 to 0.9, so we did an investigation of these equalities. What we found was pretty awesome, especially the range between 0.9 and 1. These documents were often from different authors describing the same subject (for example a report of the same meeting or a review of the same scientific article!

Practical applications of document similarly

Here are a few I can think of;

  • Plagiarism (fraud) detection.
  • In bibliographic research (to find similar research papers).
  • In biochemical research (to find genetical) similarities in Nucleotide sequences  (In fact this is already done using BLAST. To see BLAST at work, visit NCBI.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.