We all have seen this, since the most modern web Search systems also have it.
One time however I was completely surprised by the Solr implementation.
I was indexing a Wiki Garden for a large Dutch governmental organization. This Wiki Garden was a collection of Wiki’s used by the organization to document about everything you can think of; Policy’s, Ideas, Project Documentation as well as meeting minutes and discussion forums. All with all the Wiki garden contained a few hundred thousand documents. Not really Big Data yet but it came nearby.
I crawled and Indexed the whole garden with Apache Nutch 1.1, which by default stores it’s index in Apache Solr. After I finished building a nice looking interface, we decided to add a more like this (mlt) link to every single search result in order to find similar documents.
When the Lucene Term Vector is properly applied to an index, the Vector-Space model of Information Retrieval allows us to compute the ‘distance’ between two documents and express it in a number. This ‘distance’ ranges between 0 (not equal) to 1 (exactly equal or the same document).
The equalities 0 and 1 are not of interest, but what was of interest to us were equalities from 0.5 to 0.9, so we did an investigation of these equalities. What we found was pretty awesome, especially the range between 0.9 and 1. These documents were often from different authors describing the same subject (for example a report of the same meeting or a review of the same scientific article!
Practical applications of document similarly
Here are a few I can think of;