# What is the term vector in Lucene

The vector is one of the most overloaded word in math, programming, and many other fields.

The vector is a very useful and powerful thinking model. It’s used extensively in math, but you will encounter it more in programming and software engineering.

To understand vector we must use geometry to help us, because it’s the most intuitive way. But you should not stop there, vector is much more than just geometry.

Consider any number, you can think of it as an one dimensional vector. In geometry, it will be a point on the one dimensional coordinate axis. To illustrate it, draw an arrow from the origin to the point, this is an vector.

Consider two numbers, it will define a point on a plane, and there is also an arrow vector.

And then the three dimensional spaces, the vector will contains three numbers, it’s heavily used in 3D graphics programming.

And so on. In data structure an vector is represented as an array.

Consider a map

```
{
:name "Bob"
:age 20
:floor "10"
}
```

This is a typical domain object in software. If you treat every key as an axis and every value is a point on the axis, then the object is a vector.

Thus, data modeling is a process of vectorizing real-world object. Most of the vectors end up in a database, for example, relational, key value, document or columnar. Whenever you want to describe something in the computer, you have to vectorize it.

A programming language like C++ use the vector to represent a particular kind of data structure, but obviously, the vector is much more abstract than that. The C++ vector data structure basically just a dynamic array, that is you don’t have to specify the array size and when you keep adding items to it, it grows automatically. This has nothing to do with the concept of a vector.

The object can be a document. So we can manipulate them as if they are geometry objects in a multidimensional space, for example, calculate the angle between two vectors. It’s extensively used in text search and information retrieval.

The object can also be image or sound, there are AI and deep learning algorithms using vectors to recognize objects or speeches. In essence, it’s a similarity calculation problem.

Actually, the vector model is a fundamental way of human thinking, sometimes unconsciously. When we make decisions we consider multiple aspects of an event. And many computer programs are just trying to simulate this process. In deep learning, everything is vectorized, or so-called thought vector or word vector, and then the complex geometry transformation are conducted on the vectors.

In Lucene’s JAVA Doc, term vector is defined as “A term vector is a list of the document’s terms and their number of occurrences in that document.”. Indicated that each document has one term vector which is a list.

### What is a term?

A term is a basic unit searchable in Lucene. In analyzing and index phrase, a text is broken to streams, the element in the stream is a term, in the query phrase, the query first is parsed to terms and then use it to query the Lucene index.

A term is a pair, the first element is a string representing the field name the term belongs to, the second element is string represent the literal text of the term, it can be an English word, an URL, an email address or anything generated by your analyzer at index phrase. Following code create a term

```
Term t = new Term( "field" , "TermText");
```

A term is always associated with a field. Technically when you search Lucene index, you are searching terms, this is different with the grep-like search which actually searching characters, no matter a plain text search or a regular expression search, the smallest unit of search is a character. In a typical Lucene search, a user gives information about field and keywords(terms), Lucene gives back documents which contains fields which contains keywords.

Many important concepts in Lucene relate to term, such as term frequency, term dictionary, term vector, etc. So its very important to clearly understand it.

### The confusing definition of the term vector

If a field of the document enabled term vector, all terms in that field will be added to document’s term vector. I think the description in JAVA Doc is a little confusing since every term belongs exactly one field(same term text in a different field are two different terms), the number of occurrences of a term in the document is always the same as occurrences of the term’s text in its field. The Java Doc sounds like terms belong to document, a document is just a collection of terms. Terms first categorized by fields, then belong to document.

Also, notice the term vector is enabled or disabled at a level of the field, a document may contain both fields which enabled or disabled term vector at the same time. Here is how you retrieve a term vector, you need a doc id and a field name:

``` Terms terms = idxReader.getTermVector(0, "title");
```

Here is another definition I think is better. For each field in each document, the term vector (sometimes called document vector) may be enabled or disabled. If term vector is enabled, all terms in that field will be added to the document’s term vector list, the list contains the term and other information about the term: the frequency or position or offsets.

### Index options and term vector

In Lucene, you add a document to index, the document consists of fields, just like a database table row consists of columns. For each field, you can set various options to control how Lucene will deal with it when creating an index for the document.

There are three field options in Lucene: indexing, storing, and term vectors. The indexing and storing are easy to understand. The index option control how the field will be indexed, thus determine how it will be searched. Is it break up to tokens and those tokens are searched? Is it searched as a whole value, for example, an ID number string? The storing option decides whether to store the actual data in the index. The term vector, just like storing, it can be stored, but it stores different information, which generated by indexing. The index is a map from terms to documents, term vector is also a map, but from terms to the position, offset, frequency information in the document that the term belongs to.

```
term -> (frequency, position, offset)
```

In document’s term vector, for each term, we know the following information: the document id, the field name, the text of term, the frequency, position, and offsets. With this information computed, we can do a lot of interesting things when searching.

The indexed option means the inverted index information will be calculated and stored, but this information only tells you which documents or fields contains a particular term, in other words, it only let you do the simple matching, this can be just a map lookup, the key is a term , value is documented id. To get more useful search results, we need more details information provided by term vector.

Sometimes, we need the “uninverted” index: given a document, find all its terms and the positions information of these terms. Index tell us which document matched, term vector tells us how and where its matched. You can think of term vectors as a miniature inverted index for just one document.

A classic example is search result highlighting. Suppose user typed some query string and Lucene find some documents, now how are we gonna present the results to the user? Most search engines will give the user the title, URL, and a digest, and the searched keywords will be highlighted in them. The title and URL can be stored in the document, so it’s not a problem, but the digest cannot be stored, for the same document, a different query will generate the different digest. To generate digest we need to know where the search terms occur in the document, so we can select the piece of texts around the search term and optionally highlight the term. The term vector contains all the necessary information let us do this. See how to high light a blog post with Lucene 6.0.0 and Gradle build How to do Lucene search highlight example

The term vector is truly the final inch for the user to reach their search target, the spots where the searched terms exist in the target document.

Another interesting thing we can do with term vector is found similar documents of a particular document, for example, the “related posts” feature in a blog entry which is a list of links point to other documents similar to the current blog entry. With term vector information we can actually calculate how much two documents similar to each other with a simple formula.

The term vector also plays an important role when scoring matching documents in vector space model.

The term vectors are like a micro version of inverted index against only one document. This index will answer such a query: for a search term how many times it occurs in this document and where it shows up? Or simply: frequencies and positions.

The term vector is generated in the analyzing process. When the analyzer generates tokens, it also provides position and offset information. You can specify whether to store this information in term vectors:

• TermVector.YES: Only store number of occurrences.
• TermVector.WITH_POSITIONS: Store number of occurrence and positions of terms, but no offset.
• TermVector.WITH_OFFSETS: Store number of occurrence and offsets of terms, but no positions.
• TermVector.WITH_POSITIONS_OFFSETS: number of occurrence and positions, offsets of terms.
• TermVector.NO: Don’t store any term vector information.

If that information is not stored, you can also compute it on the fly when searching.

Code example to create a field with term vector enabled

```
Document doc = new Document();
doc.add(new Field("title", "quick fox brown fox",
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
```

### You can always generate term vectors dynamically

The only thing you need is the text of field is stored. At query time, you can always generate term vectors as necessary. For example, to highlight text fragments, you use getAnyTokenStream of class TokenSources to get a token stream, it will use term vectors if it has been computed at index time, it will perform the analyzing if not.

### Term vector and similarity measurement

But why call it a vector, a vector is a mathematics and geometry concept. The reason is the design of the term vector data structure in Lucene is based on a solid mathematical and IR foundation called the vector space model. Suppose a document contains only two terms, we can imagine the term vector as a vector in two-dimensional space.

```

[ term1:3, term2:4 ]

```

The number is the term frequency.

If you write term1 as x, term2 as y

```
[x:3, y:4]
```

This can be considered as a vector in the two-dimensional coordinate system.

If the document contains 3 terms, the vector will be a three-dimensional vector, and so on. Every document can be represented as a vector in a common vector space, and the query is also a document, it’s also a vector in this space.

Each term represent an axis, the number on the axis is term frequency, all available terms across all documents will generate a space with the same number of axes. It sounds complicated, but it’s no different with normal 2D or 3D space, the only difference is the number of axes.

And a document will simply be a dot in this space. Draw an arrow from the origin to the dot, you get a term vector. The query you input in the search box, after analyzing, can be also a document, it also has a position in this space.

The interesting thing is the angle between two vectors:

The angle provides a measurement of similarity(cosine similarity) of two documents which is very useful when you want to present the related document to users. So this is the second usefulness of term vectors: find all documents similar to a matched document. We can simply build an n-dimensional space, each document is represented as a vector in the space, we find those vectors that have a small angle with the current document.

### An alternative way to get term vector

Store term vector information may consume a lot of disk space, an alternative way is reanalyzing document and get term vector information on the fly, if the document is tiny then this may be a better solution. The analysis is the same as the indexing time analysis. It will add overhead to search performance but save disk space.

### Example code in Clojure display term vectors information

To make it visible I experiment in Clojure REPL with Lucene 4.10, the code list below:

```
(def rd (org.apache.lucene.store.RAMDirectory.))
(def iw (org.apache.lucene.index.IndexWriter.
rd
(org.apache.lucene.index.IndexWriterConfig. org.apache.lucene.util.Version/LUCENE_47 (org.apache.lucene.analysis.miscellaneous.LimitTokenCountAnalyzer. (org.apache.lucene.analysis.core.WhitespaceAnalyzer.) 1000))
)
)

(defn create-doc [title content]
(let [ doc (org.apache.lucene.document.Document.)]
(.add doc (org.apache.lucene.document.Field. "title" title org.apache.lucene.document.Field\$Store/YES org.apache.lucene.document.Field\$Index/ANALYZED org.apache.lucene.document.Field\$TermVector/WITH_POSITIONS_OFFSETS))
(.add doc (org.apache.lucene.document.Field. "body" content org.apache.lucene.document.Field\$Store/YES org.apache.lucene.document.Field\$Index/ANALYZED org.apache.lucene.document.Field\$TermVector/WITH_POSITIONS_OFFSETS))
doc
)
)

(def doc1 (create-doc "quick fox brown fox" "quick fox run faster"))
(def doc2 (create-doc "lucene in action" "lucene runs like fox"))

(.close iw)

(def is (org.apache.lucene.search.IndexSearcher. (org.apache.lucene.index.DirectoryReader/open rd )))
(def tops (.search is (org.apache.lucene.search.TermQuery. (org.apache.lucene.index.Term. "title" "fox")) 10))
(def tops (.search is (org.apache.lucene.search.TermQuery. (org.apache.lucene.index.Term. "title" "lucene")) 10))
(def scoreTops (. tops scoreDocs ))
(.toString (first scoreTops) )
(. (first scoreTops) doc)

(def terms (.getTermVector ir 0 "title"))

(let [ termsEnum (.iterator terms nil)
]
(loop [text (.next termsEnum)]
(if (nil? text) nil
(do
(println "freq: " (.totalTermFreq termsEnum) " text: " (.utf8ToString text))
(recur (.next termsEnum))
)
)
)
)

```

The output looks like

```
freq:  1  text:  brown
freq:  2  text:  fox
freq:  1  text:  quick
```

You can easily convert the code to Java code if you want, but an experiment in REPL is much more fun.

What we do here? First, we create a RAMDirectory rd to store our index and an IndexWriter iw. Then is the function generate a document with two fields: title and body.

Then write a document to the index.

Then we create an IndexReader object and call the getTermVector() method of IndexReader this call returns term vectors of the title field of the first document. The last let statement loop through the term vectors and print term text value and its frequency for each term.

From the code we can see each field has a term vector, the whole document has a map of term vector, the key is the field name, the value is the term vector belongs to the key field.

Java version:

```
package com.makble.lucenetest;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.Version;

public class TestTermVector {
public static Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
public static IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
public static RAMDirectory ramDirectory = new RAMDirectory();
public static IndexWriter indexWriter;

public static void main (String [] args ){
Document doc = new Document();
doc.add(new Field("title", "quick fox brown fox",
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("body", "quick fox run faster", Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));

try {
indexWriter = new IndexWriter(ramDirectory, config);
indexWriter.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

try {
// The returned Fields instance acts like a single-document
// inverted index (the docID will be 0).
TermsEnum termsEnum = terms.iterator(null);
BytesRef bytesRef = termsEnum.next();
while(bytesRef  != null){
System.out.println("BytesRef: " + bytesRef.utf8ToString());
System.out.println("docFreq: " + termsEnum.docFreq());
System.out.println("totalTermFreq: " + termsEnum.totalTermFreq());
bytesRef = termsEnum.next();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
```

```
apply plugin: 'java'
apply plugin: 'eclipse'

ext.luceneVersion= "4.0.0"

sourceCompatibility = 1.5
version = '1.0'
jar {
manifest {
attributes 'Implementation-Title': 'Gradle Quickstart', 'Implementation-Version': version
}
}

repositories {
mavenCentral()
}

dependencies {
compile group: 'commons-collections', name: 'commons-collections', version: '3.2'
testCompile group: 'junit', name: 'junit', version: '4.+'
compile "org.apache.lucene:lucene-core:\${luceneVersion}"
compile "org.apache.lucene:lucene-analyzers-common:\${luceneVersion}"
compile "org.apache.lucene:lucene-queryparser:\${luceneVersion}"
}

test {
systemProperties 'property': 'value'
}

repositories {
flatDir {
dirs 'repos'
}
}
}