Thursday, April 30, 2015

Lucene scoring examplained

Several good books already explain what Lucene scoring really means and how it is calculated.

Yet, I really liked the explanation by Kelvin Tan that got it concise enough for people already familiar with it and just needing a quick reminder... or for people that don't want to read an entire chapter dedicated to Lucene scoring!

The factors involved in Lucene's scoring algorithm are as follows:
1. tf = term frequency in document = measure of how often a term appears in the document
2. idf = inverse document frequency = measure of how often the term appears across the index
3. coord = number of terms in the query that were found in the document
4. lengthNorm = measure of the importance of a term according to the total number of terms in the field
5. queryNorm = normalization factor so that queries can be compared
6. boost (index) = boost of the field at index-time
7. boost (query) = boost of the field at query-time

The implementation, implication and rationales of factors 1,2, 3 and 4 in DefaultSimilarity.java, which is what you get if you don't explicitly specify a similarity, are:
note: the implication of these factors should be read as, "Everything else being equal, … "

1. tf
Implementation: sqrt(freq)
Implication: the more frequent a term occurs in a document, the greater its score
Rationale: documents which contains more of a term are generally more relevant

2. idf
Implementation: log(numDocs/(docFreq+1)) + 1
Implication: the greater the occurrence of a term in different documents, the lower its score
Rationale: common terms are less important than uncommon ones

3. coord
Implementation: overlap / maxOverlap
Implication: of the terms in the query, a document that contains more terms will have a higher score
Rationale: self-explanatory

4. lengthNorm
Implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher score
Rationale: a term in a field with less terms is more important than one with more
queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. It is implemented as 1/sqrt(sumOfSquaredWeights)
So, roughly speaking (quoting Mark Harwood from the mailing list),
* Documents containing *all* the search terms are good
* Matches on rare words are better than for common words
* Long documents are not as good as short ones
* Documents which mention the search terms many times are good

The mathematical definition of the scoring can be found at https://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/search/Similarity.html

Hint: look at NutchSimilarity in Nutch to see an example of how web pages can be scored for relevance

Customizing scoring

Its easy to customize the scoring algorithm. Just subclass DefaultSimilarity and override the method you want to customize.
For example, if you want to ignore how common a term appears across the index,

Similarity sim = new DefaultSimilarity() {
  public float idf(int i, int i1) {
    return 1;
  }
}

and if you think for the title field, more terms is better

Similarity sim = new DefaultSimilarity() {
  public float lengthNorm(String field, int numTerms) {
    if(field.equals("title")) return (float) (0.1 * Math.log(numTerms));
    else return super.lengthNorm(field, numTerms);
  }
}

Tuesday, April 14, 2015

Logstash: Processing files from beginning

It was a while since I posted, so figured I would break the rust and write a small post.

I was recently involved in a project leveraging ELK (ElasticSearch, Logstash, Kibana). During the set up process, I had to experiment with different things and used a file with some test data. Well, after first successful try, my file was read in and Logstash "stopped" reading it...

What was wrong? Simple.

Logstash keeps track of files that it read and position in that file, so if you have a file with 5 lines, Logstash would read that file and mark that it read 5 lines in it. Next time it sees this file, it would check number of lines in it vs what it already processed. If no new lines were added file will be simply ignored. It is done so that same files wont be rescanned from the beginning each time Logstash sees them. It is good for production, but how can you rescan the file while testing? Simple!

First of all in your input block add

start_position => "beginning" 

and if you still experience problems,  run

rm ~/.sincedb*

before running logstash, in addition to start_position => beginning. This command would delete Logstash pointer data. Obviously this should be your last resort. :)