Cloud. Big Data. Analytics... and so on: August 2015

I was using Solr for many years... well relatively. I started when Solr 4.0 just came out and had hands on production level experience all the way to Solr 4.8.

Recently, I was put in charge of the project that had to use ElasticSearch as its search engine. I was happy to jump on this opportunity and get my hands on it. I previously looked at ElasticSearch, but at that point it was very immature and I decided to pass on it at that time since I felt it was not ready to support my applications. Though I did have my eye on it, and I saw how it was using Solr's shortcomings to come up with a better architecture and address all "i wish Solr would do <insert here>".

So when I was asked to architect a robust ElasticSearch API to provide easy to use, scalable solution for series of projects I was excited.

From working with ElasticSearch, I was intrigued by how configurable it can be... i.e. I was able to specify type of the field, how it shall be analyzed and broken down and used during various series of its processing.

A very good and detailed article can be found here: https://www.found.no/foundation/text-analysis-part-1/
Spend some time with this one.

and don't be afraid to experiment. Analyze API is there for you to fine tune your schema to exactly what you want: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

I also found this to be very simple and yet comprehensive tutorial on analyzers: https://www.elastic.co/guide/en/elasticsearch/guide/current/analysis-intro.html#_built_in_analyzers

One of the challenges that rose late in the development cycle is ability to provide accurate searches when dealing with special characters. That was great opportunity for me to fine tune settings. I greatly enjoyed reading following article: http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html

Based on what I saw on stackoverflow.com and other sites, one of the issues that new users of ElasticSearch face is using _all field during searches. When searching, if you use '_all' field it by default combines all fields in your documents into a single field which is analyzed by the default analyzer. You need to search individual fields (not _all) for the field specific analyzer to be applied. In other words, if you have fine tuned schema with proper analyzers, etc none of it will be applied when you are not explicitly searching against a field that has it... So be careful there.

Also going back to special characters searches. I was faced with a question of what kind of tokenizers are the best for search for words with hyphens in them? As a start I chose to use word-delimiter filter and I specified to preserve the original. Complete settings list can be found here: http://www.elasticsearch.org/guide/reference/index-modules/analysis/word-delimiter-tokenfilter.html
I end up using generate_word_parts = true, generate_number_parts = true, catenate_all = true, preserve_original = true, split_on_numerics = false.

In my case it was the best fit based on my requirements... Your settings might end up being different.

Hope it helps and like always. Don't be afraid to reach out and ask for help!

Cloud. Big Data. Analytics... and so on

Sunday, August 2, 2015

ElasticSearch... is awesome. You just have to know what you are doing ;)