Cloud. Big Data. Analytics... and so on: 2015

Sunday, December 20, 2015

Accessing wired Windows printer from your Mac

Before my PC seized to be, I used to have old wired printer that did the job... the only problem was that I already have few Apple laptops and I wanted to be able to print from them (The problem is easily solved by buying wireless printer that supports AirPrint, i.e. even print from your iPhone!).

Anyway, if you have wired Windows printer and don't want to upgrade just yet here is what you need to do:

On Windows PC
1. Establish user account on your PC. This was one thing that I had to do to make everything that should work to actually work. This is as easy as opening your control panel and clicking on Add user in your Users menu. For more tricks see this: http://www.howtogeek.com/howto/10325/manage-user-accounts-in-windows-home-server/

2. Now onto actual set up... Select Start->Devices and Printers. Right click on the printer that you want to share, and either pick share or properties and then pick sharing tab. Make sure that share check box is selected and make sure that you note down the name of the printer.

3. Open command prompt. Use ipconfig command to find your PC's IP.

To summarize this part. You have IP address and name of the printer to connect to and you have the credentials that you created in step 1.

On MAC
1. Open System Preferences and locate Printers and Scanners icon. Click!
2. Select + under printers to add new printer, i.e. wired Windows printer.
3. Right click on the menu and select Customize Toolbar and add Advanced
4. Click on Advanced. For type select Windows printer via spoolss
5. For URL provide IP and printer name that you have... so the link looks like smb://192.138.1.13/printer_name (Replace any spaces with %20 in your
6. Under Choose a driver or Printer Model pick your printer type (I did not see my exact model so I picked closes HP model instead. Worked)
7. At this point your PC printer will be connected to your Mac.

Testing
Select something to print on your Mac... in my case, the first time it tried to print it asked for my PC username/password which we created previously, after that it was stored in my keychain and was never an issue again.

That's it! Hope it helps and once again, let me know if you have further questions, etc.

Sunday, August 2, 2015

ElasticSearch... is awesome. You just have to know what you are doing ;)

I was using Solr for many years... well relatively. I started when Solr 4.0 just came out and had hands on production level experience all the way to Solr 4.8.

Recently, I was put in charge of the project that had to use ElasticSearch as its search engine. I was happy to jump on this opportunity and get my hands on it. I previously looked at ElasticSearch, but at that point it was very immature and I decided to pass on it at that time since I felt it was not ready to support my applications. Though I did have my eye on it, and I saw how it was using Solr's shortcomings to come up with a better architecture and address all "i wish Solr would do <insert here>".

So when I was asked to architect a robust ElasticSearch API to provide easy to use, scalable solution for series of projects I was excited.

From working with ElasticSearch, I was intrigued by how configurable it can be... i.e. I was able to specify type of the field, how it shall be analyzed and broken down and used during various series of its processing.

A very good and detailed article can be found here: https://www.found.no/foundation/text-analysis-part-1/
Spend some time with this one.

and don't be afraid to experiment. Analyze API is there for you to fine tune your schema to exactly what you want: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

I also found this to be very simple and yet comprehensive tutorial on analyzers: https://www.elastic.co/guide/en/elasticsearch/guide/current/analysis-intro.html#_built_in_analyzers

One of the challenges that rose late in the development cycle is ability to provide accurate searches when dealing with special characters. That was great opportunity for me to fine tune settings. I greatly enjoyed reading following article: http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html

Based on what I saw on stackoverflow.com and other sites, one of the issues that new users of ElasticSearch face is using _all field during searches. When searching, if you use '_all' field it by default combines all fields in your documents into a single field which is analyzed by the default analyzer. You need to search individual fields (not _all) for the field specific analyzer to be applied. In other words, if you have fine tuned schema with proper analyzers, etc none of it will be applied when you are not explicitly searching against a field that has it... So be careful there.

Also going back to special characters searches. I was faced with a question of what kind of tokenizers are the best for search for words with hyphens in them? As a start I chose to use word-delimiter filter and I specified to preserve the original. Complete settings list can be found here: http://www.elasticsearch.org/guide/reference/index-modules/analysis/word-delimiter-tokenfilter.html
I end up using generate_word_parts = true, generate_number_parts = true, catenate_all = true, preserve_original = true, split_on_numerics = false.

In my case it was the best fit based on my requirements... Your settings might end up being different.

Hope it helps and like always. Don't be afraid to reach out and ask for help!

Sunday, July 5, 2015

Machine Learning... Taking first steps

This is going to be my first post (of many, I hope!!!) where I would discuss my recent projects dealing with machine learning (ML) and what I learned from them. I hope to benefit people who are trying to understand topics of ML and to make notes for my own future reference.

First of all, why do we need ML in the first place? Well, ML we can be broken into two categories "Supervised Learning" (SL) and "Unsupervised Learning" (UL).

SL in my opinion is more widely used and presents more practical applications versus UL. Let's consider few examples, SL can be further broken down to "Classifiers" and "Regression". As the name may suggest, classifiers try to predict type or class of the outcome based on historical data. For example, given number of emails some of which were classified as spam and some of which were flagged to be not spam, we can build a classifier that will try to automatically filter incoming mail as spam or not spam for us. You can see this in your Outlook or Gmail! (Ummm, I feel that I need to cover spam filtering in bit more details in the future. Stay tuned.)

Classifiers try to predict more or less yes/no answer. Is it a spam or not? Is it a car or not? Is it fraudulent transaction or not? Regression on other hand tries to predict certain value based on known parameters. One example is house cost prediction: given parameters of the house such as sq ft, number of bedrooms and bathrooms, etc, time of year and its location, one can try to predict its cost at a certain time in the future.

UL - since outputs are unknown you can not really train your algorithms to predict since you don't know what it is. One of the classical uses of UL is "Clustering". For example, taking few specific properties of a subject plot them on a "map" to see if clusters would be formed, which would help you to determine relationship between your input data. For example, if you have hundreds of articles and what to know which ones are related to each other, you can read all of them and to make your decision OR you can use ML! In the nutshell, you would feed your articles where they would be broken down to individual words and indexed (see Lucene for more information). Then cosine similarity (more on in one of my future posts!!!) will be used to determine how "close" one document relates to each other. Based on that clusters would be formed where related articles will be grouped together and unrelated will be separated... This way you could easily tell if two articles are related or totally different.

Other examples of ML include:
- Targeted retention - It would cost companies a lot of money to try to offer all of their customers incentive to stay. Instead companies try to determine which of their customers are most likely to leave and offer them a great deal in order to retain them.
- Recommendation Engines - Ever bought anything on Amazon? Then I am sure you saw a list of suggested recommendations :)
- Sentiment Analysis - Is it good or bad? A quick example, using Natural Language Processing (NLP) we can extract and process text to determine if it contains a lot of "positive" words such as 'good', 'awesome', etc

and that's us just scratching a surface of endless possibilities of ML!

Saturday, July 4, 2015

ElasticSeach JAVA API to find aliases given index

Recently I posted on stackoverflow on how to find aliases for the given index using JAVA API, so I thought I would like to it from here as well.

http://stackoverflow.com/questions/31170105/elasticseach-java-api-to-find-aliases-given-index

How to find aliases for given index in ElasticSearch using Java?
By using REST API it is pretty easy
https://www.elastic.co/guide/en/elasticsearch/reference/1.x/indices-aliases.html#alias-retrieving

While working with ElasticSearch, I ran into an issue where I needed to get a list of aliases based on provided index.

While getting a list of aliases is pretty straightforward:

 client.admin().cluster()
    .prepareState().execute()
    .actionGet().getState()
    .getMetaData().aliases();

While working with ElasticSearch, I ran into an issue where I needed to get a list of aliases based on provided index.
While getting a list of aliases is pretty straightforward:

 client.admin().cluster()
    .prepareState().execute()
    .actionGet().getState()
    .getMetaData().aliases();

I struggled to find an easy way to be able to get aliases for given index without having to iterate through everything first.
My first implementation looked something like this:

    ImmutableOpenMap<String, ImmutableOpenMap<String, AliasMetaData>> aliases = client.admin().cluster()
        .prepareState().execute()
        .actionGet().getState()
        .getMetaData().aliases();

    for (ObjectCursor<String> key: aliases.keys()) {
        ImmutableOpenMap<String, AliasMetaData> indexToAliasesMap = client.admin().cluster()
          .state(Requests.clusterStateRequest())
          .actionGet().getState()
          .getMetaData().aliases().get(key.value);

        if(indexToAliasesMap != null && !indexToAliasesMap.isEmpty()){
            String index= indexToAliasesMap.keys().iterator().next().value;
            String alias = indexToAliasesMap.values().iterator().next().value.alias();
        }
    }

I did not like it... and after poking around, I was able to get an idea on how to do it more efficiently by looking at RestGetIndicesAliasesAction (package org.elasticsearch.rest.action.admin.indices.alias.get)
This is what I end up with:

    ClusterStateRequest clusterStateRequest = Requests.clusterStateRequest()
            .routingTable(false)
            .nodes(false)
            .indices("your_index_name_goes_here");

    ObjectLookupContainer<String> setAliases= client
            .admin().cluster().state(clusterStateRequest)
            .actionGet().getState().getMetaData()
            .aliases().keys();

You will be able to find aliases for the index that you specified in setAliases
Hope it helps someone!

Thursday, April 30, 2015

Lucene scoring examplained

Several good books already explain what Lucene scoring really means and how it is calculated.

Yet, I really liked the explanation by Kelvin Tan that got it concise enough for people already familiar with it and just needing a quick reminder... or for people that don't want to read an entire chapter dedicated to Lucene scoring!

The factors involved in Lucene's scoring algorithm are as follows:
1. tf = term frequency in document = measure of how often a term appears in the document
2. idf = inverse document frequency = measure of how often the term appears across the index
3. coord = number of terms in the query that were found in the document
4. lengthNorm = measure of the importance of a term according to the total number of terms in the field
5. queryNorm = normalization factor so that queries can be compared
6. boost (index) = boost of the field at index-time
7. boost (query) = boost of the field at query-time

The implementation, implication and rationales of factors 1,2, 3 and 4 in DefaultSimilarity.java, which is what you get if you don't explicitly specify a similarity, are:
note: the implication of these factors should be read as, "Everything else being equal, … "

1. tf
Implementation: sqrt(freq)
Implication: the more frequent a term occurs in a document, the greater its score
Rationale: documents which contains more of a term are generally more relevant

2. idf
Implementation: log(numDocs/(docFreq+1)) + 1
Implication: the greater the occurrence of a term in different documents, the lower its score
Rationale: common terms are less important than uncommon ones

3. coord
Implementation: overlap / maxOverlap
Implication: of the terms in the query, a document that contains more terms will have a higher score
Rationale: self-explanatory

4. lengthNorm
Implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher score
Rationale: a term in a field with less terms is more important than one with more
queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. It is implemented as 1/sqrt(sumOfSquaredWeights)
So, roughly speaking (quoting Mark Harwood from the mailing list),
* Documents containing *all* the search terms are good
* Matches on rare words are better than for common words
* Long documents are not as good as short ones
* Documents which mention the search terms many times are good

The mathematical definition of the scoring can be found at https://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/search/Similarity.html

Hint: look at NutchSimilarity in Nutch to see an example of how web pages can be scored for relevance

Customizing scoring

Its easy to customize the scoring algorithm. Just subclass DefaultSimilarity and override the method you want to customize.
For example, if you want to ignore how common a term appears across the index,


Similarity sim = new DefaultSimilarity() {
  public float idf(int i, int i1) {
    return 1;
  }
}

and if you think for the title field, more terms is better


Similarity sim = new DefaultSimilarity() {
  public float lengthNorm(String field, int numTerms) {
    if(field.equals("title")) return (float) (0.1 * Math.log(numTerms));
    else return super.lengthNorm(field, numTerms);
  }
}

Tuesday, April 14, 2015

Logstash: Processing files from beginning

It was a while since I posted, so figured I would break the rust and write a small post.

I was recently involved in a project leveraging ELK (ElasticSearch, Logstash, Kibana). During the set up process, I had to experiment with different things and used a file with some test data. Well, after first successful try, my file was read in and Logstash "stopped" reading it...

What was wrong? Simple.

Logstash keeps track of files that it read and position in that file, so if you have a file with 5 lines, Logstash would read that file and mark that it read 5 lines in it. Next time it sees this file, it would check number of lines in it vs what it already processed. If no new lines were added file will be simply ignored. It is done so that same files wont be rescanned from the beginning each time Logstash sees them. It is good for production, but how can you rescan the file while testing? Simple!

First of all in your input block add

start_position => "beginning"

and if you still experience problems, run

rm ~/.sincedb*

before running logstash, in addition to start_position => beginning. This command would delete Logstash pointer data. Obviously this should be your last resort. :)

Tuesday, January 6, 2015

Why you should not open 42.zip.... or any zip file that you can't trust

42.zip(42.374 bytes zipped)

The file contains 16 zipped files, which again contains 16 zipped files, which again contains 16 zipped files, which again contains 16 zipped,
which again contains 16 zipped files, which contain 1 file, with the size of 4.3GB.

So, if you extract all files, you will most likely run out of space :-)

16 x 4294967295 = 68.719.476.720 (68GB)
16 x 68719476720 = 1.099.511.627.520 (1TB)
16 x 1099511627520 = 17.592.186.040.320 (17TB)
16 x 17592186040320 = 281.474.976.645.120 (281TB)
16 x 281474976645120 = 4.503.599.626.321.920 (4,5PB)

Friday, January 2, 2015

Notes from Lucas Carlson talk

6 Bad practices:
1. Synchronicity
3_tier systems are like sync information factory lines
If one area chokes, rest chokes
2. Dependency on Filesystems
State data (sessions, uploaded files, etc) stored on hard drives is very hard to scale
3. Heavy Server-Side Processing
Generating all the HTML server-side made sense when client hardware was slow
No REST/JSON/Javascript
All the information from a request needs to be compiled before returning any data
4. Expecting Services to Always be Available
Designing for success is failure
Cloud infrastructure has higher failure rates than dedicated hardware
Disaster recovery can be slow and prone to errors
5. Moving to the Cloud without a Plan
Cloud migration is often thought of as simple a cost issue, not a technical one
Higher failure rates in cloud infrastructure will break fragile applications
Migrations without a good plan can cost a lot of unexpected time and money
6. Lack of Redundancy
All single points of failure are terrible monsters (DNS, Load Balancers, Network, etc)
Not only choke points, but can take down otherwise robust system
All your eggs in one basket

6 Good practices:
1. Asynchronous Processes
Small decoupled apps
Communicate through a queue, REST APIs, or a message bus
Each one should do one thing very well: simple and specialized
2. Distributed Object Storage
Memcached, Redis, MongoDB, CouchDB, etc
Use instead of filesystems in legacy web applications (sessions, file uploads, etc)
Consider replacing oc caching the largest and fastest growing relational database tables with object storage
3. Micro-Services
Leverage increased CPU capacity on browsers with client-side Javascript (AngularJS, Ember, Backbone)
Simple and specialized REST APIs
Java: Spring, Spark, Jersey
.Net: WCF, OpenRasta
Ruby: Sinatra
Python: Flask, Bottle
PHP: Slim
Bonus: Power your mobile apps
4. Architecting fo Failure
Think about anti-fragile upfront
Pro-actively stress your system and study how it fails (not just load testing, think of Netflix's chaos monkey)
Make all failures an opportunity to eliminate bottlenecks, increase redundancy.
5. Use cloud migration as an Opportunity to modernize architecture
Dont half-do it
Not all applications will do well in cloud environment
Automation is vital in cloud environments where infrastructure isn't reliable
6. Redundancy everywhere
Audit every area of your application for redundancy
2x or 3x redundancy is not enough
Google's rule of thumb is for 5x redundancy
Be like the Hydra, kill one head and grow tow in its spot

Netflix: Chaos Monkey - try to randomly break servers through out the system and day to see how your application would react.
Dont expect it to work

3 Conclusions
1. Legacy web applications are fragile
2. Think about anti-fragility (not scalability) up front
3. Micro-services are anti-fragile future:
Lightweight distributed
Share-nothing systems built with APIs

Conclusion
1. Automation is dominating the future landscape of web app architecture
2. Micro-services are the future
- Lightweight distributed
- Share-nothing systems built with APIs
3. Docker and Linux Containers are the new way to package and distribute your applications