Cloud. Big Data. Analytics... and so on: July 2015

Sunday, July 5, 2015

Machine Learning... Taking first steps

This is going to be my first post (of many, I hope!!!) where I would discuss my recent projects dealing with machine learning (ML) and what I learned from them. I hope to benefit people who are trying to understand topics of ML and to make notes for my own future reference.

First of all, why do we need ML in the first place? Well, ML we can be broken into two categories "Supervised Learning" (SL) and "Unsupervised Learning" (UL).

SL in my opinion is more widely used and presents more practical applications versus UL. Let's consider few examples, SL can be further broken down to "Classifiers" and "Regression". As the name may suggest, classifiers try to predict type or class of the outcome based on historical data. For example, given number of emails some of which were classified as spam and some of which were flagged to be not spam, we can build a classifier that will try to automatically filter incoming mail as spam or not spam for us. You can see this in your Outlook or Gmail! (Ummm, I feel that I need to cover spam filtering in bit more details in the future. Stay tuned.)

Classifiers try to predict more or less yes/no answer. Is it a spam or not? Is it a car or not? Is it fraudulent transaction or not? Regression on other hand tries to predict certain value based on known parameters. One example is house cost prediction: given parameters of the house such as sq ft, number of bedrooms and bathrooms, etc, time of year and its location, one can try to predict its cost at a certain time in the future.

UL - since outputs are unknown you can not really train your algorithms to predict since you don't know what it is. One of the classical uses of UL is "Clustering". For example, taking few specific properties of a subject plot them on a "map" to see if clusters would be formed, which would help you to determine relationship between your input data. For example, if you have hundreds of articles and what to know which ones are related to each other, you can read all of them and to make your decision OR you can use ML! In the nutshell, you would feed your articles where they would be broken down to individual words and indexed (see Lucene for more information). Then cosine similarity (more on in one of my future posts!!!) will be used to determine how "close" one document relates to each other. Based on that clusters would be formed where related articles will be grouped together and unrelated will be separated... This way you could easily tell if two articles are related or totally different.

Other examples of ML include:
- Targeted retention - It would cost companies a lot of money to try to offer all of their customers incentive to stay. Instead companies try to determine which of their customers are most likely to leave and offer them a great deal in order to retain them.
- Recommendation Engines - Ever bought anything on Amazon? Then I am sure you saw a list of suggested recommendations :)
- Sentiment Analysis - Is it good or bad? A quick example, using Natural Language Processing (NLP) we can extract and process text to determine if it contains a lot of "positive" words such as 'good', 'awesome', etc

and that's us just scratching a surface of endless possibilities of ML!

Saturday, July 4, 2015

ElasticSeach JAVA API to find aliases given index

Recently I posted on stackoverflow on how to find aliases for the given index using JAVA API, so I thought I would like to it from here as well.

http://stackoverflow.com/questions/31170105/elasticseach-java-api-to-find-aliases-given-index

How to find aliases for given index in ElasticSearch using Java?
By using REST API it is pretty easy
https://www.elastic.co/guide/en/elasticsearch/reference/1.x/indices-aliases.html#alias-retrieving

While working with ElasticSearch, I ran into an issue where I needed to get a list of aliases based on provided index.

While getting a list of aliases is pretty straightforward:

 client.admin().cluster()
    .prepareState().execute()
    .actionGet().getState()
    .getMetaData().aliases();

While working with ElasticSearch, I ran into an issue where I needed to get a list of aliases based on provided index.
While getting a list of aliases is pretty straightforward:

 client.admin().cluster()
    .prepareState().execute()
    .actionGet().getState()
    .getMetaData().aliases();

I struggled to find an easy way to be able to get aliases for given index without having to iterate through everything first.
My first implementation looked something like this:

    ImmutableOpenMap<String, ImmutableOpenMap<String, AliasMetaData>> aliases = client.admin().cluster()
        .prepareState().execute()
        .actionGet().getState()
        .getMetaData().aliases();

    for (ObjectCursor<String> key: aliases.keys()) {
        ImmutableOpenMap<String, AliasMetaData> indexToAliasesMap = client.admin().cluster()
          .state(Requests.clusterStateRequest())
          .actionGet().getState()
          .getMetaData().aliases().get(key.value);

        if(indexToAliasesMap != null && !indexToAliasesMap.isEmpty()){
            String index= indexToAliasesMap.keys().iterator().next().value;
            String alias = indexToAliasesMap.values().iterator().next().value.alias();
        }
    }

I did not like it... and after poking around, I was able to get an idea on how to do it more efficiently by looking at RestGetIndicesAliasesAction (package org.elasticsearch.rest.action.admin.indices.alias.get)
This is what I end up with:

    ClusterStateRequest clusterStateRequest = Requests.clusterStateRequest()
            .routingTable(false)
            .nodes(false)
            .indices("your_index_name_goes_here");

    ObjectLookupContainer<String> setAliases= client
            .admin().cluster().state(clusterStateRequest)
            .actionGet().getState().getMetaData()
            .aliases().keys();

You will be able to find aliases for the index that you specified in setAliases
Hope it helps someone!