Tuesday, March 3, 2015

Functional Programming - Preserving Type Safety

Overview

Functional programming is a development paradigm centered around the use of functions. It's not just a matter of using functions, but about the perspective and thought process you exercise when faced with a programming problem. In other words, it is a framework for the mind and your code.
The goal of FP is to abstract operations on data with functions in order to maximize reusability and reduce mutation of state and side effects in your application. It so happens that by doing this, your code is much more scalable, readable, and testable.
Functional programming demands that functions have 2 important qualities:
  1. Be first-class citizens
  2. Be of high order
Functions are first-class when they can be created and assigned to a variable. Functions are high-order when they can be used in combination with other functions whether it is passed in as an argument or returned from a function. Let's expand this notion further. When a function is equivalent to a value, it follows a principle of substitutability, which means that a function call's return value can be replaced and inlined into an expression without altering its result. Consider a quick JavaScript example:


   function areaSquare(a) {
        return a * a;
   }

   function volumeCube(areaOfSquare, a) {
        return areaOfSquare * a;
   }

   a = 2;
   area = areaSquare(a);         // -> 4
   volume = volumeCube(area, a); // -> 8
The following expressions are equivalent:

   volume = a * a * a; 

   volume = areaSquare(a) * a; 

   volume = volumeCube(areaSquare(a)); 

Why JavaScript?

Simply put, omnipresence. JavaScript is everywhere. For many years, the language of the web, and now it's also breaking into the server market. Most importantly, JavaScript is a functional language; it actually supports writing in both functional and object oriented styles. This, I argue, should be the only way in which we should be using JavaScript.

Furthermore, JavaScript is a non-statically typed language. Though types are enforced, it is done at runtime and not at compile time. In fact you don't include type information in your code. Considering the invariant that referentially transparent functions need to be predictable and return the same value on same input when called, it follows naturally that functions need to be consistent in the type of value or object they return. Referentially transparent functions have no side effects and depend only on the input provided.

Functions like:

Date.now();

and

   var counter = 0;                    // global
   function incrementCounterTo(num) {
 return counter += num;    
   }

are not considered pure due to side effects that break referential transparency.

Preserving Type Safety


In order to talk about the issue of type safety, consider this code:

   var months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul",
                 "Aug", "Sep", "Oct", "Nov", "Dec"];
   function getMonthName(mo) {
      if (months[mo - 1] !== undefined) {
         return months[mo - 1];
      } 
      else {
         throw new Error("Invalid Month Error!");
      }
   }

It's clear that for every valid month number, this function will always return same month name, which is string. However, for other inputs (mo > 12) this function throws an exception, so type is not preserved across input values in the domain of this function. Exceptions causes the stack to unwind and set external program flags for termination, which cause side effects to occur. Calling this code looks like the following:

    try {
        var m = getMonthName(13);

        // do work with m
    }
    catch (e) {
        return e.message;
    }  

Also, this function violates referential transparency, for you can't under any circumstances inline an exception statement as part of a bigger expression, as it will alter the entire meaning of the expression.

Let's look at a more functional approach that will solve all of these issues.

Optional Data Type

The solution is based on a core functional programming concept called monads. In particular, we will be dealing with the Optional monad, also called the Maybe monad in some functional programming languages. Here is the code for the Optional JavaScript object (some details omitted for brevity):


var Optional = (function () {

    // private constructor
    function Option(val) {
        var _value = val || null;

        // public methods

        this.get = function () {
            if (!this.isPresent()) {
                throw 'NoSuchElementError';
            }
            return _value;
        };
        
        this.map = function (fn) {
            if (!this.isPresent()) {
                return Optional.empty();
            }
            if (typeof fn == 'function') {
                return Optional.ofNullable(fn.call(_value));
            }
            return Optional.ofNullable(_value[fn].call(_value));
        };

        this.getOrElse = function (other) {
            return _value !== null ? _value : other;
        };
    }

    return {

        empty: function () {
            return Object.freeze(new Option());
        },
        
        of: function (val) {
            if (val === null) {
                throw 'NoSuchElementError';
            }
            return Object.freeze(new Option(val));
        },
        
        ofNullable: function (val) {
            var inst = val !== null ? this.of(val) : this.empty();
            return Object.freeze(inst);
        }
    }
})();

There are many ways of doing this, I like the notion of immutability in objects, which is why I choose to freeze objects as they are created. Using Optional we keep referential transparency as well as preserve type safety in our functions. This has many benefits:
  1. Create a clear contract for our functions
  2. Prevent our callers from having to place boilerplate exception handling code into every call
  3. Provide a more succinct and fluent API experience
Let's look at how we can improve our sample problem:

   function findMonthName(months, mo) {
        if (months[mo - 1] !== undefined) {
            return Optional.of(months[mo - 1]);
        }
        else {
            return Optional.empty();
        }
    }

This allows you to be more idiomatic about how you handle failures. Your calling code looks like:

findMonthName(months, 13).getOrElse('No month found');

As you can see this function keeps its contract by returning an Optional object to encapsulate success and failure states.

Here is another use case for this functional pattern. Consider the following simple objects:

    function Address(country) {
        var _country = country;

        this.getCountry = function() {
            return _country;
        }
    }

    function Person(addr) {
        var _address = addr;

        this.getAddress = function() {
            return _address;
        }
    }

    function Customer(person) {
        var _person = person;

        this.getPerson = function() {
            return _person;
        }
    }

It would be nice to be able to access the address of a customer as:
customer.getPerson().getAddress().getCountry()
And expect it all to work all the time.....Right! You will probably want to guard against this access. In imperative and object-oriented programming, you will be writing code like this:

    var p = customer.getPerson();
    if(p !== null) {
        var a = p.getAddress();
        if(a !== null) {
            var c = a.getCountry();
            return c;
        }
        return null;
    }
    return null;

Can you spot any issues with this code? For starters, the incessant null checking that occurs only to return null again because there is no other sensible value. So, the caller of this function also must perform another round of null checks.

Let's now see a more scalable and idiomatic approach, the functional way:

   customer.getPerson()
            .map('getAddress')
            .map('getCountry')
            .getOrElse('Unknown');    

If any value is missing in the chain of calls, this will short-circuit and return 'Unknown'. We do this to avoid all of the null checks. All we need to do is tweak our API based on Optional:

   function Customer(person) {
        var _person = person;

        this.getPerson = function() {
            return Optional.ofNullable(_person);
        }
    }

And that's it!
Functional programming is a very important and powerful way of programming. Experts in the field have realized that applications written in a functional style tend to be more expressive, readable, and correct. Referentially transparent functions can have their result values substituted inline in place of the actual function call. Finally, using Optional can enhance your code in many ways by making it shorter to write as well as fluent.

Tuesday, January 6, 2015

Secrets of a Functional Ninja

Overview

JavaScript is a general-purpose, single-threaded, functional programming language for the web. Douglas Crockford in his post, JavaScript: The Most Misunderstood Programming Language, mentions that JavaScript has more in common with functional languages like Lisp or Scheme, than many other popular languages such as C or Java. This notion of "functional" is possible due its very powerful Function object, which is the core of Functional Programming and JavaScript alike.

To become a JavaScript expert, all you need to understand well is the interplay between the core pillars of the language, the rest falls into place:



In this post, I will talk about some of the nice features of the JavaScript Function object. In addition, I will talk about one of the most misunderstood, and yet powerful, features of JavaScript functions, closures.

The Function

According to John Resig in his book Secrets of a JavaScript Ninja, the most important skill that separates average JavaScript programmers from great ones lies in the understanding of functions --the very foundation of the language is based on this notion. Let's look at some of the nice properties of JavaScript functions.

Functions are first-class objects

Functions in JS can be treated just like any other object: they can be assigned to variables, returned from functions, and passed as arguments into other functions. Essentially, a function is a value. The only characteristic that separates a function from an object is that the latter can be invoked, via the ( ) operator.

Functions are JavaScript's "unit of work" (comparable to what classes are in Object Oriented  languages) and the main abstraction layer. All code that gets executed should run within the context of a function. As a result you can safely assume that functions can be nested within functions. Consider the browser's onLoad() function to be equivalent to JavaScript's main method. All code should begin within this function, creating something called the the global context --the window object.


<body onload="main();">

    ...
    <script type="text/javascript>"
      function main() {

          // application code starts here
      }
   </script>   
</body>

Aside from being "callable",  functions, like objects, can have properties and you can dynamically assign properties to it. Properties are equivalent to instance data. Let's look in detail at 2 important function properties: name and length.

Functions have a  name property; omitting it will create an anonymous function. This property is set to the name of the function in the function signature at the time of declaration. Anonymous functions are used typically as assigned to a variable (for later execution), used as immediately invoking functions, or passed in as function arguments in the form of a callback, like the setTimeout() API. Callbacks are very useful when handling asynchronous events: user input, browser events,  intervals, etc.

Another important function property is length which represents the number of named (formal) parameters with which the function was declared. length is used a lot in framework or library code to perform introspection on the function called in order to provide some nice functionality.

console.log(function(a,b) {}.length);  // returns 2

Functions can be invoked in 4 different ways:
  1. As a standalone function. This is the most straight-forward call. It occurs when the ( ) operator is applied to the function name or immediately invoked applied to a un-named function object.
  2. As a method (enabling object-oriented programming). It occurs when the ( ) operator is applied to a property of an object. 
  3. As a constructor (bringing a new object to life). It occurs when preceding the function invocation with the keyword new. In the absence of any explicit return value, the new object is returned. It is best practice to name these functions with a starting capital letter. 
  4. By calling the apply(...) and call(...) methods. Used when the intent is to force the function context to be a certain object. In these calls, the first argument is the owning object. Passing null has the interesting effect of setting the global context (namely, window)  to be the owning object.   

// 1
function myFunc() {
  // do something
}
myFunc();


// 2
var obj = {
   myFunc: function () {
      // do something
   }
};
obj.myFunc();

// 3
function Person(fname, lname) {
   this.firstname = fname;
   this.lastname = lname;
}
var person = new Person();


//4
Array.prototype.slice.call(null, [1,2,3,4.....n]);



All function object invocations (whichever of the 4 categories above) are syntactic sugar for the implicitly created Function() constructor. 

var func = new Function([arg1, [arg2, ..[argN]]], functionBody);

Where functionBody is a string containing all of the statements comprising the function body.

Variadic Functions

At function invocation, function arguments are mapped one-to-one to function parameters. However, there are cases where a user will supply more arguments to a function than the list of parameters it declares to accept.

Unlike statically types languages, JavaScript has no overloaded functions; instead, it handles parameter passing in the following manner:
  1. If more actual arguments are supplied than there are formal parameters, the extra arguments are not given explicit names. JavaScript provides another mechanism to retrieve them implicitly via the arguments object. Oddly so, the arguments object is not a JavaScript array, it is actually an array-like object with a length property and overloaded indexing operator. 
  2. If there are more formal parameters than there are actual arguments supplied to the function call, the extra parameters will be set to undefined
In addition to the implicit arguments object, all JavaScript functions (because they are first-class objects) are passed an implicit reference to this, similar to classes in OO languages.  The this reference is included in the a function's context. Function context is not straight-forward to understand, but it is JavaScript's most important feature, especially when combined with closures. I will explain this more clearly in the next sections to come.

If you come from OO languages, like Java, the this refers to the object within which an instance method is contained. In other words, it is a factor of how and where the function is declared. In JavaScript, it is a factor of how the function is invoked, based on the 4 invocation methods listed previously.

Scope

A function is scoped within the function that created it. If a function or variable is declared outside of a function scope, it will have global scope, which basically just translates into a dynamic property being added to the window object. Also, variables created without the keyword var will automatically be assigned to the global scope. By the way this can be very disastrous as it pollutes the global namespace. I just follow a simple rule, do not modify the global namespace.

To avoid namespace pollution, good JavaScript programmers write their code under some kind of function or object scope. Also, variables in the global scope become victims to "global change;" in other words, they can be changed in any place in the program, causing a debugging nightmare. The JQuery library and many others follow a pattern  called Immediately Invoked Function Expressions (IIFE), similar to:

(function() {

   // declare all variables and functions here   

}).call(this);

Function scope (similar to dynamic scoping explained below), and hence the reference to this, varies depending on the manner with which the function is being invoked, similar to before:
  1. As a standalone function. The function is in global scope, the window object owns it 
  2. As a method. The function becomes a property of said object and this refers to it. 
  3. As a constructor. The this implicitly reference is passed to the function and references the newly created object. All properties of the new object are in reachable via this
  4. By calling the apply(...) and call(...) methods. Both methods accept the owning object, which this will refer to. Essentially, the developers are in control of what the context will be.  
Throughout this post, I have identified several types of scoping rules: 

Global Scope

This is the simplest form of scoping. We covered this in detail in this post. Basically, objects created at the top level whether they are functions or variables have global scope. Also, objects created anywhere without the keyword var take on global scope. The owning object is the window object.

Lexical Scope

In this type, the visibility of a variable is based on its textual position. 


someVar = "Global";

function func() {
   var someVar = "Middle";

   return function(e) {
      var someVar = "Inner";
      alert(someVar); // will print "Inner"
   }
}
Lexical scope starts binding from the inner most scope and works outward. This is the "tightest" scoping mechanism available. JavaScript partially implements lexical scoping, but more appropriately, it implements function scoping, explained later. Languages with lexical scoping suffer from a downside called "Variable Shadowing". As you can see in the code snippet above, the variable someVar was shadowed with different layers all the way down to the innermost scope. In larger code segments, this can prove very hard to trace. 

Dynamic Scope

We covered this in detail throughout this post. The this reference depends on how a function is called: object, function, or global. Using call( )  or apply( ), developers can pass a dynamic reference to act as the context object, or pass null to indicate it belongs to the global context.

One common technique for supplying a scratch-pad to work on for the variable "this" is to pass an empty object into the call method, as such: 

someFunction.call({}, 10000);

Function Scope

Very similar to dynamic scoping. All bindings are constrained by the area bounding it, namely a function. This  is JavaScript's main scoping mechism.

function strangeIdentity() {

   for(var i=0; i<n; i++);  // notice semicolon

   return i;
}
strangeIdentity(138);  // returns 138

JavaScript groups all variable declarations to the top of the function, in a process called hoisting. JavaScript hoists all variable declarations within a function. This is somewhat confusing when compared with languages with block scope.  As good JavaScript developers, it is preferred that variables declared within functions (and within any scope) are done at the beginning.  Let's take a closer look.

Block Scope

Present in most other programming languages. Not implemented in JavaScript. As far as scope is concerned, JavaScript will ignore the curly braces in some situations in favor of function scope.  

function scopeTest() {
   for(var i = 0; i <= 5; i++) {
      var insideFor = i;
   }
   alert(insideFor);
}

In the following code, we would expect the script to fail indicating that the variable insideFor had not been declared. This is not the case, under hoisting this variable declaration was "moved up" to the top of the function and is thus visible within the entire function scope. The variable insideFor is hoisted to the top and given a value of undefined until the initialization happens within the for-loop.

To sum up, scoping can be present in the following ways:
  1. The value of the this binding, in the case of an object context or function context.
  2. The execution context defined by the value of the this binding, for cases when using apply( )  or call( )
  3. The “lifetime” of a variable. Variables exist from the moment they are declared until the function ends. Variables in the global context are accessible anywhere in the script.
  4. Variable value resolution scheme or lexical binding. Depending on where a variable appears, it's scope is bound to the function that contains it. 

Closures

A widely written about and debated topic in the JavaScript community. Closures have been a feature of functional programming languages such as Haskell for quite some time. Using them properly can have immense effects on your code.

Simply put, a closure is a scope that is created upon a function declaration allowing it access to any variables declared outside of the function --or within the enclosing (hence the name) scope.

Obviously, global variables will be included in the mix since as they are explicitly added as properties of the window object. The idea that drives developers crazy and at the same time why this is so powerful is that functions can be called at any point in time, after their declaration. So, access to that information when the function is called is preserved even after that scope has gone away.

Languages that support closures store the state of any accessible variable definitions together with the function declaration at runtime, in the functions stack. Imagine creating a "safety bubble" around all variables present, including function parameters, at the point of function declaration. This bubble will not be garbage collected until the reference to the function is gone.

function createScaleFunction(FACTOR) {
   return function(v) {
      return _.map(v, function(n) {
        return (n* FACTOR);
      });
   };
}


var scale10 = createScaleFunction(10);
scale10([1,2,3]);   //=> [10, 20, 30]

The FACTOR variable in this case is "remembered" or stored as part of the definition which gets returned to the user for later execution.  Notice that when we invoke scale10 the function is has already gone out of scope, which is impressive.  As shown before, another important use of closures is together with IIEFs to create private instance variables. In this code:


(function() {
   var _myVar = {

   };

}).call(this);

The variable _myVar will only be visible within the context or "bubble" of the immediately execution function. 

So, what's up with JavaScript?

Now we take a look at a somewhat obscure and also misunderstood JavaScript keyword, with.  The with statement is used to bound or put all properties of a JS object within the current scope, allowing us to access them without having to quality the owning object.

Support for this keyword is likely to be deprecated in new versions of JavaScript since ECMAScript 5 already prohibits its use under strict mode; however, this statement has been very controversial an debated among JavaScript experts, and you are likely to run into it in written code.

Let's take a quick look:

var someFunc = function () { return "ABC"};

var myObj = {
   someProp = 4,
   someFunc = function () { return "XYZ"};
};
with(myObj) {
   // bound the scope of myObj into the curly braces
   var xyz = someFunc();  // Who owns variable xyz? [window]
   alert(xyz);  // Prints XYZ
}

The following code above will print out "ABC " since we used the with statement to create a scope bound to object myObj. Some uses of with can be:

  1. Shortening references to an object in a deep hierarchy making code syntactically more legible (arguably harder to follow, according to Douglas Crockford)
  2. Exposing properties as top-level to a template

While reading an object's properties in an unqualified manner might look nice at a glance and make code shorter, writing to properties is very ambiguous and not consistent with what you might think. If you are within an object's scope bound by with, you would expect that assigning to or creating a new variable within this scope will implicitly create it under myObj. In other words, variable abc would belong to myObj. This is not the case. It will actually create it in the global context, which is consistent with the scoping rules mentioned previously. Only functions create new scopes, and a with block is no exception.

There is a better alternative to using with, you can use immediate functions as shown below to wrap a block of code and set the proper object scope.

Conclusion

The most powerful and yet complex concept to understand about JavaScript is that functions can be called in many different ways. Function context and scope, including the this reference,  depend on the manner in which function are invoked. While this is very powerful, it can be quite confusing.

In this post we took a brief tour of the JavaScript Function object as well as some inherent characteristics such as scope and closures. It is important to remember that closures should not be overused as they drag with it necessary memory to hold references in a function's outer scope.

Finally we looked briefly at the soon-to-be-deprecated with keyword. This can be used to achieve very powerful things at the expense of compromising readability.

Now that we have covered functions, objects, and scopes in detail stay tuned for my next blog post on Functional Javascript.

Resources

  1. Resig, John and Bibeault, Bear. Secrets of a JavaScript Ninja. Manning Publications
  2. http://javascript.crockford.com/javascript.html
  3. http://yuiblog.com/blog/2006/04/11/with-statement-considered-harmful/

Friday, December 5, 2014

Dart: A Best of Breed

I have playing with Dart quite a bit recently, and I think that the set of features it has makes it a great platform to develop small to large enterprise class applications. I have written code in languages such as Java, PHP, Python, and JavaScript, and I can see that Dart has borrowed many of the cool features of each into one great development platform. In this post I will highlight some of the core features of Dart in comparison to other languages I have used.

Get ready for this mouthful: Dart is an open source, class-based, optionally typed, structured, programming language for rich browser-based applications. Its familiar C-like syntax and scalable programming constructs, make it ideal for building single-page web applications.... ok....

Similar to other platforms such as Google Web Toolkit or CoffeeScript, you can run Dart by compiling to native JavaScript code. This will make adoption of this platform a bit easier for browser vendors other than Google. However, unlike GWT, which compiles for different targeted browsers, Dart is designed to compile your application to a single JS file targeting only the modern browsers.  Dart is optionally typed (also called documentary types), which means you can use var to assign any type of variable: String, Object, int, etc. You can use the static types as well. Documentary types improve tool integration features such as: code completion, compiler validations, and code-level documentation, which makes developer's lives easier. Darts type system can be considered a hybrid between Java's strong, static typing and JavaScript pure dynamic typing. If no type information is provided on a variable, dynamic is assumed.

In addition, Dart has a single-threaded concurrency model based on a concept called Isolates for parallel execution of code. Isolates are great because they don't have a shared-memory model, unlike Java Threads. When compiled to JavaScript, Isolates translate to HTML5 Web Workers. This stateless concurrency model is based on a message-passing middleware used to communicate from one isolate to another. In fact, the Dart runtime starts out (from its main method) within the context of an isolate; essentially, every script containing a "main" method runs in its own isolate.

Dart code can spawn a new Isolate, like Java does a for a new Thread. Isolates are different from threads in that they cannot share memory. They are ideal for creating a plugin architecture since they can be used to load external code dynamically in very much the same way to Python and PHP, but because it run in its own Isolate it will run in its own protected memory space.

On a much smaller scale, Dart also supports the development of Server-Side applications using the Dart Virtual Machine. On the server, Dart features nice I/O libraries for manipulating files and sockets. Most of the Dart libraries work on the server, except for dart:html which is in charge of DOM manipulation.

Dartium, the Dart developer edition of Chromium, is Google's open source version of Chrome. Dartium executes Dart code natively in the browser through an embedded Dart VM. For browsers that do not support Dartium, which is any other browser other than Chrome, there are a set of tools in the Dart ecosystem that might be useful:
  1. dart2js: Compiles all dart files into a single javascript file. 
  2. pub: a package manager much like Maven for Java or Composer for PHP. You can publish your libraries in Github.
  3. dartdoc: generated formatted HTML documentation (like javadoc). dartdoc comments start with "///" instead of "/**"
Like Java, Dart has single inheritance, multiple interfaces model; all classes inherit from Object. You use the keywords extends and implements, respectively. This is similar to Java's and other's inheritance model and is light years better than JavaScrip's prototypal inheritance. 
You can have public and private types and private fields are indicated via the "_" underscore. Other languages such as Python work this way as well. The latter, applies also for methods, functions, and even classes.
The "this" keyword is much more simple to understand as it refers to the instance of the class itself (like in Java) and not to the owner of the class (like in JavaScript).
You can split an application into modules or libraries --or packages. Each library is contained within its own .dart file. Libraries are defined using the "library" keyword --analogous to "package". This is a much better improvement over JavaScript since it suffers a lot from namespace pollution and variable collisions. In Dart, you can prefix (or qualify) libraries in order avoid class collisions when importing 2 libraries. For instance:

import "../my/lib/dart" as mine;

mine.commonFunction("do something"); // this will be unique
Now let's spend some time with a very powerful feature in Dart, functions. Like JavaScript, functions are first-class objects in Dart. Top-Level functions allow developers to pass functions as arguments to other functions, assign functions to variables, and dynamically call functions by name. In addition, function arguments (both objects and primitives) are passed by reference.

Dart provides a cool feature called Factory Constructors that allows class designers control the concrete implementation class for a given interface. This is really useful when you intend to have one implementation of an interface, but also to hide the underlying implementation of a class. In Java, you would have to implement it via a Factory Method pattern. In fact, String and int are implemented this way, via factory constructors. For instance:


abstract class IButton {
  
  click();

  factory IButton() {
     return new GenericButton();
  }
}

class GenericButton implements IButton {
  
   click() {
      return "Button has been clicked!";
   }

}

class CloseButton implements IButton {
  ...
}
Another cool thing about constructor methods in Dart are the initializer parameters:

class MyClass {

   var color;
   
   MyClass(color) {
      this.color = color;
   }
}

// you can also use

   MyClass(this.color) {

   }

Dart has several notations to express functions: long notatin and and short notation (similar to a Java8 lambda expression). Here is an example using short-hand syntax for getters and setters:


class MyClass {

   var _color;  // private property

   Color get color => _color;
  
   set color(Color c) => this._color = c;
   
}   

All functions in Dart return a value. Short-hand functions always return the result of evaluating the expression. Default is null. Long notation will employ the use of the "return" keyword just like any other language. You may also qualify a function as returning void. 

Dart supports closures, just like in Javascript. Closures close -over (or wrap) any local variables defined at the time of a function declaration.

Finally, Unit Testing Dart applications is easy with the built-in unittest library. Creating a test is nothing more than declaring a test function and using the Expect class to define assertions in your code. For instance:


test ("Test name", () {
   
   // anonymous function containing test body
   Expect.isNotNull(someObj);

});

Resources

  1. Dart In Action.
  2. dartlang.org

Monday, November 24, 2014

Apache Mahout: Clustering

Overview

People tend to naturally and unconsciously group things together. For instance, honey and sugar are "sweet".  When we taste items exhibiting a similar taste, we immediately describe it as "sweet."

This activity that we as humans do so quickly and instinctively is called clustering, or grouping.

The components involved in a Mahout clustering solution contain:

  1. An algorithm implementation used to perform grouping
  2. A notion of similarity and dissimilarity 
  3. A stopping condition representing when groups can no longer be formed
Circles are a good way to visualize clustering:



The center of the circle is defined as the centroid, or mean (average) of a cluster. This is the point whose coordinates are the average of the xy coordinates in the cluster. The closer items are to a cluster's center, the more similar they are.

Clustering is all about finding the similarities between 2 points or 2 items in the xy plane. Mahout contains a few clustering algorithms such as: k-means, fuzzy k-means, and canopy. In this case, "k" refers to the number of clusters to form. In the example above, "k" has been set to 3.

A Mahout clustering algorithm involves the following steps:


  1. Create a SequenceFile with the input vectors (these are boolean points in the xy space) 
  2. Create a SequenceFile with the initial cluster centers. 
  3. Pick similarity measure to use, such as: EuclidianDistanceMeasure 
  4. Define A convergenceThreshold indicating when to stop 
  5. Define number of iterations to perform 
  6. Create the Vector implementation used in input files 
Cluster centers (in step 2) are sometimes estimated or guessed; but with non-trivial data, this can be very challenging. Even if the estimated centers are way off, the k-means algorithm will adjust the centers at each iteration by computing the average center, or centroid in each cluster.

Getting perfect clusters is a science in it of itself as there are many parameters to tweak in the algorithm.

Other distance measures available in Mahout are the: SquaredEuclidianDistanceMeasure, ManhattanDistanceMeasure, CosineDistanceMeasure, TanimotoDistanceMeasure, among others.

Resources

  1. Mahout in Action. O'Reilly

Monday, November 17, 2014

Apache Mahout: An Introduction

Overview

The web grows more and more each day. Social networking has created a vehicle for users to voice their opinion about pretty much anything in the world. All this public data can be used to drastically benefit a company's business. However,  companies struggle to keep up. Manual approaches to creating reports and analysis become overwhelming. As a result, businesses have resorted to using recommender engines to reduce the level of noise and make sensible statements out of all the data they receive. Good examples of this are: Amazon, Netflix, Youtube, eHarmony, among others.

Mahout is a Java open source machine learning (or collective intelligence) library from Apache born out of the popular text search engine Lucene. Conceived in 2008, and close to its 1.0 release, it was  designed with scalability and extensibility in mind. Being an Apache product, it features nice integration with the popular Map-Reduce implementation Apache Hadoop.

Mahout provides a framework for implementing three basic things:
  1. Recommendations (Netflix, Amazon, dating sites, social networks)
  2. Clustering (Google News) 
  3. Classification (spam detection)
Recommendations are the most widely used technique today. People tend to have relatively similar preferences. Based on a user's preference for a certain object, and other similar user's preferences, a relationship between a user and an object can be established. Clustering techniques help identify structure and hierarchy among a large collection of data. Finally, classification techniques can be used to decide how much an object is or isn't part of some category. For instance, Google decides whether an contains a person's face in it. It can also be used to classify patterns as usual or unusual.

Implementation

At its core, Mahout uses primitive-based vector data sets:  [userId, itemId, preference value]. Data is only numeric for storage efficiency and performance to avoid Java's object overhead, which adds 140% more space needed to store. Since this is an introduction, in this post I will only focus on recommendations, which is the reason most people are attracted to Mahout.

Recommendations and Collaborative Filtering

Recommenders algorithms are data-intensive in nature. These techniques require no knowledge of the actual attributes of the items themselves. Using mathematical principles, recommenders produce a recommendation solely based on the relationship between users and items.

Mahout thrives on boolean data composed of a [userId, itemId]. If available, a preference value can added (much like a rating) indicating the strength of the user's preference to this item.



All algorithms relate a user to an item, but there are some differences:

User-Based Recommenders

These algorithms recommend based upon the notion of similarities among users. The following pseudo code describes the user-based algorithm process at a high level:


for every other user w
   compute a similarity s between u and v
   retain top users, ranked by similarity, as a neighborhood n

for every item i that some user in n has a preference for,
    but that u has no preference for yet
  for every other user v in n that has a preference for i
    compute a similarity s between u and v
    incorporate v's preference for i, weighted by s, into a running average
return the top items, ranked by weighted average


Explaining all possible user-based algorithms is outside of the scope of this post. In general, every used-based recommender in Mahout involves the inter-play of the following components:

  • Data model, implemented via DataModel
  • User-User similarity metric, implemented via a UserSimilarity
  • User neighborhood definition, implemented via UserNeighborhood
  • Recommender Engine, implemented via a Recommender

These algorithms can become slower as the number of users in the system increases. 

Item-Based Recommender

A cousin of the User-based recommender, Item-based algorithms are derived from how similar items are to other items. Similar to the components above, item-based algorithms contain the following components:
  • Data model, implemented via DataModel
  • Item-Item similarity metric, implemented via a ItemSimilarity
  • User neighborhood definition, implemented via UserNeighborhood
  • Recommender engine, implemented via a Recommender
A pseudo view of this algorithm will look like the following:


for every item i that u has no preference for yet
  for every item j that u has a preference for
    compute a similarity s between i and j
    add u's preference for j, weighted by s, into a running average
return the top items, ranked by weighted average

Some item-item similarities are more fixed which are good candidate for pre-computation in order to return results quicker. For instance, it's likely that two CD albums from bands such as as Metallica and Guns & Roses will continue to be similar to each other this year and the next. However, caching and pre-computation can be memory intensive (such is the case with Slope-One Recommender).

These algorithms can become slower as the number of items in the system increases. 

Evaluating Results

It is recommended to use an evaluator to score your algorithm. Using an evaluator you can split the dataset as training data and test data using a threshold value. Using list of computations and math, it will determine how "good" your recommendations are.  The smaller the score, the better the recommendation is. I highly recommend using an evaluator to experiment and tweak your recommendation engine.

Conclusion

We took a brief look at two canonical recommendation techniques: user-based and item-based. The runtime of a user based algorithm goes up as the number of users increases. The runtime of a item based algorithms goes up as the number of items increases.

Depending on your needs, you can pick an item based recommender if you have small number of items for an infinite amount of users; the converse could be used as criteria to pick a user-based approach.

Due to performance requirements and efficiency, you might need to tweak some JVM settings. For instance, the book recommends setting


-Xmx=2048  -XX:+UseParallelGC -XX:+UseParallelOldGC


I highly recommend taking a look at this framework. The 1.0 release is coming up shortly and it seems to be getting a lot of traction in the community. For large scale data projects, I highly recommended implementing Mahout on top of a Hadoop installation.

Resources

  1. Mahout in Action. O'Reilly
  2. https://www.youtube.com/watch?v=zvfKH9Yb0s0