Friday, December 5, 2014

Dart: A Best of Breed

I have playing with Dart quite a bit recently, and I think that the set of features it has makes it a great platform to develop small to large enterprise class applications. I have written code in languages such as Java, PHP, Python, and JavaScript, and I can see that Dart has borrowed many of the cool features of each into one great development platform. In this post I will highlight some of the core features of Dart in comparison to other languages I have used.

Get ready for this mouthful: Dart is an open source, class-based, optionally typed, structured, programming language for rich browser-based applications. Its familiar C-like syntax and scalable programming constructs, make it ideal for building single-page web applications.... ok....

Similar to other platforms such as Google Web Toolkit or CoffeeScript, you can run Dart by compiling to native JavaScript code. This will make adoption of this platform a bit easier for browser vendors other than Google. However, unlike GWT, which compiles for different targeted browsers, Dart is designed to compile your application to a single JS file targeting only the modern browsers.  Dart is optionally typed (also called documentary types), which means you can use var to assign any type of variable: String, Object, int, etc. You can use the static types as well. Documentary types improve tool integration features such as: code completion, compiler validations, and code-level documentation, which makes developer's lives easier. Darts type system can be considered a hybrid between Java's strong, static typing and JavaScript pure dynamic typing. If no type information is provided on a variable, dynamic is assumed.

In addition, Dart has a single-threaded concurrency model based on a concept called Isolates for parallel execution of code. Isolates are great because they don't have a shared-memory model, unlike Java Threads. When compiled to JavaScript, Isolates translate to HTML5 Web Workers. This stateless concurrency model is based on a message-passing middleware used to communicate from one isolate to another. In fact, the Dart runtime starts out (from its main method) within the context of an isolate; essentially, every script containing a "main" method runs in its own isolate.

Dart code can spawn a new Isolate, like Java does a for a new Thread. Isolates are different from threads in that they cannot share memory. They are ideal for creating a plugin architecture since they can be used to load external code dynamically in very much the same way to Python and PHP, but because it run in its own Isolate it will run in its own protected memory space.

On a much smaller scale, Dart also supports the development of Server-Side applications using the Dart Virtual Machine. On the server, Dart features nice I/O libraries for manipulating files and sockets. Most of the Dart libraries work on the server, except for dart:html which is in charge of DOM manipulation.

Dartium, the Dart developer edition of Chromium, is Google's open source version of Chrome. Dartium executes Dart code natively in the browser through an embedded Dart VM. For browsers that do not support Dartium, which is any other browser other than Chrome, there are a set of tools in the Dart ecosystem that might be useful:
  1. dart2js: Compiles all dart files into a single javascript file. 
  2. pub: a package manager much like Maven for Java or Composer for PHP. You can publish your libraries in Github.
  3. dartdoc: generated formatted HTML documentation (like javadoc). dartdoc comments start with "///" instead of "/**"
Like Java, Dart has single inheritance, multiple interfaces model; all classes inherit from Object. You use the keywords extends and implements, respectively. This is similar to Java's and other's inheritance model and is light years better than JavaScrip's prototypal inheritance. 
You can have public and private types and private fields are indicated via the "_" underscore. Other languages such as Python work this way as well. The latter, applies also for methods, functions, and even classes.
The "this" keyword is much more simple to understand as it refers to the instance of the class itself (like in Java) and not to the owner of the class (like in JavaScript).
You can split an application into modules or libraries --or packages. Each library is contained within its own .dart file. Libraries are defined using the "library" keyword --analogous to "package". This is a much better improvement over JavaScript since it suffers a lot from namespace pollution and variable collisions. In Dart, you can prefix (or qualify) libraries in order avoid class collisions when importing 2 libraries. For instance:

import "../my/lib/dart" as mine;

mine.commonFunction("do something"); // this will be unique
Now let's spend some time with a very powerful feature in Dart, functions. Like JavaScript, functions are first-class objects in Dart. Top-Level functions allow developers to pass functions as arguments to other functions, assign functions to variables, and dynamically call functions by name. In addition, function arguments (both objects and primitives) are passed by reference.

Dart provides a cool feature called Factory Constructors that allows class designers control the concrete implementation class for a given interface. This is really useful when you intend to have one implementation of an interface, but also to hide the underlying implementation of a class. In Java, you would have to implement it via a Factory Method pattern. In fact, String and int are implemented this way, via factory constructors. For instance:


abstract class IButton {
  
  click();

  factory IButton() {
     return new GenericButton();
  }
}

class GenericButton implements IButton {
  
   click() {
      return "Button has been clicked!";
   }

}

class CloseButton implements IButton {
  ...
}
Another cool thing about constructor methods in Dart are the initializer parameters:

class MyClass {

   var color;
   
   MyClass(color) {
      this.color = color;
   }
}

// you can also use

   MyClass(this.color) {

   }

Dart has several notations to express functions: long notatin and and short notation (similar to a Java8 lambda expression). Here is an example using short-hand syntax for getters and setters:


class MyClass {

   var _color;  // private property

   Color get color => _color;
  
   set color(Color c) => this._color = c;
   
}   

All functions in Dart return a value. Short-hand functions always return the result of evaluating the expression. Default is null. Long notation will employ the use of the "return" keyword just like any other language. You may also qualify a function as returning void. 

Dart supports closures, just like in Javascript. Closures close -over (or wrap) any local variables defined at the time of a function declaration.

Finally, Unit Testing Dart applications is easy with the built-in unittest library. Creating a test is nothing more than declaring a test function and using the Expect class to define assertions in your code. For instance:


test ("Test name", () {
   
   // anonymous function containing test body
   Expect.isNotNull(someObj);

});

Resources

  1. Dart In Action.
  2. dartlang.org

Monday, November 24, 2014

Apache Mahout: Clustering

Overview

People tend to naturally and unconsciously group things together. For instance, honey and sugar are "sweet".  When we taste items exhibiting a similar taste, we immediately describe it as "sweet."

This activity that we as humans do so quickly and instinctively is called clustering, or grouping.

The components involved in a Mahout clustering solution contain:

  1. An algorithm implementation used to perform grouping
  2. A notion of similarity and dissimilarity 
  3. A stopping condition representing when groups can no longer be formed
Circles are a good way to visualize clustering:



The center of the circle is defined as the centroid, or mean (average) of a cluster. This is the point whose coordinates are the average of the xy coordinates in the cluster. The closer items are to a cluster's center, the more similar they are.

Clustering is all about finding the similarities between 2 points or 2 items in the xy plane. Mahout contains a few clustering algorithms such as: k-means, fuzzy k-means, and canopy. In this case, "k" refers to the number of clusters to form. In the example above, "k" has been set to 3.

A Mahout clustering algorithm involves the following steps:


  1. Create a SequenceFile with the input vectors (these are boolean points in the xy space) 
  2. Create a SequenceFile with the initial cluster centers. 
  3. Pick similarity measure to use, such as: EuclidianDistanceMeasure 
  4. Define A convergenceThreshold indicating when to stop 
  5. Define number of iterations to perform 
  6. Create the Vector implementation used in input files 
Cluster centers (in step 2) are sometimes estimated or guessed; but with non-trivial data, this can be very challenging. Even if the estimated centers are way off, the k-means algorithm will adjust the centers at each iteration by computing the average center, or centroid in each cluster.

Getting perfect clusters is a science in it of itself as there are many parameters to tweak in the algorithm.

Other distance measures available in Mahout are the: SquaredEuclidianDistanceMeasure, ManhattanDistanceMeasure, CosineDistanceMeasure, TanimotoDistanceMeasure, among others.

Resources

  1. Mahout in Action. O'Reilly

Monday, November 17, 2014

Apache Mahout: An Introduction

Overview

The web grows more and more each day. Social networking has created a vehicle for users to voice their opinion about pretty much anything in the world. All this public data can be used to drastically benefit a company's business. However,  companies struggle to keep up. Manual approaches to creating reports and analysis become overwhelming. As a result, businesses have resorted to using recommender engines to reduce the level of noise and make sensible statements out of all the data they receive. Good examples of this are: Amazon, Netflix, Youtube, eHarmony, among others.

Mahout is a Java open source machine learning (or collective intelligence) library from Apache born out of the popular text search engine Lucene. Conceived in 2008, and close to its 1.0 release, it was  designed with scalability and extensibility in mind. Being an Apache product, it features nice integration with the popular Map-Reduce implementation Apache Hadoop.

Mahout provides a framework for implementing three basic things:
  1. Recommendations (Netflix, Amazon, dating sites, social networks)
  2. Clustering (Google News) 
  3. Classification (spam detection)
Recommendations are the most widely used technique today. People tend to have relatively similar preferences. Based on a user's preference for a certain object, and other similar user's preferences, a relationship between a user and an object can be established. Clustering techniques help identify structure and hierarchy among a large collection of data. Finally, classification techniques can be used to decide how much an object is or isn't part of some category. For instance, Google decides whether an contains a person's face in it. It can also be used to classify patterns as usual or unusual.

Implementation

At its core, Mahout uses primitive-based vector data sets:  [userId, itemId, preference value]. Data is only numeric for storage efficiency and performance to avoid Java's object overhead, which adds 140% more space needed to store. Since this is an introduction, in this post I will only focus on recommendations, which is the reason most people are attracted to Mahout.

Recommendations and Collaborative Filtering

Recommenders algorithms are data-intensive in nature. These techniques require no knowledge of the actual attributes of the items themselves. Using mathematical principles, recommenders produce a recommendation solely based on the relationship between users and items.

Mahout thrives on boolean data composed of a [userId, itemId]. If available, a preference value can added (much like a rating) indicating the strength of the user's preference to this item.



All algorithms relate a user to an item, but there are some differences:

User-Based Recommenders

These algorithms recommend based upon the notion of similarities among users. The following pseudo code describes the user-based algorithm process at a high level:


for every other user w
   compute a similarity s between u and v
   retain top users, ranked by similarity, as a neighborhood n

for every item i that some user in n has a preference for,
    but that u has no preference for yet
  for every other user v in n that has a preference for i
    compute a similarity s between u and v
    incorporate v's preference for i, weighted by s, into a running average
return the top items, ranked by weighted average


Explaining all possible user-based algorithms is outside of the scope of this post. In general, every used-based recommender in Mahout involves the inter-play of the following components:

  • Data model, implemented via DataModel
  • User-User similarity metric, implemented via a UserSimilarity
  • User neighborhood definition, implemented via UserNeighborhood
  • Recommender Engine, implemented via a Recommender

These algorithms can become slower as the number of users in the system increases. 

Item-Based Recommender

A cousin of the User-based recommender, Item-based algorithms are derived from how similar items are to other items. Similar to the components above, item-based algorithms contain the following components:
  • Data model, implemented via DataModel
  • Item-Item similarity metric, implemented via a ItemSimilarity
  • User neighborhood definition, implemented via UserNeighborhood
  • Recommender engine, implemented via a Recommender
A pseudo view of this algorithm will look like the following:


for every item i that u has no preference for yet
  for every item j that u has a preference for
    compute a similarity s between i and j
    add u's preference for j, weighted by s, into a running average
return the top items, ranked by weighted average

Some item-item similarities are more fixed which are good candidate for pre-computation in order to return results quicker. For instance, it's likely that two CD albums from bands such as as Metallica and Guns & Roses will continue to be similar to each other this year and the next. However, caching and pre-computation can be memory intensive (such is the case with Slope-One Recommender).

These algorithms can become slower as the number of items in the system increases. 

Evaluating Results

It is recommended to use an evaluator to score your algorithm. Using an evaluator you can split the dataset as training data and test data using a threshold value. Using list of computations and math, it will determine how "good" your recommendations are.  The smaller the score, the better the recommendation is. I highly recommend using an evaluator to experiment and tweak your recommendation engine.

Conclusion

We took a brief look at two canonical recommendation techniques: user-based and item-based. The runtime of a user based algorithm goes up as the number of users increases. The runtime of a item based algorithms goes up as the number of items increases.

Depending on your needs, you can pick an item based recommender if you have small number of items for an infinite amount of users; the converse could be used as criteria to pick a user-based approach.

Due to performance requirements and efficiency, you might need to tweak some JVM settings. For instance, the book recommends setting


-Xmx=2048  -XX:+UseParallelGC -XX:+UseParallelOldGC


I highly recommend taking a look at this framework. The 1.0 release is coming up shortly and it seems to be getting a lot of traction in the community. For large scale data projects, I highly recommended implementing Mahout on top of a Hadoop installation.

Resources

  1. Mahout in Action. O'Reilly
  2. https://www.youtube.com/watch?v=zvfKH9Yb0s0

Wednesday, October 29, 2014

Streaming into the Future with Dart

Overview

Dart is a scalable language built primarily for the web. It contains a rich set of libraries, an emerging community, and its own Eclipse-based development environment that make development straightforward, from very small scripts to fully-featured applications. However, you can also write purely server side or standalone applications with Dart. In this post, I will write about the asynchronous nature of Dart as represented in its concept of Streams and Futures.

Streams

Part of the data:sync library, Streams provide an asynchronous flow of data which can be as small as single mouse clicks to large sequential chunks of large file; also, Streams can be used to handle user-generated events.

Since Dart is a single-threaded programming language, the of Streams is essential to handle all types of events and performing file I/O.

Futures

In very simple terms, a Future represents a value that is to be delivered in the future, asynchronously; in other words, Futures are non-blocking calls that allow the application to compute values without halting the execution of the script. This is one of the pillars of scalability. Futures are implemented in the dart:async library and used extensively throughout the dart:io library.

Futures not only return values ("future succeeds") but also errors ("future fails"). In order to handle a value or an error, receivers of a Future can register callback handlers that will be called once the value or error is available, respectively. In some ways, this is very similar to the onload( ) and onerror( ) functions in the Javascript XMLHttpRequest object.

The result of registering a pair of callbacks is a new Future called the "successor." The two most important methods are: then( ) and catchError(). The latter handles any errors emitted by the calling Future or "predecessor" and the former is called when it completes.

If a successor (callback) is not registered in the event that the Future throws an error, the error will bubble-up to the global error handler. This similar to a catch all handler. The successor that handles a value is registered with the then( ) method, and similarly for catchError( ).

Because Futures register one or more pairs of callbacks that are in itself a future successor, depending of which errors you want to handle, you have two options: you can use sequential handlers or parallel ones. I think this makes the code a bit hard to follow sometimes since you need to rely on "purposeful" indentation and use of braces. If a Future fails to ever complete, then no callback will ever be called.

Futures have a nice static method called Future.forEach( ), which you can use to invoke an asynchronous operation on each element of an Iterable or Collection. This can be used for modifying multiple files or broadcasting a message to multiple sockets.

Here is an example of a simple use of futures and streams: a server and client.

/**
 * Dart Client
 * Author: Luis Atencio
 */
import 'dart:io';
import 'dart:convert' show UTF8, JSON;

void main() {
    
    Socket thisSocket;
     
    Socket.connect(InternetAddress.LOOPBACK_IP_V4, 49633).
      then((Socket sock) { 
        thisSocket = sock;        
      });    

    // Connect standard in to the socket 
  stdin.listen((data) => 
      thisSocket.write(
        new String.fromCharCodes(data).trim() + '\n'));
      
}


/**
 * Dart Server
 * Author: Luis Atencio
 */
import 'dart:io';
import 'dart:convert';

void main() {
  // Create an HTTP server and bind it to port 4040
  ServerSocket.bind(InternetAddress.LOOPBACK_IP_V4, 49633).
    then((ServerSocket server) {
      server.listen(listenForRequests);
    }).
  catchError((e) => print(e.toString()));
}


void listenForRequests(Socket client) {
  
  print('Connection from '
      '${client.remoteAddress.address}:${client.remotePort}');
    
    BytesBuilder builder = new BytesBuilder();
    client.listen((List buffer){
        builder.add(buffer);
        String str = UTF8.decode(builder.takeBytes());
        print('String from client:  ' + str);
    }, onDone: () {      
      client.close();
    });    
}

Resources

  1. https://www.dartlang.org/
  2. https://www.dartlang.org/docs/tutorials/httpserver/
  3. http://www.w3schools.com/xml/xml_http.asp
  4. https://www.dartlang.org/docs/tutorials/futures/
Add to Flipboard Add to Flipboard Magazine.

Tuesday, September 23, 2014

The Dart VM

Overview

It's hard for me to ever think that one day JavaScript will replaced by a better platform. It's one of my favorite languages to develop on, and one that has certainly withstood the test of time. Dart is an attempt to do such thing.

Dart is a pure Object Oriented (OO), optionally typed language with influences from other languages such as: Erlang, Java, SmallTalk, and JavaScript.

Dart, a standalone VM, is a tool that be used to run Dart programs in the CLI as well as in a browser. Dart VM is not bytecode based, like Java; rather it works directly on Dart source code. Since there is no compilation step, it is considered a Language VM. The downside of this approach is that, unlike the Java VM, you could not embed any other languages to run on the Dart VM. The reason it was designed this way is to continue (or compete, I should say) with the ease of use with which JavaScript applications compile and run. The main goal is to keep the "edit-refresh-view" cycle intact that JS developers love. Only when testing in other browsers, will you ever have to compile to native JavaScript (unless the platform kicks off and you have adoption from all major browser vendors to include the Dart VM extension).

Dart runs in 2 modes:

  1. Checked: assignments are dynamically checked and certain violations raised as runtime exceptions. In addition, assert statements used to check boolean conditions are turned on in checked mode. This is the recommended mode during development. 
  2. Production: assert statements are ignored and static type annotations have no effect.

Dartium

Dartium is a special Chrome-based browser made up of Chromium and the Dart VM. In Dartium, you do not have to compile code to JavaScript until you are ready to test in other browsers. Dartium can be installed as a standalone browser or bundled as part of the Dart Editor. The executable is the Chromium executable.

At the time of this writing, it is recommended NOT to use Dartium as your primary browser, since it might have potential stability and security issues. In order to execute a dart file, you can embed it onto you HTML using the <script> tag with the application/dart type and a reference to your .dart file.

Resources

  1. https://www.dartlang.org/tools/dartium/
  2. https://www.dartlang.org/docs/dart-up-and-running/contents/ch04-tools-dartium.html
Add to Flipboard Add to Flipboard Magazine.