Welcome Developers


My name is Luis Atencio. Currently, I am Lead Software Developer at Citrix Systems. Programming languages, design patterns, and techniques are my passion. If you are interested in programming or just technology in general, please follow my blog!

Thursday, June 26, 2014

Notes on Continuous Integration: Scripting Your Build Process

Continuing on the series of Continuous Delivery topics which follows from this last post here, in this post I will take about the importance of using a build tool to give structure to your build process/workflow.

Overview

Application systems become exponentially more complex by several increasing factors: size of code base; size of the development team; number of third party artifacts; and number of external dependent systems to integrate with such as web services, databases, queues, caches, etc. 

Building the artifacts of such an application is a multi-step task that will quickly start to become a real headache to carry out manually. Automating your build process via scripts or tools is essential for medium to large size projects. Developers, QA and Operations (and emerging DevOps) teams must work in unison to accomplish this. Continuos Delivery is the responsibility of the whole team.

Build tools typically follow a certain paradigm: they carry out ordered tasks that belong to different phases or goals. These tasks can be anything from source code checkout, source compilers, static code analysis, test execution, deployment scripts, etc. 




From the image above, your build script is a pipeline composed of a sequence of actions that must be executed predictably in the right order and together constitute your application's backbone. A deployment pipeline will have different stages which might execute scripts to: checkout code, compile, test, deploy, run tests, code analysis, versioning, etc. Performing these steps manually is error prone and inefficient. 

In the Java world there are many tools you can use: Ant, Maven, or Gradle, to name a few. In the PHP world it's pretty much Phing's game. 

In this post I will focus on Maven, since this is the tool I have the most experience with and use successfully and many different types of projects. Maven provides a rich declarative, extensible XML-driven domain model for building applications. With its convention over configuration paradigm, Maven can accomplish any build task possible via its rich plugin ecosystem. 

Maven can be downloaded here: http://maven.apache.org/download.cgi

Maven is a tool developed in Java for Java projects primarily and its power lies in its dependency management. Large Java deployments are typically multi-module in nature, and having a tool to manage all of your third party as well as in-house modules and their transitive dependencies, could save you from lots of pain. Transitive dependencies are easy to understand; basically, if project A depends on artifact B, B depends on artifact C, by the transitive property, project A depends on C. Maven handles all of this for you as well. 

Every tool has its downsides, and Maven has its share of them. Although it does not occur too frequently, the biggest downside is that Maven will tend to update its core plugins without warning, making your builds somewhat unpredictable. Because Maven's core is really small by design, it relies on plugins to become a full fledged build system. A subset of these plugins might get updated on the fly at any time. Another downside is that Maven's language is an external DSL written in XML called a POM file, which means in order to extend it you must write custom extension plugins. The good part is that Maven's plugin ecosystem is enormous, there are plugins for every task you will perform on a typical Java enterprise projects. As a result, every vendor or tool provider will have a Maven plugin available that integrates your tools/products together. 

On the other hand, a build tool called Rake, the Ruby solution to Make, solves this problem by providing a native API for Ruby. Ruby is a very expressive language well suited for writing internal DSLS (I discuss internal DSLs here). Instead of writing XML elements, your build script becomes Ruby code which allows you to use all of the power of a general purpose programming environment such as a debugger and code completion, code refactoring and modularization, class augmentation, etc. A word of advice: Maven uses a feature called SNAPSHOTs. Snapshots are equivalent to Composer's "-dev" functionality for PHP applications. Maven will always check for newer versions of the same artifact version you are using, and download it if there is one available. I recommend using snapshots only in development phases and only for in-house artifacts. If you are using a third party artifact, you will want to download it only once-- avoid using SNAPSHOT versions for these. Third party vendors can change the contents of a SNAPSHOT artifact at any time, so your build becomes very unpredictable. Maven provides configuration to avoid snapshot downloads from third party repositories.

Maven will used to drive the assembling of your application artifact, whether it be a an EAR, WAR, or JAR file. Maven will begin executing at the commit stage in the diagram above. Once the artifact has been built, it will proceed to deploy it to an Artifact Repository for later consumption. The following is a high level diagram illustrating all of the phases of the lifecycle. You can attach plugins to perform tasks at every step.

Principles and Practices

Below I will discuss some important principles and practices to follow when you are creating your build scripts. Maven will help make this process really straightforward and consistent. Fore more information on creating a Deployment Pipeline, you can take a look at this post Deployment Pipeline. This will become important for architecting your scripts.

Create a Script for each State

One way of organizing your builds is to write a script for each stage in your deployment process. This keeps your scripts clean and focused on the particular tasks for each stage. If you need to share information amongst the scripts, Maven provides the functionality of having a parent script (parent POM) that can be derived by your individual scripts. Maven translates the idea of writing scripts, to implementing plugins. The plugin ecosystem is huge and you can perform tasks such as: compiling code, running tests, assembling code, copying resources, creating manifest files, versioning, source code checkout, code minification, etc.

Use the Same Script to Deploy to Every Environment

Scripts used to deploy to development machines should be exactly the same as QA, staging, as well as Production environments. This will ensure that your build process is tested thoroughly every step of the way. To achieve this you must externalize (or extract) configuration information from the scripts so that you can configure each environment in the same way, both artifacts belong in source control. Maven can accomplish this with the Templates and Filters pattern.

Oftentimes, the setup in a developer's machine is nowhere near the same as a production environment. For instance, you might have queueing systems, messaging systems, email servers, or databases that are configured far differently in production than in a local environment. In these cases, you need to look for simplified versions of these dependencies. Research of tools such as in-memory databases, in-memory queues, mock e-mail servers, etc. This investment is well worth your time. If your application depends on some components built in-house, it is essential that all build environments have access to them. With Maven, you can set up and external Nexus server. Nexus is an artifact repository (or package proxy) in charge of storing artifacts developed in-house. It can also act as a package proxy for external repositories so that you can read artifact (JAR files) from the open source community like Apache, Spring, Google code, to name a few. This makes all components accessible to all build environments.

Packaging Systems

Whenever possible, use packaging systems for any artifacts related to the Operating System and your application. I highly recommend that all your environments be Unix or Linux-based. Package managers really depend on the platforms you are using: Debian and Ubuntu use the Debian package manager, RedHat uses RPM, and CentOS and Fedora use Yum. 

Language platforms also have package managers, every major language has one: PHP has Pear or Composer; Java uses Maven; Python uses pickle or Pip, Perl has CPAN, Ruby uses rubygems, and so on. I recommend that your application be treated as a package as well so as part of your build process, deploy the different versions of your application artifacts into the artifact repository. Using package managers allows you to script your deployment process much easier, all of the installations become a set of commands against your package manager tool. If you require special installations of commercial software for which there are packages available, you would have to manage this exception and make this manual step.

Make your builds idempotent

This is a nice way to say: ensure that your build process leaves the build environment in the exact same state you found it before building. This is especially true for test artifacts like databases. If you need to set up databases to perform integration tests for your application, always make sure to remove any test artifacts created as part of the build. I've experienced this many times. Developers tend to write unit test that write data against databases (terrible idea) and fail to properly clean up data. This can cause the next builds to fail.

Script your deployments

The best way to script your deployment process is to you a Continuous Integration (CI) tool. There are many players in this field including Jenkins, Hudson, and Bamboo. Some better than others, these tools provide an interface to set up different build plans that are composed of multiple stages. Depending on your deployment pipeline and the needs of your system, you should script each stage accordingly. These CI tools have a plugin architecture that will allow you to run tools such as Code Analysis, Unit Testing, among others. 

At work we have had a really positive experience using Bamboo. Bamboo is commercially available, but you can install Jenkins which is free to use. If you don't have access to these tools for some reason, at the very least you should write your own scripts that perform this orchestration for you. 

In later posts I will provide more details about our implemented deployment pipeline, scripts, and our use of Bamboo CI. 

Resources

  1. Humble, Jez and Farley David. Continuous Delivery: A Reliable Software Releases through Build, Test, and Deployment Automation. Addison Wesley. 2011
Add to Flipboard Add to Flipboard Magazine.

Monday, January 13, 2014

Internal Domain Specific Languages

Overview

In this post I will discuss the use of DSLs and how they are an important tool in a programmer's toolbox. The ideas expressed in this blog come from personal experience and the theory from Martin Fowler's book "Domain Specific Languages."

If you think DSLs are not important, try validating proper URL or email addresses without the use of Regular Expressions… I thought so… let's continue.

DSLs are used more and more in modern software development projects. They are necessary for systems such as healthcare, insurance, and science since they specialize in addressing the needs of modeling very specific domains. They have been widely adopted in technology as well to support deigns for build systems, web frameworks, graphing utilities, testing frameworks, and much more.

I will start out with a basic definition of a DSL:

"A Domain Specific Language (DSL) is a computer programming language of limited expressiveness focused on a particular domain."

This definition encompasses some important aspects that are worth exploring further. Let's break it down:

1) DSLs are "computer programming languages," which means the language will need to be compiled, run, parsed, or interpreted one way or another by a computer. This does not imply the language must be Turing Complete; preferably, it should not --it should not be used for general purpose coding.

2) DSLs are of "limited expressiveness," so as to not be confused with a general purpose programming language. They belong in the category of Ubiquitous Language or "mini-language" suitable for a very narrow purpose. They are not designed to build complete systems, yet you can combine DSLs to build many aspects of one.

3) DSLs are "focused on a particular domain," which means the language constructs used must support your domain and your domain only. For instance, unless you are writing a DSL to describe a mathematical application or solution, your DSL will not use/have syntax to support mathematical statements and functions.

Benefits of a DSL

DSLs have many benefits to enhance a system's architecture and design:
  1. Enhance communication with Subject Matter Experts (SME) or domain experts. Using a DSL can increase collaboration and involvement from SMEs as they can read and understand what is being coded. Perhaps they can begin to contribute to that part of the code as well. 
  2. Improve productivity by enhancing readability of code and lowering the margin of bugs.
  3. Move your system architecture into a more declarative model which allowing you to use the Semantic Model design pattern.
  4. Abstract poorly written APIs. Using an Adapter (Wrapper) pattern the DSL creates a layer of abstraction hiding away all of the ugly nuances of poorly written code. Like a veneer on top of the API.

Types of DSLs


   External

An External DSL is represented in a separate language than the main programming language it's working with. Although they are not General Purpose Languages (GPL), they have a formal grammar.

It may use a custom syntax or it may follow the syntax of another language, such as XML.
Examples of external DSLs can be: CSS, SQL, Awk, Regex, XML files for things like Hibernate, OSWorkflow and Struts configuration, Maven build files, RAKE syntax for Ruby builds, among others.

XML languages are very commonplace in a software system. An XML file that abides by a schema and describes the stages of a state machine or the structure of a tree are good candidates to be DSLs. On the other hand, a simple XML file to store key/value pairs of configuration data probably would not be considered one.

Similarly, configuration data in .properties files, even though they have limited expressiveness and focused on the domain, are not be regarded as DSLs. A file with only assignment expressions lacks the fluency criteria needed to pass for a DSL. YAML files, on the other hand, are much more rich and structured and can act as DSLs.

Implemented in a language outside of the host, external DSLs require you develop a parser or interpreter for it. This is reason most people like to express their domains in XML because parsing and reading are so widely supported. This translation can happen at build time or runtime depending on the project. The resulting code from the parser is then executed. Typically, external DSLs will be translated or compiled into the source code language of the system itself, and executed with it. You can use tools such as ANTLR as a good starting point to begin writing external DSLs.

One of the best examples of an external DSL I can think of is Graphviz (Graph Visualization Software). It is an open source package of visualization tools that are defined in a language called DOT. It is typically used to generates artifacts like topological graphs, UML diagrams, etc. For example, you can define graphs with nodes and edges in a DOT script. A parser and generator interprets the DOT script and generates a graph from it. Here's an example:

graph {
   a -- b;
   b -- c;
   a -- c;
   d -- c;
   e -- c;
   e -- a;   
   }

I have used it to draw inter-class dependency and inter-package dependency graphs. It's pretty simple to use. 
Furthermore, what counts as an external DSL and what not is somewhat blurry. The boundary condition in an external DSL is the programming language itself. Languages can have a domain focus, for instance: Matlab and R; the former is used mainly to express mathematical constructs and the latter is a statistics package, yet both can be considered full-fledged GPL languages. Even though they are focused on a particular domain, they are still very much for general purpose and may not be regarded as a DSL since it is not narrow in scope and syntax.

Within the External DSL umbrella, there is a category called fragmentary DSLs. Take a look at Regular Expressions and SQL, as examples. Regex and SQL syntax are limited in expressiveness and very domain oriented, their syntax differ from the host language they are implemented with and are often seen scattered throughout the code instead of their own package or file.

   Internal

An Internal DSL (also known as embedded DSL or Fluent Interface) is a language represented within the syntax of a general-purpose language you are working with.

Basically, it's a stylized use of that language for a domain-specific purpose. You use a subset of the language's features in a particular style to handle one small aspect of the overall system. Some people probably already use them, without realizing they are.

Examples include code written in Lisp and Ruby for the most part, but also seen more and more in Java and PHP. Ruby on Rails, for instance, is a collection of DSLs. I do not have any first-hand experience with Rails, but I have heard it at conferences and from colleagues that programming in Rails has a different feeling than programming with straight Ruby. In Java or PHP, what you would typically see is the use of the Fluent Interface pattern to write in a stylized suitable for a DSL.

The boundary in an internal DSL lies in the language nature itself. A common way of documenting or describing an API is by listing all of its methods. By having properly descriptive method names a developer can understand what the purpose of a class is. Alternatively, methods in a DSL only make sense in the context of a much larger expression that is being built to populate a Semantic Model --it is like trying to stitch sentences together and telling a story. This is a snippet from a testing automation DSL I am designing of populating a form and submitting it:

getFormElement(By.name("form")).
    set(By.name("username"), "test user").
    set(By.name("password"), "test password").submit();
As mentioned above, one important part of building an internal DSL is to have a good Semantic Model or Domain Model for your application. Having a well-defined model will allow you develop and test the application separate from the DSL. The DSL will merely act as mechanism for expressing how the model is to be configured and used.

Implementation of an Internal DSL

Because the topic of Domain Specific Languages is so broad, I decided to focus on implementing  Internal DSLs.

Even though Java is probably not the ideal language for this, you can still achieve acceptable DSL-style coding with it. I'll start out with an example from a framework I use a lot called JMock:

mockContext.checking(new Expectations() {{

   atLeast(2).of(mockDao).getData(ID);
       will(returnValue(testObj));
  
   oneOf(mockDao).processData(testObj);
       will(returnValue(Status.SUCCESS));
    
   ...  
    
   allowing(mockDao).performCleanUp();

}});

Do not be too much concerned with the syntax, but this is very typical of DSL writing in Java. As I mentioned before, there are other languages that lend themselves to expressing DSLs more concisely. In Java, there is a level of verbosity you must deal with before you can start to implement proper sentence-like code.

The important thing to notice in this example is the style of coding,  indentation and the placement of syntactic constructs, and the way methods are named and called. There are several ways to do this, all very similar in concept, which I will talk about in the next section.

Earlier we defined an Internal DSL as a stylized version of using the host language to increase the readability of your code. You are not obligated to use the host language. With the advent of JVM languages,  embedded JRuby code within Java can also be considered an internal DSL. It's common to see polyglot applications that use several JVM languages together. The criteria is that your internal DSL be written in an executable language and parsed by executing it within that language.

Internal DSLs are much simpler to implement since you do not have to build a grammar and parser in order to make it executable. The parsing and generating of the syntax tree are already done by your host language compiler or interpreter. However, it is much more limited in expressivity than the external counterpart since you are bound by the host language's syntax rules.

Moreover, as a mentioned before internal DSLs are typically implemented using fluent expressions that wrap the Sematic Model and provide a fluent facade (or language) on top of it. This is very beneficial: it is not necessary to understand how the model works in depth in order to write code to use it; people who are not programmers can make sense of it as well; and, finally,  it allows your model and language to evolve independently.

Techniques to write fluent code

As you all know, semantic models and traditional APIs are implemented using methods (I am assuming we're in OO land here, of course). Fluent expressions are built on them as well, the difference is in how you write and combine them.

In introductory programming courses, we are taught that functions should always describe actions, for example: getItem( ), fetchRecord( … ), etc. In fluent interfaces, we use functions as building-blocks to describe an overarching model; so you are not executing actions in this case, you are describing the data and stitching together an expression. This is often known as Method Chaining and can be seen in popular frameworks like JQuery.

Here's an example in Java:

car()
   .transmission(Transmission.Manual)
       .speeds(6)
   .color(Color.Black)
   .make("Nissan")
        .model("370 Z")
   .wheels()
        .size(19)
   .end();


Another alternative that looks similar to this is Function Sequence  where every line is a different function statement (ending in ";") indented appropriately for readability. This approach can be tricky because your functions are tightly coupled to each other (which breaks the whole "Unit of Work" paradigm) and you would have to ensure that they resolve properly since it matters the order in which you invoke them. There are techniques to do this. Also, your functions (and data) would have to be global in scope, which might or might not be a good idea to do.
Method Chaining is better for this as all of the functions and data are contained within the scope of the expression object. This gives rise to another approach called Object Scoping, in which you place the DSL script in a subclass of an Expression Builder, this way all of the functions and data are localized 
within the scope of the superclass (no globals). Combining a fluent expression with object scoping you can use Function Sequence or Method Chaining within an object scope, such as is the case with the JMock example I showed you earlier.

In languages like JavaScript or Ruby, you can use Literal Maps (or associative arrays) to build objects. Javascript objects in JSON form are very common and have lots of support when it comes to parsing them. You can have something like the following to build the same car as above.

{car: {
   transmission: {
      "type": "Manual",
      "speed": 6  
   },
   color: "black",
   make: {
       "name" : "Nissan",
       "model": "370 Z" 
   },
   wheels: {
       "size": 19
   },   
  }
}

Even if your host language is Java or PHP, this is still considered an internal DSL since you are using the syntactic structure of a host language to define your DSL. 

Examples of combining Object Scoping with Method Chaining can show you the patterns are not mutually exclusive. It all depends on what your needs are. In all of them, though, it is recommended to use strict indentation rules to convey the structure of the logical syntax tree of the DSL; otherwise it will be very confusing. 

If you build objects with mandatory nested elements, perhaps Nested Function works well. On statically-typed languages such as Java, IDEs can be great tool to type hint what each argument contains.  Here's a nested function example:

car(
   transmission(Transmission.Manual, 6),
   color(Color.Black),
   make("Nissan", 
        model("Z 370")),
   wheels(
        size(19))
   );

This will allow you to write your DSL without needing to use context variables. There are lot of other benefits to using this approach. The nested functions actually reveal and enforce something very valuable to the understanding of the expression you are constructing: hierarchy. Since the functions are nested within each other (e.g transmission( ) function is called within the car( ) function), you are effectively building the syntax tree of the DSL. In addition, since method evaluation order executes method arguments first, you are basically building your model from the smallest pieces to the big ones. 
There are drawbacks to Nested Function: 1. Because of method evaluation order, you are basically building your model backwards.   2. It could be very intuitive to build objects this way if you are thinking syntax tree, yet very confusing if you see it as a sequence of commands.  3. You would not be able to build big structures, because you would have to have functions with many arguments which are hard to test and use. 
4. The arguments are identified by position instead of name. If you have language like PHP where you can specify named arguments that's great. In other cases, if you have a function like wheelSize( ) with two arguments radio and depth, I might be using two doubles for this and you would have to rely on documentation and IDE tooling to aid prevent you from building the wrong thing. 

In sum, for objects with simple structure, Method Chaining is appropriate. If you are specifying a collection of objects of the same type, then either a Literal List or even a hybrid between Method Chaining + Function Sequence works. For a collection of different objects with subelements, I would recommend a Literal Map structure.

In internal DSLs, closures play an important role as it can sometimes limit the expressiveness of a  language. In languages that do not support closures such as Java (lambdas are still experimental at the time of this writing), writing Nested Function DSLs can be quite noisy or verbose.

Another form of DSL writing is through Literal Extension, a form of language enhancement. Some languages like Javascript and Ruby support this by allowing you to extend language literals. Java does not support this as they are immutable by design. The danger here is that it enhances types at a global level instead of the limited context of your DSL, which is undesirable. This is a feature you typically would not need to use, but it would be very cool to have when you do since it gives you the impression or customizing the host language.

In JavaScript you can do some cool things with Literal Extension:

Number.prototype.tiresOn = function (car) {. . .}
Number.prototype.dollars = function ( )  { . . . }


thisCar = new Car();
  with(4.tiresOn(thisCar));
  worth(50000.dollars())

For the test automation framework that I am designing, I decided to use a hybrid of Function Sequence + Method Chaining + Object Scoping. Here's a more complete example:

public class SignInTest extends Tests.Simple {

  @Test
  public void testSignIn() {

   use(Browser.FIREFOX);

   beginAt("http://www.mycompany.com/signin");

   
   getFormElement(By.name("form")).
      set(By.name("username"), "test user").
      set(By.name("password"), "test password").submit();
   
   waitFor(10).until(pageTitle.startsWith("My Homepage"));

   assertTitleEquals("My Homepage | Welcome");    
  }
}

Like many other frameworks of its kind, this DSL is designed to wrap the Selenium test API which can be daunting to use by someone not very familiar with programming in Java. You will have to use things such as loops, iterators, anonymous inner classes, and other things to write the same test. This DSL has abstracted all of that to make this task a little simpler.

You can see the use of Function Sequence very clearly since my expressions are all not completely tied together. I used for Method Chaining to handle related operations such as filling and submitting a form all in one expression. Moreover, by extending the class Tests.Simple I was able to expose protected level state variables such as pageTitle. In addition, this allows engineers to use the API methods such as use(Browser), beginAt(String), etc without declaring or creating an object. This is using Object Scoping using inherited variables and methods. 

A quick word about Code Generation

Writing a DSL to populate a semantic or object model and then directly executing it is referred to as "interpretation". Typically in this approach, all of the steps happen at runtime as a single process, similarly to how a PHP or Python interpreter works. You write your code, populate your data, and then you execute it. Internal DSLs support this out of the box because they inhabit within the host language the system is written in.

The counterpart of interpretation is "compilation". In this approach, we parse some program text and produce an intermediate executable output. In the DSL world, compilation is often referred as "code generation". This is the norm in external DSLs. The model (typically packaged as a library) and the parser need to be compiled together. We run the parser to generate the code that is to be run. Lastly, the model and the generated code are run to populate the model and produce your desired result. This makes your build process much more complex.

Which approach you take depends on what type of DSL your application needs. The higher the degree of communication between developers and SME's the greater the tendency toward an external DSL
should be.

Conclusion

As we saw in this post, DSLs are really useful to bridge the gap between a developer's and a subject- matter-expert's understanding of system requirements.

A good deal of momentum for internal DSLs comes from Ruby and the advent of Rails since it's syntax lends itself to writing in this style. I like the metaphor of bending the host language to express your domain. Basically, you are twisting the style rules we are taught or seen for so many years to gain fluency and expressiveness.

On the other hand, external DSLs have many applications in industry. They are used to close the gap between subject matter experts and application developers. Also to close the gap between developers and designers; CSS is a really good example of this. You often find people that code CSS say they do not consider themselves programmers. Another example in the industry is the BPEL language, which is an orchestration language to define business processes backed by Web Services.

Choosing between an internal and external DSL will depend on the nature of the problem you are tackling and how you want to communicate with the domain experts. Do not try to do solve every problem with DSLs. They are powerful because they are narrow in focus. DSLs are usually built incrementally as the domain and understanding of the problem evolve.

DSLs are not a paradigm shift in the way we reason about software. Domain specific languages have been around for many years. You can think of HTML as the DSL of web page layout.

Finally, do not confuse the benefits provided by the model versus the benefits provided by a DSL, they are mutually exclusive. The idea an internal DSL is to improve readability, and subdue the internal workings of the application code.


Resources

  1. Fowler, Martin and Parsons, Rebecca. Domain-Specific Languages. Addison Wesley. 2011
  2. http://jmock.org/oopsla2006.pdf
  3. http://javieracero.com/blog/internal-vs-external-dsl
  4. http://en.wikipedia.org/wiki/Business_Process_Execution_Language
  5. http://www.infoq.com/news/2007/05/jruby-dsl
Flip if you like it: Add to Flipboard Magazine.

Thursday, October 31, 2013

Exploring Python

Overview

I've been playing with some Python recently and have come to appreciate a lot of its features; some of which, I wanted to share with all of you, as I did with my development team. This is not a beginners tutorial of Python, but merely informal words of my own impressions as I compare Python to some of the other languages I am more familiar with.

Python is a free-to-use, general purpose programming language. It was designed to be some sort of middleware language between the shell and the system. So, initially it was intended for DevOps or System Administrators, so it's no surprise that it is available out of the box in all Linux distributions. However, it has evolved to be much more than that, to a point where entire enterprise-level applications are built on top of it.

Python runs as interpreted byte code (in somewhat similar idea to a Java JVM) and also runs on Windows and Mac. There also exist language ports so that you can mix Python with other programming languages such as Jython (running in the Java JVM) and IronPython (running in the .NET CLR).

Overall, Python is a dynamically-typed language with elegant and concise syntax, a powerful data structure library, garbage collection, and a big community of developers supporting it. It was ranked 8th in the TIOBE index of popular programming languages. It offers both worlds: you can use to do proof-of-concept applications (fast prototyping) or to build entire systems. It also has ample support from PaasS cloud providers such as Heroku and Google App Engine.

How's that for a summary...

Scoping

Python modules are very similar to namespaces in other languages. Like all Python code, they execute top to bottom from the moment they are imported or read in. At that point, the module's functions, classes, and variables are in scope.

One thing that distinguishes Python from other programming languages is that you can conditionally include python modules, meaning you can actually wrap them in if-else statements, this is typical of scripting languages. In addition, you can choose to read in the entire module or just pieces of it, for instance: 


import logging
or 
from logging import *
will read in Python's built-in logging module and all its artifacts. More specifically, I can type:
from logging import Logger

which will just bring the Logger class into the scope of your script. 

Functions also create scope. The function scope starts when the function runs and gets destroyed when the function returns (unless you use generators, more on this soon). Recursive function invocations will create their own namespaces. 

Variables in Python are implicitly global, meaning that if you assign a value to a variable at the beginning of your script, that variable is accessible (global scope) from that moment on. Unless, the variable is assigned a value within a function or class, in that case, they are local to that function or class. There is a caveat to this: Python supports the keyword global. To avoid confusion of accidentally changing a global variable inside of a function, you would prepend the variable assignment with the keyword global. If you do this, you have access to the globally declared variable instead of defining a new local variable inside the function.  Abusing this keyword, however, can make code really hard to follow, so I would do it in edge cases.

Functions

Functions are objects in Python very similar to JavaScript in this respect. This is really nice if you like to code in a functional programming paradigm: functions can be passed to other functions, returned from functions, and assigned to variables.  Like any other language, a function is just a group of code with tight local scope that performs a repeatable task. You can define your own with the keyword def and a label. That label acts as a reference (alias) to the function object. 

Additionally, Python supports the concept of an anonymous function, in this case called a lambda. Lambda don't have a return statement but do return a single expression. As functions, they can be assigned and passed in the exact same way. I would not use lambdas to replace functions, they have completely different purposes. A lambda expression complements a function very well in cases where you need to perform a very specific operation to a set of values, e.g. compute prime numbers, increment, decrement, etc. Typically, you will see lambda functions used in conjunction with Python built in functions: filter, map, reduce.

nums = range(1, 100) 
for i in range(2, 8): 
   nums = filter(lambda x: x == i or x % i, nums)
Another interesting feature of Python is the ability create functions which can accept a variable number of arguments and keyword arguments. You will find some of this in languages like PHP and Javascript.  In Python, you can specify that a function is to accept a variable number of arguments by using the *args function argument. This parameter will bind to a list of arguments of any length. In addition, you can also specify a keyword set of arguments by using the **kwargs parameter. "args" and "kwargs" names are just conventions, it's the "*" and "**" that tell the language to behave this way. A function will look like this:


def myFunc(self, *args, **kwargs):


Decorators

This is one of my favorite language features in Python. The things you would need to do in other programming languages to get something like this is complicated and laborious. As the name suggests, decorators allow you to wrap or "decorate" method invocations. Let me provide an example of a function tracer:

import logging

# configure the logger
FORMAT = 'WARN %(message)s'
logging.basicConfig(format=FORMAT)
logger = logging.getLogger('decorators_2')


count = 0  # global definition outside of function

def trace(myfunc):
 '''
  Function tracer
 '''
 def inner_func(*args, **kwargs):
  global count
  logger.warning('Trace ' + str(count) + ': entering...')
  myfunc(*args, **kwargs)   # invokes original function
  logger.warning('Trace: leaving\n') 
  count += 1   
 return inner_func    

@trace
def some_func_1():
    print('some_func_1')
    
@trace    
def some_func_2():
    print('some_func_2')

def run():
 some_func_1()
 some_func_2() 
 some_func_1()



This snippet of code shows how I can provide logging as the function is entering and before it returns. For things such as debugging or tracing this can be very useful.

You can provide multiple levels of decoration and wrapping. It can actuality get pretty hard to trace and debug. Basically, as soon as a decorated function is called, the Python interpreter will first fire your decorator function and pass an alias to the invoked function. To do something like this in Java, you would probably have to use AspectJ and setup a tracing aspect with all of your point cuts defined. Then you would have to re-compile the code using the AspectJ compiler; definitely, something to think about twice before implementing it. 

Classes and Inheritance

Python is an Object Oriented language. Which means it has support for: polymorphism, inheritance, encapsulation, and abstraction. 

Polymorphism

Since Python is a dynamically typed language, sometimes we take polymorphism for granted. In other words, we don't care so much about type-checking here as we do in other languages. In Python, aside from  doing some reflection work or tests, there is no real need to do check for types in production level code.

You will rely on duck-typing for polymorphic behavior. This mechanism is different from other languages such as Java where polymorphic behavior (called inclusion polymorphism) can only happen with method invocation on classes belonging to the same inheritance path or related by some common super class. Also, you don't have to worry about parametric polymorphism and concepts such as variance, co-variance, and bounded quantification, so understanding this in Python is much easier.


class AlbacotRanger(object): 
    def quack(self):
        print "Quack like an Albacot Ranger Duck!"

class AnconaDuck(object): 
    def quack(self):
        print "Quack line an Ancona Duck!"

    def quackAsADuck(typeOfDuck):
        typeOfDuck.quack()

alba = AlbacotRanger()
ancona =  AnconaDuck()
quackAsADuck(alba)
quackAsADuck(ancona)

In the example above, you can see that classes AlbacotRanger and AnconaDuck do not share a common super class that  defines the method quack( ). In Java, this would not be allowed. In Python, the interpreter will inject the proper type at runtime and it only cares that a method quack( ) exists at that moment. Also remember, that in Python you can actually remove method and variable definitions from an instance at runtime, so checking for this stuff is pointless. Under duck-typing: if it quacks like a duck and acts like a duck, it is a duck.

In addition, if both classes were to belong to a parent class, say Duck, that defines a method quack( ), then you should expect this to work as well in similar manner to Java.

Inheritance

Python supports inheritance just like many other general purpose programming languages do. Its support for multiple inheritance, however, is not just the one where you can extend multiple interfaces but actually extend multiple concrete classes-- pretty insane. This topic can get pretty intense. Multiple inheritance is not a trivial problem to solve, you can think of it as recursive member resolution starting from the child and working its way up the inheritance tree. In other words, a method or member variable is looked under the derived class and if not found recursive up the base classes until reaching the root of all classes, object. For all intent and purpose, think of every method in Python as being virtual.

In order to keep up with the times, after Python 2.2 support for classes changed, in a way that more closely resembles other OO languages out there. These are called "new-style classes." In "old-style," method resolution is done very simply: depth-first, left-to-right class scan; whereas, in new-style method resolution is a bit more tricky. This was done to account for multiple inheritance and support for cooperative calls to super( ). super( ) is only available in new-style.

In new-style classes, method resolution is done using the C3 or Dynamic Method Resolution Order (MRO) algorithm as proposed by Dylan. Being that it is a true multiple inheritance language, Python deals with the inheritance diamond issue by dynamically linearizing the search order so that left-to-right ordering can occur. For extensive details on this, you should read the Dylan paper (resource below).

Behind the scenes, Python stores this dynamic order in a hash called _mro_ which you are not supposed to mess with unless you know what you are doing. This hash can change, to support dynamic reordering of classes. A call to super( ) basically returns a proxy to a parent class instance, and this can be useful to make calls to base methods that have been overridden in derived classes. super ( ) has a second argument to qualify the instance you are referring to, it can be of a class type or an object. To properly design your classes for cooperative calls to super( ), visit the article called "Python's  super( ) considered super!" in the resources section.

Word of caution: old-style classes is something that will eventually be deprecated, so you should never use this in production level code. Stick to "new-style."


Encapsulation

One of the core principles when designing good APIs, is to avoid exposing unnecessary internal state. Unfortunately, the notion of making things "private" like other languages does not exist in Python. But because encapsulation is such a common and recommended practice, Python has limited support for this by using name mangling. 

If you follow the practice of prefixing variable and method names with at least 2 underscores, then Python will textually replace that variable and include the class name. For instance, variable __foo will be replaced with _classname__foo; this will kind "hide" access to that variable and avoid intra class collisions with other identifiers. This happens irrespective of the syntactic position of the identifier, so long as it occurs within the class.

Another part of encapsulation is the ability to make state read-only. Since there is no "final" property concept in Python, you can implement read-only behavior or copy-on-access by using decorators from the abc module. The next section expands on this.


Abstraction

Abstract classes are implemented significantly differently in Python from other programming languages. They are not native, yet supported via the abc module, which stands for Abstract Base Classes.

Classes become abstract when they declare a field called __metaclass__= abc.ABCMeta. By doing this, the Python compiler enhances this class and adds extra metadata and functionality. 

As you would expect ABCs can be subclassed directly (by regular inheritance statement) or you can also use the register( ) function to set unrelated concrete classes as being subclasses of your ABC, effectively making them "virtual subclasses" of your ABC--this is pretty unique to Python. You can do this using the register( ) method in the abc.ABCMeta class. If you perform the issubclass( ) test, it comes out positive. 

The difference with using register( ) as opposed to doing normal inheritance is that the registering class will not factor into the Method Resolution Order (mro) of the registered classes, so calls to super( ) referring to a method in your ABC are not possible.

So, in a way an Abstract Base Class is like a template that enhances the derived class. Typically, we are used to classifying inheritance relationships semantically as IS-A, but in this case that is not necessarily the case, any class can be registered from your ABC --a "virtual" IS-A.

Examples of abstract classes are present in the collections module and the numbers module. 

Let's take a look at a short example:



     from abc import *
     class MyIterable(object):
        def __getitem__(self, index):
           pass
        def __len__(self):
           pass
       def get_iterator(self):
           return iter(self)

     class BaseIterable:
        __metaclass__ = ABCMeta

        @abstractmethod
        def __iter__(self):
           while False:
             yield None

        def get_iterator(self):
           return self.__iter__()

def run():
    BaseIterable.register(MyIterable)
    print issubclass(MyIterable, BaseIterable) and "Is Subclass" or "Not a subclass"

if __name__ == '__main__':
    run()


In the example above, I created a MyIterable class from a BaseIterable (I like to preserve IS-A) abstract base class.  So MyIterator will inherit all of the functionality provided by its meta class and, therefore, the issubclass( ) test passes.

Furthermore, ABCs can also declare abstract methods and abstract properties. The @abc.abstractmethod decorator can be used to annotate methods to act as abstract, meaning they must be overridden by concrete classes. If your ABC declares at least an abstract method or property, it cannot be instantiated directly. Even though abstract methods may contain implementation code, a class derived from an ABC cannot be instantiated unless all abstract methods and properties have been overridden; otherwise, you will get a TypeError. You can always invoke the base class method by calling super( ). Finallythe @abc.abstractmethod decorator will only affect methods for classes that have been derived via regular inheritance; "virtual subclasses" created via register( ) will not be affected by this decorator.

Unlike other programming languages, you can also define abstract properties by using the @abc.abstractproperty decorator. This decorator takes as input functions to define get, set, and delete behavior for a property. As with @abc.abstractmethod, using this decorator requires your class to be derived from ABCMeta. Abstract properties are less common than abstract method, perhaps an example will provide a better explanation. With this you can easily create read-only properties, as such:


import abc
class Base(object):
    __metaclass__ = abc.ABCMeta
   
     def value_getter(self):
        return 'Should never see this'
     def value_setter(self, newvalue):
              return
     value = abc.abstractproperty(value_getter, value_setter)

class Impl(Base):
    
    @abc.abstractproperty
    def value(self):
        return __x
Using the long form allows you to pass in the getter and setter functions. 

Generators

Generators are a powerful tool for creating iterators. By using the yield keyword, you can create (and return) data piece-wise as you are producing results, thereby, generating a new iterator. In this example I am processing a list of names and generating an iterator with names that start with a given letter:

users = ['Luis', 'Camilo', 'Marta', 'Lucia', 'Natasha', 'Dave', 'John', 'Mitch', 'Lalo']

def findNameStartsWith(users, letter):
 for (index,user) in enumerate(users):
  if user[0] == letter:
   yield (index,user)


def run():
 # Find User and process it
 global users
 name = findNameStartsWith(users, 'L')

 for (index,name) in enumerate(name):
  print 'Found: '+ str(name)

Basically a generator is any function that uses yield. This keyword allows you to "return" the value from a function yet preserve the state of the currently executing function at that point in time so that it can continue to be processed on subsequent calls.

Last note

This post basically summarizes my impression of using Python as compared to other more popular programming languages. Even though it was designed to be a middleware language, Python is super powerful and offers many good features that make it a compelling language for the enterprise. It has support for the desktop and the web, where it's more popular. Python on the web can be implemented in multiple ways: you can use an Apache extension like mod_python to run Python within an Apache process; however, the norm nowadays is to use a WSGI compatible web framework. WSGI is a unified programming interface for the web. Its biggest proponent is the Django framework which in my opinion is as brilliant as  Ruby on Rails is for the Ruby community.


Resources

  1. python.org
  2. http://en.wikipedia.org/wiki/Polymorphism_(computer_science)
  3. http://rhettinger.wordpress.com/2011/05/26/super-considered-super/
  4. http://www.youtube.com/watch?v=E_kZDvwofHY
  5. http://www.youtube.com/watch?v=23s9Wc3aWGY
  6. http://tech.blog.aknin.name/2010/04/02/pythons-innards-introduction
  7. žhttps://developers.google.com/appengine/docs/python
  8. žhttp://docs.python.org/2/tutorial/classes.html
  9. http://www.python.org/download/releases/2.3/mro/
  10. http://docs.python.org/2/library/functions.html#super
  11. http://rhettinger.wordpress.com/2011/05/26/super-considered-super/

Thursday, September 19, 2013

Standard PHP Library (SPL)

Overview

The Standard PHP Library (SPL) is a collection of Classes and Interfaces created to solve common programming problems. Particularly, those of traversing aggregate and recursive data structures such as XML trees, arrays, lists, database result sets, directories, etc. All of this is done via the use of Iterators. This library is analogous to the prominent java.util.* Java API.

In my experience, most PHP developers have failed to realize the power and structure that SPL can bring to your code.  It's structured using the Decorator (also known as Wrapper) pattern throughout the library and when used properly, it can make your code very readable, structured, and extensible. 

This library has been built into PHP since 5.0.0. It does not require any separate installation or special configuration. So, let's jump right in:

Before we dig into any library class, let's briefly talk about the interfaces that support all of this and and form the core of the entire library:
  • Countable: classes implementing this interface can be passed into the count  function.
  • Traversable: marker interface to detect if a class can be used in a foreach( ) construct.
  • Iterator: classes implementing this can be iterated themselves internally. If you wanted to extend SPL with your own custom iterator, you could implement this interface.
  • ArrayAccess: provides access to object state as an array
  • Serializable: used to serialize and unserialize objects. Classes implementing this interface will no longer support the __wakeup( ) and __sleep( ) functions. The serialize( )  and unserialize( ) functions will be used instead.

Iterators

SPL provides standard iterators to be used with Arrays and Objects built on the interfaces I mentioned above. Typically, these work when called within the context of a foreach( ) statement. Let's start with a simple ArrayIterator example:


class SkipMapIterator extends ArrayIterator { 
 
 private $func; // map function func(x)
        private $skip; // skip factor

 public function __construct($iter, $skip, $func) { 
  parent::__construct($iter); 
                $this>skip = $skip;
  $this->func = $func; 
 } 

 public function current() { 
  $x = parent::current(); 
                $pos = parent::key();
                if($pos % $this>skip == 0) {
                   return call_user_func($this->func, $x); 
                } 
  return $x;
 } 
} 


$numbers = array(1,2,3,4,5,6,7,8,9);

$squareIter = new SkipMapIterator($numbers, 2, 
          function($x) {
  return pow($x,2);
   });

echo '[';
 foreach($squareIter as $key=>$num) {
  echo $num;
  if($key < $squareIter->count() -  1){
   echo ',';
  }
 }
echo ']';

In the example above, I subclassed the ArrayIterator class in order to create a SkipMapIterator that can take a function and apply it to each element in the list skipping "$skip" elements. In the example above, as the list is iterated, the function is applied to every other element.

The output is the following:

[1, 2, 9, 4, 25, 6, 49, 8, 81]

It skips over every other element. This operation is very similar to an array_walk( ) function but gives you a lot more control. A very trivial example indeed, but proofs the concept. Imagine having a list of file names that must be updated together. Loop over the file names array, and the function performs the same update to all files.

Let's take a look at another example:

class RecurringEventPlanner  { 
 
 private $duration; // The duration of the recurring event
 private $day;      // Day to plan event
 private $name;     // Name of the event
 
 private static $DAYS  = array('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday');
  
 public function __construct() { 
  $this->name = 'Unknown Event';  
  $this->day = 'Sunday';    // Default to day 'Sunday'
  $this->duration = 7;   // 1 week
 } 
 
 public function withName($name) {
  $this->name = $name;
  return this;
 }
 
 public function withDuration($duration) {
  $this->duration = $duration;
  return this;
 }

 public function onDay($day) {
  $this->day = $day;
  return this;
 }
 

 public function printEvent() { 
  
  $weekly = new InfiniteIterator(new ArrayIterator(self::$DAYS));
  $event = new LimitIterator($weekly, 0, $this->duration);
  foreach ($event as $day) {
   print 'On '. $day. ': ';
   if($day == $this->day) {
    print '******'. $this->name. '******'; 
   }
   else {
    print 'Nothing happens';
   }
   print "\n";
  }
 } 
} 

$sundayMass = new RecurringEventPlanner();
$sundayMass->withName('Sunday Mass at 9:00 am');
$sundayMass->withDuration(14); // 2 weeks
$sundayMass->printEvent();

As I mentioned at the beginning of this post, one of the core principles behind the SPL APIs is the use of the Decorator (Wrapper) pattern. If you take a look at the printEvent( ) function above, I attempted to wrap Iterators within other Iterators and enhance its functionality with every nested class. I began with a simple ArrayIterator to loop over the days of the week. I wrapped this iterator with an InfiniteIterator that will allow me to endlessly iterate over and over the days in the week. Of course, I can't just iterate for ever, so I wrapped that iterator further using a LimitIterator so that the user can provide an end date for the event. The RecurringEventPlanner class will print a weekly schedule with the event for a period of a given amount of days (in our example 2 weeks).

Data Structures

Iterators are just one part of the SPL library, let's take a look at some of the data structures that the SPL library provides. On a daily basis, you would probably use things like: stack, queue, and heap, so let's focus on them.

Implemented with a DoublyLinkedList, the stack is a data structure with Last-In, First-Out (LIFO) functionality with typical push( ) and pop( ) functions. Notice that as you push items onto the stack, the stack's internal pointer is always pointing at the next item. To print all of the contents of the stack, you must rewind this pointer.

$stack = new SplStack();
$stack->push('First In, Last Out');
$stack->push('Second In');
$stack->push('Last In, First Out');

$stack->rewind();

while($stack->valid())
{
    echo $stack->current(), PHP_EOL;
    $stack->next();
}

The stack is a very useful programming structure. Alongside the SplStack is the SplQueue with First-In, First-Out (FIFO) functionality (basically same code). Much more interesting and powerful are the SplHeap structures. Heaps are basically tree structures that satisfy the ordering principles of the heap nodes. You have the SplMaxHeap (maximum key at the top) and SplMinHeap (minimum key at the top) flavors as well. I won't spend time talking about the internals of the data structures themselves, there is extensive material on the subject out there and a lot of math involved; suffice to say, they are ready for use with this library so that you don't have to implement them yourself.

Let's pick a more complex data structure from the list, SplPriorityQueue. A priority Queue is implemented internally using a max heap and exposes a compare(priori1, priori2) function that will be used to sort the keys within the tree structure (heap). By default, if no comparison function is provided, it will use the key's natural ordering of elements, i.e strings will be compared in lexicographical order, numbers in numerical order, etc. When storing objects, it is necessary to provide your own comparison function. Let's look at an example, suppose you are trying to implement a print queue that routes jobs to certain printers in priority:


class PriorityPrintQueue extends SplPriorityQueue 
{  
 private $prioriTable;
 
 public function __construct() {
  $this->setExtractFlags(SplPriorityQueue::EXTR_BOTH); 
  $this->prioriTable = array(
   '/10\..*/' => '1',
   '/20\..*/' => '2',
   '/30\..*/' => '3',
   '/40\..*/' => '4'
  );
 }
  
    public function compare($priority1, $priority2) 
    { 
        if ($priority1 === $priority2) {
         return 0; 
        }
        return $priority1 < $priority2 ? -1 : 1; 
    } 
    
    public function add($printer, $host) {
     
     $priority = 0;
     
     // search for the correct priority depending on host
  foreach($this->prioriTable as $h => $p) {
   if(preg_match($h, $host) > 0) {
    $priority = $p;
    break;
   }     
  }     
     $this->insert($printer, $priority);
    }
    
    
    public function sendJob($jobName) {
     $this->top(); 
     $printer = $this->extract();
     echo 'Sending job to printer: '. $printer['data']. "\n";
     
     // route job...
    } 
} 

$printQ = new PriorityPrintQueue(); 

$printQ->add('Apple Laser Printer',   '40.344.23.233'); 
$printQ->add('HP Scanner Photosmart', '10.20.30.40'); 
$printQ->add('Logitech All in One'  , '30.50.62.77'); 
$printQ->add('Caselogic Printer',     '20.234.900.765'); 

$printQ->sendJob('Job 1');
$printQ->sendJob('Job 2');
$printQ->sendJob('Job 3');
$printQ->sendJob('Job 4');

Running this code will print the following:

Sending job to printer: Apple Laser Printer
Sending job to printer: Logitech All in One
Sending job to printer: Caselogic Printer
Sending job to printer: HP Scanner Photosmart

As you can expect, the first job will be routed to the Apple Laser Printer because its host IP is ranked with highest priority in the PriorityPrintQueue class. In our compare( ) function, we have defined the highest number to mean higher priority. We can certainly change this if we wanted to.

Autoloading

Autoloading is an internal PHP mechanism. When a class is instantiated, the __autoload( ) method is called. This is really useful when writing object oriented code where you would typically have one class per PHP file.

You can override this method and provide your own autoloading function. PHP will look at your script first and invoke the autoload function if you have provided one. This is heavily discouraged and considered bad practice, to the point where it might be deprecated in future releases. However, if properly done, autoloading is really useful as it can eliminate the clutter of using many  require* and include* statements at the beginning of your script.

In PHP 5, SPL provides a more flexible autoloading mechanism, which you can use to take advantage of the power of this mechanism while not breaking your entire application. This practice is followed heavily by systems such as Composer. Composer is a PHP Dependency Management tool that, if configured to do so, will automatically provide the scaffolding necessary to autoload your classes into your application without you having to explicitly require/include all of the files.

 // init.php

 spl_autoload_register(null, false);

 spl_autoload_extensions('.php, .class.php, .lib.php');

 // Custom class loader that will search the classes folder for any
 // to instantiate classes
 spl_autoload_register(function($class)
   {
 $filename = $class . '.class.php';
 $file ='src/' . $filename;
 if (!file_exists($file)) {
    return false;
 }
 require_once $file;
   });


-------------------------------------------------------------------------------
// main.php

require_once('init.php');
// instantiate MyClass w/o having to include the class in the script
 $hello = new MyClass();

 // say hello!
 $hello->sayHello();


In the example code above, I registered a custom class loader function via the SPL autoloading mechanism. This function will be called internally every time a class is instantiated without having to explicitly include every single class file that I need. The ini.php script can manage the location of all your classes in a central location. Once this is done once, application developers just need to worry about using them.

The power of autoloading is not just in cutting down in the need to include files. You can also use it enforce coding standards and best practices. For instance, I can use the spl_autoload_extensions function to indicate that all class files must end in ".class.php". Also, I can force class name conventions in a style similar to PSR-0. For more information on PHP coding standards, you can read more here.

Conclusion


With the advent of the SPL libraries in PHP 5, you can make your code a lot more readable, structured, and extensible. All in all, more Object Oriented, which obviously is a nice paradigm to follow. In this post we started by taking a look at iterator classes and interfaces and how you can extend them to create very powerful Iterator classes for your application. The need to iterate is a given in any application. Also, we discussed some advanced data structures that you can use in your applications such as Stacks, Queues, and Heaps, which you would otherwise need to write yourself. And finally, we briefly explained the PHP native concept of autoloading and how it can improve the readability and maintainability of your application.

Hope this blog helps you in your endeavor to write more Object Oriented PHP code!

Stay tuned!

Resources

  1. http://www.php.net/manual/en/book.spl.php
  2. http://php.net/manual/en/language.oop5.autoload.php
  3. http://www.phpro.org/tutorials/SPL-Autoload.html
  4. https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-0.md
XTDMHZQW4S85

Tuesday, August 20, 2013

Notes On Continuous Delivery - Anatomy of a Deployment Pipeline

Overview

Hi developers, and welcome to another blog post on the topic of Continuous Delivery! This continues our previous discussion on CD here

A solid CI strategy can help minimize waste or downtime for:

1.  QA when they have to wait for "good" builds and,
2.  Software being released weeks after it was "done."

As developers, we must get into the mindset of writing production-ready code, running CI on production-like systems, and leveraging cross-functional teams, so we can optimize every step in the build lifecycle.

Cross-functional teams means bringing together operations, QA, and development together to craft a solid Deployment Pipeline for your team or organization in which eventually you will be able to issue a build into production by the click of a button. Having deployment be so easy and painless, creates an enormous feedback loop.

Once you have created and implemented a Deployment Pipeline for your application, take a step back and  think of the abstractions that you can extract from this exercise. You may be able to find patterns of software delivery which could be used as template for all applications within your organization.

What is a Deployment Pipeline (DP)?

I like the definition proposed by the authors of "Continuous Delivery" (see resources below):
... At an abstract level ... it is an automated manifestation of your process ... [as it moves] from version control... into the hands of your users...
A Deployment Pipeline has roots in Continuous Integration and Release Management tools. Use these tools together with a cross-functional team to move code from version control every step of the way to the hands of the users and you have a successful DP.

Alternatively, consider another level of abstraction:
... an automated manifestation of your process ... [as it moves] from the mind of the customer... into the hands of your users... 
The latter is a very powerful concept: "from concept to cash" and one you can use to buy time to build this from the product owners or investors. 
A typical Deployment Pipeline contains the following stages:

  1. The commit stage asserts that the system works a technical level (source): compilation, unit tests, and code analysis. Basically, it is code complete at this point and all developers have given it the thumbs up.
  2. The automated acceptance test stage asserts system works a functional and non-functional level, behavior meets customer expectations.
  3. Manual testing or exploratory testing asserts system is usable, performs well, provides value. Also, test very specific features and paths not tested during automation. At this point QA has given it the thumbs up.
  4. Finally, the release stage delivers the system to users. You can deliver the software to everyone at once,  or as Canary Releases to a subset of users. I believe the Chrome project works this way.
This is at a high level, how I think of the DP for my team:




This might seem a little intimidating and it should be. I will briefly describe each step in the process at a high level, and explain more details in subsequent posts.

The process starts with a developer checking code as a result of completing (with proper definition of "done") a story or task into VC. I won't get too much into the details of how your version control policy should be; however, I am going to try to implement something similar to the following using Git workflows (more on this later):

http://nvie.com/posts/a-successful-git-branching-model/

The CI system (Jenkins, Bamboo, etc) responds to this change and instantiates a new pipeline. This pipeline will work from bottom-up. It will attempt to check out and compile the code (if necessary), execute all application tests such as unit tests and integration tests, and perform code analysis.

Code analysis metrics might include things like:
  • Cyclomatic Complexity (CC)
  • Test Coverage
  • Code Duplication
  • Afferent and Efferent coupling
  • Style
There are a number of tools such as JDepend, PHPDepend, Checkstyle, Findbugs, etc, that can help measuring these metrics. An open source tool like Sonar or commercial such as Clover can help aggregate all of this data into a nice dashboard fashion.



Passing this commit stage is very important so that developers can continue working on other tasks. Once the commit stage succeeds, all of the main artifacts (WAR, EAR, JAR, PHAR, etc) and dependent (child) artifacts are created and store it into the artifact repository for later use. Repositories (such as Nexus or Artifactory) can make artifacts available on demand. This is very important because this same artifact is the one that will get pushed eventually through in the pipeline -- only build your binaries once. If you don't need to compile code (like in PHP), then your binaries or your artifacts become the source files. There are other things you might want to accomplish as you push your artifact through to the Acceptance Stage such as prepare any databases and external systems if need.

At this stage, you start executing your long-running acceptance tests. Some CI servers can let you execute tests in parallel --these are good candidates for parallelization. During this stage, QA can also perform any smoke tests and exploratory tests.

Afterwards, the pipeline splits to stage environments for different purposes and audience. Again, it really depends on your needs, perhaps you accomplished everything you needed during Acceptance Tests phase, your business owners trust the process, and the application can be deployed directly into Production. Sometimes, the business owners themselves as users will want to review the application before releasing it. For this we can  provision some sort of staging or preview environment, I called this UAT.

In addition to this, whether you want to make the entire process automated is entirely up to you. I would recommend that in order to get to the next stage, you have push button stages. Unless, you have ridiculous automated test coverage, you will want QA to conduct some sort of exploratory testing.

Most likely your application will have to interact with external systems: database, authentication, payment processing, etc. If you have development and QA environments for these, great; otherwise, create test doubles. Use mock APIs for unit tests, and use or create your own fake services for the others.

Your deployment process should be repeatable, so deploy the same way to every environment. This will ensure that the build is tested throughly and effectively. This can be very complicated to achieve. If nothing else, the IP address of the different systems will be different. This speaks to the Configuration Management issue we have discussed before. How you supply deploy-time configuration to your system is entirely up to you. You may decide to store them in the database, LDAP or Active Directory services, remote NFS, Puppet, Chef, etc. As a corollary to this, make sure all your environments are AS SIMILAR AS POSSIBLE TO PRODUCTION. This includes: network topology, firewall, OS configuration, application stack, and data.

Once the application is deployed, perform an automated quick SMOKE TEST script to sanity check your app. This can be as simple as testing whether the app is up and serving 200 OK page responses. In addition, your smoke test can also check that any services your application depends on are up and running.

Typical agile development teams range from 5 to 10 people so expect multiple builds to be put in through the pipeline at once. The orchestrating tool (Bamboo or Jenkins) should take care of handling this efficiently. In other words, the CI system will try to place the latest version of the code through the pipeline only when the subsequent stage has completed. As you progress through the pipeline, the steps will take longer and longer. The first steps (code checkout, unit test run, automated acceptance test run) are all automated and quick. The orchestration will wait to move the build through the first phase only when the pipeline frees up at that stage. If there are multiple commits, only the last one at that moment gets put through the pipeline. This works in similar fashion to a processor's instruction pipeline. The last steps (Smoke tests, Capacity tests, etc) will probably be manually triggered by a QA engineer and take much longer.

If any stage in the pipeline fails, the entire team is responsible for it and it should become the highest priority above all else. Remember, in CD you will release code through the pipeline, so nothing can be more important, it represents the health of your application. The purpose of failed tests is not to burden developers but to flesh out builds that are unfit for production.


Preparing a release

There is business risk associated with every production release, no doubt. You don't want to delay the introduction of new value and capabilities for your users. Especially, if your application is subject to a very competitive market, such as mobile apps. However, a comprehensive back-out or roll-back plan needs to be in place and this plan needs to be understood by the entire team and in my opinion this is probably the hardest part of CD. Two main reasons why releases are feared are: 1) the introduction of hard to find, high impact problems , and 2) the fact that you are committed -actual users have been impacted. So, you want to make sure you:
  • Have a release plan created and maintained by everybody involved
  • Mitigate errors by levering as much automation as you can
  • Rehearse the process often
  • Have the ability to rollback if things don't go according to plan
  • Have a strategy for migrating configuration and production data as part of the upgrade
In order to have successful releases, you must be in full control of your target environments. Often, this is a challenge especially when you are dealing with managed cloud infrastructures (PaaS) where your capacity to exercise operational tasks (extracting data, configuration, network topology, etc) could be very limited. In these environments, you might not be able to run custom scripts. You might need to code maintenance web hooks into your applications. Whatever the case may be, account for this as part of the release, and create your back-out plans accordingly.

If you are in full control of your different environments, look to automate the creation and provisioning of your target instances. Set this up first in your testing environment, hash out all of the nuances, and look for ways to overcome all of the issues. Exercise and rehearse this process, perhaps many times a day, this will make your build and your process sturdy and reliable. Once, you are satisfied with the setup, abstract it and reuse it as a template for all your staging, QA, and production instances. Share this template (build scripts, binaries, etc) for everyone to see and maintain.

Implementing a deployment pipeline

A deployment pipeline takes time to build and mature. Start out by drawing sketches of what you envision your delivery process be for your organization. Ideally, implement this with new projects before any work starts -sprint zero. Every organization's deployment process will be different, but the tasks at hand are consistent across all of them. In general you want to:
  1. Model your value stream
  2. Automate the build and deployment
  3. Automate unit tests
  4. Automate acceptance tests
  5. Automate release
The first step is to map out the path from check-in to release. Involve everyone on the team. If the people in charge of releases are on a different team, approach them and jot down all of the steps needed to release your application. One good idea is to look for similar projects within your organization, how are they doing their release? Are they looking to move into a CD delivery strategy, if so, try to standardize on a common solution.

Once you have sketched out a solution, try to model it in your CI system of choice. I highly recommend either Jenkins or Bamboo, but you can pick the one that suits your organization best. You will be creating the different steps or stages of your build: compilation (if applicable), unit tests, acceptance tests, and deployment. Ideally, you would have your deploy phase prompt you for a version number or something similar.

Configure your CI tool to automate the build: version control, compilation (if needed), testing and creation of the deployable artifact. This artifact can be pushed to any machine and configured to work. In Configuration Management we talked about keeping configuration separate from the application source, not hard coded.

Next, configure the deployment step. To start-off you can just deploy locally. Ideally, you will want to push to different machines for the different target environments: QA, UAT, Production, etc. All the environments should be "production-like" and the maintenance and provision all automated so that it can be replicated. The only change, would be the configuration.

After your deployment has been automated, the next step is to be able to perform push-button deployments. Configure your CI server so that you can choose the build you want, and have it make its way through the pipeline and into your target environment. In Bamboo for instance,  you can integrate JIRA issue tracking and actually have it push builds from JIRA.

Conclusion

A Deployment Pipeline takes its inspiration from the processor instruction pipeline in that different stages can execute in parallel. The developers don't just stop working as soon as QA is ready to test. As artifacts are pushed through the pipeline, stages that become idle can continue to execute, simultaneously.

The steps described in the last section should be part of any typical value stream. As the project gets more complex, your pipeline should evolve with it --it's a living system. Remember, your build process it is the spinal chord of your application, it holds everything together.

If your application is made up of multiple modules or components, then consider having mini-pipelines to build each one, and one big pipeline to aggregate all of the artifacts.

Continuous Delivery is an iterative approach with constant optimization loops. If you see that certain acceptance tests fail frequently, focus on those, fix them and add more if need be to assure the stability of that part of the system. Work on this pipeline as part of the development process, this is not to be taken as a side task. Also, it involves everyone on the team. Also, take it a step at a time. Especially if you are not familiar with the technology involved or you are still deciding which tools to use. Spend a few weeks on understanding and carefully crafting each step.

Even if you don't need to release software many times per day or per sprint, having a reliable deployment pipeline will be a nice asset to your organization.

Resources

  1. http://www.sonarqube.org/
  2. https://www.google.com/intl/en/chrome/browser/canary.html
  3. DZone Refcardz. Continuous Delivery: Patterns and Antipatterns in the Software Lifecyle