sábado, 17 de octubre de 2015

You need a corporate framework

If you are working in a big enough software development team you probably agree that consistency in the code and development practices is very important.  Consistency is what makes you save time when joining a new project or reviewing somebody else code, or what saves ops team time when they deploy a new module and have to figure out how to monitor it, or what saves analytics team time when they have to understand and use the logs & metrics generated by a new component.

To be able to get certain degree of consistency (and quality at the same time) it is very common these days to have coding guidelines, technical plans, training plans, code reviews....  All those practices are very important and help a lot to achieve certain degree of consistency, but in my opinion they don't solve some of the most important problems and in addition they depend a lot on human responsibility (bad, very bad, you shouldn't trust any human).

So let's try to figure out what are some of the problems we have today.  Is any of these problems familiar to you?
  • You start a new project and you don't know what folders to create (should I create doc and test folders), how to name things (is it test or tests, src or lib?), should I use jasmine or mocha for testing, should I put the design of this component in a wiki page, a gdocs or a .txt in a folder, where do I put configuration, should I mention the third party licenses somewhere ...
  • Each component logs different things, with different names and in different format.  Do all your components log every request and response? Do they use WARN and ERROR consistently?  Do you always use the same format for logging?  I've seen teams using as many logging libraries as components they have.  The cost of not having good consistent logging can easily make a company waste hundreds of thousands of dollars very quickly.
  • Half of the components don't have a health or monitoring endpoint, or if they have it the amount of information shown or the format is totally inconsistent.   One service expose the average response time, the other the P99, the other only counters...  It makes hard (if not impossible) to monitor components so at the end nobody pays attention to them until a customer complains.
  • My retries strategy sucks.   Do you always retry when you make requests to third party components (very common with the popularization of "microservices" architectures)?  All your components do the same amount of retries?  The timeout before retrying is always the same?  Do you retry against a different server instance?
  • The configuration of each component is different.    One use XML, the other JSON, the other env variables?   In some components it can be changed on the fly while in others it can't?  In some components the config is in git, in others in chef recipes, in others in external configuration servers?
  • Do you have any service registration and service discovery solution?  Or some services are registered in a database, others in a config file, others in the load balancer configuration file?

Use the force Luke!

What you need is a corporate framework and a corporate project template.

You don't even need to create your own framework.  The best example of this kind of framework I know would be Finagle from Twitter and other teams like Tumblr, Pinterest or Foursquare are reusing it.

Finagle enforces a design to build Scala services (Futures based), it provides a TwitterServer class that automatically exposes a stats endpoint and read configuration properties from command line arguments, includes support for distributed logging,  provides lot of clients (MySQL, HTTP, Redis...) exposing a consistent API and automatically generating logs and statistics, integrates with zookeeper for seamless registration and discovery of services.    If you don't know it I highly recommend you to take a look.

I tried to implement my own framework some months ago (https://github.com/ggarber/snap).  It is very rudimentary (the maturity level of a hackathon project) but I'm using it in production to test if it is really helpful and even at the level of immaturity it has I found it very helpful (I don't need to care much about consistency anymore and it also saved me time).

The other piece I think it is mandatory is to have project template.   It avoids you having to make decisions and should have a reasonable amount of tools integrated to automatically run tests, review styles, initiate a pull request... and maybe even deploy.

This project template can be an Eclipse plugin, a yeoman generator or something else, but if you don't have one I don't understand why :)  As an example for node.js projects I like this one created by a friend: https://github.com/luma/generator-tok

Hopefully I convinced you of how important is to have a corporate framework and project template that you use for all your components.    Feedback is more than welcomed.     And contributors for the snap framework (https://github.com/ggarber/snap) even more! :)

sábado, 4 de julio de 2015

HTTP/2 explained in 5 minutes

After reading and playing for some days with HTTP/2 this is a summary of my understanding at a very high level.

HTTP/2 is all about reducing the latency accessing web applications.  It maintains the semantics (GET, POST... methods, headers, content)  and url schemes of existing HTTP and it is based on the improvements proposed by Google as part of his SPDY protocol that is finally replaced by HTTP/2.

The three most important changes introduced in HTTP/2 in my opinion are:
1) Reduced overhead of HTTP Headers by using binary fields and header compression (HPACK).
2) Ability to use a single TCP connection for multiple HTTP Requests without any type of first in line blocking (Responses can be sent in different order than requests).
3) Support for pushing contents from server to client without previous request (for example the server could send some images that the browser will need when it receives the request for the HTML file referencing those images).

The most controversial "feature" of HTTP/2 was making TLS mandatory.  At the end the requirement was relaxed but some browsers (firefox) plan to make it mandatory anyways.

Most of the relevant browsers (at least chrome and firefox and some versions of IE) already include support for HTTP 2 as well as the most popular opensource servers (nginx and Apache).  So you should be able to take advantage of the new version of the protocol right now.

The HTTP/2 support is negotiated using the same HTTP Upgrade mechanism used for websockets and should be transparent for users and elements in the middle (proxies),

Application developers should benefit for free automatically without any change in their apps but they can get even more benefits with some extra changes:
* Some tricks that are in use today like spriting or inlining resources are not needed any more.  So they can simplify the build/deploy pipeline.
* Server push could be automatic in some cases but in general will require developers to declare the resources to be pushed in each request.  This feature requires support in the web framework being used.

I made a hello world test pushing the javascript referenced in an HTML page automatically and it improved the latency as expected.  I plan to repeat the test with a real page with tens/hundreds of referenced js/css/img files and publish the results.


lunes, 8 de diciembre de 2014

Static type checking for Javascript (TypeScript vs Flow)

I've never been a big fan of Javascript for large applications (nothing beyond proxies and simple services) and that is partially because in my experience the lack of static typing ends up making very easy to make mistakes and very difficult to refactor code

Because of that I was very excited when I discovered TypeScript some months ago (Disclaimer: I'm not a JS expert) and I was very curious about the differences between TypeScript and Flow when some colleage pointed me to it today.   So I tried to play find the seven differences, but I'm lazy and I stopped after finding one.

Apart from cosmetic differences and tools availability both TypeScript and Flow support type definition based on annotations, type inference and class/modules support based on EcmaScript 6 syntax.    The relevant difference I found after reading/playing with them (for half an hour) is that because of the way they implement type inference Flow can detect type changes of the variables after the initial declaration making it more appropriate for legacy Javascript code where adding annotations can be not possible.

This is some code I used to play with it with some inline comments:

var s = "hello";
s.length  // Both TS and Flow know that s is a string and they check they have a length method


var s: string = null;
s = "hello";
s.length  // Both TS and Flow know that s is a string and they check they have a length method

var s = null;
s = "hello";

s.length // TS doesn't know that this is a string but Flow knows and can check it has a length method

domingo, 26 de octubre de 2014

Service discovery and getting started with etcd

After playing with some Twitter opensource components recently (mostly finagle) I became very interested on the concept of service discovery as a way to implement load balancing and failure recovery in the interconnection between internal services of your infrastructure.   This is specially critical if you are have a microservices architecture.

Basically the idea of Service Discovery solutions is having a shared repository with an updated list of existing instances of type A and having mechanisms to retrieve, update and subscribe to that list allowing other components to distribute the requests to service A in an automated and reliable way.


The traditional solution is Zookeeper (based on Google Plaxos algorithm with code opensourced by Yahoo and maintained as part of the Hadoop project) but apparently other alternatives have appeared and are very promising in the near future.  This post summarized very well the alternatives available.

One of the most interesting solutions is etcd (simpler than Zookeeper, implemented in Go and supported by the CoreOS project).  In this post I explain how to do some basic testing with it.

etcd is a simple key/value store with support for expiration and watching keys that makes it ideal for service discovery.   You can think of it like a redis server but distributed (with consistency and partition tolerance) and with a simple HTTP interface supporting GET, SET, DEL and LIST.

Installation

First step is to install etcd and the command line tool etcdctl.
You can easily download it and install it from here or if you are using a Mac you can just "brew install etcd etcdctl"

Registering a service instance

When a new service instance in your infrastructure starts it should register himself in etcd by sending a SET request with all the information that you want to store for that instance.

In this example we store the hostname and port of the service instance and we use a url schema like /services/SERVICE/DATACENTER/INSTANCE_ID.   In addition we set a ttl of 10 seconds to make sure the information expires if it is not refreshed properly because this instance is not available.

var path = require('path'),
    uuid = require('node-uuid'),
    Etcd = require('node-etcd');

var etcd = new Etcd(),

    p = path.join('/', 'services', 'service_a', 'datacenter_x', uuid.v4());


function register() {
  etcd.set(p,
    JSON.stringify({
      hostname: '127.0.0.1',
      port: '3000'
    }), {
        ttl: 60
    });


  console.log('Registered with etcd as ' + p);

}
setInterval(register, 10000);
register();

Discovering service instances

When a service in your infrastructure requires using other service it has to send a GET request to retrieve all the available instances and subscribe (WATCH) to receive notifications of nodes down or new nodes up.

var path = require('path'),
    uuid = require('node-uuid'),
    Etcd = require('node-etcd');

var etcd = new Etcd();
var p = path.join('/', 'services', 'service_a', 'datacenter_x');

var instances = {};
function processData(data) {
  if (data.action == 'set') {
    instances[data.node.key] = data.node.value;
  } else if (data.action == 'expire') {
    delete instances[data.node.key];
  }
  console.log(instances);
}


var watcher = etcd.watcher(p, null, {recursive: true});
watcher.on("change", processData);

etcd.get(p, {recursive: true}, function(res, data) {
  data.node.nodes.forEach(function(node) {
    instances[node.key] = node.value;
  });
  console.log(instances);
});


Conclusions

Service discovery solutions are becoming a central place of lot of server infrastructures because of the increasing complexity in those infrastructures specially because of the raise of microservices like architectures.  etcd is a ver simple approach that you can understand, deploy and start using in a less than an hour and looks more actively maintained and future proof than zookeeper. 

I tend to think that if Redis is able to have a good clustering solution soon it could replace specialized service discovery/configuration solutions in some cases (but I'm far from an expert in this domain).

The other thing that I found missing are good frameworks making use of these technologies integrated with connection pool management, load balancing strategies, failure detection, retries...    Kind of what finagle does for twitter, maybe that can be my next project :)

jueves, 20 de marzo de 2014

Actor Model

The more I write concurrent applications the more I hate it.    Typically you end up having a code full of locks, queues, threads and threadpools where it is from difficult to impossible to know if it is correct or it only apparently works.

Because of that I decided to do a little research on the Actor Pattern that apparently is powering frameworks like Erlang making it a very good solution for highly concurrent communication platfomrs (like Facebook Chat or WhatsApp).

These are the slides I prepared, there is no much explanation on them, so feel free to ask me any question and try it!   The Actor Model is fun and will simplify your life no matter if you use a framework for it or just keep in mind the concept in your future designs.




miércoles, 12 de febrero de 2014

Scientific way of estimating the cost of a feature in your project



I'm a fan of estimations as long as they are not used to try to figure out when a feature will be done.    I like estimations and I think they are critical when they are used to decide which features should be done and which ones shouldn't.

So, if they are so important, what is the best way to make estimations.   I'm going to share my secret formula based on the things that I have read and my personal experience in my professional career (where I have to admit that my estimations are now completely different than 15 years ago).

There are two key concepts that we need to understand before digging into the actual formula:
  • One feature working doesn't mean the feature is complete or ready.   Instrumentation, thread safety, unit tests, error handling, documentation, automation, unexpected problems, bug fixing...  most of the times takes much more time that the implementation of the basic functionality.
  • Once you write something you usually have to maintain and not break it forever.   Making sure that new features, refactors or any minor change doesn't break any existing code is a really big deal in any project with enough complexity.

Based on those key concepts we can split the cost of a feature in 3 buckets:
  • Cost to have something working (the usual engineers initial estimation): X
  • Cost to have something ready to be shipped: Y
  • Cost to keep it working for the life of the product: Z
For a total cost for adding a feature to a product of X + Y + Z 

And now is when the scientific part is applied.   Based on my experience and thousands (well, maybe 3 or 4) articles I have read I think the Pareto Principle has a perfect application in this case.

In any project the cost of implementing the basic functionality (X) is 20% vs the 80% of implementing the rest of functionality needed to ship the product (Y).   So Y = 4 * X

The Circular Estimation Conjecture: You should always multiply your estimates by pi.
I've seen a similar estimation of X + Y = PI * X that is a bit optimistic in my opinion.  I recommend you to read the visual demonstration of what is called the circular estimation conjeture







For the second part (the maintainability cost Z) we can apply the same Pareto Principle to get Z = 4 (X + Y)

With all those numbers in place the conclusion is easy.   The total cost of having a feature in a product is X + 4 * X  + 4 * (X + 4 * X) = 25 * X

Take your initial guess (or ask any engineer) to get X, then the cost of the feature that you need to use to decide if it is worth to waste your time implementing it or not is exactly 25 * X

As corollary and final demonstration of the theorem, I though this post was going to take me 5 mins to write it and it took me 25 mins and I suspect I will have to spend more than one hour discussing about it with other people.






martes, 21 de enero de 2014

Writing sequential test scripts with node

Today I was trying to create a node.js script to test a HTTP service but the test required multiple steps.   I gave it a try by using async module to "symplify" that code and that's the ugly code I came up with.

I'm not an expert in js/node, feel free to comment if I'm doing something wrong, I'm more than happy to learn.

(inflightSession and create are two helper functions that I have)

Test using node + jasmine: 

it("should accept valid sessionId", function(done) {
      async.waterfall([
          inflightSession,

          function(sessionId, callback) {
            create({ 'sessionId':sessionId }, callback);
          }
      ], function(error, response) {
          expect(response.statusCode).to.equal(200);
          done()
    });
  });


Same test using python + unittest:

def create_ok_test():
    session_id = inflight_session()
    response = create({ 'sessionId': session_id })

    assert_equals(200, response.status_code)


Same test using node ES6 generators (yield keyword):

it("should accept valid sessionId", function*() {
      var sessionId = yield inflightSession();

      var response = yield create({ 'sessionId': sessionId });
      expect(response.statusCode).to.equal(200);
  });


Honestly the code in python is way more readable than the existing node code, and still better even when comparing it with the new node generators.    Anway definitely looks like a promising way to move forward in the node community.   Some comments:

ES6 generators are available under a flag in node 0.11 and are supposed to be included in 0.12.

yield is a common keyword in other languages (i.e. python, C#) to exit from a function but keeping the state of that function so that you can resume the execution later.

function* is the syntax to define a generator function (a function using yield inside).

You need a runner supporting those generator functions (in this example jasmine needs to add support for it), basically calling the generator.next and waiting for the result (the result should be a promise or similar object) before calling generator.next again.

UPDATE: As I´m somehow forced to use node, I ended up creating a helper function and my tests are now like this

itx("should accept valid sessionId", inflightSession, function(sessionId, done) {
      create({ 'sessionId':sessionId }, function(error, response) {
          expect(response.statusCode).to.equal(200);
          done()
      });

});