Scalable Performance: using ZeroMQ in your next application

If you are like me, you will try to abstract communication between threads or sub-processes within your application so it works the same while running on single CPU core or a thousand CPU cores. One of the easiest ways to achieve this is through a queue system such as RabbitMQ, ActiveMQ, or Gearman but I suggest using ZeroMQ.

Despite having “MQ” in the name, ZeroMQ is not your typical queue system. Instead it’s a socket library that carries messages over TCP, IPC, inproc and multicast allowing you to transmit messages in the same manner between threads, processes, or networks.

IPC is usually used for communication between processes running on the same computer and uses UNIX domain sockets in the background. Unfortunately it’s only available on UNIX and GNU/Linux platforms at the moment.

Inproc passes messages in memory so no I/O threads are involved. Downside of using inproc transport is that you are forced to use the same context instance.

Multicast (PGM) can only be used with PUB/SUB sockets and is used to reliably deliver same message to multiple clients over IP network. PGM is rate limited by default which means that there is a potential performance penalty if used over a loopback interface.

TCP transport provides a generic unicast messaging that can be used over IP network. It will work in the same manner as IPC if used over a loopback interface.

ZeroMQ sockets will not work with regular, non-ZeroMQ sockets. This means that you can not build an HTTP server (for example) on top of ZeroMQ and expect it to work with browsers that do not support ZeroMQ (I don’t know of any browsers that support ZeroMQ, although I strongly believe that they should).

ZeroMQ provides multiple messaging patterns: request/reply, publish/subscribe, pipeline and exclusive pair. Each pattern provides it’s own routing strategy and directionality.

Request/reply provides a bidirectional communication with load-balanced outgoing message routing and last peer incoming message routing. In this pattern a ZMQ_REQ socket (client) can send a message to a ZMQ_REP socket(s) (service/server) that it’s connected to and receive a response from it. If a client socket is connected to multiple service sockets, messages originated on a client socket will be sent in a load-balanced fashion. Service sockets on the other hand will receive a multi-part message (array/tuple) that will contain client socket’s unique identifier as well as a message that client socket has sent.

Publish/subscribe is used for a unidirectional commication with fan-out outgoing message routing. Similar to ZMQ_REP sockets, ZMQ_PUB sockets act as a service and ZMQ_SUB sockets act as clients. Service socket send multi-part messages consisting of a topic and a message. Client sockets subscribe to specific topics that they want to receive messages about.

Pipeline provides a reliable unidirectional communication with load-balanced outgoing message routing. ZMQ_PUSH sockets are used on service side and ZMQ_PULL sockets are used on client (node) side. This pattern is used to reliably distribute messages from a service to multiple nodes that are connected to it. If there are no connected nodes, service will block until at least one node connects to it. Unlike publish/subscribe pattern, a message is only delivered to a single node instead of all nodes connected to a service.

Exclusive pair provides bidirectional communication between two ZMQ_PAIR sockets. ZMQ_PAIR sockets are designed for inter-thread communication across the inproc transport and do not implement functionality such as auto-reconnection. ZMQ_PAIR sockets are considered experimental and may have other missing or broken aspects.

Fixing corrupted QCOW2 disk images

Yes, corruptions happen. Here is what I do to fix corrupted QCOW2 (or any other VM disk image format) disk images:

qemu-nbd --connect=/dev/nbd0 disk.qcow2
fdisk -l /dev/nbd0
fsck /dev/nbd0p1

Review of 5 most popular layer 7 load balancers

Several months ago we’ve performed a set of tests on some of the most popular load balancers: HAProxy, Barracuda 440, Zeus, Enterprise R16 and Nginx proxy. We ended up picking Zeus and HAProxy as two load balancers that we will be supporting within Zombie Cloud Platform.


HAProxy is probably one of the most commonly used load balancers. It’s free, open source, well maintained and extremely lightweight. It is designed to run on a single or dual core systems with minimal amount of RAM. Our tests have showed that you can achieve 1,800-2,500 requests per second (200 concurrent connections) on a virtual server allocated a single core of Intel Xeon X3440 CPU and 1GB DDR3 RAM. In most tests CPU usage did not go past 20% and HAProxy process has never used more then 20% of RAM (that’s 200MB!).

Biggest drawback is that there is no direct API or web interface to perform proxy management so you will have to write your own scripts to manage HAProxy’s configuration.

Zeus Load Balancer

Zeus Load Balancer offers a high performance load balancer with lots of features. Some of the best features include:

  • SOAP APIs and a web-based management interfaces
  • Content caching
  • SSL acceleration
  • Built-in clustering and high availability
  • Traffic management rules

It comes in two flavors: installable package that will work with most Linux distributions and a pre-configured virtual machine image. We’ve opted for installable package and setup process took us roughly 10-15 minutes from beginning to end.

Due to overhead of running management interface and other bundled services Zeus required a little bit more CPU and RAM. With that said, we’ve tested several server configurations and were able to get 2,200-2,800 requests per second (200 concurrent connections) with just a single core of Intel Xeon X3440 and 2GB of DDR3 RAM allocated to the virtual machine. During most tests CPU usage did not exceed 60% and RAM utilization never went past 1.2GB.

Zeus offers a free developer license that will allow you to run a single load balancer (no high availability) with a bandwidth cap of 1 megabit per second. Paid licenses are available on either monthly or perpetual basis. Enterprise R16 Enterprise R16 is a hardware based solution. It’s Layer 7 load balancing is based on HAProxy so performance is very similar. Features include:

  • Web-based management interface
  • SSL acceleration
  • Built-in high availability

Web-based interface isn’t anything fancy but it gets the job done.

After slightly tweaking some kernel settings we were able to achieve 2,000-2,300 requests per second (200 concurrent connections). Hardware was based on generic SuperMicro server with Intel Atom 330 CPU and 2GB DDR2 RAM. also offers other load balancer models at extremely competitive prices and their customer service is superb. If you are on a tight budget and don’t need all the bells and whistles - definitely give them a try.

Barracuda 440

Model 440 is Barracuda’s higher end hardware load balancer model. Hardware is definitely refined and looks really good on the rack. Performance is definitely there too - we were able to achieve 2,200-2,400 requests per second. Some of the features include:

  • Intrusion prevention
  • Built-in high availability
  • SSL offloading
  • Web-based and API management interfaces
  • Content caching

The only two drawbacks are price and web-based management interface. This unit will cost you slightly over $3,500 for hardware and $672/year for optional (required for the first year) update subscription. In my personal opinion - it is a little too expensive comparing to other load balancers.

Nginx Proxy

Nginx web server has support for simple load balancing through HTTP Upstrem module. It’s one of the easiest load balancing solutions you can find and usual setup will not take more then 20 minutes. Like HAProxy, Nginx has no management interface and you will have to write your own scripts to provision configuration changes.

Performance is mediocre. We’ve tried multiple different server configurations and best results were 1,200-1,800 requests per second with an average of 40-50% CPU utilization and about 40% of RAM utilization on a similar virtual machine that we’ve used for HAProxy testing - 1 Intel Xeon X3440 CPU core and 1GB of DDR3 RAM.

Planning development of highly scalable web applications

Several weeks ago I gave a presentation which was a 2-hour intro into building highly scalable web applications. This post is the first follow-up to that presentation. Planning is the most crucial stage in developing a highly scalable web applications. Without proper planning you will not be able to build anything stable nor scalable.

Performance, stability, low cost - pick two

As with most things in life - you can’t get everything. General rule is that anything fast and free will most probably have stability issues; stable and free will not perform well; and fast and stable will cost you money - whether up front in licensing fees or in system administrator’s or developer’s time.

Allocate a budget

Whether you are trying to develop a blog that needs to handle millions of users (e.g.: TechCrunch) or a new ground breaking online shopping application - you will need to pay for servers, server management, software licenses, development, etc.

What have others done?

Spend some time doing research on what technologies other companies are using. Chances are that someone else in the world has already tried solving same or similar problem that you are facing and most companies tend to have a separate technical resource (blog, project tracking software, etc) where they post interesting findings. Don’t be afraid to get in touch with techies in IT companies - most of the time they are friendly and will try to help you out.

Fully virtualized or hybrid environment?

Bare metal setups are out of question if you are trying to achieve any kind of scalability. In a perfect world your application will scale horizontally without any human intervention and virtualization is the only way to truly achieve this but in certain cases you will still need to depend on some bare metal servers. Those servers usually run software that is not optimized for virtual storage and depend a lot on hard drive speeds (e.g.: busy databases, distributed storage). This is usually not an issue if you run your own virtualization servers where you control hardware but it does create a problem if your application runs on external cloud platform and certain parts need to meet security specifications.

What 3rd party services are available?

There are plenty of hosted applications available that you can use to reduce time spent on management and development. By using such services you offload headaches of trying to scale your application to someone else. On the other hand this presents a new problem - what happens if service you utilize within your application goes down? It’s up to you to figure out what services you want to rely on and how to deal with outages.

Distributed applications

Your application might be scalable but it does not mean that it will be able to cope with outages. Outages happen every day - networks are still fairly unreliable, power back-ups fail, servers crash for no apparent reason. In order to reduce downtime you should plan to have instances of your application running from multiple datacenters and it’s always a good measure to have back-ups going back at least 30 days.

Fixed. theme by Andrew McCarthy