Down the Rabbit Hole with AmazonMQ - Why I Use Self-Managed RabbitMQ
Having spent considerable time down the AmazonMQ Rabbit Hole; I can say that my learnings powered my intense desire to formulate this article and share what my analysis identified when working with RabbitMQ on AmazonMQ, and why I worked with my team to set up self-managed RabbitMQ clusters.
The Horror Story
“But it’s no use now,” thought poor Alice, “to pretend to be two people! Why, there’s hardly enough of me left to make one respectable person!”
Our AmazonMQ cluster would suddenly, and randomly, encounter a memory spike on one node which resulted in cascading node failures. It took over 7 hours to recover from even with an enterprise support plan because it required specialist intervention to prevent critical data-loss. This happened twice…
Serious Anti-Patterns
According to RabbitMQ best practices, a production environment should;
- High Availability policies should not cause all queues to be replicated across all nodes in a cluster. There should be only some replication occurring.
- Traffic should always be routed directly to the queue owner to prevent internal re-routing of messages.
- Queue Length’s should be kept small.
- A proxy should be used to control appropriate connection re-use and management.
- One node should not be the owner of every queue.
Hope you got your tea ready...
But what actually happens?
- LoadBalancer does not act as an AMQP Proxy, it can actually trigger off more connections causing excessive load that blocks access to your cluster.
- LoadBalancer routes inbound traffic to any available node, even if that node is not the queue owner. This means that if you transfer 1kb of data in to the cluster, and it is sent to a replica node, it will be internally redirected to the queue owner node which means that 1kb has turned in to 2kb.
- AmazonMQ will continually override your HA Policy to treat each node in the cluster as a HA Replica. This means for every 1kb transferred in to the cluster, there are N kilobytes transferred around the cluster, where N is equal to the number of cluster nodes. Add this with the internal redirection to the queue owner, and you’re sitting at (N+1) kb, in a 3 node setup. That’s 4kb of data for 1 message, not including the consumption of the message. If you receive 1,000,000 messages per day… That’s a staggering 4,000,000kb of transfer, 3,000,000kb of which is wastage.
- AmazonMQ Gives you 70% of the resources you pay for, and has lower waterfall alarms.
- AmazonMQ is upwards of over 6 times pricier than a comparable self-managed RabbitMQ Setup.
- AmazonMQ does not rebalance your cluster when you add a new node, and RabbitMQ isn’t smart enough to do it automatically either.
- No direct access to the partition of a cluster node… This means you cannot copy the queue data binaries required for replaying messages on a critically failed cluster.
Highly Prone to Cascading Failures
In my experience working with RabbitMQ and with Amazon’s AmazonMQ implementation of RabbitMQ, on countless occasions I have been blessed by the Gods of Cascading Failures.
One of the most significant patterns I detected which allows early intervention to prevent a cascading failure of the MQ Cluster is;
- Memory on one node begins to climb rapidly, while the other node’s memory remains within tolerance. Let’s refer to this node as “Node P0”, short for “Problematic Node”.
- Memory on Node P0 inexplicably reaches the lowest point to trigger memory-waterfall.
- Node P0 begins refusing requests and essentially locks up.
- Because a memory waterfall alarm is triggered, you can no longer reboot the cluster or specific nodes within the AWS Console.
- Load Balancer detects Node P0 is not accepting requests, begins routing requests to the next node in line, let’s call this “Node P1”.
- Node P1 Memory begins to inexplicably climb rapidly until it reaches memory-waterfall.
- Node P1 begins rejecting traffic.
- Load Balancer detects both Node P0 and Node P1 are rejecting traffic, so it routes the traffic to the next node in line, you guessed it… we’ll call this other node “Node P2”, how utterly imaginative.
- Node P2’s memory climbs until it hits memory-waterfall, and now there are no more nodes to serve traffic.
With Self-Managed, what intervention strategies could we consider?
- Rebalance queue owners across each node when we detect Node P0’s memory begins climbing. This will likely still result in Node P0 failing, but we can sacrifice Node P0 to prevent the cascading failure.
- Preempt a cascading failure by shovelling massive queues into backlog queues, and trickle-feed the backlog queues into the primary queue using a rate-limited python consume and publish technique.
- Clear a non-critical queue to reduce cluster load.
- Kill off all connections to the RabbitMQ cluster.
- Reboot Node P0 as soon as it hits memory-waterfall and rebalance queues across Node P1 and Node P2.
- Mitigate data-loss by connecting to a functional node and downloading the RAW queue data from the HDD, and then trickle-feed replay it using python after a full cluster purge and reboot.
The Conclusion
In conclusion, AmazonMQ might be okay for some people with minimal traffic, at which it’s likely overkill, and you should just use SQS instead… But for any serious application, AmazonMQ falls very short of being production ready, sustainable, or maintainable.
When you factor in costs, lack of data recovery, lack of cluster recovery, and in my experience, the lack of support (even when paying thousands for premium business tier support), AmazonMQ running RabbitMQ is not viable in my opinion.
Is the problem RabbitMQ? Not entirely…
Is the issue AmazonMQ? Again, not entirely…
Then what is the issue? Default configuration and haphazardly slapped together one size fits all “solutions” are the enemy of scale and fault-tolerance.
Footnote from the Author: Surely you didn't expect a post like this not to be littered with Alice in Wonderland puns...