Engineering

Engineering for Message Velocity: Building Communication Systems That Scale With Your Growth

Introduction

In the English language, it is sometimes said that “Advertising is the lifeblood of business.” There can be a case made that while this is true, in the era of digital marketing, it can be extended to “User communication is the lifeblood of business”. While it might sound a little cliche, every e-commerce business relies heavily on reaching its clients. They are even willing to pay to establish a communication channel with discounts and gifts.

It usually begins with a handful of email templates, a simple notifications service, and maybe a basic SMS functionality for essential comms. Fast-forward a few successful marketing campaigns and an influx of new users, and suddenly, those “simple” solutions barely work. Messages arrive late, fail, or slow down the whole product. While high-level managers understand the importance of delivering timely messages, an idea covered thoroughly in our previous article, they may find the actual engineering solutions to be a bit of a mystery.

This article addresses that gap. We’ll identify what makes modern, high-performing messaging architectures, how to benchmark your current system, and which strategies keep you ready even when exponential growth suddenly arrives. If you want to future-proof your platform’s communication layer, this is a guide for you.

Anatomy of High-Performance Messaging Architecture

There is no single solution that would cater to the needs of every company sending out thousands, if not millions, of messages. However, there are multiple options to consider, all with their own unique quirks. Understanding these choices is the first step in designing a system that can cater to your increasing volume and complexity. Let’s start with this first dilemma:

Message Brokers vs. Direct Delivery

Message Brokers (e.g., RabbitMQ, Kafka, AWS SQS) act as intermediaries, queueing messages between senders and recipients. The sender (publisher) posts messages to a queue, and one or more consumers (subscribers) process them asynchronously. It offers fault tolerance, load balancing, and decoupled services. However, it has more moving parts and requires proper configuration to avoid bottlenecks.

On the other hand, in the direct delivery approach, messages are sent immediately from one service to another without an intermediary. It is simpler to set up for small scale and offers real-time communication. However, it rapidly becomes brittle and less fault-tolerant as volumes grow.

Asynchronous vs. Synchronous Processing

In the asynchronous model, the sender passes a message (often via a queue) and continues other tasks without waiting for a response. Thus, this approach offers high throughput, better fault tolerance, and you avoid having a single point of failure blocking the entire chain.

The opposite of this is a synchronous model. Here, the sender awaits confirmation of the message receipt. This will work for real-time interactions where immediate feedback is vital, say like for communicators. It will be a dramatically bad choice for high-volume scenarios (say, an email marketing campaign), where you can afford to delay thousands of emails cause there is a problem with a single address in the sender list.

Microservices vs. Monolithic Architecture

Microservices architecture organizes communication-related features, such as email delivery, push notifications, and analytics pipelines, into separate, independent services. Each service can scale on its own and operate without directly impacting others. This approach enables rapid iteration and ensures that a failure in one microservice won't crash the entire application. However, it introduces operational complexity and demands careful planning.

In contrast, a monolithic architecture consolidates all communication functionalities into a single service or codebase. This makes initial development and deployment simpler. However, scaling individual components becomes challenging, and a single performance bottleneck can potentially bring down the entire system. So, problems with sending millions of emails will also prevent thousands of sms and push notifications from being sent, despite no problem with those aspects of the whole system.

Push vs. Pull Delivery Models for Notifications

Push-based models actively send notifications directly to clients (examples include APNs for iOS and FCM for Android). This approach offers real-time message delivery with minimal overhead on the client side but requires stable client-server connections and sophisticated load balancing to manage large traffic spikes.

On the other hand, pull-based models rely on clients periodically checking the server for new messages. This method is often simpler to implement and can reduce server load when events are short-lived or infrequent. However, it can result in message delivery delays, and frequent polling increases network traffic.

Benchmarking Your Current System's Limitations

Before jumping to any conclusions and investing in proper solutions, you need data to assess your current system. In other words, you need to know what’s failing to fix it. A handful of technical signals can tell you whether you’re nearing or have already reached your system’s breaking point. Those include:

Message Queue Depth Trends

Tracking message queue depth involves monitoring the number of unprocessed messages remaining in the queue. A growing queue typically indicates that consumers aren't receiving incoming messages correctly and effectively. That usually boils down to system bottlenecks or capacity issues. You can monitor this metric using tools provided by popular message brokers like RabbitMQ, Kafka, and AWS SQS or through custom dashboards. Remember to have the alerts set properly. If you kick off the queue with thousands of messages, this is to be expected. It’s only once the queue is processing does the depth matter.

Processing Time per Message

Processing time per message refers to the average and worst-case duration from the moment a message enters the queue until it's fully processed. Lengthy processing times may suggest lacking resources, or inefficient code execution paths. Measure processing durations by adding timestamps as the message enters and leaves the queue, and log these times using tools such as Datadog, Prometheus, or Kibana. Of course, this also needs to be done smartly. What you are looking for are gradual growths over time that ultimately hit your maximum accepted value of the process duration.

Error Rate Correlation with Volume

Monitoring the error rate as your message volume increases involves tracking errors like dropped messages or delivery failures in relation to traffic spikes. A significant increase in errors that goes along with a surge in user activity usually uncovers underlying scalability problems. This correlation can be effectively analyzed by overlaying error logs with usage metrics via centralized logging or Application Performance Monitoring (APM) tools like Sentry or New Relic. Of course, watch out for false positives. This may also grow due to, say, a Gmail outage or a purge of dead/bot accounts at any bot provider. Consider some automation where addresses that constantly fail over a period of a few days are automatically removed from your database.

Database Connection Saturation

This situation occurs when the maximum number of database connections is reached, causing slowed queries and queued requests. Because messaging systems often interact closely with databases containing user or content information, increased messaging traffic can overwhelm database resources and make your product barely, if at all, usable. To effectively monitor this, track connection pool usage and query latency, especially during peak times such as high-traffic campaigns or increased user activity periods. If this occurs often, consider creating a mirror clone of the tables that your messaging system uses and pointing that system to the duplicate table. This way, everything can be sent without affecting the product’s performance.

Building a System Ready For Scale

Once you understand the root causes of your messaging constraints, you can consider different architectural and coding patterns designed to handle high volumes reliably. On top of the few solutions provided in the previous section, please assess the following options:

Event-Driven Architecture

This type of architecture operates by having system components create events, such as "Order Placed" or "User Signed Up," to a central event bus. Consumers subscribe to specific event types and trigger appropriate actions, like sending a welcome email. This design promotes loose coupling among components, naturally scales by adding consumers for events experiencing heavy traffic, and simplifies the introduction of new features without rewriting existing functionality. In other words, using this type of system design will allow you to easily add new types of messaging without affecting existing processes.

Event-Driven Architecture

Stream Processing for High-Volume Communications

Stream processing uses solutions like Apache Kafka Streams or AWS Kinesis to manage event streams in real time. These platforms allow you to process the messages before directing them to their intended recipients. This is highly beneficial for handling time-sensitive notifications, conducting real-time data analytics, or dynamically segmenting user groups (i.e. for sending SMS notifications to all U.S.-based users who completed a purchase within the past hour). In other words, you can prioritize your messages and make sure that the time-sensitive ones will be dispatched at the right time rather than held up in a queue. You wouldn’t like your multifactor authentication to take more time to send you the login code then your user has to use it, right?

Sharding Strategies for Message Storage

Sharding is essential because large message tables can slow queries, inflate backup sizes, and complicate data migrations. Effective sharding involves segmenting messages based on criteria such as user region, time range, or other identifiers, placing each segment into its own database or partition. The primary principle behind sharding is maintaining shards at a size that is manageable yet capable of effectively handling localized surges in traffic. It can also be legally required if regions you operate in demand that you store their users data in a geographically speaking local server.

Circuit Breaker Patterns for External Dependencies

This pattern monitors the responsiveness and performance of third-party services, such as email gateways or SMS providers. If a monitored service becomes unresponsive or slows significantly, the circuit breaker "opens," redirecting requests to backup solutions or temporarily queuing them until the primary service recovers. Implementing this pattern is crucial as it prevents system-wide failures caused by a single external API outage or slowdown. Ideally, you want the backup solution to be a different provider altogether to increase the chances of keeping your service going even when your first choice is completely down.

Build vs. Buy Decision Framework

Until now, we have been providing insights and advice on how to diagnose and improve your own messaging system. However, there is a simpler solution that requires no technical expertise, and all you need to do is “throw the money at the problem”. In this scenario, you use an external tool and/or provider to manage your messaging system for you or to manage the solution you have already built. The question then becomes: Should you? Let’s explore this possibility:

Complete outsource

To start with, let’s just ponder on why you should bother with building your own set-up in the first place. Utilizing third-party messaging services is beneficial when aiming to offload complexities related to email deliverability, push notification routing, or SMS management. These services are quick to implement, typically offer guaranteed service-level agreements (SLAs), and require minimal in-house effort.

However, costs can quickly escalate at high message volumes, and customization options may be limited. Examples of reliable third-party messaging platforms include AWS SNS, Twilio, and SendGrid, which efficiently handle high-volume, multi-channel messaging scenarios.

Bespoke Solution

On the other hand, there is always the option of developing a custom-built solution. This is ideal when you require maximum control over user data handling, performance optimization, or regulatory compliance. Tailored architecture allows precise oversight over performance characteristics, can lead to significant cost savings at very large scales, and provides maximum privacy of the entire codebase.

However, developing custom solutions demands specialized engineering expertise and robust DevOps capabilities. For example, a large marketplace requiring specialized compliance or high throughput might develop a custom solution based on Apache Kafka. You also need to sort-of rebuild existing products to your specification which can mean a long time to launch.

(You’ll find a more comprehensive overview of the buy vs build dilemma in this article.)

Say you have already had your system for years, and it’s getting rusty. What then?

Improving the Existing Messaging System

This improvement, commonly known as refactoring, should be considered when existing infrastructure operates well at moderate scales but begins to experience error rates or latency spikes during peak loads. If your team is highly familiar with the current codebase and the problematic areas are clearly defined, refactoring with them will be more cost-effective than a complete rebuild. For instance, they can replace synchronous service calls with asynchronous message queuing in a monolithic architecture, which will eliminate performance bottlenecks without migrating entirely to microservices.

Of course, any task that can be done in-house can be outsourced, but in this instance, it’s likely that the time investment needed to onboard an external team to the task might be as big as simply doing the update by the existing team.

All things considered, it doesn’t have to be either in-house or outsourced. You can always work with a mix that works for you:

Hybrid Approaches

The messaging solutions are suitable for this setup when unique requirements exist in specific areas, such as real-time push notifications, while standard communication channels like email can effectively rely on third-party services.

This approach balances control with convenience, allowing your engineering resources to focus primarily on differentiated features. However, managing multiple providers and integrating them seamlessly introduces additional complexity and coordination challenges.

We can’t help but quote Woody Allen and say: “Whatever works” and allows you to get all your messages on time, in a legal way and without crippling your product.

Conclusion

Shifting from basic messaging to a truly scalable, high-velocity communication system is about making strategic engineering decisions. For many businesses, these choices revolve around embracing event-driven patterns, queue-based or streaming architectures, and robust strategies for sharding and error handling. As you move from a legacy design to a more modern, resilient platform, open and ongoing collaboration between business stakeholders and engineering leaders is crucial.

When you know where your limits are, how to measure them, and what architectural approaches best mitigate those pain points, you’re free to pursue growth without fear that each new campaign or user surge will bring your messaging to a crawl. Ask your team the right questions:

  • Do we know our message processing ceiling?
  • Are we prepared if the volume suddenly doubles?
  • Have we adopted asynchronous patterns, or are we still locking threads with synchronous calls?

Answering those questions sets the stage for a messaging architecture that not only keeps pace with growth but accelerates it, allowing you to reap the full rewards of timely, relevant customer communication.

Of course, if you’d like to get this done by experts, Appunite is here to help! Contact us today, and let’s get working together to make your messaging system best fit your current and future needs.