top of page

Six Rules of Thumb for Scaling Software Architecture

Updated: Feb 1

In the ever-evolving landscape of software architecture, the pursuit of scalable systems is both an art and a science. As digital demands grow and user bases expand, the ability to handle increasing workloads becomes paramount. While the industry giants showcase engineering marvels, not every system starts with scalability as a primary focus. Features and usability often take precedence in the initial stages, only to find scalability challenges looming as success leads to a surge in requests and data volumes.

In this exploration, we delve into six essential rules of thumb for software architects navigating the intricate path of scaling architecture. These principles, shaped by practical experiences and lessons learned, offer invaluable guidance. Whether you are in the midst of architecting a new system or steering an existing one toward scalability, these rules provide a compass to navigate the complexities and challenges of scaling software architecture.

Six Rules of Thumb for Scaling Software Architecture

For most business and government systems, scalability isn't the primary focus initially. Emphasis lies on adding features and improving usability. When success leads to increased demand, the challenge arises to adapt the system. Architects, the designers of the system, step in to guide this evolution, often requiring changes to the core structure of the system.

To aid architects in this journey, six essential rules serve as a roadmap. These rules provide crucial insights for crafting systems that can efficiently handle growing demands and data as they scale.

Saving Cost while Scaling up

In system scaling, a fundamental principle emerges the symbiotic connection between cost and scalability. The ability to seamlessly augment processing resources to accommodate heightened loads is at the core of scaling systems. A prevalent strategy involves deploying multiple instances of stateless server resources, coupled with a load balancer for efficient request distribution (refer to Figure 1).

In cloud platforms like Amazon Web Services (AWS), the elemental costs are two-fold:

  1. Cost of Virtual Machine Deployments:

  • Directly linked to each instance of server deployment.

  1. Cost of Load Balancer:

  • Determined by the influx of new and active requests, alongside the volume of processed data.

As the demand for processing intensifies, so does the need for additional virtual machines, leading to escalated costs. Concurrently, load balancer expenses surge in tandem with request loads and data sizes.

Figure 1: A Simple Load Balancing Example

Six Rules of Thumb for Scaling Software Architecture 1

Therefore, the synergy between cost and scale is undeniable. Design decisions regarding scalability inevitably wield a profound impact on deployment costs. Neglecting this correlation might result in unwelcome surprises, akin to prominent companies grappling with unexpectedly substantial deployment bills.

Optimizing Design for Cost Reduction

In the face of this intricate relationship, how can one architect a system to mitigate costs?

Two key strategies emerge:

Elastic Load Balancer Implementation: Employ an elastic load balancer that dynamically adjusts server instances based on instantaneous request loads. During periods of light traffic, you only pay for the minimum number of server instances. As demand surges, the load balancer spawns new instances, synchronizing capacity with the escalating request volumes.

Optimizing Server Instance Capacity: Enhance the capacity of each server instance by fine-tuning deployment parameters (e.g., threads, connections, heap size). Default platform settings seldom align optimally with your workload. Thoughtfully adjusting these parameters can yield substantial performance improvements, effectively achieving more work with the same resources — a pivotal tenet in the pursuit of scalability.

Finding System Weakness

Your system is facing a bottleneck somewhere, though it might not be immediately obvious. Scaling a system means increasing its capacity to handle more work. For instance, in the example mentioned earlier, we increased the system's ability to process requests by adding more server instances. However, most software systems consist of various interconnected parts, often called microservices. When you boost capacity in one area, you may overwhelm other parts of the system.

Consider our load-balanced example where all server instances connect to the same shared database. As we add more servers, the demand on the database grows, eventually reaching a point where the database can't keep up. Now, the database becomes a bottleneck, and merely adding more server capacity won't help. To continue scaling, the focus should shift to enhancing the database's capacity. This might involve optimizing queries, adding more computing resources like CPUs and memory, or exploring solutions like database replication or sharding.

Figure 2: Increasing Server Capacity creates a Bottleneck at the Database

Six Rules of Thumb for Scaling Software Architecture 2

Any shared resource in your system has the potential to become a bottleneck, restricting overall capacity. As you boost capacity in different parts of your system, it's crucial to examine downstream areas to prevent overwhelming the system with requests unexpectedly. Ignoring this can lead to a cascading failure, as discussed in the next rule, ultimately causing the entire system to crash.

Databases, message queues, long-latency network connections, thread and connection pools, and shared microservices are common culprits for bottlenecks. Systems with high traffic loads quickly reveal these vulnerable elements. The key is to anticipate potential crashes when bottlenecks emerge and can swiftly add more capacity to keep the system running smoothly.

Slow Services are Worse than Failed Services

In the realm of system operations, slow services pose a greater threat than failed services. In an ideally functioning system, designed for stability, communication latencies between microservices and databases should remain consistently low under normal load conditions (refer to Figure 3).

Figure 3: Low latencies under normal load

Six Rules of Thumb for Scaling Software Architecture 3

However, as client loads surpass operational profiles, latencies between microservices start to rise. Initially, this increase may be gradual and may not significantly impact overall system operations, especially during short-lived load surges. Yet, if the incoming request load consistently exceeds capacity (Service B), pending requests accumulate in the requesting microservice (Service A). This is depicted in Figure 4.

Figure 4: Increased load causes longer latencies and requests to back up

Six Rules of Thumb for Scaling Software Architecture 4

In such scenarios, the situation can deteriorate rapidly. When one service becomes overwhelmed, experiencing thrashing or resource exhaustion, the requesting services become unresponsive, leading to a cascading failure. This occurs when a slow microservice causes requests to accumulate along the processing path until the entire system abruptly fails.

This emphasizes why slow services are considered more problematic than unavailable ones. In the case of a failed or temporarily partitioned service, an exception is received immediately, allowing for informed decision-making (e.g., retrying or reporting an error). Conversely, gradually overwhelmed services behave seemingly correctly but with prolonged latencies. This exposes potential bottlenecks in all dependent services, eventually leading to a catastrophic failure.

To mitigate such risks, architectural patterns like Circuit Breakers and Bulkheads act as safeguards against cascading failures. Circuit breakers allow throttling or shedding of request loads if latencies to a service surpass a specified threshold. Bulkheads shield an entire microservice from failing if one of its downstream dependencies encounters issues. Incorporating these patterns contributes to the construction of both resilient and highly scalable architecture.

Scaling Data Tier

Scaling the data tier, where databases reside, poses unique challenges in system development. Databases serve as the core repository for essential business data, including transactional databases (holding critical data like customer profiles and account balances) and operational data sources (temporary data for data warehouses, such as user session lengths and page view counts).

Transactional databases demand correctness, consistency, and availability, making them crucial for the system's integrity. In contrast, operational data is often transient, allowing for out-of-band storage in logs or message queues, as it doesn't need to be 100% complete and can be aggregated over time.

As the system's request processing layer scales, shared transactional databases can swiftly become a bottleneck due to increased query loads. Initial steps involve query optimization and adding memory to boost database engine performance. However, when these measures prove insufficient, more radical changes become necessary.

Modifying the data organization in the data tier can be challenging. Schema changes in relational databases may require data reloads, causing downtime for write operations. NoSQL, schemaless databases mitigate the need for reloads but necessitate changes in query-level code to accommodate altered data organization.

Further scaling often involves database distribution, either through a leader-follower model with read-only replicas or a leaderless approach. Choosing the right partition key is critical, as it influences data distribution across nodes. Changing the partition key typically requires rebuilding the database. Distributing and partitioning data across multiple nodes can be administratively complex, leading to the preference for managed cloud-based alternatives like AWS Dynamodb or Google Firestore.

The key takeaway is clear: altering logical and physical data models to scale query processing capabilities is seldom a seamless process. It's a challenge best tackled infrequently to minimize disruptions.

Cache Everything for a Smoother System

To alleviate the strain on your database, a smart strategy is to minimize direct access whenever possible. Enter caching – a technique where your trusty database engine can make the most of on-node cache resources. It's a straightforward yet effective solution, albeit one that may come with additional costs.

But why bother querying the database if it's unnecessary? For data that undergoes frequent reads and changes infrequently, you can optimize your processing logic by checking a distributed cache, like a Memcached server, before resorting to a database query. While this introduces a remote call, fetching data from a cache over a fast network is far less resource-intensive than querying the database instance.

Implementing a caching layer involves tweaking your processing logic to check for cached data. If the desired information isn't in the cache, your code must then query the database, load the results into the cache, and return them to the requester. Deciding when to remove or invalidate cached results depends on your application's tolerance for potentially serving outdated data to clients.

A well-designed caching scheme can be a game-changer in scaling a system. By handling a substantial portion of read requests through the cache, you reduce the burden on your databases, sparing you from intricate and cumbersome data tier modifications (as discussed in the previous rule). This not only streamlines operations but also creates room for accommodating a growing number of requests.

Monitoring Matters in Scalable Systems

As teams grapple with increasing workloads, the challenge of realistic load testing becomes apparent. Testing the scalability of a deployment, especially when envisioning a tenfold increase in database size, involves significant effort. Generating representative data and creating realistic workloads for testing, whether focusing on reads or both reads and writes, demands meticulous preparation. Typically, load tests are executed using specialized tools.

However, this approach is complex, often falling short of accurately reflecting real-world scenarios, and consequently, it's seldom undertaken.

An alternative solution is monitoring. At its core, system monitoring ensures the operational status of infrastructure components. Alerts are triggered if critical resources like memory or disk space are running low, or if remote calls are failing. This proactive approach enables remedial actions before severe issues arise.

Basic monitoring, while essential, becomes insufficient as the system scales. Understanding the intricate relationships between application behaviors becomes paramount. For instance, monitoring how database writes perform under increasing concurrent write requests or detecting microservices' circuit breakers tripping due to growing downstream latencies are crucial insights. Cloud platforms offer monitoring frameworks, such as AWS CloudWatch, while solutions like Splunk provide comprehensive log aggregation.

Observability, a broader term encapsulating monitoring and performance analysis, is vital. Deep insights into performance necessitate custom metrics aligned with specific application behavior. Integrate these metrics into your microservices to inject them into the monitoring framework, ensuring observability.

Two key considerations emerge.

  1. First, generating custom metrics tailored to your application's behavior is essential for in-depth performance insights. Carefully design these metrics and integrate them into your monitoring setup for comprehensive observations.

  2. Second, monitoring is an ongoing necessity and cost in your system. It should be active at all times. The data collected serves as a guide for experiments and efforts when tuning performance and scaling your system. Adopting a data-driven approach in system evolution ensures time is invested in modifying and enhancing fundamental components that underpin your performance and scaling requirements.


Scaling software architecture is a journey fraught with challenges and complexities, yet it is an inevitable phase in the life of a successful system. The six rules of thumb presented here encapsulate foundational knowledge that every software architect should wield in their arsenal. From understanding the interconnectedness of services to the critical role of monitoring and the strategic use of caching, these rules offer practical insights to overcome scalability hurdles.


bottom of page