Hi Friends,
Here is something I wish someone had told me earlier in my development journey.
The code is not the hard part.
The hard part is the decisions you make before you write a single line. The architecture. The tradeoffs. The questions you ask about how your system will behave when things go wrong, when traffic spikes, when a database node goes down at 2am and your users are still trying to use your application.
System design is one of those topics that feels abstract until the day it is not. And by then you are usually already dealing with the consequences of not thinking about it early enough.
I have been diving deep into this and I want to share what I learned in a way that actually makes sense. Not textbook definitions. Real concepts you can carry into every project you build from here on out.
Here is what is inside this post:
What good system design actually means and the three pillars that hold it up
The three things every system does with data and why each one requires a different mindset
The CAP theorem and why it is the most important tradeoff you will ever make as a developer
Availability numbers that will change how you think about uptime
Speed and how to measure whether your system is actually performing
How to think through architecture before you write any code
& Please be a kind human and subscribe to support the blog!
What Is Good System Design
Good system design is not about finding the perfect solution. It is about finding the best solution for your specific use case.
That distinction matters more than it sounds. Every architectural decision is a tradeoff. Every choice you make optimizes for something and sacrifices something else. The job of a system designer is not to avoid tradeoffs. It is to make informed ones.
Three principles sit at the foundation of every well-designed system:
Scalability is your system’s ability to handle growth. More users. More data. More requests. A scalable system grows without breaking.
Maintainability is how easy it is to change your system over time. Code that works today but cannot be understood or modified six months from now is a liability not an asset.
Reliability is whether your system does what it is supposed to do consistently. Especially when things go wrong. And things always go wrong eventually.
Planning for failure is not pessimism. It is engineering.
The Three Things Every System Does With Data
Every application, regardless of how complex it is, does exactly three things with data. Understanding each one changes how you design for it.
Moving Data
Data moves constantly. Between your frontend and backend. Between services. Between your application and external APIs. The question is not whether data moves but how seamlessly it does. Optimizing for speed and security in data movement is not an afterthought. It is a design decision you make upfront.
Storing Data
Storing data is not just picking a database. It is understanding your access patterns before you design your schema. How will this data be read? How often? By how many users simultaneously? What needs to be indexed? What needs to be backed up and how quickly does it need to be recoverable? Secure and readily available are not the same thing and sometimes they are in tension with each other.
Transforming Data
Raw data is rarely useful on its own. Transformation is the process of taking what your system collects and turning it into something meaningful. Aggregations. Reports. AI outputs. Recommendations. The way you design your transformation layer determines how useful your application actually is to the people using it.
The CAP Theorem: The Most Important Tradeoff in Distributed Systems

Named after Eric Brewer, the CAP theorem is one of the most important concepts in system design and also one of the most misunderstood.
It states that in any distributed system you can only guarantee two of these three properties at the same time:
Consistency means every node in your distributed system has the same data at the same time. When you read data you always get the most recent version.
Availability means the system is always responsive to requests. Every request gets a response even if that response is not the most current data.
Partition Tolerance means the system continues functioning even when network issues occur and nodes cannot communicate with each other.
You can only pick two.
A banking system needs consistency and partition tolerance. Every transaction must reflect the true state of an account. If the system is temporarily unavailable while it ensures consistency that is an acceptable tradeoff. Transactions taking longer to process is better than a customer seeing an incorrect balance.
A social media feed might prioritize availability and partition tolerance. If you see a post that is a few seconds behind the absolute latest state of the platform that is acceptable. The feed being unavailable is not.
This is why I said good system design is about the best solution for your use case not the perfect solution. The CAP theorem makes it mathematically clear that perfect is not an option. Informed tradeoffs are.
Availability: The Numbers That Actually Matter
When we talk about availability we are asking one question: is our system up and running when our users need it?
The answer is almost always expressed as a percentage. And the difference between those percentages is larger than most people realize.
99.9% availability sounds excellent. It means 8.76 hours of downtime per year.
99.99% availability means 52 minutes of downtime per year.
99.999% availability, what the industry calls five nines, means roughly 5 minutes of downtime per year.
For a personal blog 99.9% is probably fine. For a payment processing system, a healthcare platform, or any service where users depend on uptime to do their jobs, the difference between 99.9% and 99.999% is the difference between acceptable and unacceptable.
This is where SLOs and SLAs come into the conversation.
An SLO is a Service Level Objective. An internal target your team commits to. Web request success rates, response time percentages, error rate thresholds. These are the standards you hold yourself to.
An SLA is a Service Level Agreement. A contract with your customers. If you breach an SLA there are consequences. Refunds. Credits. Damaged trust. Lost contracts.
Understanding the difference between what you target and what you promise is fundamental to building systems people can actually rely on.
Predicting the Unexpected
Availability does not happen by accident. It is engineered through redundancy and graceful degradation.
Redundancy means designing your system so that if one component fails, another can take over. Duplicate database nodes. Multiple servers. Failover mechanisms. If a single point of failure exists in your architecture, it will eventually fail. Redundancy removes single points of failure.
Graceful degradation means designing your system to fail partially rather than completely. If your recommendation engine goes down, your core product still works. If your search service is unavailable, users can still browse. The most critical functionality stays intact even when peripheral systems fail.
This is the combination of reliability, fault tolerance, and redundancy working together. You cannot predict every failure. You can design a system that survives the ones you did not predict.
Speed: Measuring What Actually Matters
Speed in system design is not one number. It is two distinct measurements that mean completely different things.
Throughput measures how much your system can handle in a given period of time. For servers, this is measured in RPS, requests per second. For databases, it is QPS, queries per second. Higher throughput means better performance under load. This is the measurement that tells you whether your system can handle scale.
Latency measures how long it takes to handle a single request. This is the measurement that tells you how fast your system feels to an individual user. A system can have high throughput and high latency. Or low throughput and low latency. They are related, but they are not the same thing, and optimizing for one does not automatically optimize for the other.
Data throughput, measured in bytes per second, tells you how efficiently data is moving through your system. All three of these measurements together give you a complete picture of your system’s performance.
How to Think Through Architecture Before You Write Code
This is the part most developers skip. And it is the part that causes the most pain later.
Before you write any code, ask yourself these questions:
Who is using this system, and how many of them will there be at peak?
What is the most critical piece of functionality, and what happens if it fails?
What are the access patterns for my data, and how does that shape my schema?
Where are the single points of failure, and how do I design around them?
What is an acceptable SLO for this system, and what would a breach of it look like?
These questions do not have universal answers. They have answers that are specific to your use case, your users, and your constraints. The CAP theorem tells you that you cannot optimize for everything. These questions help you decide what actually matters for the system you are building.
Getting the design right before you write the code is not about being slow. It is about not having to refactor a production system because of decisions that seemed fine at the time and turned out not to be.
Designing a system without thinking it through leads to three things: vulnerabilities, inefficiencies, and refactoring at the worst possible time. The time you spend on architecture before you build is never wasted. It is always cheaper than fixing it later.
What I Am Thinking About Next
Topics like these I enjoy talking about: Vector databases, distributed systems, caching layers, and message queues. There is a whole world of infrastructure decisions that sit between the code we write and the experience our users have. I plan to keep writing about it.
If system design is something you want to understand better, subscribe and come along for the journey.
Let’s Build It Beautifully,
Fab