Interview Prep: Classic System Design
- Jenya's AI Interview Prep Notes
- Sell Me This Pen
- The Coding Grind
- The ML Grind
- Classic System Design
- Just Chatting
- Keep At It
This note is more focused on classic infrastructure roles. Not every company will run interviews like this, most modern AI Infra roles include ML System interview loops, which differ significantly and align more closely with the topics covered in The ML Grind section.
First, source a good list of questions, you have a few options available on the web:
- Paid ones, I used bytebytego, neetcode also has a dedicated section to systems design, there are many more out there, find the one that suits your needs.
- You can also do a deepsearch in a chatbot to give you a list of question.
The issues with many of the solutions provided for System Design questions is that they are too basic and may work for entry/mid level positions, but they are not in-depth enough for more senior roles.
Those sites are still valuable for sourcing questions, but instead of just reading them, use a chatbot for mock interviews.
Use the prompt below:
**System Design Mock Interview Format**
Let's run a system design mock interview in a realistic way:
* You'll ask me to design a system without giving unnecessary details.
* I will ask only the clarifying questions I need.
* You may ask follow-up or leading questions based on my answers.
* I will verbalize my reasoning process as we go, and we'll iterate.
* A large part of a real system design interview is the candidate asking questions, so please don't overshare, let me extract the requirements.
Below is a question we will use as a reference. Assume I haven't read it; use it as a guide, not a script.
COPY-PASTE here the entire question & solution from the sources above (bytebytego/neetcode/etc)
Think of this as a real interview. Walk through your solution step by step. You can use the voice feature in chatbots to reason out loud as you work through the problem. When you're done, say that you'd like to end the interview, and ask to explain how you should have approached to solve the problem.
Know the numbers
Scales
You should get comfortable working with large numbers. The easiest method I've found is to use scientific notation. Here are a few examples:
1e6 is 1,000,0001e6 × 1e3 = 1e(6 + 3) = 1e91e9 ÷ 1e3 = 1e(9 – 3) = 1e6
While you work through the chatbot-guided mock interviews, always do the math by hand.
| Power of 10 | Symbol | Name | Power of 2 | Time |
|---|---|---|---|---|
| 10¹⁵ | P | peta | 2⁵⁰ | 31.7 million yrs |
| 10¹² | T | tera | 2⁴⁰ | 31.7 thousand yrs |
| 10⁹ | G | giga | 2³⁰ | 31.7 yrs |
| 10⁶ | M | mega | 2²⁰ | 11.6 days |
| 10³ | K | kilo | 2¹⁰ | 16 min 40 sec |
| 10⁰ | unit | |||
| 10⁻³ | m | milli | ||
| 10⁻⁶ | µ | micro | ||
| 10⁻⁹ | n | nano |
Just a few more random numbers
- 128 bit = 16 bytes → IPv6, UUIDs
- 26 chars = English alphabet
Clock sync drift
- NTP ≈ 10 ms
- GPS ≈ 100 ns
- Atomic ≈ 0.1 µs
HW latencies
| Latency | Component | Notes |
|---|---|---|
| 0.250 ns | 1 cycle @ 4 GHz | |
| 0.333 ns | 1 cycle @ 3 GHz | |
| 1 ns | L1 Cache | |
| 4 ns | L2 Cache | |
| 100 ns | DRAM | DRAM access 13 ns, memory controller 40 ns, CPU cache miss handling 25 ns, bus traversal 15 ns. Assume p99 around x1.2 or the same, there is not as much jitter in the latency. |
| 65 µs | NVMe | ~3x for p99 |
| 10 ms | HDD | ~10x for p99 |
Network
The numbers here are rough latency and bandwidth estimates, it's an extension of Jeff Dean's classic 2010 latency sheet, but adapted to modern AWS network tiers. It doesn't matter whether the company uses AWS specifically, the patterns and relative orders of magnitude are universal.
Why do we care about latencies across the network hierarchy?
Because SLAs around latency and availability directly constrain workload placement and distribution strategies. Your placement across failure domains is dictated by your ability to meet those SLAs.
For example:
- Deploying everything in a single AZ (Availability Zone) caps you at ~99.5% availability, which is insufficient if your target is 99.9% or higher.
- To meet higher availability requirements, you must distribute across AZs.
- Once you go multi-AZ, latency often becomes the dominant constraint. You typically want user requests to be routed to the closest region to avoid extra network latency, this is when you start thinking about cross-region deployments.
- If you can satisfy both availability and latency SLAs within one region, the system is dramatically simpler than one stretched across regions, coasts, or continents.
This is a cheat sheet I created for quick reference and rehearsal. I keep all of this on a single index card, it may look like a lot, but it's actually very manageable once you learn a few memory tricks I describe below.
Start every interview with a question, with SLAs requirements for the service that needs to be built, if they don't care, suggest a reasonable SLA yourself.
I used the following notation for latencies between two hosts:
- iAZ - in zone, two hosts are located within the same zone
- xAZ - cross-zone, two hosts are located in different AZs within a region
- xREG - cross-region
- xCOA - cross-coast (NY - SF)
- xCON - cross-continent (NY - LON)
Availability SLAs
| Availability | Downtime / month |
|---|---|
| 99.5% | 3.65 h |
| 99.95% | 21.6 min |
| 99.99% | 4.4 min |
Network Latencies
The latency and link columns are easy to memorize. For GByte/s column, just remember the value for 10 Gbit/s, and you can derive the rest even if you're not great at arithmetic (halve it, multiply by 5×, 10×, etc.).
The numbers below are expressed as Round-Trip Time (RTT), the time it takes to send a request and receive a response. Keep in mind that actual latency may be higher when you include the TCP handshake, TLS negotiation, or other protocol overheads. In practice, these estimates are still accurate for back-of-the-envelope calculations, especially when using long-lived connections that amortize those setup costs.
The estimates below represent p50 latencies. Should you also memorize p99 numbers, since many SLAs are defined using p99 (e.g., "99th percentile of requests must be under 200 ms")? Not really. For rough reasoning, you can approximate p99 as 2 × p50. It's not scientifically precise, but it's good enough for back-of-the-envelope calculations. For example, a 0.4 ms p50 latency would translate to ~0.8 ms p99.
| Link Type | <1KB | Gbit/s Link | GByte/s |
|---|---|---|---|
| iAZ | 0.4ms | 25 | 3.125 |
| iAZ | 0.4ms | 50 | 6.25 |
| iAZ | 0.4ms | 100 | 12.5 |
| ×AZ | 1.2ms | 10 | 1.25 |
| ×REG | 20ms | 5 | 0.625 |
| ×COA | 65ms | 5 | 0.625 |
| ×CON | 80ms | 5 | 0.625 |
While the latencies above are enough for most cases, the actual time to transfer a blob of data depends on more than just RTT. It's influenced by optimizations in the network stack, such as pipelining, where multiple packets are sent back-to-back. How many packets can be pushed at once varies and involves some networking math, including congestion control and TCP slow-start.
Back in the day I had to learn those details as I was preparing for Meta’s infra interviews. These days, most infra engineers aren't expected to go deep on networking unless the role specifically calls for it.
Here's how to think about the numbers: the RTT table applies to payloads under ~1.46 KB because that fits in a single packet (standard MTU). Sending 1 KB means you send the packet and get an ACK - one full round trip. For larger transfers, you can stream many packets in parallel, but eventually you'll hit the slowest link in the path, which creates backpressure via congestion control (e.g. by dropping packets).
The key idea is this: depending on bandwidth and latency, transfer times shift from being latency-bound to bandwidth-bound. For example, sending 1 KB vs 100 KB isn't 100× slower, because many packets can be sent at once. But with 10 MB → 100 MB → 1 GB transfers, time scales roughly 10×, so the transition from latency-bound to bandwidth-bound happens around 1–10 MB. That’s what the color coding highlights.
You mostly need to memorize the latency-bound part (on the left) the rest can be derived with:
Transfer time ~ RTT + payload ÷ bandwidth.
| Link | 100 KB | 1 MB | 10 MB | 100 MB | 1 GB |
|---|---|---|---|---|---|
| iAZ | 0.43 ms | 0.7 ms | 3.6 ms | 33 ms | 330 ms |
| xAZ | 1.23 ms | 2.0 ms | 9.2 ms | 81 ms | 820 ms |
| xREG | 20.2 ms | 22 ms | 96 ms | 180 ms | 1640 ms |
| xCOA | 65.2 ms | 67 ms | 98 ms | 225 ms | 1685 ms |
| xCON | 80.2 ms | 82 ms | 96 ms | 240 ms | 1700 ms |
Storage
This section covers the cost of persisting data. When paired with the previous latency breakdown, it ties into the trade-offs between performance and consistency.
Broadly, systems fall into two categories: those that provide strong consistency, and those that optimize for latency (See PACELC below for more details).
- Caches: RTT + RAM – While other overheads exist in the critical path, cache access latency is dominated by network round-trip time and memory access.
- DB Async (×AZ): RTT + RAM + Disk – To persist data, it must be flushed to disk. Depending on hardware (e.g. NVMe vs HDD), either disk or network may dominate. On modern NVMe, RTT dominates due to asynchronous replication – the write is committed only to the leader, and replication to followers happens after the client receives the ACK.
Classic synchronous replication with 2-Phase or 3-Phase Commit is less common today due to its impact on availability. However, it's still worth considering:
- DB Sync (iAZ): 2×RTT + 2×RAM + 2×Disk - The client waits until the data is committed on two hosts, which includes memory and disk writes for both.
To achieve strong consistency and high availability, Paxos/Raft based systems (etcd, zookeeper) provide a reference architecture:
- Write: 2×RTT + 3×RAM if the write lands on the leader. Add an extra RTT if the write is forwarded from a follower.
- Read: 2×RTT + 1×RAM assuming a read from the leader.
Distributed Blob Stores
Two architectural approaches:
- Fat Client: Handles quorum logic client-side.
- Metadata Proxy: Centralizes coordination and request routing.
The core difference lies in where the decision is made about which hosts to contact to perform read/write operations. Each model comes with trade-offs:
- With a proxy, you must ensure the metadata service is scalable, highly available, and consistent, or it becomes a single point of failure.
- With a fat client, there's ~no central bottleneck, but clients must implement replica selection, often using consistent hashing with virtual nodes – which can be tricky to get right and maintain.
Latency behavior is governed by the quorum formula:
R + W > N
(Number of replicas read from) + (Number of replicas written with ACK) must exceed total replicas N.
Examples for N = 3
- R(1) + W(3) > 3 → Write must be acknowledged by all 3 replicas. Any read returns a consistent result.
- R(3) + W(1) > 3 → Write is acknowledged after one replica, but reads must query all 3 and choose the freshest.
- R(2) + W(2) > 3 → A balance: both reads and writes wait for 2 out of 3, guaranteeing overlap on at least one up-to-date replica.
This model lets you trade off read vs write latency and availability. For example:
- Waiting for ACKs from all 3 replicas increases p99 write latency.
- Writing to all 3 but ACKing after 1 reduces write latency, but puts more burden on read path consistency.
PACELC Theorem
The PACELC theorem extends the CAP theorem. It states that:
- If a network partition occurs, the system must choose between availability and consistency.
Since parts of the system can't communicate, we either shut everything down to preserve strong consistency, or continue operating and risk serving inconsistent data.
- If there's no partition, the trade-off shifts to latency vs consistency.
For example, in database replication:
- Waiting for acknowledgments from replicas before committing ensures strong consistency, but increases latency.
- Acknowledging the transaction immediately (before replication) reduces latency, but can lead to stale reads from replicas.
A short cheatsheet below demonstrates why it’s called PACELC:
- If Partition → Availability || Consistency
- Else → Latency || Consistency
There is a lot more to explore about storages, however, this section should be sufficient for non-storage focused roles. If you are curious to learn more, I would highly encourage you to read Designing Data-Intensive Applications.
Monitoring & Debugging
L.U.S.E.R. is an extension of the USE method for performance analyses to include additional metrics required for distributed systems monitoring and debugging.
- Latency
- Measure stalls, p50, p90, p99
- SLO - 99.9% requests are under 100ms
- Hangs - latency requires the start and end of an operation, for hangs there could be no end.
- Utilization
- % of resource not idle
- Saturation
- Backlog
- Queue sizes
- Memory, CPU, IO pressures
- Errors
- User - the error has happened because of the user error (e.g. 404 page not found)
- System - something went wrong and the server side is at fault (a DB is timing out)
- Client - ensure the errors are tracked from the client perspective.
- Rate
- Volume of incoming requests, you would use it to correlate if your service has gotten slower due to increased load on your system.