
73% of Engineers Fail This 5-Minute System Design Question. Here's Exactly Why.
The question is "design a basic chat system." Not design Twitter. Not design a distributed database. A chat system. And 73% of software engineers are failing it in FAANG interviews. I've been thinking about why this keeps happening and the answer is genuinely useful regardless of whether you're actively interviewing or not — because it's really about how engineers think under pressure, not just what they know.
What actually goes wrong The moment most candidates hear the prompt, they jump straight into implementation mode. Database schema. WebSocket setup. API endpoints. Moving fast, looking busy, heading straight off a cliff. The problem isn't that those decisions are wrong. It's that they're making them before establishing what they're actually building. "Design a chat system" is massively underspecified. The right architecture for 200 concurrent users is genuinely different from the right architecture for 10 million. A single WebSocket server works fine at small scale. At large scale you need message queues, load balancers, and a horizontal scaling strategy that changes the entire design. If you skip the requirements step you might spend your entire five minutes designing the wrong thing, correctly.
The components that almost everyone misses Candidates typically design the obvious parts — client, server, database. What they consistently skip are the parts that make real-time messaging actually work in production: Connection management — users disconnect constantly. How does your system detect this, clean up state, and avoid ghost connections piling up on your server? User presence tracking — online/offline status sounds simple. It's a distributed systems problem. Presence changes constantly across multiple servers and needs to be propagated to the right clients without flooding the network. Message delivery confirmation — a message leaving your server is not the same as a message being received. Without an acknowledgement layer you have no way of knowing your system is actually delivering anything. Offline message delivery — what happens to messages sent to a user who is currently offline? You need a queue, a retention policy, and a delivery mechanism on reconnect. This is a non-trivial problem that doesn't exist in demos but dominates production engineering. These aren't obscure requirements. They're the reason chat is an interesting design problem. Missing them signals you're thinking about components, not systems.
The structure that passes Candidates who consistently pass this question follow a deliberate structure rather than reacting to the prompt: Minute 1 — requirements and scale Before touching anything, ask:
One-on-one or group chat? Scale: hundreds of users or millions? Does message history persist? For how long? What does real-time mean here — sub-second or best effort?
These questions aren't stalling. They determine whether the next four minutes are spent on the right design. Minutes 2-3 — high-level architecture Sketch all the major components before going deep on any of them: Client Apps | v WebSocket Server (Chat Server) | | v v Message DB Connection Manager | | v v Message Queue Auth Service (offline delivery) The goal here is coverage, not depth. Show the interviewer you know what a complete system looks like before you start optimising pieces of it. Minutes 4-5 — the interesting problems Now go deep on the genuinely hard parts:
Message ordering across a distributed system Scaling WebSocket connections across multiple servers without losing session state Trade-offs in storage strategy for message history (append-only log vs traditional DB) How your message queue handles guaranteed delivery vs at-most-once
This is where technical depth lands, but only because the interviewer can already see the full system it sits inside.
Why this specific question keeps appearing Every senior engineer has inherited a system that solved the obvious problem and ignored the hard ones. Connection management that was never properly designed. Message delivery that was assumed rather than guaranteed. Presence state bolted on later that never quite worked right. These are the scars of systems built by engineers who jumped to implementation without thinking through the full problem space first. The chat system question tests exactly that tendency. The surface-level simplicity creates a clear filter between engineers who react and engineers who think. And that distinction — between someone who builds systems with those scars and someone who thinks through the hard parts before writing code — is worth more to an interviewer than any specific technical knowledge.
The actual takeaway The 27% who pass aren't necessarily more knowledgeable. They've practiced suppressing the instinct to start building immediately until the deliberate pause becomes instinctive. That's a trainable skill. And it's useful way beyond interview rooms — it's the same instinct that makes engineers ask "what problem are we actually solving" before opening a code editor. Happy to discuss in the comments — particularly curious whether people find the requirements gathering step feels unnatural under time pressure or whether it becomes automatic with practice.
Source: Dev.to


