System Design Fight Club – Over 50 System Design Interview Question Solutions

LOGGING SERVICE

RESOURCES

problem requirements context:
- https://leetcode.com/discuss/interview-question/system-design/622704/Design-a-system-to-store-and-retrieve-logs-for-all-of-eBay
- https://leetcode.com/discuss/interview-question/system-design/292216/Design-Error-Logging-service/283697
- https://leetcode.com/discuss/interview-question/system-design/1753420/design-a-log-aggregation-service-to-get-top-n-errors-within-a-given-time-range
from the SDFC discord’s thread about metrics: “applications like Prometheus do pull-based” (Thanks, @BOSS!)
- https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/
there’s a trick to doing streaming logs of kubernetes (Thanks, @neakor!)

REQUIREMENTS

Design a system to store and retrieve logs for all of eBay (quoted from one of the leetcode threads above)

VARIATIONS:

“real-time” requirement for viewing the logs

APPROACHES: 1) push-based 2) pull-based 3) sync vs. async 4) continuous 5) ElasticSearch vs. Data Warehouse, such as dremel/presto/trino on iceberg/parquet files

synchronous approach
continuous, push-based, async w/ kubernetes
data warehouse

push-based vs pull-based trade-offs:

who gets the retry logic
who has to hold the network connection open
the collection backend service will have to be aware of all daemon agents

push-based: daemon agent is responsible for recording offset pull-based: more ambiguous, possibly the collection service back-end

CHECKLIST:

LIVE DISCUSSION QUESTIONS:

Have you found it useful to go through the references in Alex Xu - Vol. 1 & Vol. 2, DDIA and Understanding Distributed Systems books -– Alex Xu’s, yes DDIA, I haven’t gotten that far yet

hello, how long do you think it should take a student to get to a good enough level in system design -– Define “good enough”?

He yubi, How one can get practical hands on also along with this theoritical study, any action plan for that to begin with ? -–

Designing Distributed Systems by Brendan Burns
https://fly.io/dist-sys/ (Thanks, @Loading)

If you’re reading Brendan Burns, I recommend reading all his books. Coz they share similar concepts

Kubernetes Patterns, Kubernetes Up & Running

This is a meta SD question.

What protocol Daemon service will be use to communicate the collection service (HTTP ? ), it will be sync or async call ?

what is being transferred from the daemon to the collection service?

What happen when a log from a service is very large? if we have 100k character in a log, that would be 100kb/log?

grpc streaming for push?

local logs at node persist in disk like WAL(append to file sequentially)? What are the query patterns Logging Service supports apart from searching( doesn’t it have to support key range based query)? -– yes, the .log file should basically be a WAL

Do we expect log entry to be very big structured data? -– So, the .log file is unstructured ElasticSearch should be structured .parquest is structured

do we really need log file ? can we not directly upload to Collection service ? -– “continuous” means more overhead spent on network traffic batched makes one round trip for a grouping of log statements

SaaS product to hire freelance service and provide freelance service

In the pull based model , the collection backend service will have to be aware of all daemon agents, not sure if that’s too efficient

especially if we have a prod service that frequently auto scales

should we not batch multiple log line to honour rate limits in collection service backend? -– batched definitely does cut down on the overall number of round trips occurring over your network

For synchronous, maybe add a log buffer in memory. -– I think this would immediately make it “async”

how do we maintain state ? who is Responsible for maintaining the checkpoints of which point a log file is read ? -— push-based: daemon agent is responsible for recording offset pull-based: more ambiguous, possibly the collection service back-end

imo, push won’t scale well. the logging service would need to scale with the sum of all services. pull would decouple producing from consuming. this is the same choice prometheus and Kafka made

how do we handle back pressure also ?