System Design Fight Club – Over 50 System Design Interview Question Solutions

AD CLICK ECOSYSTEM

RESOURCES

Khang Pham’s book
Alex Xu’s ML book
Ad click aggregator from Alex Xu Volume II
My previous coverage of ad click prediction and ad click aggregator

REQUIREMENTS

ad click auction + impression placement
ad click aggregation

APPROACHES: 1) impression serving / “placement” ID a. “impression ID” = Ad ID + User ID + de-dupe key b. Ad ID + de-dupe key 2) One big ML Model vs. model for each ad from the “Model Store” a. Model(Ad Features, User Features) -> click probability b. Ad ID -> Model(User Features) -> click probability

CHECKLIST:

LIVE DISCUSSION QUESTIONS:

Draft files: https://imgur.com/a/TTq48vH

Curious about how to vary the data flow based on privacy regulation (EU user = GDPR vs Cali user VS rest of USA, for example)

and how to utilize browser fingerprinting for tracking in addition to (or in leueu of) cookies

Hi did we also discuss the QPS and other load for this problem? If not any estimate I assume this would be Ready heavy and are we talking about planet scale here? 1e7 something -– 10M+ TPS of reads 100k+ TPS of clicks

Is it normal to need 2 network calls for getting Ad ID and the actual S3 URL?

Would this be stateful design? just would like to know how would the design change from higher level if we want to go with stateless design

Alex Xu, talked about event timestamp and processing timestamp. I think in this design we don’t have to care Watermark if we store event timestamp as it is not real time and not using stream windowing

just wanted to ask in general, so here would we do the partitioning by userID or by Ad_ID coz if we do by AD_ID -> It would be a scatter_gather if we want to fetch which all type_of_users have clicked the AD

but if we want to know what all ADs did a user clicked on that would be a scatter gather right?

And if we partition by UserID, what if we want to get the data of what is the topMost ads/area_of_interest for all the users, this would be a scatter_gather again.

will adding secondary indexes (local or global) work? (PS :- I just added this point out of curiosity to know more if that would really work) -— data warehouses tend to not have secondary indices because you tend to do full table scans

“columnar storage”

Dremel whitepaper:

“in-situ” (“in-place”) compute – you bring the compute to the storage
columnar, and document-based

I missed the live session, how do u guys get notified when it is scheduled? -– there should be a little bell icon

it is usually the same timing everytime but you can join the discord for that -– it’s at 10:30pm PST every weekend

can we use redis sorted set for top k metric here?

for top k we can use count min sketch