System Design Fight Club

Over 50 System Design Interview Question Solutions

Hyperparameter Training

RESOURCES

  • Job Scheduler is a bit of a similar problem

PROMPT

How would you go about designing the infrastructure to support running trains with different hparams combinations and returning the highest-score hparam combination in an engineering organization?

For the purpose of this problem, we can consider “training a model” as a black box:

Inputs:
A set of hyperparameter candidate values to be tested
Model specification
Training + validation data

Output:
An F1 score, represented as a float from 0-1
The "best" hyperparam values that resulted in that F1 score

What are some choices for components of the system? How do they work together? What trade-offs could you consider? What failure modes are and aren't handled?

You can assume the following:

Training a single model takes multiple hours on a single machine
Machines running training jobs requires special hardware like a GPU
There is a quota on the maximum number of GPUs you can provision at one time
Training and Validation data is the order of 10s of GBs
You have access to a cloud provider like AWS or GCP
There is a way to monitor the progress and performance of a training run.

Inputs to the system:
List of Hyperparameters, e.g. [a, b], and their candidate values e.g. [(a=1,b=x), (a=1, b=y), …]
You're passed all combinations that we want tested
(i.e. no need to explode these into all possible values)
Specification of model
training+validation data

Expected Output:
List of hyperparameter tuples, the associated score
Sorted descending by score
This can be made available in any way that makes sense, through a dashboard, as a file, in the output of a job, etc.

So the expected flow is a User tries to interact with the system by providing the inputs (list of hyper-parameters, specification of model, training+validation data)

Can someone come up with a scalable system design for this?

REQUIREMENTS

  • You’re passed all combinations that we want tested
  • Associated F1 score for a set of hyperparameters
  • Training a single model takes multiple hours on a single machine
  • Machines running training jobs requires special hardware like a GPU
  • There is a quota on the maximum number of GPUs you can provision at one time

CHECKLIST:

  • load balancers
  • CDNs?
  • secondary indices?
  • DB schemas
  • partitioning keys, secondary indices
  • NALSD numbers (return to this at end)

LIVE DISCUSSION QUESTIONS:

https://scikit-learn.org/stable/modules/grid_search.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

  • search autocompletes of amazon.com’s search bar
  • item purchases (similar item recommendation)

Can we use Kubeflow for running these experiments ?

Looking at the schema for metadata store, wondering do we need nosql solution or can we use relation DB itself here ? We might get this by doing the back off the envleope calculation. Thoughts ? -– relational DOES actually work here just fine :)