Posted 2024-01-18Updated 2024-04-26Tech / System-Design / ML4 minutes read (About 545 words)

Machine Learning Primer

Problem Framing

ML systems can be divided into two broad categories: predictive ML and generative AI. At a high level, ML problem framing consists of two distinct steps:

Determining whether ML is the right approach for solving a problem.
- The non-ML solution is the benchmark you’ll use to determine whether ML is a good use case
Framing the problem in ML terms, and determining which features have predictive power.

Prediction

Regression models are unaware of product-defined thresholds.
- If your app’s behavior changes significantly because of small differences in a regression model’s predictions, you should consider implementing a classification model instead.
Predict the final decision if possible
- Hiding the app’s behavior from the model can cause your app to produce the wrong behavior.
Understand the problem’s constraints
- Dynamic thresholds: regression
- Fixed thresholds: classification
Proxy labels substitute for labels that aren’t in the dataset. Note: All will have potential problems.

Generation

Distillation: To create a smaller version of a larger model
Fine-tuning or parameter-efficient tuning
Prompt engineering

Feature Engineering

One hot encoding

One hot encoding is used for categorical features that have medium cardinality.
Problems:
- Expansive computation and high memory consumption are major problems.
- One hot encoding is not suitable for Natural Language Processing tasks.

Feature hashing

Cross feature

Crossed features / conjunction, between two categorical variables of cardinality c1 and c2.

Crossed feature is usually used with a hashing trick to reduce high dimensions.

Embedding

The purpose of embedding is to capture semantic meaning of features.

$d = \sqrt[4]{D}$ where $D$ is the number of categories.
Embedding features are usually pre-computed and stored in key/value storage to reduce inference latency.

Numeric features

Normalization
Standardization

Training Pipeline

Data partitioning
- Parquet and ORC files are usually get partitioned by time for efficiency to speed up the query time.
Handle imbalance class distribution
- Use class weights in loss function
- Use naive resampling
- Use synthetic resampling
  - Synthetic Minority Oversampling Technique (SMOTE)
    - Randomly picking a point from the minority class
    - Computing the k-nearest neighbors for that point
    - The synthetic points are added between the chosen point and its neighbors.
Choose the right loss function
Retraining requirements

Inference

Inference is the purpose of using a trained machine learning model to make a prediction.

Imbalanced Workload
- Upstream process Aggregator Service Worker pool to pick workers; Aggregator Service can pick workers through following ways:
  - Work load
  - Round Robin
  - Request parameter
- Serving logics and multiple models for business-driven system.
Non-stationary problem
- Update or retrain models to achive sustained performance. One common algo is Bayesian Logistic Regression.
Exploration vs. exploitation
- One common technique is Thompson Sampling where at a time, t, we need to decide which action to take based on the reward.

Metrics

Offline metrics
- Metrics like MAE, or R2 to measure the goodness of of fline fit.
Online metrics
- Expose the model to a specific percentage of real traffic.
- Allocate traffic to different models in production.
- A/B testing (examples)
  - Amazon SageMaker
  - Linkedin A/B testing

Machine Learning Primer

https://janofsun.github.io/2024/01/18/Machine-Learning-Primer/

Author

Jie Sun

Posted on

2024-01-18

Updated on

2024-04-26

Licensed under

#Machine Learning

Machine Learning Primer

Problem Framing

Prediction

Generation

Feature Engineering

One hot encoding

Feature hashing

Cross feature

Embedding

Numeric features

Training Pipeline

Inference

Metrics

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Categories

Tags