Machine Learning Primer
Problem Framing
ML systems can be divided into two broad categories: predictive ML and generative AI. At a high level, ML problem framing consists of two distinct steps:
- Determining whether ML is the right approach for solving a problem.
- The non-ML solution is the benchmark you’ll use to determine whether ML is a good use case
 
 - Framing the problem in ML terms, and determining which features have predictive power.
 
Prediction
- Regression models are unaware of product-defined thresholds. 
- If your app’s behavior changes significantly because of small differences in a regression model’s predictions, you should consider implementing a classification model instead.
 
 - Predict the final decision if possible
- Hiding the app’s behavior from the model can cause your app to produce the wrong behavior.
 
 - Understand the problem’s constraints
- Dynamic thresholds: regression
 - Fixed thresholds: classification
 
 - Proxy labels substitute for labels that aren’t in the dataset. Note: All will have potential problems.
 
Generation
- Distillation: To create a smaller version of a larger model
 - Fine-tuning or parameter-efficient tuning
 - Prompt engineering
 
Feature Engineering
One hot encoding
- One hot encoding is used for categorical features that have medium cardinality.
 - Problems:
- Expansive computation and high memory consumption are major problems.
 - One hot encoding is not suitable for Natural Language Processing tasks.
 
 
Feature hashing
Cross feature
Crossed features / conjunction, between two categorical variables of cardinality c1 and c2.
- Crossed feature is usually used with a hashing trick to reduce high dimensions.
 
Embedding
The purpose of embedding is to capture semantic meaning of features.
- $d = \sqrt[4]{D}$ where $D$ is the number of categories.
 - Embedding features are usually pre-computed and stored in key/value storage to reduce inference latency.
 
Numeric features
- Normalization
 - Standardization
 
Training Pipeline
- Data partitioning
- Parquet and ORC files are usually get partitioned by time for efficiency to speed up the query time.
 
 - Handle imbalance class distribution
- Use class weights in loss function
 - Use naive resampling
 - Use synthetic resampling
- Synthetic Minority Oversampling Technique (SMOTE)
- Randomly picking a point from the minority class
 - Computing the k-nearest neighbors for that point
 - The synthetic points are added between the chosen point and its neighbors.
 
 
 - Synthetic Minority Oversampling Technique (SMOTE)
 
 - Choose the right loss function
 - Retraining requirements
 
Inference
Inference is the purpose of using a trained machine learning model to make a prediction.
- Imbalanced Workload
- Upstream process  Aggregator Service  Worker pool to pick workers; Aggregator Service can pick workers through following ways:
- Work load
 - Round Robin
 - Request parameter
 
 - Serving logics and multiple models for business-driven system.
 
 - Upstream process  Aggregator Service  Worker pool to pick workers; Aggregator Service can pick workers through following ways:
 - Non-stationary problem
- Update or retrain models to achive sustained performance. One common algo is Bayesian Logistic Regression.
 
 - Exploration vs. exploitation
- One common technique is Thompson Sampling where at a time, t, we need to decide which action to take based on the reward.
 
 
Metrics
- Offline metrics
- Metrics like MAE, or R2 to measure the goodness of of fline fit.
 
 - Online metrics
- Expose the model to a specific percentage of real traffic.
 - Allocate traffic to different models in production.
 - A/B testing (examples)
 
 
Machine Learning Primer
https://janofsun.github.io/2024/01/18/Machine-Learning-Primer/