Machine Learning Primer
Problem Framing
ML systems can be divided into two broad categories: predictive ML and generative AI. At a high level, ML problem framing consists of two distinct steps:
- Determining whether ML is the right approach for solving a problem.
- The non-ML solution is the benchmark you’ll use to determine whether ML is a good use case
- Framing the problem in ML terms, and determining which features have predictive power.
Prediction
- Regression models are unaware of product-defined thresholds.
- If your app’s behavior changes significantly because of small differences in a regression model’s predictions, you should consider implementing a classification model instead.
- Predict the final decision if possible
- Hiding the app’s behavior from the model can cause your app to produce the wrong behavior.
- Understand the problem’s constraints
- Dynamic thresholds: regression
- Fixed thresholds: classification
- Proxy labels substitute for labels that aren’t in the dataset. Note: All will have potential problems.
Generation
- Distillation: To create a smaller version of a larger model
- Fine-tuning or parameter-efficient tuning
- Prompt engineering
Feature Engineering
One hot encoding
- One hot encoding is used for categorical features that have medium cardinality.
- Problems:
- Expansive computation and high memory consumption are major problems.
- One hot encoding is not suitable for Natural Language Processing tasks.
Feature hashing
Cross feature
Crossed features / conjunction, between two categorical variables of cardinality c1 and c2.
- Crossed feature is usually used with a hashing trick to reduce high dimensions.
Embedding
The purpose of embedding is to capture semantic meaning of features.
- $d = \sqrt[4]{D}$ where $D$ is the number of categories.
- Embedding features are usually pre-computed and stored in key/value storage to reduce inference latency.
Numeric features
- Normalization
- Standardization
Training Pipeline
- Data partitioning
- Parquet and ORC files are usually get partitioned by time for efficiency to speed up the query time.
- Handle imbalance class distribution
- Use class weights in loss function
- Use naive resampling
- Use synthetic resampling
- Synthetic Minority Oversampling Technique (SMOTE)
- Randomly picking a point from the minority class
- Computing the k-nearest neighbors for that point
- The synthetic points are added between the chosen point and its neighbors.
- Synthetic Minority Oversampling Technique (SMOTE)
- Choose the right loss function
- Retraining requirements
Inference
Inference is the purpose of using a trained machine learning model to make a prediction.
- Imbalanced Workload
- Upstream process Aggregator Service Worker pool to pick workers; Aggregator Service can pick workers through following ways:
- Work load
- Round Robin
- Request parameter
- Serving logics and multiple models for business-driven system.
- Upstream process Aggregator Service Worker pool to pick workers; Aggregator Service can pick workers through following ways:
- Non-stationary problem
- Update or retrain models to achive sustained performance. One common algo is Bayesian Logistic Regression.
- Exploration vs. exploitation
- One common technique is Thompson Sampling where at a time, t, we need to decide which action to take based on the reward.
Metrics
- Offline metrics
- Metrics like MAE, or R2 to measure the goodness of of fline fit.
- Online metrics
- Expose the model to a specific percentage of real traffic.
- Allocate traffic to different models in production.
- A/B testing (examples)
Machine Learning Primer
https://janofsun.github.io/2024/01/18/Machine-Learning-Primer/