Machine Learning Primer

Problem Framing

ML systems can be divided into two broad categories: predictive ML and generative AI. At a high level, ML problem framing consists of two distinct steps:

  • Determining whether ML is the right approach for solving a problem.
    • The non-ML solution is the benchmark you’ll use to determine whether ML is a good use case
  • Framing the problem in ML terms, and determining which features have predictive power.

Prediction

  • Regression models are unaware of product-defined thresholds.
    • If your app’s behavior changes significantly because of small differences in a regression model’s predictions, you should consider implementing a classification model instead.
  • Predict the final decision if possible
    • Hiding the app’s behavior from the model can cause your app to produce the wrong behavior.
  • Understand the problem’s constraints
    • Dynamic thresholds: regression
    • Fixed thresholds: classification
  • Proxy labels substitute for labels that aren’t in the dataset. Note: All will have potential problems.

Generation

  • Distillation: To create a smaller version of a larger model
  • Fine-tuning or parameter-efficient tuning
  • Prompt engineering

Feature Engineering

One hot encoding

  • One hot encoding is used for categorical features that have medium cardinality.
  • Problems:
    • Expansive computation and high memory consumption are major problems.
    • One hot encoding is not suitable for Natural Language Processing tasks.

Feature hashing

Cross feature

Crossed features / conjunction, between two categorical variables of cardinality c1 and c2.

  • Crossed feature is usually used with a hashing trick to reduce high dimensions.
Embedding

The purpose of embedding is to capture semantic meaning of features.

  • $d = \sqrt[4]{D}$ where $D$ is the number of categories.
  • Embedding features are usually pre-computed and stored in key/value storage to reduce inference latency.
Numeric features
  • Normalization
  • Standardization

Training Pipeline

  • Data partitioning
    • Parquet and ORC files are usually get partitioned by time for efficiency to speed up the query time.
  • Handle imbalance class distribution
    • Use class weights in loss function
    • Use naive resampling
    • Use synthetic resampling
      • Synthetic Minority Oversampling Technique (SMOTE)
        • Randomly picking a point from the minority class
        • Computing the k-nearest neighbors for that point
        • The synthetic points are added between the chosen point and its neighbors.
  • Choose the right loss function
  • Retraining requirements

Inference

Inference is the purpose of using a trained machine learning model to make a prediction.

  • Imbalanced Workload
    • Upstream process Aggregator Service Worker pool to pick workers; Aggregator Service can pick workers through following ways:
      • Work load
      • Round Robin
      • Request parameter
    • Serving logics and multiple models for business-driven system.
  • Non-stationary problem
  • Exploration vs. exploitation
    • One common technique is Thompson Sampling where at a time, t, we need to decide which action to take based on the reward.

Metrics

  • Offline metrics
    • Metrics like MAE, or R2 to measure the goodness of of fline fit.
  • Online metrics
Author

Jie Sun

Posted on

2024-01-18

Updated on

2024-04-26

Licensed under

Comments