Posted 2025-08-31Updated 2025-09-13Tech / System-Design / ML5 minutes read (About 811 words)

Personalized feed

Problem Statement

Personalized posts/news/other content feed on social media like facebook, instagram, and Linkedin, to maintain user engagement.

Clarifications or Assumptions

assume the motivation for a personalized feed is to keep users engaged with the platform
refreshed activity consists of both unseen posts and posts with unseen comments
a post contains textual content, images, video, or any combination
the system should place the most engaging content at the top of timelines
Is there a specific type of engagement we are optimizing for?
- I assume users can click, like, share, comment, hide, block another user, and send connection requests
- for example, liking a post is more valuable than only clicking it
- our system should consider major reactions when ranking posts
the system should display the ranked posts in less than 200 milliseconds (ms).
3B ADU, 2B ADU who check their feeds twice a day

Frame as an ML business objective

Business Objective

The objective of the system is to increase user engagement.

ML Objective

Maximize a weighted score based on both implicit and explicit reactions

implicit (dwell time or user clicks etc.) and explicit reactions (likes, shares, and hides etc.)
input: a user; output: a ranked list of unseen posts or posts with unseen comments sorted by engagement score.

Pointwise Learning to Rank (LTR) is a choice for us to rank personalized feeds based on engagement scores.

Data Engineering

User
Posts
User-post interactions
Friendship

Feature Engineering

Post features
- textual content
- images or videos
- reactions, converted into numeric values
- hashtags
  - Tokenization
    - contain multiple words, Viterbi
  - Tokens to IDs
    - Hashtags evolve quickly
  - Vectorization
    - simple text representation methods such as TF-IDF or word2vec
    - Transformer-based models are useful when the context of the data is essential, faster and lighter text representation methods are preferred here
- post’s age
  - Users tend to engage with newer content
  - Bucketize into categories + one-hot encoding
User features
- Demographics: age, gender, country, etc
- Contextual information: device, time of the day, etc
- User-post historical interactions
- Being mentioned in the post
User-author affinities
- Like/click/comment/share rate
- Length of friendship
- Close friends and family

Modeling

Candidate generation

Cached embeddings to improve efficiency
- pre-compute and store items embeddings into a service (FAISS/Milvus/Elastic KNN) offline to support online approximate nearest neighbors (ANN) search
During online retrieval we use the user tower to generate user embedding on the fly

Rule of thumb: If your score can be written as sim(u(user_only), v(item_only)), it’s retrieval-safe and ANN-friendly.

the retrieval model can’t consume user-item interaction features (which are usually the most powerful) because by consuming them it will lose the ability to provide cacheable user/item embeddings.
non-decomposable, pairwise features φ(user,item), push it to ranking stage

Lightweight ranker

learning objective is different from the retrieval stage:
- Retrieval is trained for recall/coverage; pre-rank is distilled for “what the heavy model would keep”
- PSelect = { media in top K results ranked by the second stage}
the only reason we use Two-Twoer NN again because of its cacheability property and scale/latency/throughput tradeoffs
view this approach as a way of distilling knowledge from a bigger second-stage model to a smaller (more light-weight) first-stage model

Heavey ranker

multi-task multi label (MTML) neural network model.
Value Model (VM): Expected Value = W_click * P(click) + W_like * P(like) – W_see_less * P(see less) + etc.

For both stages we choose to use neural networks

it’s important to be able to adapt to changing trends in users’ behavior very quickly. Neural networks allow us to do this by utilizing continual online training, meaning we can re-train (fine-tune) our models every hour as soon as we have new data.
a lot of important features are categorical in nature, and neural networks provide a natural way of handling categorical data by learning embeddings

Re-ranker

filter-out/downrank some items based on integrity-related scores
shuffle items based on some rules

Loss function

The overall loss is computed by combining task-specific losses
use a binary cross-entropy loss for each binary classification task
a regression loss such as MAE, MSE, or Huber loss for the regression task (dwell-time prediction).

Improving the DNN architecture for passive users (dwell-time and skip), since the current multi-task DNN model will predict very low probabilities for all reactions, since they rarely react to posts.

Evaluation

Offline metrics

ROC-AUC for overall performance

Online metrics

Click-through rate (CTR)
Reaction rate
Total time spent
User satisfaction rate found in a user survey

Serving

References ⭐

Personalized feed

https://janofsun.github.io/2025/08/31/Personalized-feed/

Author

Jie Sun

Posted on

2025-08-31

Updated on

2025-09-13

Licensed under

#Machine Learning

Personalized feed

Problem Statement

Clarifications or Assumptions

Frame as an ML business objective

Data Engineering

Feature Engineering

Modeling

Evaluation

Serving

References ⭐

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Categories

Tags