Feed Based System
Problem Statement
Design a Twitter with 500 million daily active users feed system that will show the most relevant tweets for a user based on their social graph.
Metrics
- Topline is that it’s scientific as well as a business-driven decision. Overall user engagements: comments, likes, and retweets. Business-driven decision: time spent on Twitter.
 - Weighted engagement + Normalized Score (divided by the total number of active users). A higher score equates to higher user engagement.
 
Architecture
Tweet Selection
- New Tweets + Unseen Tweets: To fetch a mix of newly generated Tweets along with a portion of unseen Tweets from the cache.
- Edge case: User returning after a while
- The pool of Tweets keeps on increasing so a limit needs to be imposed on the number of Tweets.
 
 
 - Edge case: User returning after a while
 - Network Tweets + interest/popularity-based Tweets
- Two-dimensional scheme: selecting network Tweets + potentially engaging Tweets.
 - Benefits
- Helpful for Bootstrap problems
 - To increase the discoverability on the platform and help grow the user’s network.
 
 
 
Ranking
- Logistic regression
- Pros and Cons:
- it is fast to train
 - the major limitation of linear models is that it assumes linearly exists between the input features and prediction.
 
 - Approaches
- Train a single classifier for overall engagement.
 - Train seperate models for each engagement action based on production needs (i.e. P(like), P(comments), P(tweet)).
 
 
 - Pros and Cons:
 - MART (multiple additive regression trees)
- Tree-based models don’t require a large amount of data as they are able to generalize well quickly.
- Train a single model for the overall engagement.
 - Train specialized predicators to predict different kinds of engagement.
 - Overall engagement + individual predictor of each engagement action -> to build one common preictor
- P(engagement) and share its output as input into all of your predictors.
 
 
 
 - Tree-based models don’t require a large amount of data as they are able to generalize well quickly.
 - Deep Learning
- Seperate neural networks
 - Multi-task neural networks
 
 
- Stacking models and online learning
- Training tree-based models and neural networks to generate features that we will utilize in a linear model (logical regression)
- Trees generate features by using the triggering of leaf nodes (result in a boolean feature).
 - Plug-in the output of the last hidden layer as features into the logistic regression models.
 - Logistic regression: this helps the model to re-learn the weight of all tree leaves as well.
 
 - Main advantage: online learning
- Using real-time online learning with logistic regression so that we can utilize sparse features to learn the interactions.
 
 
 - Training tree-based models and neural networks to generate features that we will utilize in a linear model (logical regression)
 
Training Data Generation
- Training data generation through online user engagement
 - Balancing positive and negative samples
- Randomly downsample
 
 - Train test split
- We are building models with the intent to forecast the future.
 
 
Feature Engineering
The machine learning model is required to predict user engagement on user A’s Twitter feed.
Dense features
- User-author features
- User-author historical interactions
- author_liked_posts_3months
 - author_liked_posts_count_1year
 
 - User-author similarity
- common_followees
 - topic_similarity
 - tweet_content_embedding_similarity
 - social_embedding_similarity
 
 
 - User-author historical interactions
 - Author features
- Author’s degree of influence
- is_verified
 - author_social_rank
 - author_num_followers
 - follower_to_following_ratio
 
 - Historical trend of interactions on the author’s Tweets
- author_engagement_rate_3months
 - author_topic_engagement_rate_3months
 
 
 - Author’s degree of influence
 - User-Tweet features
- topic_similarity
 - embedding_similarity
 
 - Tweet features
- Features based on Tweet’s content
- Tweet_length
 - Tweet_recency
 - is_image_video
 - is_URL
 
 - Features based on Tweet’s interaction
- num_total_interactions
- caveat: We need to apply a simple time decay model to weight the latest interaction more than the ones that happened some time ago.
 
 - Seperate features for different engagements
 
 - num_total_interactions
 
 - Features based on Tweet’s content
 - Context-based features
- day_of_week
 - time_of_day
 - current_user_location
 - lastest_k_tag_interactions
 - approaching_holiday
 
 
- User-author features
 Sparse features
- unigrams/bigrams of a Tweet
 - user_id
 - tweets_id
 
Diversity
- Introducing the repetition penalty as adding negative weights
 - To bring the Tweet with repetition three steps in the sorted list.
 
Feed Based System