Skip to content

Miguel-G-Soares/Automated-Detection-of-Claims-in-TikTok-Videos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Automated Detection of Claims in TikTok Content

Project Overview

This project aims to develop a predictive model that classifies whether TikTok videos contain user claims or opinions. The goal is to help TikTok's moderation team streamline the process of reviewing content that has been flagged by prioritizing user reports more effectively. Using a dataset of TikTok content, machine learning models were trained, evaluated and tested on their ability to accurately classify content as either claim or opinion-based.

Business Problem

TikTok relies on user reports to identify potentially problematic content (misinformation, claims, inappropriate material, etc...). With millions of users creating a vast amount of data everyday, Tiktok has a significant challenge in processing all of the reports made.

The content moderation team needs to assess whether flagged content contains factual claims or simply expresses opinions. Classifying content into these two categories can help TikTok reduce the workload on moderators, allowing them to prioritize claim-based content that may require quicker and/or more thorough reviews.This will not only reduce response time but also improve the user experience by handling harmful or misleading content more efficiently.

Data

The dataset collected includes several variables that can help identify claim/oppinion based content:

  • Video id: Identifying number assigned to video upon publication on TikTok
  • Video duration: How long the published video is in seconds
  • Video transcription text: Transcribed text of the words spoken in the video
  • Verified status: Whether the TikTok user who published the video is either “verified” or “not verified”
  • Author ban status: Whether the TikTok user who published the video is either “active”, “under scrutiny” or “banned”
  • Video view count: The total number of views a video has
  • Video like count: The total number of likes a video has
  • Video share count: The total number of times a video has been shared
  • Video download count: The total number of times a video has been downloaded
  • Video comment count: The total number of comments a video has
  • Claim status (Target): Whether the video has a "claim" or an "opinion".

Modelling

# Data manipulation
import numpy as np
import pandas as pd

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Data modeling
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Data evaluation
import sklearn.metrics as metrics
from xgboost import plot_importance
data = pd.read_csv("tiktok_dataset.csv")

EDA

data.head()
# claim_status video_id video_duration_sec video_transcription_text verified_status author_ban_status video_view_count video_like_count video_share_count video_download_count video_comment_count
0 1 claim 7017666017 59 someone shared with me that drone deliveries a... not verified under review 343296.0 19425.0 241.0 1.0 0.0
1 2 claim 4014381136 32 someone shared with me that there are more mic... not verified active 140877.0 77355.0 19034.0 1161.0 684.0
2 3 claim 9859838091 31 someone shared with me that american industria... not verified active 902185.0 97690.0 2858.0 833.0 329.0
3 4 claim 1866847991 25 someone shared with me that the metro of st. p... not verified active 437506.0 239954.0 34812.0 1234.0 584.0
4 5 claim 7105231098 19 someone shared with me that the number of busi... not verified active 56167.0 34987.0 4110.0 547.0 152.0
data.shape
(19382, 12)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB
data.describe()
# video_id video_duration_sec video_view_count video_like_count video_share_count video_download_count video_comment_count
count 19382.000000 1.938200e+04 19382.000000 19084.000000 19084.000000 19084.000000 19084.000000 19084.000000
mean 9691.500000 5.627454e+09 32.421732 254708.558688 84304.636030 16735.248323 1049.429627 349.312146
std 5595.245794 2.536440e+09 16.229967 322893.280814 133420.546814 32036.174350 2004.299894 799.638865
min 1.000000 1.234959e+09 5.000000 20.000000 0.000000 0.000000 0.000000 0.000000
25% 4846.250000 3.430417e+09 18.000000 4942.500000 810.750000 115.000000 7.000000 1.000000
50% 9691.500000 5.618664e+09 32.000000 9954.500000 3403.500000 717.000000 46.000000 9.000000
75% 14536.750000 7.843960e+09 47.000000 504327.000000 125020.000000 18222.000000 1156.250000 292.000000
max 19382.000000 9.999873e+09 60.000000 999817.000000 657830.000000 256130.000000 14994.000000 9599.000000

Checking for missing values

data.isna().sum()
#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

There are at least 298 rows with missing data

data[data.isna().any(axis=1)].head()
# claim_status video_id video_duration_sec video_transcription_text verified_status author_ban_status video_view_count video_like_count video_share_count video_download_count video_comment_count
19084 19085 NaN 4380513697 39 NaN not verified active NaN NaN NaN NaN NaN
19085 19086 NaN 8352130892 60 NaN not verified active NaN NaN NaN NaN NaN
19086 19087 NaN 4443076562 25 NaN not verified active NaN NaN NaN NaN NaN
19087 19088 NaN 8328300333 7 NaN not verified active NaN NaN NaN NaN NaN
19088 19089 NaN 3968729520 8 NaN not verified active NaN NaN NaN NaN NaN

It appears that all observations with missing data only have information on the video's id, duration and author.

Observtions are missing the most important information and as such should be removed

data.dropna(inplace=True)

Checking for duplicates

data.duplicated().sum()
0

There isn't any duplicated data

Deleting purposeless information

data.drop(columns=['#','video_id'],inplace=True)

Checking for outliers

for column in data.select_dtypes(include=['int64','float64']).columns:
    plt.figure(figsize=(5,1))
    sns.boxplot(data[column],orient='h')
    plt.xlabel(column)
    plt.show()

png

png

png

png

png

png

Video like, share comment and download counts are significantly right-skewed, with most videos having low popularity scores while the rare few that go viral increase tremendously

Checking predictor feature class balance

data.claim_status.value_counts(normalize=True)
claim_status
claim      0.503458
opinion    0.496542
Name: proportion, dtype: float64

Feature transformation

data.claim_status.unique()
array(['claim', 'opinion'], dtype=object)
data.verified_status.unique()
array(['not verified', 'verified'], dtype=object)
data.author_ban_status.unique()
array(['under review', 'active', 'banned'], dtype=object)

The three categorical features to handle are claim_status, verified_status, and author_ban_status:

Two of them are binary

  • claim_status: opinion , claim
  • verified_status: not verified , verified

While the last is ordinal in nature

  • author_ban_status: banned , under review , active
# One-hot encoding the binary categorical features
data.claim_status = data.claim_status.replace({'opinion': 0, 'claim': 1})
data.verified_status = data.verified_status.replace({'not verified': 0, 'verified': 1})

# Label encoding the ordinal feature
data.author_ban_status = data.author_ban_status.replace({'banned':0,'under review': 1, 'active': 2})
data.head()
claim_status video_duration_sec video_transcription_text verified_status author_ban_status video_view_count video_like_count video_share_count video_download_count video_comment_count text_len
0 1 59 someone shared with me that drone deliveries a... 0 1 343296.0 19425.0 241.0 1.0 0.0 97
1 1 32 someone shared with me that there are more mic... 0 2 140877.0 77355.0 19034.0 1161.0 684.0 107
2 1 31 someone shared with me that american industria... 0 2 902185.0 97690.0 2858.0 833.0 329.0 137
3 1 25 someone shared with me that the metro of st. p... 0 2 437506.0 239954.0 34812.0 1234.0 584.0 131
4 1 19 someone shared with me that the number of busi... 0 2 56167.0 34987.0 4110.0 547.0 152.0 128

Feature extraction

data['text_len']=data.video_transcription_text.apply(len)

Text length of videos with claims versus opinions

data[['claim_status','text_len']].groupby(['claim_status']).mean().reset_index()
claim_status text_len
0 claim 95.376978
1 opinion 82.722562
sns.histplot(x=data.text_len,hue=data.claim_status)
plt.show()

png

Tokenizing the text column

# Tokenizing text into 2-grams
c_vec = CountVectorizer(ngram_range=(2, 3),
                            max_features=15,
                            stop_words='english')
c_vec
# Fitting vectorizer into the data
count_data = c_vec.fit_transform(data['video_transcription_text']).toarray()
count_data
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
# Creating the dataframe counting the 15 most used n-grams
count_df = pd.DataFrame(data=count_data, columns=c_vec.get_feature_names_out())

count_df
colleague discovered colleague learned colleague read discussion board friend learned friend read internet forum learned media learned news media claim news claim point view read media social media willing wager
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19079 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19080 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19081 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19082 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19083 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

19084 rows × 15 columns

# Updating the original datarame
data = pd.concat([data.drop(columns='video_transcription_text'),count_df],axis=1)

data.head()
claim_status video_duration_sec verified_status author_ban_status video_view_count video_like_count video_share_count video_download_count video_comment_count text_len ... friend read internet forum learned media learned news media claim news claim point view read media social media willing wager
0 1 59 0 1 343296.0 19425.0 241.0 1.0 0.0 97 ... 0 0 0 0 0 0 0 0 0 0
1 1 32 0 2 140877.0 77355.0 19034.0 1161.0 684.0 107 ... 0 0 0 0 0 0 0 0 0 0
2 1 31 0 2 902185.0 97690.0 2858.0 833.0 329.0 137 ... 0 0 0 0 0 0 0 0 0 0
3 1 25 0 2 437506.0 239954.0 34812.0 1234.0 584.0 131 ... 0 0 0 0 0 0 0 0 0 0
4 1 19 0 2 56167.0 34987.0 4110.0 547.0 152.0 128 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 25 columns

Creating the ML models

x=data.drop(columns='claim_status')
y=data.claim_status
#Creating 3 different subsets with the following percentages train-0.6 , validate-0.2 , test-0.2

x_tr,x_test,y_tr,y_test = train_test_split(x,y,random_state=0,test_size=0.2)
x_train,x_val,y_train,y_val=train_test_split(x_tr,y_tr,random_state=0,test_size=0.25)
# Making sure all dataframes are as expected
for type in ['train','test','val']:
    print('x'+type+':',globals()['x_'+type].shape)
    print('y'+type+':',globals()['y_'+type].shape)
xtrain: (11450, 24)
ytrain: (11450,)
xtest: (3817, 24)
ytest: (3817,)
xval: (3817, 24)
yval: (3817,)

Random Forest

The evaluation metric chosen is recall as it most emphasizes a reduction in False Negatives. The Tiktok team wants to be as sure as possible that videos flagged by the model are indeed claims or not

# Instantiating the random forest model
rf=RandomForestClassifier(random_state=0)

# Selecting hyperparameters to tune
cv_params = {
    'max_depth': [5, 7, None],
    'max_features': [0.3, 0.6],
    'max_samples': [0.7],
    'min_samples_leaf': [1,2],
    'min_samples_split': [2,3],
    'n_estimators': [75,100,200],
}

# Selecting scores
scoring=('precision','accuracy','recall','f1')

# Instantiating the Cross-Validation classifier
clf=GridSearchCV(rf,param_grid=cv_params,scoring=scoring,cv=5,refit='recall')
%%time
clf.fit(x_train,y_train)
CPU times: total: 9min 28s
Wall time: 9min 49s
print(clf.best_params_)
print(clf.best_score_)
{'max_depth': None, 'max_features': 0.6, 'max_samples': 0.7, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 75}
0.9948228253467271

Gradient Boosting

xgb = XGBClassifier( objective = 'binary:logistic' , random_state = 0 )

xgb_params ={'max_depth': [4,8,12],
             'min_child_weight': [3, 5],
             'learning_rate': [0.01, 0.1],
             'n_estimators': [300, 500]
             }

clf_xgb = GridSearchCV(xgb,param_grid=xgb_params,scoring=scoring,cv=5,refit='recall')
%%time
clf_xgb.fit(x_train, y_train)
CPU times: total: 2min 6s
Wall time: 39.6 s
print(clf_xgb.best_params_)
print(clf_xgb.best_score_)
{'learning_rate': 0.1, 'max_depth': 12, 'min_child_weight': 3, 'n_estimators': 300}
0.989645054622456

Evaluating the models

Random Forest:

y_rf_pred=clf.best_estimator_.predict(x_val)
# Calculating the confusion matrix for the Random Forest
cm_rf = metrics.confusion_matrix(y_val,y_rf_pred)

# Display of confusion matrix
metrics.ConfusionMatrixDisplay(confusion_matrix=cm_rf).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x293e4aa0210>

png

target_labels = ['opinion', 'claim']
print(metrics.classification_report(y_val, y_rf_pred, target_names=target_labels))
              precision    recall  f1-score   support

     opinion       1.00      1.00      1.00      1892
       claim       1.00      1.00      1.00      1925

    accuracy                           1.00      3817
   macro avg       1.00      1.00      1.00      3817
weighted avg       1.00      1.00      1.00      3817

XGBoost:

y_xgb_pred=clf_xgb.best_estimator_.predict(x_val)
# Calculating the confusion matrix for the Random Forest
cm_xgb = metrics.confusion_matrix(y_val,y_xgb_pred)

# Display of confusion matrix
metrics.ConfusionMatrixDisplay(confusion_matrix=cm_xgb).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x293e235e790>

png

print(metrics.classification_report(y_val, y_xgb_pred, target_names=target_labels))
              precision    recall  f1-score   support

     opinion       0.99      1.00      0.99      1892
       claim       1.00      0.99      0.99      1925

    accuracy                           0.99      3817
   macro avg       0.99      0.99      0.99      3817
weighted avg       0.99      0.99      0.99      3817

Both models are able to obtain near perfect results, but the Random Forest manages to eek out a victory with a marginally higher recall evaluation score

Testing

Using the best model to test never-seen-before data as a predictor for future behaviour

y_pred = clf.best_estimator_.predict(x_test)
# Calculating the confusion matrix for the Random Forest
cm_rf = metrics.confusion_matrix(y_test,y_pred)

# Display of confusion matrix
metrics.ConfusionMatrixDisplay(confusion_matrix=cm_rf).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x293e2390a10>

png

Feature importances of the best model

importances = clf.best_estimator_.feature_importances_

rf_importances = pd.Series(importances, index=x_test.columns).sort_values(ascending=False)[:10]

sns.barplot(x=rf_importances.index,y=rf_importances)
plt.xticks(rotation=45,ha='right')
plt.show()

png

The most predictive features are all related to the popularity of video. This is not unexpected, as analysis from prior EDA pointed to this conclusion.

Conclusion

The models successfully demonstrate the potential of machine learning to automate the detection of video claims in TikTok, with the best-performing model, a Random Forest, providing near-perfect classification of claims.

Recommendations:

  • Implement the predictive model: The model can be integrated into the moderation pipeline to automatically flag claim-based content for faster reviews.

  • Combine predictive models with human moderators: While the model performs well, the automated system should still be complemented by human reviewers, especially for more situational cases.

Future Steps

  • Expand the dataset to include multiple languages so it can work across the entirety of the content.

  • Continue to improve the model with the feedback from moderators to include more subjective cases and improve performance over time.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published