Enhanced Farsi News Classification On Large Dataset

Introduction

In this task, we have a large (2 GB) dataset where one column contains the text of news articles, and another column indicates the category or topic of the news. The goal of this project is to develop a neural network model to classify news articles into their respective categories. The dataset has been preprocessed with Tokenization and Feature Extraction.

Initial Processing:

The first step involves improving and refining the category labels. A review of the titles shows that some titles are synonymous or belong to the same broader category. These titles are merged to improve the accuracy of the classification. The result is a set of main categories, and any null values in the dataset are identified and removed. before merge :

after merge :

Train, Test, and Split:

To optimize the model and reduce overfitting, the dataset is split into training and testing sets. The goal is to ensure that the training data contains an equal number of samples from each category, making the data balanced. The dataset is then split into training and testing sets, with the training set representing 80% of the data and the testing set representing 20%.

Tokenization and Feature Extraction:

Tokenization is performed to break down the text into words or tokens, and the TF-IDF (Term Frequency-Inverse Document Frequency) method is used for feature extraction. This method highlights the importance of a word within a document relative to a collection of documents. Various parameters like ngram_range and max_features are tuned to control the tokenization process and the number of features extracted. It is important to mention that using TF-IDF after seperating train and test results in more negatives comparing to real - life use.

Artificial Neural Network (ANN):

The neural network architecture is designed for the task. Recurrent Neural Networks (RNNs) are typically used for natural language processing tasks, but the complexity of such networks can lead to overfitting. To prevent this, the network architecture includes dropout layers, which deactivate a fraction of the neurons during training to avoid overfitting. The final output layer uses a softmax activation function to classify the input into one of the predefined categories. We use adam as optimizer and categorical cross entropy for loss.

Model Evaluation:

The model is evaluated using the following methods:

Cross-Validation: A common method in machine learning to assess how the model generalizes to an independent dataset.
Confusion Matrix: A tool to visualize the performance of the classification model by showing the correct and incorrect predictions.
K-Folds Cross-Validation: The dataset is split into k parts, where each part is used as a test set while the rest are used as the training set. This method provides a more robust evaluation of the model's performance.

User Interface Implementation:

User interface is created using the Gradio library, which allows for easy integration of the model into an interactive environment, such as Google Colab, where users can input new text data and receive classification predictions from the model. Last Cell in the notebook contains the implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
NewsClassification(2).ipynb		NewsClassification(2).ipynb
README.md		README.md
stopwords.npy		stopwords.npy
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhanced Farsi News Classification On Large Dataset

Introduction

Initial Processing:

Train, Test, and Split:

Tokenization and Feature Extraction:

Artificial Neural Network (ANN):

Model Evaluation:

User Interface Implementation:

Note that the dataset used for this project is not included in the repository; In case of need for the dataset , email me under [email protected]

About

Releases

Packages

Languages

negarhonarvar/Enhanced-Farsi-News-Classification-On-Large-Dataset

Folders and files

Latest commit

History

Repository files navigation

Enhanced Farsi News Classification On Large Dataset

Introduction

Initial Processing:

Train, Test, and Split:

Tokenization and Feature Extraction:

Artificial Neural Network (ANN):

Model Evaluation:

User Interface Implementation:

Note that the dataset used for this project is not included in the repository; In case of need for the dataset , email me under [email protected]

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages