In this task, we have a large (2 GB) dataset where one column contains the text of news articles, and another column indicates the category or topic of the news. The goal of this project is to develop a neural network model to classify news articles into their respective categories. The dataset has been preprocessed with Tokenization and Feature Extraction.
The first step involves improving and refining the category labels. A review of the titles shows that some titles are synonymous or belong to the same broader category. These titles are merged to improve the accuracy of the classification. The result is a set of main categories, and any null values in the dataset are identified and removed. before merge :
after merge :
To optimize the model and reduce overfitting, the dataset is split into training and testing sets. The goal is to ensure that the training data contains an equal number of samples from each category, making the data balanced. The dataset is then split into training and testing sets, with the training set representing 80% of the data and the testing set representing 20%.
Tokenization is performed to break down the text into words or tokens, and the TF-IDF (Term Frequency-Inverse Document Frequency) method is used for feature extraction. This method highlights the importance of a word within a document relative to a collection of documents. Various parameters like ngram_range and max_features are tuned to control the tokenization process and the number of features extracted. It is important to mention that using TF-IDF after seperating train and test results in more negatives comparing to real - life use.
The neural network architecture is designed for the task. Recurrent Neural Networks (RNNs) are typically used for natural language processing tasks, but the complexity of such networks can lead to overfitting. To prevent this, the network architecture includes dropout layers, which deactivate a fraction of the neurons during training to avoid overfitting. The final output layer uses a softmax activation function to classify the input into one of the predefined categories. We use adam as optimizer and categorical cross entropy for loss.
The model is evaluated using the following methods:
Cross-Validation: A common method in machine learning to assess how the model generalizes to an independent dataset.
Confusion Matrix: A tool to visualize the performance of the classification model by showing the correct and incorrect predictions.
K-Folds Cross-Validation: The dataset is split into k parts, where each part is used as a test set while the rest are used as the training set. This method provides a more robust evaluation of the model's performance.
User interface is created using the Gradio library, which allows for easy integration of the model into an interactive environment, such as Google Colab, where users can input new text data and receive classification predictions from the model. Last Cell in the notebook contains the implementation.