This repository contains a final project realized for the Natural Language Processing course of the Master's degree in Artificial Intelligence, University of Bologna.
This project aims at producing a good abstractive summary of podcasts transcripts, obtained from the Spotify Podcast Dataset. This task was originally proposed in the context of the TREC Podcast Track 2020, where the objective was to provide a short text summary that a user might read when deciding whether to listen to a podcast. The summary should accurately convey the content of the podcast, be human-readable, and be short enough to be quickly read on a smartphone screen.
The Spotify Podcast Dataset is the first large-scale set of podcasts with transcripts that has been released publicly, with over 100,000 transcribed podcast episodes comprised of raw audio files, their transcripts and metadata. The transcription is provided by Google Cloud Platform’s Speech-to-Text API.
While no ground truth summaries are provided in the dataset, the episode descriptions written by the podcast creators serve as proxies for summaries, and are used for training supervised models.
More info about how to have access to the dataset on podcasts-no-audio-13GB folder.
In our solution an extractive module is developed to select salient chunks from the transcript, which serve as the input to an abstractive summarizer. The latter utilizes a BART model, that employs an encoder-decoder architecture. An extensive pre-processing on the creator-provided descriptions is performed selecting a subset of the corpus that is suitable for the training supervised model. The figure below summarizes the steps involved by our method. In order to have a better understanding of our proposed solution, take a look to the notebook and the report.
The bart-large-cnn
has been fine-tuned for 3 epochs on filtered transcripts as input.
The final model, that we call bart-large-finetuned-filtered-spotify-podcast-summ
has been uploaded on the Hugging Face Hub 🤗.
It can be used for the summarization as follows:
from transformers import pipeline
summarizer = pipeline("summarization", model="gmurro/bart-large-finetuned-filtered-spotify-podcast-summ", tokenizer="gmurro/bart-large-finetuned-filtered-spotify-podcast-summ")
summary = summarizer(podcast_transcript, min_length=39, max_length=250)
print(summary[0]['summary_text'])
Alternatively you can run the summarization script passing a transcript file as argument:
python compute_summary.py transcript_example.txt
BERTScore has been chosen as semantic metric to evaluate the results on the test set, as shown by the table below our model outperform the bart-large-cnn
baseline:
Model | Precision | Recall | F1 Score |
---|---|---|---|
bart-large-cnn | 0.8103 | 0.7941 | 0.8018 |
bart-large-finetuned | 0.8401 | 0.8093 | 0.8240 |
This is an example of the prediction made by the fine-tuned model:
CREATOR-PROVIDED DESCRIPTION:
In this episode, I talk about how we have to give up perfection in order to grow in our relationship with God.
It s not about perfection, it s about growing as we walk on the path to perfection.
GENERATED SUMMARY:
In this episode I talk about the idea of Perfection and how it has the ability to steal all of our joy in this life — if we let it.
I go into detail about a revelation I had after walking away from my coaching career and how badly I need Jesus.
- Transformers 4.19.4
- TensorFlow 2.9.1
- Datasets 2.3.1
- Tokenizers 0.12.1
We use Git for versioning.
Reg No. | Name | Surname | Username | |
---|---|---|---|---|
1005271 | Giuseppe | Boezio | [email protected] |
giuseppeboezio |
983806 | Simone | Montali | [email protected] |
montali |
997317 | Giuseppe | Murro | [email protected] |
gmurro |
This project is licensed under the MIT License - see the LICENSE file for details