news-processor

This project process uruguayans newspapers dataset, and trains a BERT-like models from scratch using this dataset. This project is hosted by the Natural Language Processing Group from University of the Republic, Uruguay (http://www.fing.edu.uy/inco/grupos/pln/).

Requirements

Python

Content

The project has the following content:

parser_data: folder cotaining all the source code and datasets to parse the original news dataset.
train_model: folder containing the source code to train the desired model.
eval tasks: folder with different challenges to evaluate the created model.

Manage the dataset

You need to have the depurated dataset with the desired format to train the model. There are 2 ways to do this.

A - Download the dataset.

Download the file from this link. This link downlads a file called all_together.txt that contains the dataset we've already processed with the code at parse_data. Please be sure the file is not corrupted:

size: run in bash:

$ du -h all_together.txt # expected result: 1.2 GB

md5: run in bash:

$ md5sum all_together.txt # expected result: 62e5e1f28bb9634563eeddb9daf8c098

B - Run parsers

Move to parser_data and follow instructions. This will generate the cleaned versions of the original uruguayans news datasets.
Split all datasets news into sentences. To do this you can use the sentence-splitter repository. You can run the splitter once with all_news.txt.
You're expected to have one file with news separated with an empty line. Each news has it's sentences devided in different lines.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
eval_tasks		eval_tasks
parser_data		parser_data
train_model		train_model
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

news-processor

Requirements

Content

Manage the dataset

A - Download the dataset.

B - Run parsers

About

Releases

Packages

Contributors 2

Languages

pln-fing-udelar/news-processor

Folders and files

Latest commit

History

Repository files navigation

news-processor

Requirements

Content

Manage the dataset

A - Download the dataset.

B - Run parsers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages