THESIS_LIFEBOAT

Before we get started

This program is still very much a work in progress, that is being used and developed simultaneously. As such, the code is still of a rather poor quality with many untranslated comments and with a lot of 'hackey' solutions, to the problems that have emerged. Normally I would not yet open this repository to the public - but after I posted about the project on reddit I received multiple requests to share the source code. I have decided to do so, but please keep in mind that the code here is in no way representative of the final version that I plan to release somewhere this fall. (And the code that is here will require a lot of rewriting to use it for other websites / purposes)

The situation

I have to write 2 master theses and personal circumstances have caused me to lag behind on my studies and have left me with very little time to spare. I need to start the writing process as soon as possible, yet I also have to do my research. This entails researching the colonial archieves of the Dutch parliament from 1815 untill 1955 - which contains thousands of documents. The website of this arhieve is a bit of a mess, the search function does not work very well and the documents are served as scans saved as pdfs. A text option is avaible, but these were the result of a computer scanning the pdf's and are of a rather poor quality and full of spelling errors.

Researching this archieve would take me months. Its like find a needle in very large and very flawed haystack.

The solution

Enter THESIS_LIFEBOAT. This program scrapes the database website, downloads thousand of documents (taking care not to overload the servers by taking ample pauses between downloads), scans the text within these documents (doing a lot better job at this than whatever method the archive itself used) and stores the text, title, pagenumber and other information to a database.

The documents are then scanned and analysed for the occurence of certain keywords from my thesis, with the text surrounding each keyword being stored in a database

If these keywords are found the text surrounding these keywords is stored in a database as a 'mini-summary' and a 'number of occurances' counter is updated. The program also scans the titles of the documents to find the publication dates. Future versions will include graphs, to measure word occurance over time, google-ngram-style and a django based GUI to view the documents.

This allows me to not only find the documents that are most relevant to my thesis, but also to find relevant portions of otherwise irrelevant documents. This program works, I have it run when I sleep. I have it running on 2 seperate computers for 24 hours each day and I have already analysed over 2100 documents with it (In just two days!)

TLDR

The program downloads the files

Then Analyzes the files

And adds them to a database

Saving not just the title, but the number of pages, the search terms used, the number of times these terms were found, publication date (inferred from the title) and the document text as well. This way you can quickly find the most relevant documents.

The program extract the text surrounding the occurance of each search term (capitalizing the terms themselves for easier readability) allowing you to quickly find the relevant portions of each document or ascertain if the document is relevant to your studies. The full document text is avaible as well.

Will there be a GUI interface?

Yes! The first prototype has just been uploaded, here are some screenshots.

The GUI is essentially a table, ranked and sorted based on the 'normalized number of mentions' variable. From here the user is able to click on the articles within the database to view the 'document overview' page. On this page the user can view the article information, the relevant portions, the entire document text and a link to the file itself. The formatting of the actual text still needs some work though, to migitage the 'wall of text' syndrome that is currently going on.

Loading the database currently is anything but a seemless experience, recuiring the use of the 'row_id_generator.py' before loading the file with the 'python3 manage.py inspectdb > modelspy' command. So its still a prototype, but its already usable.

Does it work?

Yes. I have it running as we speak and have worked through thousands of documents already.

I want to use this program, but for different research, can I use this program?

Yes. The downloading and fetching process is based on the website of the archive, so you should replace thhose parts with your own code. The analyzer and summarizer should work on any pdf that you throw at it. I have created a seperate script that contains just the analyzer.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
DJANGO_GUI/django_gui		DJANGO_GUI/django_gui
images		images
LICENSE.TEXT		LICENSE.TEXT
README.md		README.md
analyzer.py		analyzer.py
scriptie_scraper.py		scriptie_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

THESIS_LIFEBOAT

Before we get started

The situation

The solution

TLDR

Will there be a GUI interface?

Does it work?

I want to use this program, but for different research, can I use this program?

About

Releases

Packages

Languages

License

hackforthesea/THESIS_LIFEBOAT

Folders and files

Latest commit

History

Repository files navigation

THESIS_LIFEBOAT

Before we get started

The situation

The solution

TLDR

Will there be a GUI interface?

Does it work?

I want to use this program, but for different research, can I use this program?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages