Number of text corpuses #125

Glorifier85 · 2020-09-12T12:22:46Z

Hi there,

first of, great application! Intuitive and easy to use - eaxactly what I needed. The question I have is: is there a reason why the minimum number of texts to be chosen is ten? I am sure there is but can we change that somehow? What if I just wanted to tokenize and compare two corpuses?

Thanks!
Glorifier

severinsimmler · 2020-09-12T19:35:50Z

Hi @Glorifier85,

thank you for the positive feedback, we are very happy that the application is useful for people.

Topic modeling is a technique that works well with a large number of documents. I think it makes theoretical and practical no sense to topic model less than 10 documents (but 10 is actually more or less randomly chosen). Please refer e.g. Tang et al.:

The number of documents plays perhaps the most important role; it is theoretically impossible to guarantee identification of topics from a small number of documents, no matter how long.

The length of the documents also plays an important role. Maybe you should consider segmenting documents of your small corpus – topic modeling works even with tweets (i.e. 280 characters) quite well (see e.g. Ordun et al.).

Glorifier85 · 2020-09-13T22:18:04Z

Hi @severinsimmler,

thanks for your response, much appreciated!
Understood re the number of corpuses. Frankly, I could see why the length of documents plays a role but I dont quite understand why the sheer number of documents would be so important. But I'll have a close look at the papers you've linked.

Speaking of document length, is there an optimal length in terms of word count? Törnberg (2016), see link below, for example mention that they split documents in chunks of 1000-word texts. Is this something you can confirm?
https://www.sciencedirect.com/science/article/pii/S2211695816300290

Many thanks!

severinsimmler · 2020-09-19T12:36:15Z

but I dont quite understand why the sheer number of documents would be so important

Most natural language processing algorithms are designed to extract information from an extensive data set. In general, one could say the more the better. Always. But I think this is also a question of methodology. If I only have two documents, why do I need a quantitative method? I could evaluate the texts with qualitative methods (e.g. close reading) and probably gain more valuable insights.

they split documents in chunks of 1000-word texts. Is this something you can confirm?

Yes, 1000 words per document is a good starting point. I don't know how your texts are structured, but you could also segment by paragraph or chapter.

Glorifier85 · 2020-09-20T17:55:00Z

Thanks again! To your knowledge, is there a maximum number of words per document that should not be exceeded, like a hard cap? I am planning to model social media comments from news outlets over a certain period of time (3-5 years) so I might end up with >500k words per document.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number of text corpuses #125

Number of text corpuses #125

Glorifier85 commented Sep 12, 2020

severinsimmler commented Sep 12, 2020 •

edited

Loading

Glorifier85 commented Sep 13, 2020

severinsimmler commented Sep 19, 2020

Glorifier85 commented Sep 20, 2020

Number of text corpuses #125

Number of text corpuses #125

Comments

Glorifier85 commented Sep 12, 2020

severinsimmler commented Sep 12, 2020 • edited Loading

Glorifier85 commented Sep 13, 2020

severinsimmler commented Sep 19, 2020

Glorifier85 commented Sep 20, 2020

severinsimmler commented Sep 12, 2020 •

edited

Loading