-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of text corpuses #125
Comments
Hi @Glorifier85, thank you for the positive feedback, we are very happy that the application is useful for people. Topic modeling is a technique that works well with a large number of documents. I think it makes theoretical and practical no sense to topic model less than 10 documents (but 10 is actually more or less randomly chosen). Please refer e.g. Tang et al.:
The length of the documents also plays an important role. Maybe you should consider segmenting documents of your small corpus – topic modeling works even with tweets (i.e. 280 characters) quite well (see e.g. Ordun et al.). |
Hi @severinsimmler, thanks for your response, much appreciated! Speaking of document length, is there an optimal length in terms of word count? Törnberg (2016), see link below, for example mention that they split documents in chunks of 1000-word texts. Is this something you can confirm? Many thanks! |
Most natural language processing algorithms are designed to extract information from an extensive data set. In general, one could say the more the better. Always. But I think this is also a question of methodology. If I only have two documents, why do I need a quantitative method? I could evaluate the texts with qualitative methods (e.g. close reading) and probably gain more valuable insights.
Yes, 1000 words per document is a good starting point. I don't know how your texts are structured, but you could also segment by paragraph or chapter. |
Thanks again! To your knowledge, is there a maximum number of words per document that should not be exceeded, like a hard cap? I am planning to model social media comments from news outlets over a certain period of time (3-5 years) so I might end up with >500k words per document. Thanks! |
Hi there,
first of, great application! Intuitive and easy to use - eaxactly what I needed. The question I have is: is there a reason why the minimum number of texts to be chosen is ten? I am sure there is but can we change that somehow? What if I just wanted to tokenize and compare two corpuses?
Thanks!
Glorifier
The text was updated successfully, but these errors were encountered: