-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internet Archive's crawl of tweet stream: compare against "official" crawl #6
Comments
Here is the list of tweets from my crawl last year. I used python tweepy. Since the file size limitation of github, I cut the original list into 4 files. There are in total 39,623,506 unique tweetids, sorted. |
FWIW: Downloading 43,956,390 Tweets "officially" from the Twitter API will take just under 20 days (if you make use of statuses/lookup endpoint with both App & User tokens). If you have an app with 3 or 4 authenticated users (lets say your co authors), you can use those extra tokens to spread out the calls and do it in about a week. |
twarc might be useful if you want to download the tweets using the official Twitter API: Downloading is called "hydrating". |
I have compared Jimmy's id list with the tweets of Internet Archive. Jimmy's list seems to contain tweets that are not within the evaluation period. For example, the first tweet id in the list "622918845364219905" was published at "Sun Jul 19 00:00:01 +0000 2015". Therefore, I looked at the the tweets from the Internet Archive, used the tweet ids of the first and last tweets posted within the evaluation period as boundaries to perform filtering on Jimmy's list. The number of tweets left of Jimmy's list is 40260362. The difference between this new list and the tweet archive is not significant (about 0.2%). It is worth noting that a small number of relevant tweets (132 out of 6187) are not in the tweet archive. As suggested by Jimmy, it seems that you can use the Internet Archive tweets as training data. |
I'd like to draw everyone's attention to this - this means that people who did not participant in the TREC Microblog track last year can still get the tweet data (e.g., for training). |
The Internet Archive appears to have a crawl of tweets around the 2015 evaluation period:
https://archive.org/details/archiveteam-twitter-stream-2015-07
The list of tweetids from last year's crawl is here:
https://cs.uwaterloo.ca/~jimmylin/TREC2015-tweetids.txt.bz2
This is the "official" crawl in the sense that it was the one used for constructing the pools, etc.
Note that the file is file is 200 MB; it contains 43,956,390 tweetids, sorted. This is the union of two separate crawls using the tools in twittertools (on top of twitter4j).
Can someone compare the Internet Archive crawl with the official tweetids? If overlap is good, then we have a way for getting training data to people who didn't participate last year... :)
The text was updated successfully, but these errors were encountered: