CopyCat is an open-source resource to make deduplication in TREC-style experimental setups more attainable and provides (1) a compilation of near-duplicate documents within the ClueWeb09, the ClueWeb12, and two Common Crawl snapshots, and (2) a software library enabling the deduplication of arbitrary document sets.
- Getting Started
- Inclusion and Exclusion lists of near-duplicates for the ClueWebs and two Common Crawl snapshots
- Deduplication of Run and Qrel files
- Deduplication of Large Webcrawls with the CopyCat Spark Pipeline
- Documentation on Document-Preprocessing (Used in all parts of the software)
- Showcases
- Tutorial: ClueWeb09 (Category B) Index without Near-Duplicates
- Tutorial: ClueWeb12 (Category B13) Index without Near-Duplicates
- Development Environment
The CopyCat software comes with a cli for the deduplication of run and qrel-files and a spark library for the deduplication of complete web crawls and provides a docker image with support for jupyter notebooks (+ common data science libraries).
To start a bash shell with CopyCat installed and your local directory mounted, run
docker run --rm -ti -v ${PWD}:/home/jovyan webis/chatnoir-copycat:1.0-jupyter bash
In this bash, you have CopyCat installed. E.g., to show the help, run copy-cat --help
, which prints:
usage: CopyCat: Deduplication of run files and qrels.
[-h] --input INPUT --output OUTPUT
[--similarities {url,s3,cosine(3+5-grams),cosine(8-grams),cosine(1-grams),simhash(1-grams),simhash(3+5-grams),md5,text-profile} [{url,s3,cosine(3+5-grams),cosine(8-grams),cosine(1-grams),simhash(1-grams),simhash(3+5-grams),md5,text-profile} ...]]
--documents {ChatNoirMapfiles,AnseriniIndex} [--anseriniIndex ANSERINIINDEX] [--retrieveDocId RETRIEVEDOCID] [--ranks RANKS]
[--s3Threshold S3THRESHOLD] [--threads THREADS] [--runFile {true,false}] [--keepStopwords {true,false}]
[--contentExtraction {Anserini,Boilerpipe,Jericho,No}] [--stemmer {porter,krovetz,null}] [--stopwords STOPWORDS]
named arguments:
-h, --help show this help message and exit
--input INPUT The run file or qrel file that should be deduplicated.
--output OUTPUT The result of the deduplication in jsonl format.
--similarities {url,s3,cosine(3+5-grams),cosine(8-grams),cosine(1-grams),simhash(1-grams),simhash(3+5-grams),md5,text-profile} [{url,s3,cosine(3+5-grams),cosine(8-grams),cosine(1-grams),simhash(1-grams),simhash(3+5-grams),md5,text-profile} ...]
Calculate all passed similarities.
--documents {ChatNoirMapfiles,AnseriniIndex}
Use the passed DocumentResolver to load the documents. E.g. AnseriniIndex loads documents by accessing a local
anserini-index.
--anseriniIndex ANSERINIINDEX
When using AnseriniIndex as resolver for documents, we use the specified index.
--retrieveDocId RETRIEVEDOCID
Retrieve a single document from and print it to the console. This is useful to check the preprocessing on a few
example documents.
--ranks RANKS Include documents up to the specified rank in the deduplication.
--s3Threshold S3THRESHOLD
Report only near-duplicate pairs with s3 scores on word 8-grams above the specified threshold.
--threads THREADS
--runFile {true,false}
Is the specified a run file (pass true), or a qrels file (pass false)
--keepStopwords {true,false}
Switch: keep stopwords or remove them.
--contentExtraction {Anserini,Boilerpipe,Jericho,No}
The name of the content extraction. (Use 'Anserini' for Anserini's default HTML to plain text transformation, or
'No' in case documents are already transformed (e.g., because they come from an anserini index)
--stemmer {porter,krovetz,null}
The name of the stemmer (passed to Lucene with Anserini).
--stopwords STOPWORDS The list of stopwords is read from this file. When keepStopwords is false, and stopwords = null, then Anserinis
default is used.
To start a jupyter notebook with CopyCat installed and your local directory mounted, run:
docker run --rm -ti -v ${PWD}:/home/jovyan -p 8888:8888 webis/chatnoir-copycat:1.0-jupyter
Now you can point your browser to localhost:8888 to access the notebook.
CopyCat provides an docker image to support various deduplication experiments on standard IR run and qrel files for various test collections. The following List provides some examples on how to use the docker image to deduplicate run/qrel files:
This repository contains the two showcases from the paper. As a general introduction, see the Getting Started section.
- Deduplication of run/qrel files with the copycat-cli (copycat-on-clueweb.ipynb, open in nbviewer)
- Transfer of relevance labels
Please install the following requirements:
- Java 8 (our hadoop cluster runs hadoop 2.8, hence we need to compile to be compatible with java 8)
- Maven
- Project Lombok to your IDE (used to remove a bit of boilerplate code)
- https://approvaltests.com/ (especially the diff-tools, this is used in unit-tests)
- Docker
After the installation of the requirements, you can run the following command to compile CopyCat and run all unit-tests:
make install
To detect near-duplicates in large web crawls, copycat runs a pipeline of four steps: (1) calculation of the SimHash fingerprint for each document, (2) selection of one representative document, if multiple documents have identical fingerprints, (3) partition of the remaining fingerprints, (4) calculation of the Hamming distances between all fingerprints of a partition.
-
Create Document representations
- Run
./copycat-spark/src/main/bash/new-document-representation-spark.sh
to produce document representations - The Makefile contains specific examples. E.g., execute
make common-crawl-small-sample-document-representations
to create document representations for a small sample of the common crawl to test your environment. - The following commands produce the document-representations used in the paper:
make clueweb09-document-representations
make clueweb12-document-representations
make common-crawl15-document-representations
make common-crawl17-document-representations
- Run
-
Create SimHash Deduplication Jobs
- This covers the selection of one representative document (if multipledocuments have identical fingerprint), and the partition of the remaining fingerprints into deduplication jobs.
- After the creation of the document representations, run
./copycat-spark/src/main/bash/create-deduplication-candidates.sh
to produce: (1) a list of removed-documents (e.g. too short documents, regarding your configuration); (2) exact-duplicates; and (3) near-duplicate-tasks. - The
near-duplicate-tasks
are the blocks that will be fully all-pairs deduplicated in step 3 of the pipeline (Run Deduplication Jobs). - The following commands produce the deduplication jobs used in the paper:
make create-deduplication-candidates-cw09
-
Run Deduplication Jobs
- After the creation of the deduplication jobs, run
./copycat-spark/src/main/bash/deduplicate.sh
to produce the pairs of near-duplicates. - The following commands produce the document-representations used in the paper:
make deduplicate-cw09
make deduplicate-cw09-cw12-cc15
- After the creation of the deduplication jobs, run