GitHub - commoncrawl/ml-opt-out-experiments: A series of experiments into ML opt–out protocols

What is this?

PySpark Jobs for investigating prevalence of ML Opt–Out Protocols, written by Alex Xue as part of the blog post A Further Look Into the Prevalence of Various ML Opt–Out Protocols.

How Do I Run It?

Requires sparkcc.py from commoncrawl/cc-pyspark.

Setup is the same as cc-pyspark. Make sure you have an ./input directory.

To run the jobs:

$SPARK_HOME/bin/spark-submit job_name.py \
    --num_output_partitions 1 --log_level WARN \
    ./input/test_warc.txt output_file_name

and specifically to run html_metatag_count.py (which has a different output schema)

$SPARK_HOME/bin/spark-submit ./html_metatag_count.py \
    --num_output_partitions 1 --log_level WARN --tuple_key_schema True \
    ./input/test_warc.txt output_file_name

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
html_metatag_count.py		html_metatag_count.py
tdm_header_count.py		tdm_header_count.py
user_agent_count.py		user_agent_count.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

How Do I Run It?

About

Releases

Packages

Languages

commoncrawl/ml-opt-out-experiments

Folders and files

Latest commit

History

Repository files navigation

What is this?

How Do I Run It?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages