Skip to content

Latest commit

 

History

History
418 lines (332 loc) · 14.9 KB

README.md

File metadata and controls

418 lines (332 loc) · 14.9 KB

CauseNet Source Code for Analysis & Extraction

This source code forms the basis for our CIKM 2020 paper CauseNet: Towards a Causality Graph Extracted from the Web. The code is divided into two components: one component for analyzing the graph and another component for extracting the graph from the web. The final graph can be downloaded from causenet.org. When using the code, please make sure to refer to it as follows:

@inproceedings{heindorf2020causenet,
  author    = {Stefan Heindorf and
               Yan Scholten and
               Henning Wachsmuth and
               Axel-Cyrille Ngonga Ngomo and
               Martin Potthast},
  title     = {CauseNet: Towards a Causality Graph Extracted from the Web},
  booktitle = {{CIKM}},
  publisher = {{ACM}},
  year      = {2020}
}

Overview

Project structure

We assume the following project structure:

CIKM-20/
├── java
│    ├── bootstrapping
│    └── extraction
├── notebooks
│   ├── 01-concept-spotting
│   │   ├── 01-texts-training.ipynb
│   │   ├── 02-texts-spotting-wikipedia.ipynb
│   │   ├── 03-texts-spotting-clueweb.ipynb
│   │   ├── 04-infoboxes-training.ipynb
│   │   ├── 05-infoboxes-spotting.ipynb
│   │   ├── 06-lists-training.ipynb
│   │   └── 07-lists-spotting.ipynb
│   ├── 02-graph-construction
│   │   └── 01-graph-construction.ipynb
│   ├── 03-graph-analysis
│   │   ├── 01-knowledge-bases-overview.ipynb
│   │   └── 02-graph-statistics.ipynb
│   └── 04-graph-evaluation
│       ├── 01-graph-evaluation-precision.ipynb
│       ├── 02-qa-corpus-construction.ipynb
│       └── 03-graph-evaluation-recall.ipynb
└── data/
    ├── bootstrapping
    │   ├── 0-instances
    │   ├── 0-patterns
    │   ├── 1-instances
    │   ├── 1-patterns
    │   ├── 2-instances
    │   ├── 2-patterns
    │   └── seeds.csv
    ├── question-answering/
    ├── causality-graphs/
    │   ├── extraction
    │   │   ├── clueweb
    │   │   └── wikipedia
    │   ├── spotting
    │   │   ├── clueweb
    │   │   └── wikipedia
    │   ├── integration
    │   ├── causenet-full.jsonl.bz2
    │   ├── causenet-precision.jsonl.bz2
    │   └── causenet-sample.json
    ├── categorization
    ├── random
    ├── concept-spotting
    │   ├── infoboxes
    │   ├── lists
    │   └── texts
    ├── flair-models
    │   ├── infoboxes
    │   ├── lists/
    │   └── texts/
    ├── lucene-index/
    └── external
        ├── extraction-sources
        │   ├── clueweb12
        │   └── wikipedia
        ├── knowledge-bases
        │   ├── conceptnet-assertions-5.6.0.csv
        │   ├── freebase-rdf-latest.gz
        │   └── wikidata-20181001-all.json.bz2
        ├── msmarco
        ├── nltk
        ├── stop-word-lists
        ├── spacy
        └── stanfordnlp

Prerequisites

We recommend Miniconda for easy installation on many platforms.

  1. Create new environment:
    conda env create -f environment.yml
  2. Activate environment:
    conda activate cikm20-causenet
  3. Install Kernel:
    python -m ipykernel install --user --name cikm20-causenet --display-name cikm20-causenet
  4. Start Jupyter:
    jupyter notebook

CauseNet: Analysis

The code was tested with Python 3.7.3, under Linux 4.9.0-8-amd64 with 16 cores and 256 GB RAM.

Overview of causal relations in knowledge bases

Overview of causal relations in knowledge bases as provided by Table 1.

Required Input Data

Execution

Execute the following notebook:

notebooks/03-graph-analysis/
└── 01-knowledge-bases-overview.ipynb

CauseNet: Graph Analysis

Required Input Data

Execution

Execute the following notebook:

notebooks/03-graph-analysis/
└── 02-graph-statistics.ipynb

CauseNet: Graph Evaluation

Required Software

  • DBpedia Spotlight

Required Input Data

Execution

Execute the following notebooks:

notebooks/04-graph-evaluation/
├── 01-graph-evaluation-precision.ipynb
├── 02-qa-corpus-construction.ipynb
└── 03-graph-evaluation-recall.ipynb

Computed Output Data

02-qa-corpus-construction.ipynb will extract simple causal questions from MSMARCO:

question-answering/
├── causality-qa-training.json
└── causality-qa-validation.json

CauseNet: Graph Extraction

The graph extraction is structured as follows:

  1. Bootstrapping Component (Java):
    • generates linguistic patterns from Wikipedia sentences using a bootstrapping approach
  2. Extraction Component (Java):
  3. Causal Concept Spotting (Python):
    • training sequence taggers for sentences, infoboxes and lists
    • spotting causal concepts in extractions of previous step
  4. Graph construction (Python):
    • final construction and reconciliation steps

The code was tested with Java 8 and Python 3.7.3, under Linux 4.9.0-8-amd64 with 16 cores and 256 GB RAM.

Bootstrapping Component

Required Input Data

  1. Bootstrapping seeds:
    data/bootstrapping/seeds.csv
  2. Lucene index with preprocessed Wikipedia sentences:
    data/lucene-index/

Execution

  1. Compile:
    mvn package -f ./java/bootstrapping/pom.xml
  2. Execute:
    ./scripts/bootstrapping.sh

Computed Output Data

The bootstrapping component will compute the following files:

data/bootstrapping/
├── 0-instances
├── 0-patterns
├── 1-instances
├── 1-patterns
├── 2-instances
└── 2-patterns

The following components will use the patterns after the second iteration: data/bootstrapping/2-patterns.

Extraction Component: Wikipedia

Input Data

Execution

  1. Compile:
    mvn package -f ./java/extraction/pom.xml
  2. Execute:
    ./scripts/extraction-wikipedia.sh

Computed Output Data

  • Causal relations extracted from texts, infoboxes and lists:
    data/causality-graphs/extraction/
    └── wikipedia
        └── wikipedia-extraction.tsv
    

Extraction Component: ClueWeb12

We provide code to parse one ClueWeb12 file. To parse the entire ClueWeb12 corpus, you can integrate this code into your cluster software.

Input Data

Execution

  1. Compile:
    mvn package -f ./java/extraction/pom.xml
  2. Execute:
    ./scripts/extraction-clueweb12.sh

Computed Output Data

  • Causal relations extracted from webpage texts:
    data/causality-graphs/extraction/
    └── clueweb12
        └── clueweb12-extraction.tsv
    

Causal Concept Spotting

Models were trained on a NVIDIA GeForce GTX 1080 Ti (11 GByte). To reproduce the results, we recommend to use a similar GPU architecture. If you do not want to retrain the models, you can use our models: /data/flair-models/

Required Software

No manual steps required. The correct versions will be automatically installed if you use the provided environment.yml.
For completeness:

Required Input Data

  • Concept Spotting datasets:
    /data/concept-spotting/: This folder contains the manually annotated training and evaluation data for the concept spotting.
  • Output data of the extraction components:
    data/causality-graphs/extraction/
    ├── clueweb12
    │   └── clueweb12-extraction.tsv
    └── wikipedia
        └── wikipedia-extraction.tsv
    

Execution

Execute the following notebooks:

notebooks/01-spotting/
├── 01-texts-training.ipynb
├── 02-texts-spotting-wikipedia.ipynb
├── 03-texts-spotting-clueweb.ipynb
├── 04-infoboxes-training.ipynb
├── 05-infoboxes-spotting.ipynb
├── 06-lists-training.ipynb
└── 07-lists-spotting.ipynb

Computed Output Data

  • Flair models for sequence labeling:
    /data/flair-models/
  • Separate causality graphs:
    data/causality-graphs/spotting/
    ├── clueweb12
    │   └── clueweb-graph.json
    └── wikipedia
        ├── infobox-graph.json
        ├── list-graph.json
        └── text-graph.json
    

Graph Construction

Required Input Data

data/causality-graphs/spotting/
├── clueweb12
│   └── clueweb-graph.json
└── wikipedia
    ├── infobox-graph.json
    ├── list-graph.json
    └── text-graph.json

Execution

Execute the following notebook:

notebooks/02-graph-construction/
└── 01-graph-construction.ipynb

Computed Output Data

data/causality-graphs/integration/
└── causenet-full.jsonl.bz2

Contact

For questions and feedback please contact:

Stefan Heindorf, Paderborn University
Yan Scholten, Technical University of Munich
Henning Wachsmuth, Paderborn University
Axel-Cyrille Ngonga Ngomo, Paderborn University
Martin Potthast, Leipzig University

License

The code by Stefan Heindorf, Yan Scholten, Henning Wachsmuth, Axel-Cyrille Ngonga Ngomo, Martin Potthast is licensed under a MIT license.