This repository contains the source code for our paper CodeGraphSMOTE - Data Augmentation for Vulnerability Discovery
- Python
- PyTorch
- PyTorch Geometric
- NetworkX
- imbalanced-learn
- gensim
- tokenizers
- pandas
For the transformer reconstruction demo:
- dash (for the transformer reconstruction demo)
- cpg-to-dot
Training data for the various datasets can be obtained at:
From the commits, methods are extracted as vulnerable prior to the fix commit and as non-vulnerable after the fix commit, as described in Devign and ReVeal. Afterwards, the resulting C code is processed using Fraunhofer-CPG. A single file per method containing the cpg in Graphviz DOT language needs to be placed in the cache folders of this directory (alternatively the paths in params/dataset_params.py
can be changed). The processed CPG-files can be created ergonomically using cpg-to-dot.
notebooks/
analyze_cwe.ipynb
Visualization of the distances between CWE clusters. Used for the right-hand side of figure 4degree_vis.ipynb
Notebook containing the code for visualizations of the average degree against the number of nodes. Used for figures 2b and 2c.
params/
Hyperparameters of training, models and datasets as well as paths to the data and various other configurationscripts/
cpg_reconstruction/
demo.py
Interactive demonstration the reconstruction of code from a CPG using the trained transformerdemo2.py
Interactive demonstration of the interpolation between two code samples and reconstruction using the transformer; used for figure 3train.py
Training of the CPG reconstruction transformer
cwe_distances.py
Generates the data needed fornotebooks/analyze_cwe.ipynb
degree_per_node.py
Generates figure 2adraw_cwes.py
Generates left-hand side of figure 4plot_percentages.py
Used for creating figure 5quick_preprocess.py
Parallelized version of data preprocessing. Use this before training on any datasetview_results.py
Creates textual summaries of the cross-validation experiments. Used for table 1
cv_classifier.py
Used to generate the results on the full dataset with cross-evaluation for table 1cv_subsampling_drop.py
Used to generate the subsampled results shown as "Node-Dropping" in figure 5cv_subsampling_sard.py
Used to generate the subsampled results shown as "SARD" in figure 5cv_subsampling_smote.py
Used to generate the subsampled results shown as "CodeGraphSMOTE" in figure 5cv_subsampling.py
Used to generate the subsampled results shown as "Downsampled" in figure 5train_vgae.py
Training of the VGAE model used for CodeGraphSMOTE
All implementations of models, training and data processing are in experiments/
. All other files are utility files to ease implementation of the scripts.
Create an issue in this repository if you find a bug or have questions about the content.
For additional support, ask a question in SAP Community.
If you wish to contribute code, offer fixes or improvements, please send a pull request. Due to legal reasons, contributors will be asked to accept a DCO when they create the first pull request to this project. This happens in an automated fashion during the submission process. SAP uses the standard DCO text of the Linux Foundation.
Copyright (c) 2023 SAP SE or an SAP affiliate company. All rights reserved. This project is licensed under the Apache Software License, version 2.0 except as noted otherwise in the LICENSE file.