In this study, we empirically study different existing Deep Learning Based Vulnerability Detection techniques for real world vulnerabilities. We test the feasibility of existing techniques in two different datasets.
- Part of Devign dataset (often referred to as FFMpeg+Qemu dataset in the project).
- Our Collected vulnerabilities from Chrome and Debian issue trackers (Often referred as Chrome+Debian or Verum dataset in this project).
To download data
cd data;
bash get_data.sh
To download (some of) pretrained models
cd models;
bash get_models.sh
Some of the tools in this study can be used for a new datasets. In order for doing that, we use Joern for parsing the C code in this repository.
cd code-slicer/joern;
bash build.sh
Once the build is successful, go to the folder you want to perform your experiment, create a folder named raw_code
and create every functions in separate C files.
We followed the custom to file names <name>_<VUL>.c
, wehre the <VUL>
is the Vulnerability identifier of the function (0 for benign, 1 for vulnerable).
-
You have to extract the slices from the parsed code. Modify the data_processing/extract_slices.ipynb for extracting slice. This will generate a file
<data_name>_full_data_with_slices.json
in your data directory. -
Run data_processing/create_ggnn_data.py for formatting data into different formats.
-
Update data_processing/full_data_prep_script.ipynb to input to the GGNN.
- Clone our implemetation of Devign from here.
- Use the following parameters
"node_features"
as"--node_tag"
,"graph"
as--graph_tag
, andtargets
as--label_tag
. - User
--save_after_ggnn
flag for saving the data after processing through GGNN.
The running APIs are exposed by this file. Moddify the parameters to fit your need.
To try ReVeal on Chrome+Debian(Verum) dataset,
cd Vuld_SySe/representation_learning;
bash run_verum.sh
To try ReVeal on Devign dataset,
cd Vuld_SySe/representation_learning;
bash run_devign.sh
We include different scripts for running other models (i.e. VulDeePecker, SySeVR, Draper) under scripts/
and real_data_scripts/
folders.
We are using several different components from the state-of-the-art research. Please cite accordingly to pay due attributes and credits to the authors.
- If you use Code-Slicer portion from this repository, please cite the following
@inproceedings{yamaguchi2014modeling,
title={Modeling and discovering vulnerabilities with code property graphs},
author={Yamaguchi, Fabian and Golde, Nico and Arp, Daniel and Rieck, Konrad},
booktitle={2014 IEEE Symposium on Security and Privacy},
pages={590--604},
year={2014},
organization={IEEE}
}
- If you use Devign, please cite,
@inproceedings{zhou2019devign,
title={Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks},
author={Zhou, Yaqin and Liu, Shangqing and Siow, Jingkai and Du, Xiaoning and Liu, Yang},
booktitle={Advances in Neural Information Processing Systems},
pages={10197--10207},
year={2019}
}
- If you refer to empirical finding reported in the paper, please cite our pre-print as
@article{chakraborty2020deep,
title={Deep Learning based Vulnerability Detection: Are We There Yet?},
author={Chakraborty, Saikat and Krishna, Rahul and Ding, Yangruibo and Ray, Baishakhi},
journal={arXiv preprint arXiv:2009.07235},
year={2020}
}