This project showcase some possible uses of the attention signal of neural model of code, in particular we focus on generative GPT-like model, thus we extract the attention from the transformer units. Note that GPT-like models are decoder only, thus we get masked attention (i.e. only the previous tokens get attended) and the attention matrix is triangular.
We collect ground truth eye tracking data from developers while they explore code, and we compare the attention signal of the neural model with this ground truth data.
We study the models and developers when performing the sense-making task: given a self-contained source code file, the task is answer a single question (e.g. complexity related, parameter related, etc.) listed at the end of the file. Starting from the prompt: # Answer:
.
post-processed eye tracking data and ready-to-plot experimental data
: contains the post processed data from the raw data, including the ground truth visual attention vectors, the interaction matrix, and the anntation on the answer correctness. Available hereraw eye tracking data
: contains the raw data from 25 developers over 92 valid code exploration sessions divided in two batches. Available here. Place the content at the path:data/eye_tracking_studies/dataset_human_sessions
.sensemaking code snippets
: list of files used for the eye tracking study and to be processed by the neural models. Available here in the repo.
Follow these steps to prepare the repository to reproduce the plots in the paper:
- setup a virtual environment in the current folder:
pip install virtualenv
virtualenv venv
- activate the virtual environment:
source venv/bin/activate
- install all the dependencies:
pip install -r config/dependencies/venv_requirements.txt
This level of reproduction allows to reproduce the plots in the paper, without running the experiments from scratch.
- Download the
experimental_data
and place the content in thedata/experimental_data
folder. The data included in theexperimental_data
folder are:- The folder
cmp_v06
contains the comparison csv generated for all the models, check the config files inconfig/comparisons/cmp_v06*
for the exact configuration used for the different models. - The folder
eye_v10
is contains the metadata regarding how much time has been spent on each token (using tokens of different models). For more details check the config files inconfig/eye_v10*
for the exact configuration used for the different models.
- The folder
Note to reproduce these analysis from scratch use the readme and the .sh
scripts in the main folder of the repository.
- Run the notebook
notebooks/47_Empirical_Study_Results.ipynb
to reproduce the plots in the paper.
This level of reproduction allows to recreate all the experimental data, from the attention extraction to the comparison with the data in the eye tracking dataset.
Preliminaries:
- Note: you need the
screen
command to run the experiments, you can get viasudo apt-get install screen
. - Download the human data from here, unzip and place the content at the path:
data/eye_tracking_studies/dataset_human_sessions
.
-
Run the script
1_DOWNLOAD_MODELS.sh
to download the models used in the experiments. Insert the model name you want to study and its HuggingFace identifeir (e.g.,Salesforce/codegen-350M-mono
). Note that this piepline works only with HuggingFace models. Then insert a folder where you want to download your model locally (e.g.data\models
). The models will be downloaded in thedata/models
folder. This step will generate a config file with the details of your experiment in the folderconfig/automatic
, with the namedownload_{timestamp}.yaml
. -
Run the script
2_QUERY_MODEL.sh
to query the model with the code snippets form the sensemaking task followed by the questions. First you have to decide which configuration to use among those in the folderconfig/template/experiments
. For demo purposes, we suggestexp_vXX_test.yaml
, whereas to be consistent with the paper use theexp_v10_rebuttal.yaml
. When prompted you have to choose a short name for your model (e.g.codegen350mono
), then the output will be stored here:data/model_output/exp_vXX_test/codegen350mono
. This will generate a config file with the details of your experiment in the folderconfig/automatic
, with the name{template_name}_{modelname}_{date}.yaml
and output the attention signal of the model and its derived metrics in the folder:data/model_output/exp_vXX_test/codegen350mono
. -
Run the script
3_COMPARE_WITH_HUMAN.sh
to compare the attention signal of the model with the ground truth data. First you have to decide which configuration to use among those in the folderconfig/template/comparisons
. For demo purposes, we suggestcmp_vXX_test.yaml
, whereas to be consistent with the paper use thecmp_v10_rebuttal.yaml
. When prompted you have to choose a short name for your model (e.g.codegen350mono
) use the same name as done in the previous step. The it will ask which configuration to use to postporocess the eye tracking data form the humans. For demo purposes, we suggesteye_vXX_test.yaml
, whereas to be consistent with the paper use theeye_v10_rebuttal.yaml
. This will generate a config file with the details of your experiment in the folderconfig/automatic
, with the name{template_name}_{modelname}_{date}.yaml
and output the comparisons in the folders:data/eye_tracking_attention/eye_vXX
anddata/comparisons/cmp_vXX
.
This repository contains the following subfolders:
attwizard
: all the code and scripts ([attwizard.script
]) used to manipulate, analyze and visualize the attention signal. Note that this packge includes also tools to post-process data from the HRR and for comparing data from human and models.eye_tracking
: the code and scripts used to post-process eye tracking data collected during the eye tracking study.config
: the configuration files used in the experiments.data
: the data collected in the experiments.notebooks
: the notebooks used to design and prototype the experiments.
- (optional) if you want to store your experiment in Azure container storage, you need to setup a blobfuse and mount the container. Then you will rely on a
fuse_connection.cfg
file to store in the root of the repo. It will contain the following
accountName yourStorageAccountName
accountKey yourStorageAccountKey
containerName yourContainerName
To link the local folder to an Azure container storage, you can use the following command:
PREPARE_CONTAINER.sh
Then follow the instruction in the terminal.
- Check the attwizard/script folder for the scripts used to manipulate, analyze and visualize the attention signal.