TLDR
: Raw GIT data extracted via libgit2 for a couple of open source projects.
Each directory contains sample CSV
files and a README
with the download links (Parquet
and JSON
formats).
Our intention to publish further data (a lot actually) with additional metrics and dimensions.
Please feel free to request datasets for other repositories and/or projects in the issues!
Column | Type | Description |
---|---|---|
id | string | The commit's SHA |
delay | int64 | Seconds elapsed between the creation and last application of the commit (rebases can cause negative values) |
age | int64 | Shortest interval between the commit and it's parents |
ismerge | bool | Whether the commit has two or more parents (or is a squash) |
squashof | int64 | Whether it is a squash and merge commit (currently parsed from commit message) |
author_name | string | The author's name |
author_email | string | The author's email address |
committer_name | string | The committer's name |
committer_email | string | The committer's email address |
author_time | datetime | The author signature's timestamp |
committer_time | datetime | The committer signature's timestamp |
loc_d | int64 | Number of lines deleted in this commit |
loc_i | int64 | Number of lines inserted in this commit |
comp_d | int64 | Whitespace complexity deleted in this commit |
comp_i | int64 | Whitespace complexity inserted in this commit |
nfiles | int64 | Number of files (paches) affected by this commit |
message | string | The (nice and shiny and fixless :) commit messages |
ndiffs | int64 | Number of diffs and parent commits |
author_email_dedup | string | Author's deduplicated email address |
author_name_dedup | string | Author's deduplicated name |
committer_email_dedup | string | Committer's deduplicated email address |
committer_name_dedup | string | Committer's deduplicated name |
These are the individual files touched by commits. Patches are generated by diffing two revisions.
Column | Type | Description |
---|---|---|
id | string | The commit the patch belongs to |
parent_id | string | The parent commit which the diff was generated against |
oldpath | string | The file's old path before applying the patch |
newpath | string | The file's new path after applying the patch |
ismerge | bool | Whether the commit has two or more parents (or is a squash) |
status | string | What kind of modification happened with the file (added / deleted / modified / etc) |
author_time | datetime | The author signature's timestamp |
oldsize | int64 | The file's size in bytes before applying the patch |
newsize | int64 | The file's size in bytes after applying the patch |
language | string | Programming language of this patch |
langtype | string | Language types given by Github's linguist |
skipped | string | Whether the patch generation has been skipped or NOT (otherwise the reason) |
istest | bool | Whether the file is a test file or not |
loc_d | int64 | Number of lines deleted in this patch |
loc_i | int64 | Number of lines inserted in this patch |
comp_d | int64 | Whitespace complexity deleted in this commit |
comp_i | int64 | Whitespace complexity inserted in this commit |
loc_d_std | float32 | Deleted number of lines deviation in the hunks |
loc_i_std | float32 | Inserted number of lines deviation in the hunks |
comp_d_std | float32 | Deleted complexity deviation in the hunks |
comp_i_std | float32 | Inserted complexity deviation in the hunks |
nhunks | int64 | Number of hunks in this patch |
nblames | int64 | Number of unique commits this patch has churned lines from |
blame_loc | int64 | Number of lines this patches has churned (deleted) |
Contains blame segments for patches, an example from libgit2's github page.
Column | Type | Description |
---|---|---|
id | string | The commit's SHA |
author_email | string | The author's email address |
author_time | datetime | The author signature's timestamp |
ismerge | bool | Whether the commit has two or more parents (or is a squash) |
newpath | string | The file's new path after applying the patch |
istest | bool | Whether the file is a test file or not |
blame_id | string | The commit's SHA where this commit has churned lines from |
loc_d | int64 | Number of churned lines |
language | string | Programming language of the affected file |
blame_author_email | string | The churned author's email address |
blame_author_time | datetime | The author signature's timestamp of the chirned commit |
blame_ismerge | bool | Whether the churned commit was a merge commit or not |
author_email_dedup | string | Author's deduplicated email address |
author_name_dedup | string | Author's deduplicated name |
Both lightweight and annotated tags.
Column | Type | Description |
---|---|---|
id | string | The tags's SHA |
name | string | The tag's name (in case of annotated) |
message | string | The tag's message (in case of annotated) |
type | int64 | Git object type (mostly commit) |
author_time | datetime | The timestamp of the tag's creation |
A more detailed documentation is on the way.
Reading Parquet files requires PyArrow
conda install -y pandas pyarrow
Fetching package metadata ...............
Solving package specifications: .
Package plan for installation in environment /Users/krisz/.pyenv/versions/miniconda3-latest/envs/ml3:
The following packages will be UPDATED:
pandas: 0.21.0-py36_0 conda-forge --> 0.22.0-py36_0 conda-forge
pandas-0.22.0- 100% |################################| Time: 0:00:04 2.30 MB/s
Currently there two formats available: Parquet and JSON
wget https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.parquet
wget https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.json.gz
--2018-01-15 11:09:00-- https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.parquet
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.100.149
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.100.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1483345 (1.4M) [binary/octet-stream]
Saving to: ‘commits.parquet’
commits.parquet 100%[===================>] 1.41M 616KB/s in 2.4s
2018-01-15 11:09:03 (616 KB/s) - ‘commits.parquet’ saved [1483345/1483345]
--2018-01-15 11:09:04-- https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.json.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.100.149
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.100.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1642853 (1.6M) [application/json]
Saving to: ‘commits.json.gz’
commits.json.gz 100%[===================>] 1.57M 903KB/s in 1.8s
2018-01-15 11:09:06 (903 KB/s) - ‘commits.json.gz’ saved [1642853/1642853]
import pandas as pd
commits = pd.read_parquet('./commits.parquet')
commits[['id', 'message']].head()
id | message |
---|---|
c15648cbd059b92c177586ab1701a167222c7681 | Initial draft of libgit2\n\nSigned-off-by: Sha... |
44181c23ea6c39d51a4b481dc59ecf2cc3967e76 | Mark git_oid parameters const when they should... |
46d8b885bd65158e8cb53266ba4b627b5991bce8 | Rename git_odb_sread to just git_odb_read\n\nM... |
171aaf21d9f7582270c390962f61d3d2613c4d59 | Hide GIT_{BEGIN,END}_DECL from doxygen as its ... |
b51eb250ed0cbda59d3108d04569fab9413909fd | Cleanup git_odb documentation formatting\n\nSi... |
import pandas as pd
commits = pd.read_json('./commits.json.gz', compression='infer')
commits[['id', 'message']].head()
id | message |
---|---|
c15648cbd059b92c177586ab1701a167222c7681 | Initial draft of libgit2\n\nSigned-off-by: Sha... |
44181c23ea6c39d51a4b481dc59ecf2cc3967e76 | Mark git_oid parameters const when they should... |
46d8b885bd65158e8cb53266ba4b627b5991bce8 | Rename git_odb_sread to just git_odb_read\n\nM... |
171aaf21d9f7582270c390962f61d3d2613c4d59 | Hide GIT_{BEGIN,END}_DECL from doxygen as its ... |
b51eb250ed0cbda59d3108d04569fab9413909fd | Cleanup git_odb documentation formatting\n\nSi... |