Skip to content

CircleCI scheduled nightly pa11y scan

Caley Woods edited this page Jul 18, 2024 · 8 revisions

Accessibility scanning on 18f.gsa.gov

We perform automatic accessibility testing with pa11y in two ways:

  1. a full scan of the entire site, run nightly
  2. smaller targeted scans for each push, build, or pull request (PR), based on what files were changed

Targeted scans take less than 5 minutes, and the full nightly scan continues to check accessibility site-wide. Previously, full scans on every build would take over 25 minutes on CircleCI.

Problem

An 18F team member identified that pa11y runs were too slow, and created issue #3752.

We investigated and determined that pa11y runs took about 25 minutes, because every build and PR would check all files, and there were over a thousand URLs being checked. This length of time impeded contributions and updates to the site.

Solution details

Targeted scans

For each push, build, and pull request (PR), we scan only the files that need to be scanned.

Here's a rough sketch of the algorithm:

  • If a post or page has changed, it should be scanned
  • If a post or page's layout has changed, the post or page should be scanned
  • If files that affect many pages are changed, such as layouts or partials, collect a random sample of files across the entire site to scan — 3 from every collection, and all the pages that live in the _site/ root folder.

The targeted scan outputs the list of files to be scan to a file PA11Y_TARGETS. The CI job uses this file to focus the build / PR scan and keep scanning times down.

Note

Note: This is a naive algorithm, but it's good enough, especially with the nightly scan as backup. For instance, if a partial or layout was changed, we'd ideally only scan pages which implement the changed layouts or partials — and perhaps only a sample then.

Full scan

We run a full scan nightly, around 5am Eastern Time. The scan creates a GitHub issue in the repository if there are any errors.

Technical implementation

Tip

Read this if you're a developer or trying to understand the details of how this scanning strategy is implemented.

Targeted scans

We use Jekyll hooks (:documents and :pages) to determine what files have changed according to git. Changes to files within assets/, _includes, and _sass cause the plugin to sample 3 files (or less if there aren't 3 files to scan) from the blog, all collections listed in the Jekyll config, and 3 of the blog archive pages. Once the plugin determines what file(s) have changed and should be scanned it outputs those to a file named pa11y_targets which is used in the CI environment to let pa11y know what files should be scanned.

CircleCI

Once the Jekyll build completes, a shell script is ran from the CircleCI config that checks for the existence of pa11y_targets. If the file is not found then the pa11y scan is skipped. If pa11y_targets exists then its contents are base64 encoded and pushed into an environment variable ($PA11Y_TARGETS) within CircleCI's $BASH_ENV area which lets some stateful information exist between job steps which are otherwise run "fresh". When it's time to run the pa11y scan, we base64 decode the contents of $PA11Y_TARGETS and pass that file list to pa11y. This is where the reduction in pa11y scan times on most pull requests comes from.

Full scan (nightly)

A full pa11y scan is performed against main every morning at 4:58am ET. The CircleCI config reads pipeline.trigger_source to know whether it should do a full scan or not. If pipeline.trigger_source is schedule then a full pa11y scan using the sitemap is ran which takes around 25 minutes as of June 2024. During the full scan, pa11y outputs the scanning of each url to stdout and if there are errors it outputs to stderr. A tee command is used to cause the stderr output to be duplicated into a file named pa11y-errors for sending to GitHub. When pa11y exits without errors it will exit with a status code of 0 but if there are errors it will exit with a status code of 2, that information is used to conditionally make a call to the GitHub API to report the pa11y errors.

GitHub API

At the end of the nightly pa11y scan if errors are detected a POST is made using cURL to https://api.github.com/repos/18f/18f.gsa.gov/dispatches. This POST sends a JSON body to GitHub, the JSON is formed using the jq utility and includes the base64 encoded contents of the pa11y-errors file in the client_payload key of the JSON. This POST causes a GitHub action that creates a new issue to be executed.

GitHub API Authentication Token (Personal Access Token)

The authentication token used to call GitHub is stored as a CircleCI environment variable named GITHUB_TOKEN. This token is a fine-grained personal access token created by Caley Woods ([email protected]) that has contents:write access granted just to the 18f/18f.gsa.gov repository, the resource owner is the 18F organization but the token has no permissions to the 18F GitHub organization.

Token compromise remediation

Reach out to (insert principal engineer who owns the token here, temporarily it's Caley Woods [email protected]) as well as the GitHub admins in #admins-github on Slack to have the token revoked. If an incident has taken place where the token was used maliciously, follow the security incident portion of the TTS handbook.

Token creation, lifetime, renewal, and approval

The token is generated to have the maximum lifetime of one year. To create a new token, visit the personal access tokens area of your GitHub settings and click "Generate new token". Select "Custom" from the Expiration field and then use the date selector to push the date out one year into the future. Under "Resource owner" select 18F and then write a brief justification description about what this token does and why it's needed, the token has to be approved by the GitHub admins before it can be used. Under "Repository access" select "Only select repositories" and from the dropdown pick 18F/18f.gsa.gov as the repo. Under the "Permissions" section click "Repository permission" to expand the section and scroll down to find "Contents" and set the access level of Contents to Read and write. Write access is required to Contents or the GitHub API will return an error saying that the token does not have access to the repo. Scroll down to the "Overview" area at the bottom of the page and verify that your token has read and write access to Contents as well as read access to Metadata which will be automatically applied by GitHub, the token should have zero organization permissions.

After the token is created, copy its value from the GitHub UI and replace the GITHUB_TOKEN environment variable for the 18f.gsa.gov project within CircleCI. As long as the personal access token has been approved no other changes are required. If you're unsure whether or not the token is approved, you'll see a "Pending" badge displayed on the token in the tokens list.

Note

Note: GitHub does allow you to regenerate a personal access token and extend its duration but in testing this a "bad credentials" error was encountered after the token had expired and was regenerated. For this reason it's recommended to create a new personal access token and update CircleCI with the new token when the old token is approaching its expiration date.

GitHub action

The new issue GitHub action works on the repository_dispatch event which is started by the GitHub API call to the /dispatches endpoint mentioned above in the GitHub API step. The action receives the base64 encoded error output from pa11y, decodes it, and uses it to create a new GitHub issue with the pa11y error output in the issue body.

Miscellaneous

The other source of pipeline.trigger_source is webhook and those are events pushed to CircleCI from GitHub when commits to a branch with a pull request are made, this trigger causes the Jekyll changed files logic to be followed.