🔥Let’s Do DevOps: Shard GitHub Actions Workload over Many Concurrent Builders

This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can…

Apr 24, 2023

This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can do it!

Hey all!

I’ve recently been building a tool I’m calling the GitHub Cop. It’s a tool that processes all 1.5k+ repos in my GitHub Org and sets all the settings, branch protections, auto-link rules, permissions, etc. That’s a lot of changes each night, and on every repo sequentially.

To do that, I’ve built several tools, including an API token circuit breaker to avoid running out of API tokens when doing tens of thousands of actions in a short time, and how to paginate API calls when processing thousands of attributes, but none of those processes help speed up the single-threaded processing of thousands of repos.

What would help is having several threads working concurrently. Thankfully, GitHub Actions makes this pretty easy to do! I did have to teach bash some arithmetic, but I was able to make it work.

Let’s talk about theory, then how I implemented this change in my custom bash program, and the limitations I faced and how I solved them.

GitHub Actions Theory: Matrix

Sharding (a word not recognized by my computer’s dictionary), is the practice of splitting a single workload into several parts, and then assigning each those work “shard” or fragments to one of several builders. It’s usually assumed that the workload would then be processed concurrently. It’s a method of reducing the total time a workload runs by splitting it across several runners.

Because this is a relatively foundational topic of computing, GitHub Actions makes this relatively easy using the matrix strategy. In math, a “matrix” means a rectangular array or table where numbers are stored and often combined in a predictable way.

In a GitHub Actions context, a matrix is values that are combined and handed off to the tasks you list on the job. If you have a single matrix value like attribute: [foo, bar], the job will be run twice, once where attribute is set to foo and once where attribute is set to bar.

You can take it further by having several variables, like the following example (taken from the GitHub Actions page on matrix strategy). This set of two values are combined in every pattern, which will kick off 6 jobs.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	jobs:
	example_matrix:
	strategy:
	matrix:
	version: [10, 12, 14]
	os: [ubuntu-latest, windows-latest]

view raw matrix_example.yaml hosted with ❤ by GitHub

The jobs will be kicked off, one with each of these sets of matrix values:

{version: 10, os: ubuntu-latest}
{version: 10, os: windows-latest}
{version: 12, os: ubuntu-latest}
{version: 12, os: windows-latest}
{version: 14, os: ubuntu-latest}
{version: 14, os: windows-latest}

Update Action to Support Sharding

Sounds simple enough, right? Okay, maybe it sounds like a whole heap of nonsense. Let’s implement it together, and see if it makes more sense in practice. It sure clicked for me when I went and did it!

So for my GitHubCop before any sharding, the Action was really, really simple. On a nightly schedule, the job repo_cop runs, which downloads the code for this repo (line 10), and then runs a script from the repo called repoCop.sh (line 14) and injects a Secret, which is the GitHub PAT with admin rights, on line 13. Super duper simple, right?

Show hidden characters

	name: GitHub Repo Cop
	on:
	# Run nightly at 5a UTC / 11p CT
	schedule:
	- cron: "0 5 * * *"
	jobs:
	repo_cop:
	runs-on: ubuntu-latest
	steps:
	- uses: actions/checkout@v2
	- name: GitHub Repo Cop
	env:
	GITHUB_TOKEN: ${{ secrets.MY_GITHUB_TOKEN }}
	run: ./repoCop.sh

view raw repo_cop_no_sharding.yaml hosted with ❤ by GitHub

And we could update this to run as a matrix, however, there are some steps that don’t need to be run multiple times. The first thing the GitHub RepoCop does is get a list of all the Repos in the Org. You could have every single runner do this, but there’s just no reason to do that, so let’s separate that code.

I’m skipping over how I did that in my codebase — really if what you’ve built is all function-based, just move any functions for “prep the work list” to a separate file or command so it can be run first on a single builder. For my workload, that’s just “get all the repos”. Done.

I initially was going to use GitHub Actions Outputs to pass the variable for which Repos to work to my fleet of builders, but then I ran into this snippet:

Outputs are Unicode strings, and can be a maximum of 1 MB. The total of all outputs in a workflow run can be a maximum of 50 MB.

Given that I want to construct a list or array, and the output would be a unicode string, I decided to pass. I’m pretty sure that it would work? But I decided it’s safer and would lead to less headaches to write the contents of the variable to a file, and pass that binary file to the builders. To do that, two changes.

Change 1: In my application code, I write all the repos I want to process to a file called all_repos. That’s super duper easy:

Show hidden characters

	# Write Org repos to file
	echo "$ALL_REPOS" > all_repos

view raw write_repos_to_file.sh hosted with ❤ by GitHub

Change 2: I need to update my Action to have step 1 be only the “get all the repos” stuff, and then I need to stash the binary file for the next stage of builders. Here’s my new action file.

You can see on line 14, we’ve updated the script we’re calling to be one that just gets all the repos (and writes a file called all_repos to the root of the working directory). And then we have a new step (line 17) that uploads the file as an artifact. We use a computed name on line 20 for the github.run_id — a unique numerical identifier for the workflow run. That’ll make sure that this same artifact isn’t downloaded by a future run, and isn’t over-written either, in case we want to go back and check its contents. This value exists in the GitHub Context.

Show hidden characters

	name: GitHub Repo Cop
	on:
	# Run nightly at 5a UTC / 11p CT
	schedule:
	- cron: "0 5 * * *"
	jobs:
	repo_cop:
	runs-on: ubuntu-latest
	steps:
	- uses: actions/checkout@v2
	- name: GitHub Repo Cop
	env:
	GITHUB_TOKEN: ${{ secrets.MY_GITHUB_TOKEN }}
	run: ./repoCop_getRepos.sh
	# Action "Outputs" are unicode strings, so can't be a "list", or at least would be more complex to handle the data structure
	# Rather than deal with that, going to upload the binary list file, and will read in next step
	- name: Upload repos list
	uses: actions/upload-artifact@v3
	with:
	name: all-repos-${{ github.run_id }}
	path: all_repos

view raw repo_cop_get_all_repos.yaml hosted with ❤ by GitHub

But that’s only half of the story — we’ve now updated our Action from doing the work single-threaded, to only creating a list of repos to work and not doing the work. However, we’ve now eliminated all the single-threaded tasks and can start on the multi-threaded tasks.

Here’s the sharded version of the second step in our Action. Since we’re now ready to shard, we use a matrix strategy (line 5), with fail-fast disabled, so that if one of the shards has a catastrophic failure we keep going. And on line 8, we define a shard value of 1/2 or 2/2. The work “shard” is meaningless here — it’s a name we created for a variable that’ll be passed to our builders.

In order for our builders to decide what parts of the work to build, they need to understand two things — first, how many segments of work exist — that’s the denominator, or the value below the line in a fraction. If there are 2 segments of work, the number below the line is 2. Imagine a pie split into 2 pieces.

Second, the builder needs to know which segment it should be working. If there are two pieces of pie, should it work the first piece, or the second? That’s the numerator, or the number above the line in a fraction. Given that’s how I (a human, hi!) am understanding the issue, I decided to use that same metaphor to pass to the builders doing the work — so on line 8, we pass either shard = 1/2 or shard = 2/2 to the builders. Those 2 jobs are run concurrently.

Show hidden characters

	repo_cop:
	runs-on: ubuntu-latest
	timeout-minutes: 720 # 12 hours - Default is 360 minutes / 6 hours
	needs: repo_cop_get_repos # Require the repos list to be generated first
	strategy:
	fail-fast: false # Keep going even if one shard fails
	matrix:
	shard: [1/2, 2/2]

view raw repo_cop_sharded_1_2.yaml hosted with ❤ by GitHub

The steps in the second action start the same — we download the code (we have to since we’re running a script from our repo) on line 2, and then we deviate. Since the list of repos we’ll be running is stored as an artifact, both our concurrent builders go and fetch that artifact. We use the same github.run_id attribute as we used to store the artifact, so it’s for sure the same version of that file.

Then on line 9, we call our application code — a bash file called repoCop.sh, in this case, and we are again passing our Secret as an environment value, but we’re also doing something interesting — on line 12, we pass in an environment value SHARD as the value of matrix.shard. The GitHub Action will set that to 1/2 on the first builder and 2/2 on the second builder.

Show hidden characters

	steps:
	- uses: actions/checkout@v2
	- name: Download repos list
	id: download-repos
	uses: actions/download-artifact@v3
	with:
	name: all-repos-${{ github.run_id }}
	- name: GitHub Repo Cop
	run: ./repoCop.sh
	env:
	GITHUB_TOKEN: ${{ secrets.MY_GITHUB_TOKEN }}
	SHARD: ${{ matrix.shard }}

view raw github_cop_repos_action_2_2.yml hosted with ❤ by GitHub

And that’s it! At least for the Action. The application code still has to be taught to understand what’s going on with that value.

Some tools will have a native understanding of sharding. For instance, yarn is a browser testing tool, and it can front several back-end testing tools. If it’s fronting jest, another testing tool, it natively understands what sharding is, and you can pass it a shard value and a job total value, and it’ll do it. However, my bash script is written by me, and doesn’t yet understand that as a command line argument. So let’s make it do so.

run: yarn test --shard=${{ matrix.shard }}/${{ strategy.job-total }} --coverage

Theory: Updating Application to Support Sharding

Updating the Action to support sharding is only half the battle. It doesn’t inherently split the application workload in two — it only injects some variables into the workload. Your application code must also be updated in order to understand what those variables instruct it to do — what parts of the workload it should work on.

The way my application is written is a big while loop, which operates on a list of repositories. Here’s a simplified version. Basically, we are serially processing one repo at a time from the var ALL_REPOS, which is a return delimited list of the repositories we should inspect and update.

Show hidden characters

	while IFS=$'\n' read -r GH_REPO; do
	(Action stuff)
	done <<< "$ALL_REPOS"

view raw action_simple.sh hosted with ❤ by GitHub

It follows, then, that if we want builderA to do the first 50% of the list, and builderB to do the second 50%, we could break the ALL_REPOS list into 2 parts. We could probably do that as part of the Action as a separate step (and in retrospect, that would have been cleaner!), however I put that logic right into my application.

Let’s walk through the steps I follow for bash to understand fractional workloads, and how it’s implemented.

Practice: Updating Application to Support Sharding

At first I built this logic to split the list down a binary perfect 50%, but it turns out that often cuts the middle entry in half, whoops. So rather than that, let’s cut the line by a fractional 50% of the total. To do that, we need to know the total, so let’s count on line 5.

Show hidden characters

	# Read the list of repos into a var
	ALL_REPOS=$(cat all_repos)

	# Count lines in the list of repos
	ALL_REPOS_LENGTH=$(echo "$ALL_REPOS" \| awk 'NF' \| wc -l \| xargs)

view raw all_repos_length.sh hosted with ❤ by GitHub

Next we need to read in the SHARD value that we’re passing to our builders. Remember, we’re using SHARD values like “1/2” and “2/2”.

The Shard number we should be working is above the line — either the first half or the second half, line 2. and the total count of shards is the number below the line, line 5.

Show hidden characters

	# Shard number
	SHARD_NUMBER=$(echo "$SHARD" \| cut -d "/" -f 1)

	# Number of shards
	SHARD_COUNT=$(echo "$SHARD" \| cut -d "/" -f 2)

view raw shard_fraction_read.sh hosted with ❤ by GitHub

Since we now know we’re either working the first 1/2 or the second 1/2, we count up how many lines of the list that would equate to. For example, if we have 10 repo lines, the first half would be 1–5, and the second half would be 6–10, line 3.

On line 4 we handle an outlier — if there are 11 lines, bash match gets a little wonky — it doesn’t natively do floating point math, so half of 11 if 5.5, but bash would understand it as “5”. That would mean we’d miss 1 of our lines, that’s not great! So we add a number to our half. That will mean each builder will work 6 repos, with the middle repo worked twice. For my idempotent workflow, who cares. If your workload isn’t idempotent (you should fix it!), this is a more serious issue and you should solve it better.

Then on line 8, we use a native bash command called split that’s intended to split binary files, but can operate on lists. We tell it to put n number of lines per list, and read the all_repos file we’ve read into disk on this Action, and to output 2 lists, 1 that is the first half, and one that is the second half. They’re named like all_repos_00 and all_repos_01.

Show hidden characters

	# Number of lines per shard, add 1 to round up fractional shards
	# Lines aren't duplicated, last shard is slightly short of even with others, which is fine
	LINES_PER_SHARD=$((ALL_REPOS_LENGTH / SHARD_COUNT))
	LINES_PER_SHARD=$((LINES_PER_SHARD + 1))

	# Shard the list based on the shard fraction
	# This outputs files like all_repos_00, all_repos_01, all_repos_02, etc
	split -l $LINES_PER_SHARD -d all_repos all_repos_

view raw lines_per_shard_and_split.sh hosted with ❤ by GitHub

Since the split tool starts counting at 0, we subtract 1 from our “shard number”, the part of the pie we’re going to work , on line 2. Then on line 5 we read into memory the ALL_REPOS var from the specific file for this builder that it should work.

And finally, on line 9, we count all the repos we should work.

Show hidden characters

	# Select shard to work - minus one since `split` starts counting at 0
	SHARD_TO_WORK_ON=$(("$SHARD_NUMBER" - 1))

	# Read in the ALL_REPOS variable based on created shard file
	ALL_REPOS=$(cat all_repos_0${SHARD_TO_WORK_ON})

	# Count the repos
	ALL_REPOS_COUNT=$(echo "$ALL_REPOS" \| awk 'NF' \| wc -l \| xargs)

view raw shard_to_work.sh hosted with ❤ by GitHub

And boom, we’re done.

Summary

For this article, we’ve rewritten a single-threaded, monolithic Action to support multi-threading across Shards, and then updated our bash application code to do the same.

With both of those changes, we’re able to shard our workload across n runners. In this case, I’ve found that any more than 2 runners causes “secondary rate-limit” errors to pop up, so I decided to stick with two. It cuts the run of this huge Action down by about 40%, which is pretty cool.

I do still have a major issue — I’m using a PAT which only has 5k API action tokens per hour, and if I rewrite this Action to be an App, it can both be event-driven (new repo created, immediately update it), and I get 3x as many tokens (and more can be arranged by GitHub!). That’s probably what I’ll have for you next time.

Good luck out there!
kyler

Let's Do DevOps

Discussion about this post