🔥Let's Do DevOps: Inventory Dependency Licenses across all Repos in GitHub Org🔥

aka, SBOM-arama

Oct 08, 2024

This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can do it!

Hey all!

I was asked by our Legal and Compliance teams to investigate whether all our dependencies for our software was licensed properly for us to use it. And I confidently answered, I have absolutely no idea.

As with any enterprise, we have a process for reviewing dependencies to make sure the licenses and functionality match, but that process has been intermittently enforced in the past, and as big enterprises gobble up small companies via acquisition, we have to do our best to bootstrap all our processes on the new product and team.

Validating that all the licenses they’ve ever integrated into their tools are legally available for us to use can fall by the wayside. However, we’re doing our best to find them so we can replace them with tools that are available to us!

First step, we need to find all of them. And GitHub is super helpful here - their dependabot service relies on finding all the dependencies of a codebase based on analyzing the package manager files in a repo. They also return that info at their SBOM API call for each repo. So the data is there, we just have to go find it.

I wrote a python script that can find all the Repos in an Org, and then iterates over each one to download the SBOM (via REST API) for each one, then aggregates all the data into a single CSV. When run as a GitHub Action, that CSV is stored as an artifact that can be downloaded and used for other automation, or just analyzed with Excel/Google Sheets to find all the copyleft software. Here’s what a run looks like:

And you end up with a spreadsheet that looks like this:

In the Org I ran this in, we scanned 1.5k repos and found 53k dependencies, including a few dozen licenses that we shouldn’t have found, so we’re working our way through them.

What will you find in your own org? Here’s a link to the Action to test it out now if you want.

Let’s talk about how this Action works, as well as how it works!

Imports and Validation

Let’s talk python. I normally use Bash (it’s just so quick to test stuff out!) but I’m trying to branch out into other languages. And given how much json we’re working with here, I’m pretty happy to not need to break out any jq. While jq is amazing, intuitive it is not.

So let’s do some imports - we need `requests` to talk to the http json API, we need `os` to read environmental variables (like the GitHub Org, and the PAT we’ll use for authentication), we need `json` to read and filter the json API responses.

We also need `csv` to write out our CSV file of each repo’s dependencies and licenses, and we need `time` to implement our waiting functionality if our API wallet runs dry. We’ll talk about each of these in depth as we go.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	# Imports
	import requests
	import os
	import json
	import csv
	import time

	###
	## Introduction
	###
	print("##################################################")
	print("Finding all repos' SBOMs and storing in CSV")
	print("The CSV will be stored as an artifact in the GitHub Actions run")
	print("##################################################")

view raw imports.py hosted with ❤ by GitHub

Lets Walk(through) This Python

We’re not going to walk through the file in the order the text appears, top to bottom, but rather logically, how the functions are called and what they do, in order. That’ll help us follow the logic (I promise, follow me here).

First we check to make sure variables are set. We don’t require any inputs and our only response is a True if all are present. We require the `GITHUB_TOKEN` (which is the PAT) in order to make authenticated read call to GitHub’s API, and we require `GITHUB_ORG` to know what GitHub Organization to target for reading. If all goes well, we return True, line 11.

And down in our program, we show how we call it. Basically we say that if the response of this function is True (which confirms those values are populated from environmental variables), then we read them and set them to values. This should never fail because we just confirmed they’re present, so there is no error handling.

Show hidden characters

	# Function
	def initial_var_validation():
	# Get env variable GITHUB_TOKEN, and if not present, exit
	if 'GITHUB_TOKEN' not in os.environ:
	print("GITHUB_TOKEN not found in environment variables")
	exit(1)
	# Read repo we should evluate as an environmental var
	if 'GITHUB_ORG' not in os.environ:
	print("GITHUB_ORG not found in environment variables")
	exit(1)
	return True

	# Calling it
	if initial_var_validation() == True:
	GITHUB_ORG = os.environ['GITHUB_ORG']
	GITHUB_TOKEN = os.environ['GITHUB_TOKEN']

view raw initial_var_validation.py hosted with ❤ by GitHub

Next up is a helper function for checking the API token wallet. GitHub provides authenticated users 5k tokens per hour to make API calls. Reads are usually 1 token (but are sometimes a bit more for large responses). If we run out of API tokens, our script can fail in weird ways.

To avoid that, we’ll call this function several times before doing API operations to make sure our wallet has a sufficient budget to continue. If not, it sleeps for 60 seconds and checks again, and loops forever until we have tokens in our wallet again.

On line 3 we enter a forever loop. In programming 101 they say to never do this, but it’s useful here - we want to look until we break. On line 4, we send a git request to api.github.com/rate_limit (reference doc). On line 12 we check if we got any http response code, and if we got an http/200 (happy response), we continue.

We convert our response to a json package, and then read .rate.remaining to find our remaining API wallet token budget. Then we check to see if we have less than 100 tokens. If we do, we sleep for 60 seconds and check again. If we have 100 or more tokens, we break from this forever loop and go back to our real program.

Show hidden characters

	# Check if hitting API rate-limiting
	def hold_until_rate_limit_success():
	while True:
	response = requests.get(
	url="https://api.github.com/rate_limit",
	headers={
	"Accept": "application/vnd.github.v3+json",
	"Authorization": f"Bearer {os.environ['GITHUB_TOKEN']}"
	}
	)

	if response.status_code != 200:
	print("Error fetching rate limit info")
	exit(1)

	rate_limit_info = response.json()
	remaining = rate_limit_info['rate']['remaining']

	if remaining < 100:
	print("ℹ️ We have less than 100 GitHub API rate-limit tokens left, sleeping for 1 minute and checking again")
	time.sleep(60)
	else:
	break

view raw hold_until_rate_limit_success.py hosted with ❤ by GitHub

Next up we build the http headers. We end up calling this a lot, so it makes sense to abstract it out. We establish a map called `headers` on line 4, and we set Accept header to what github returns, a github json standard, and then we set our PAT as a bearer token for Authorization on line 6. Then we return the headers and store it as… well, `headers`. Hey, names are hard.

Show hidden characters

	# Define function
	def build_headers():
	# Create headers for sbom request
	headers = {
	"Accept": "application/vnd.github.v3+json",
	"Authorization": "Bearer "+GITHUB_TOKEN,
	}
	return headers

	# Build headers
	headers = build_headers()

view raw build_headers.py hosted with ❤ by GitHub

Next up, we are ready to start learning about the Organization. First of all, how many repos are we talking? This is interestingly relevant because we can only ask for 100 Repos at a time, and its not clear how many pages there are of Repos, so we have to do ✨math✨.

Line 4 - check our API token wallet.

Line 7 - Reach out to get the GitHub Org global info, making sure to send our authentication and other relevant headers on line 9. We check our response code on 13 to make sure we get a happy http/200 response.

On line 18 - 19, we read both the Private repos (line 18) and Public repos (line 19), and then add them together (line 20).

Functions can only return one thing, but that thing can be a complex object, so that’s exactly what we do. We build a dictionary that contains all that information and return it to the caller. But what’s the caller?

Show hidden characters

	# Find count of all repos in org and store as var
	def get_repo_count():
	# Get API token wallet
	hold_until_rate_limit_success()

	# Find how many repos exist in the Org
	org_info = requests.get(
	url="https://api.github.com/orgs/"+GITHUB_ORG,
	headers=headers
	)

	# Check response code, and if not 200, exit
	if org_info.status_code != 200:
	print("Error fetching org info")
	exit(1)

	# Store info
	PRIVATE_REPO_COUNT=org_info.json()['owned_private_repos']
	PUBLIC_REPO_COUNT=org_info.json()['public_repos']
	TOTAL_REPO_COUNT=PRIVATE_REPO_COUNT+PUBLIC_REPO_COUNT

	# Build dict of info and return
	d = dict();
	d['PRIVATE_REPO_COUNT'] = PRIVATE_REPO_COUNT
	d['PUBLIC_REPO_COUNT'] = PUBLIC_REPO_COUNT
	d['TOTAL_REPO_COUNT'] = TOTAL_REPO_COUNT
	return d

view raw get_repo_count.py hosted with ❤ by GitHub

The caller is another function! So we want to get all the repo names, so that we can read their SBOMs. And now we know how many there are, because of the above function. Woot. Okay, let’s do it.

This is a little intense, so let’s move fast.

Line 5 - check to make sure our API token wallet is full by calling the function we defined earlier.

Line 7 - call the above function to get a repo count.

Line 10 - establish a list to put all the repo names in.

Line 16 and 17 - Set that we can fetch 100 repos per page (the max), and then start iterating through the pages we’ll require, which is the total repo count divided by the per-page limit (+ 2 to make sure we call enough pages).

Line 21 - Get all the repos on whatever page we’re on, then on line 26 make sure the response code worked.

Line 32 - Iterate over every repo in the response, and check to see if the top-level keys “archived”, “disabled”, or “is_template” are set. If yes, skip it. If not, add the “name” attribute to the list!

Line 40 - Once the iteration has run over all pages, and all repos, we kick back all the repo names to the caller, and on line 43 we see the caller. It stores all the repo names in “repo_names” variables as a list of repo names.

Show hidden characters

	# Define function
	def get_all_repo_names():

	# Check API token wallet
	hold_until_rate_limit_success()

	repo_count_info = get_repo_count()

	# Can get 100 repos at a time, so need to loop over all repos
	repos = []

	# Announce
	print()
	print("Fetching all repos")

	per_page=100
	for i in range(1, repo_count_info["TOTAL_REPO_COUNT"]//100+2):
	print("Fetching repos page "+str(i))

	# Fetch all repos
	response = requests.get(
	url="https://api.github.com/orgs/"+GITHUB_ORG+"/repos?per_page="+str(per_page)+"+&page="+str(i),
	headers=headers
	)

	# Check response code, and if not 200, exit
	if response.status_code != 200:
	print("Error fetching repos")
	exit(1)

	# Iterate over response, find all repos
	for repo in response.json():
	# If not archived, disabled, or template, append
	if repo["archived"] == False and repo["disabled"] == False and repo["is_template"] == False:
	repos.append(repo["name"])

	# Announce
	print()

	return repos

	# Get all repo information
	repo_names = get_all_repo_names()

view raw get_all_repo_names.py hosted with ❤ by GitHub

Next up we initialize our CSV by creating the file and writing the headers.

On line 3, we open a file called (github_org_name)_repo_dependency_listing.csv in write mode (not “a” for append, so we’ll over-write anything that exists with that name, doesn’t matter in a GitHub Action) as “file”.

On line 5, we create our writer and link it to that file.

On line 8, we create our CSV headers - org, repo, dependency_name, and license.

On line 9, we write that information to the CSV file.

And on line 12, we call the function, no inputs or outputs.

Show hidden characters

	# Define function
	def initialize_csv():
	with open(GITHUB_ORG+"_repo_dependency_licensing.csv", 'w', newline='') as file:
	# Initialize writer
	writer = csv.writer(file)

	# Write header
	field = ["org", "repo", "dependency_name", "license"]
	writer.writerow(field)

	# Call function
	initialize_csv()

view raw initialize_csv.py hosted with ❤ by GitHub

Next up we’re going to start iterating over our repos and getting their dependencies. First let’s talk about the function to do that. It doesn’t tackle all the repos at once, it’s build to get one Repo’s SBOM.

On line 2 we define our function and receive some info - we’re looking for the repo name (repo), the iteration index (index), and the count of all repos (repo_count).

On line 8, we store the URL we’ll do the request to, and on line 11 we do the request for a single repo (whose name is passed into this function). On line 17 we check our response code. It’s a little more complex here than in earlier functions, we establish some error handling to print the error message if we fail to retrieve the SBOM for any repo so folks can diagnose it. Error handling also enables our script to continue without globally failing if there’s an issue.

This is getting long, so let’s split this function into two snippets, on to the next one!

Show hidden characters

	# Define function
	def get_repo_dependencies(repo, index, repo_count):

	# Check rate limit
	hold_until_rate_limit_success()

	# URL
	url = "https://api.github.com/repos/"+GITHUB_ORG+"/"+repo+"/dependency-graph/sbom"

	# Fetch sbom
	response = requests.get(
	url=url,
	headers=headers
	)

	# Check response code, and if not 200, exit
	if response.status_code == 200:
	# Print green check box
	print("✅ Successfully fetched SBOM for repo", repo, "("+str(index)+"/"+str(repo_count)+")")
	else:
	print("❌ Error fetching SBOM for repo", repo, "("+str(index)+"/"+str(repo_count)+")")
	# Print error message
	print("Error message:", response.json()['message'])
	return

view raw get_repo_dependencies_part1.py hosted with ❤ by GitHub

Okay, so we’ve got our response, and validated the http header is happy to get here. Let’s read the SBOM and write the dependencies (and their licenses) to the CSV!

On line 2, lets read through the SBOM json response for the sbom.packages address to get ALL the dependencies. There’s a lot on most repos (our largest has 2.2k!).

On line 4 we check to see if the .licenseConcluded key is present. If not, the SBOM API isn’t sure what the license is of the package, so we mark it as Unknown.

On line 10, we check to see if the string “GPL” exists in the license string. GPL is the strongest copyleft license, and for a for-profit company it can spell trouble, so we print a little warning that it’s been found. More on copyleft licenses here.

Then on line 14 we open up our CSV again, this time in “a” mode for Append, so we just add info to the file, and on line 19 we establish our field to write, which is our github Org, then the repo name, then the package name (since we’re iterating over each package 1 by 1, and then the license for that package, and on line 20 we write the row.

Show hidden characters

	# Parse response by looping over sbom.packages to get all names and license types
	for package in response.json()['sbom']['packages']:
	# If license key not present, set to unknown
	if 'licenseConcluded' not in package:
	license = "Unknown"
	else:
	license = package['licenseConcluded']

	# If license contains string GPL, print out repo name
	if "GPL" in license.upper():
	print("- ⬅️ Copyleft licensed tool found:", package['name'], "with license:", license)

	# Write to CSV
	with open(GITHUB_ORG+"_repo_dependency_licensing.csv", 'a', newline='') as file:
	# Initialize writer
	writer = csv.writer(file)

	# Write data
	field = [GITHUB_ORG, repo, package['name'], license]
	writer.writerow(field)

view raw get_repo_dependencies_part2.py hosted with ❤ by GitHub

Finally, we call the get_repo_dependencies() function. It works on a single repo, so we loop over every repo (using an index, established on line 3 and incremented on line 6).

Show hidden characters

	# Get dependencies for each repo, write to CSV
	repo_count = len(repo_names)
	index=1
	for repo in repo_names:
	dependencies = get_repo_dependencies(repo, index, repo_count)
	index+=1

view raw repo_loop.py hosted with ❤ by GitHub

Put it in a GitHub Action!

Okay, we have a functional python script that can read all the info we need, and write them to a file! At least for everyone that has python3 installed, a virtual env setup, and all the required dependencies installed. Which means it’s kind of hard to use, right?

So let’s make it super easy to use - let’s put it in a GitHub Action so folks can call it easily and use it even if they don’t understand it.

This is a “composite” GitHub Action, and can be listed in the GitHub Marketplace (which I’ve done, here is the link).

On line 1 - 2 we establish some basic stuff, like the name and description of this Action.

Then on line 4 we setup what we’ll need to receive to operate - we only need 2, the GitHub Org name, and the GitHub Token (PAT). Both are set to required, as our script can’t succeed without them.

Show hidden characters

	name: Org-Wide License Analyze
	description: 'This github action is used to analyze the license of every repository in a github org'

	inputs:
	GITHUB_ORG_NAME:
	description: 'The name of the github org to analyze'
	required: true
	type: string
	GITHUB_TOKEN:
	description: 'The github token to use for the analysis, needs all repo: permissions and org:read'
	required: true
	type: string

view raw action_1.yml hosted with ❤ by GitHub

Next up we start building steps for our Action to run.

Line 4 - checkout the repo, in case any files are required (they probably aren’t? Maybe this should be removed).

Line 6 - setup python with version 3.12 on the runner.

Line 10 - Install any dependencies we’ll require, using pip and referencing the requirements.txt file in the Action’s repo.

Show hidden characters

	runs:
	using: "composite"
	steps:
	- uses: actions/checkout@v4

	- uses: actions/setup-python@v5
	with:
	python-version: '3.12'

	- name: Install dependencies
	shell: bash
	run: \|
	python3 -m pip install --upgrade pip
	pip install -r ${{ github.action_path }}/requirements.txt

view raw action_2.yml hosted with ❤ by GitHub

Next up we copy the license file to our working directly on line 4.

Then on line 6 we run our python3 script, and set the GitHub Org name and the GitHub Token as environmental variables for our python3 script to read.

Then on line 13 we run an upload/artifact step that’ll collect our CSV and store it as an Artifact on this GitHub Action run for humans (or automation) to download and further process.

Show hidden characters

	- name: Copy the license file to the action directory
	shell: bash
	run: \|
	cp ${{ github.action_path }}/org_get_all_repos_sbom.py .

	- name: Run License Analysis
	shell: bash
	run: python3 org_get_all_repos_sbom.py
	env:
	GITHUB_ORG: ${{ inputs.GITHUB_ORG_NAME }}
	GITHUB_TOKEN: ${{ inputs.GITHUB_TOKEN }}

	- name: Upload License Overview as Artifact
	uses: actions/upload-artifact@v4
	with:
	name: org-license-csv
	path: '*.csv'

	branding:
	icon: 'arrow-down-circle'
	color: 'gray-dark'

view raw action_3.yml hosted with ❤ by GitHub

Links

Here’s a link to the code repo:

https://github.com/KyMidd/ActionOrgWideLicenseInventory

Here’s a link to the marketplace entry where you can use this yourself:

https://github.com/marketplace/actions/org-wide-license-analyze

Summary

In this write-up we talked about what an SBOM is, and how it’s useful to know the licensing of the dependencies are teams are using. We talked a little about how GitHub provides this information, but doesn’t provide it for the entire Organization (at publication time, at least!).

We walked through how we can use python to fetch this information, how we’ll pass secure information to python, and some error handling that we’ve built in to handle edge cases. We also embedded the python script in a GitHub Action and listed it on the Marketplace so folks can easily consume this tool without needing to install or run anything locally on their system.

Hope it was an illuminating read. Thanks ya’ll.

Good luck out there.
kyler

Let's Do DevOps

🔥Let's Do DevOps: Inventory Dependency Licenses across all Repos in GitHub Org🔥

aka, SBOM-arama

Imports and Validation

Lets Walk(through) This Python

Put it in a GitHub Action!

Links

Summary

Discussion about this post