How a Small Team Shrank a Microsoft Monorepo by 94%
A Microsoft monorepo ballooned to 150GB. The culprit wasn't code, but a Git hashing bug from 2006 that saw different files as identical.
#1about 2 minutes
The scale of Microsoft's monorepo problem
A monorepo with 20 million lines of code grew from a manageable 2GB to an unworkable 150GB, prompting an investigation into its exponential growth.
#2about 3 minutes
How automated changelog tooling bloated the repository
The versioning tool Beach Ball generated thousands of changelog files, causing a separate versioning branch to swell to an enormous 130GB.
#3about 9 minutes
Discovering a Git hashing algorithm bug from 2006
A Git expert found that an old hashing algorithm only used the last 16 characters of a filename, causing collisions that prevented proper diffing of changelog files.
#4about 4 minutes
Implementing the new path walk algorithm to fix Git
The solution was a new "Path Walk" algorithm for `git push` and `git repack` that uses the full file path to avoid hash collisions and ensure correct diffing.
#5about 2 minutes
Applying the fix with new Git config and repack commands
Developers can enable the new algorithm for pushes via a `git config` setting and shrink local clones using the `git repack --use-path-walk` command.
#6about 2 minutes
Using the new `git survey` command to find large files
A new built-in command, `git survey`, was created to help developers identify large files, blobs, and binaries in their repository history.
#7about 3 minutes
Best practices for managing large repositories
Beyond the specific fix, general best practices like not checking in binaries and avoiding thousands of files in a single folder are crucial for repository health.
#8about 6 minutes
The broader impact on the open source community
The new algorithm has shown significant size reductions for other large monorepos like Chromium, and the fix is being upstreamed to benefit the entire Git community.
Related jobs
Jobs that call for the skills explored in this talk.
Matching moments
06:19 MIN
How a small team fixed tech bankruptcy incrementally
Shipping Quality Software In Hostile Environments
04:13 MIN
How Git and GitHub created corporate open source
The Future of Open Source
03:30 MIN
How GitOps fosters team growth and experimentation
GitOps keeps focus on apps, not on infrastructure
01:44 MIN
Managing community feedback and contributions
DB UX Design System – How we’ve open sourced our largest inner source project
15:02 MIN
Answering audience questions on GitHub Actions
Lights, Camera, GitHub Actions!
03:01 MIN
How GitHub evolved from a Git wrapper to a social platform
Coffee With Developers - Kyle Daigle, COO of GitHub
07:32 MIN
Deconstructing the overloaded git commit command
Coffee with Developers - Scott Chacon on growing GitButler and the future of version control
How Microsoft worked around a Git limitation to shrink a repository by 94%Imagine that you are responsible for a Git repository with 1000 users, and 20 million lines of code. You struggle to keep up with constant pull requests but the biggest problem is that the Git file size of the repository is mushrooming to over 170GB ...
Dev Digest 108 - Git off my cloud!Welcome to another edition of the WeAreDevelopers Dev Digest. This time we have an interview with Sead Ahmetovic, CEO of of WeAreDevelopers and Scott Chacon, co-Founder of GitHub. They talk about careers, early coding days, developer communities, eva...
Alan Smithee
GitHub Copilot: Beyond the Basics – 10 Ways to Elevate Your CodingWelcome to an in-depth exploration of GitHub Copilot and its capabilities. If you're a software developer or someone intrigued by AI's potential to revolutionize coding, this post is for you. GitHub Copilot, an AI-powered code completion tool, offers...
From learning to earning
Jobs that call for the skills explored in this talk.