B2: Git & GitHub Essentials
Version control for economists — because ‘final_v2_REAL_final.do’ is not a system
Learning Objectives
By the end of this module, you should be able to:
- Explain why version control matters for reproducible economics research
- Describe Git’s mental model — snapshots, staging, and commits — and relate it to your existing workflow
- Execute the core Git workflow: clone, status, add, commit, push, pull
- Read a diff and a commit history to understand what changed and when
- Recover from common mistakes without panic
The Problem You Already Have
You’ve done this. Maybe not today, but you’ve done it:
analysis.do
analysis_v2.do
analysis_v2_final.do
analysis_v2_final_REAL.do
analysis_v2_final_REAL_oct12.do
analysis_v2_final_REAL_oct12_after_meeting.do
Or worse — you renamed the file, overwrote something, and now you’re not sure which version produced the tables in your paper. Your co-author emailed you analysis_FINAL.do and you’re not sure if it’s newer than yours. You both edited the same file last week and now you have to manually compare 400 lines of Stata code.
This is not a personal failing. This is a systems problem. And it has a systems solution.
Think of your current file management as a missing data problem. Without version control, you’ve lost the full history of your analysis pipeline. You have the final dataset but no audit trail — no way to verify which transformations were applied, in what order, by whom. Git is the metadata you wish you’d been collecting all along.
Why economists should especially care
- Reproducibility: Journals increasingly require replication packages. Version control makes this trivial instead of painful.
- Collaboration: Research is co-authored. Git solves the “who has the latest version?” problem.
- Audit trail: If a referee asks “when did you add that control variable?”, you can answer precisely — with a timestamp and a note explaining why.
- Mistake recovery: Deleted a critical chunk of code? Changed something that broke your results? Git lets you go back.
Version control is not a programming tool that economists happen to use. It is a research integrity tool that every empirical researcher needs — just like a lab notebook in the sciences.
The Mental Model: Snapshots, Not Track Changes
If you’ve used “Track Changes” in Word, forget that model. Git works differently.
Git takes snapshots of your entire project at specific points in time. Each snapshot (called a commit) is a complete picture of every file in your project at that moment. You decide when to take each snapshot, and you attach a message describing what changed and why.
Think of it like this:
| Concept | Analogy |
|---|---|
| Repository (repo) | A project folder that Git is tracking |
| Commit | A checkpoint save — a snapshot of your project at a moment in time |
| Commit message | A lab notebook entry — “Added state fixed effects to main specification” |
| Staging area | A holding zone where you decide which changes go into the next snapshot |
| Remote (GitHub) | A backup copy of your project that lives online and can be shared |
A commit is like a dated entry in a research log. You wouldn’t write “changed stuff” in a lab notebook — you’d write “re-ran regressions with clustered standard errors at the state level per referee suggestion.” Same idea here. Every commit is a documented, recoverable checkpoint.
Why the staging area matters
The staging area is Git’s most confusing concept for beginners, and also one of its most useful features. Here’s why it exists:
Say you’ve been working for an hour. You fixed a bug in your data cleaning code AND started experimenting with a new specification. These are two separate changes with two separate purposes. The staging area lets you commit them separately, with separate messages, even though you made both changes in the same work session.
You choose what goes into each snapshot. This keeps your history clean and meaningful.
Setting Up Git
Before using Git, you need to tell it who you are. Open your terminal and run:
git config --global user.name "Your Name"
git config --global user.email "your.email@uvm.edu"This tags your commits with your identity — important when you’re collaborating and need to know who changed what.
You only need to do this once per computer. Git remembers these settings. If you’re working on a shared computing cluster, you may need to do it once per account.
To verify your setup:
git config --listYou should see your name and email in the output.
The Core Workflow
Here’s the workflow you’ll use 95% of the time. We’ll walk through each step.
clone → [edit files] → status → add → commit → push
↑ │
└──────────── pull ←──────────────┘
Step 1: Clone — Get a copy of a project
git clone downloads a repository from GitHub to your computer.
git clone https://github.com/username/my-econ-project.gitThis creates a folder called my-econ-project/ with all the files and their full history.
Cloning into 'my-econ-project'...
remote: Enumerating objects: 47, done.
remote: Counting objects: 100% (47/47), done.
remote: Compressing objects: 100% (31/31), done.
Receiving objects: 100% (47/47), 12.03 KiB | 6.02 MiB/s, done.
You only clone once. After that, you use pull to get updates.
Step 2: Check status — What has changed?
After editing files, git status tells you what’s different since the last commit.
git statusOn branch main
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
modified: analysis/01_clean_data.do
Untracked files:
(use "git add <file>..." to include in what will be committed)
analysis/02_regressions.do
no changes added to commit (use "git add" to add)
This tells you:
01_clean_data.dowas modified (Git knows about it, you changed it)02_regressions.dois new (Git hasn’t seen it before — “untracked”)- Nothing is staged yet (nothing is ready to be committed)
Run git status often. It’s free, it never changes anything, and it’s the fastest way to orient yourself. Think of it as looking at your desk before you start working — just seeing what’s there.
Step 3: Add — Stage your changes
git add moves changes into the staging area — the “ready to commit” zone.
# Stage a specific file
git add analysis/01_clean_data.do
# Stage multiple specific files
git add analysis/01_clean_data.do analysis/02_regressions.do
# Stage all changes (use cautiously)
git add .After staging, git status looks different:
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: analysis/01_clean_data.do
new file: analysis/02_regressions.do
The files are now in the staging area, ready for a snapshot.
git add .
git add . stages everything. This includes data files, log files, temporary files, and anything else in the folder. For economics projects, you almost never want to commit your datasets to Git (they’re too large and should live elsewhere). Stage specific files by name until you’re comfortable with .gitignore files, which tell Git what to ignore.
Step 4: Commit — Take the snapshot
git commit creates the snapshot and attaches your message.
git commit -m "Add state fixed effects to main regression specification"[main a3b8f2d] Add state fixed effects to main regression specification
2 files changed, 34 insertions(+), 12 deletions(-)
create mode 100644 analysis/02_regressions.do
The commit now exists in your local history. It hasn’t gone anywhere yet.
What makes a good commit message?
Your commit message is the “why” behind the change. Future you (and your co-authors) will read these.
| Bad | Good |
|---|---|
| “updated file” | “Fix missing value handling in income variable” |
| “changes” | “Add robustness check: drop top 1% of earnings” |
| “final version” | “Revise tables for R&R — add county fixed effects” |
| “stuff” | “Clean ACS data: harmonize occupation codes across years” |
Rules of thumb:
- Start with a verb (Add, Fix, Update, Remove, Refactor)
- Be specific enough that you could find this commit later
- Keep it under ~72 characters for the first line
- If you need more detail, leave a blank line and write a longer description
Step 5: Push — Send your commits to GitHub
git push uploads your local commits to the remote repository on GitHub.
git pushEnumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 512 bytes | 512.00 KiB/s, done.
Total 4 (delta 2), reused 0 (delta 0)
To https://github.com/username/my-econ-project.git
f4a21c8..a3b8f2d main -> main
Now your changes are backed up online and visible to collaborators.
Step 6: Pull — Get your collaborator’s changes
git pull downloads and integrates changes that others have pushed.
git pullremote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
Updating f4a21c8..b7c9e31
Fast-forward
analysis/03_tables.do | 28 +++++++++++++++
1 file changed, 28 insertions(+)
create mode 100644 analysis/03_tables.do
Your co-author pushed a new file. Now you have it.
Think of push and pull like submitting and downloading files from a shared Dropbox — except Git tracks exactly what changed, when, and by whom. And if two people change the same file, Git will alert you instead of silently overwriting someone’s work.
Reading a Diff: What Changed?
A diff shows the line-by-line differences between two versions of a file. Learning to read diffs is like learning to read regression output — it looks cryptic at first, then it becomes second nature.
git diffdiff --git a/analysis/01_clean_data.do b/analysis/01_clean_data.do
index 3a2b1c0..8f4e2d1 100644
--- a/analysis/01_clean_data.do
+++ b/analysis/01_clean_data.do
@@ -23,7 +23,8 @@
* Clean income variable
gen income_clean = income
replace income_clean = . if income < 0
-replace income_clean = . if income > 500000
+replace income_clean = . if income > 1000000
+label var income_clean "Annual income (cleaned, USD)"Here’s how to read this:
- Lines starting with
-were removed (the old version) - Lines starting with
+were added (the new version) - Lines with no prefix are context — unchanged lines shown for reference
In this example: someone raised the income top-code from $500k to $1M and added a variable label. Clear, traceable, reviewable.
Reading commit history
git log --onelinea3b8f2d Add state fixed effects to main regression specification
f4a21c8 Fix missing value handling in income variable
c1d9e2f Initial commit: data cleaning pipeline
Each line is a commit. Read from bottom to top — that’s the chronological order of your project’s development. The string on the left (e.g., a3b8f2d) is the commit’s unique ID. You can use it to inspect or revert to any past snapshot.
For more detail:
git logThis shows full commit messages, authors, and timestamps — your project’s complete lab notebook.
GitHub: The Online Layer
Git is the version control system that runs on your computer. GitHub is a website that hosts Git repositories online. They are related but separate things.
| Feature | Git (local) | GitHub (online) |
|---|---|---|
| Where it lives | Your computer | github.com |
| What it does | Tracks changes, manages history | Hosts repos, enables collaboration |
| Works offline? | Yes | No |
| Required? | Yes (for version control) | No (but very useful) |
What GitHub gives you
- Backup: Your code exists somewhere other than your laptop
- Collaboration: Co-authors can push and pull changes
- Visibility: Browse code, history, and changes in a web browser
- README display: GitHub renders your project’s README file as a homepage — useful for replication packages
- Issue tracking: A lightweight task list for your project (“TODO: add robustness checks”, “BUG: inflation adjustment wrong for 2019”)
Exercise: Your First Git Workflow
This exercise walks you through the full workflow. You’ll need a terminal open and Git installed. (If you haven’t installed Git, see the setup note at the end of this module.)
Part 1: Clone and explore
- Clone a sample repository (your instructor will provide the URL, or use any public repo):
git clone https://github.com/username/econ-project.git
cd econ-project- Look around:
git status
git log --onelineNotice the clean history. Every change is documented.
Part 2: Create and commit a file
- Create a simple analysis script. Open a text editor and create a file called
my_analysis.do(ormy_analysis.R) in the project folder:
* My first version-controlled analysis
* Author: [Your Name]
* Date: [Today's Date]
display "Hello, version control!"- Check what Git sees:
git statusYou should see your new file listed as “untracked.”
- Stage and commit:
git add my_analysis.do
git commit -m "Add initial analysis script with header"- Verify:
git log --onelineYour commit is now at the top of the history.
Part 3: Make a change and commit again
- Edit
my_analysis.do— add a line of code (anything):
* Calculate summary statistics
sysuse auto, clear
summarize price mpg weight- See the diff:
git diffYou should see your new lines marked with +.
- Stage and commit with a descriptive message:
git add my_analysis.do
git commit -m "Add summary statistics for auto dataset"- Check your history:
git log --onelineYou now have two commits — two recoverable checkpoints.
Part 4: Push (if you have a remote)
If your instructor has set up a remote repository for you:
git pushIf not, that’s fine — everything you’ve done is tracked locally.
You created a file, committed it with a clear message, made a change, and committed that separately. You now have two snapshots of your project. You can go back to either one at any time. This is the entire core workflow — everything else is refinement.
When Things Go Wrong
Things will go wrong. That’s fine. Here are the most common situations and the simplest fixes.
“I forgot to pull before I started working”
git pullIf your changes don’t overlap with your collaborator’s, Git merges automatically. If they do overlap, Git will tell you there’s a conflict and ask you to resolve it. (Conflict resolution is beyond this module — for now, the fix is to communicate with your co-author about who’s editing which file.)
“I committed the wrong file”
If you committed a data file or a log file by accident and haven’t pushed yet:
git reset --soft HEAD~1This undoes the last commit but keeps your changes staged. You can then unstage the file you don’t want and recommit:
git restore --staged big_dataset.dta
git commit -m "Add analysis script (without data file)"“I want to see what a file looked like before”
git log --oneline my_analysis.do
git show a3b8f2d:my_analysis.doThis prints the file as it existed at that commit — without changing anything in your current version.
“I want to undo changes I haven’t committed yet”
If you’ve been editing a file and want to go back to the last committed version:
git restore my_analysis.dogit restore throws away your uncommitted changes. They’re gone. This is why committing frequently is a good idea — committed changes can always be recovered. Uncommitted changes cannot.
“I have no idea what’s going on”
git statusAlways start here. Read the output carefully — Git usually tells you exactly what’s happening and suggests what to do next.
What We’re Not Covering (And Why)
Git is a deep tool. We’ve covered the core workflow that handles the vast majority of day-to-day research work. Here’s what we’re deliberately leaving out:
| Topic | What it is | Why we’re skipping it |
|---|---|---|
| Branches | Parallel versions of your project | Essential for teams, but adds complexity; covered in an advanced module |
| Merge conflicts | When two people edit the same lines | Requires branches to explain properly |
| Pull requests | A way to propose and review changes | A GitHub collaboration feature; builds on branches |
| Rebasing | Rewriting commit history | Advanced; easy to lose work if done wrong |
.gitignore |
Telling Git which files to skip | Useful but mechanical; easy to look up when you need it |
These are all real and useful. But the workflow in this module — clone, status, add, commit, push, pull — will carry you through your first year of version-controlled research. Add the other tools as you need them.
Discussion Questions
Think about a research project you’ve worked on (or are working on). What would a good set of commit messages look like for that project? How would having that history change how you work?
Many economics journals now require replication packages. How does version control change the process of creating a replication package — from something you build at the end to something you build as you go?
A colleague says “I use Dropbox, so I already have version control.” What’s right about this? What’s missing compared to Git?
What kinds of files should go in a Git repository, and what kinds should not? Think about a typical empirical economics project with .do files, .dta files, .csv files, .log files, .tex files, and PDFs.
Key Takeaways
- Version control solves the “final_v2_REAL_final” problem. Git tracks every version of every file automatically. You never need to rename files to save old versions.
- The core workflow is five commands.
status,add,commit,push,pull— that’s 95% of what you need day-to-day. - Commit messages are your research log. Write them for future you. Be specific about what changed and why.
- Start now, not later. The best time to put a project under version control is at the beginning. The second best time is today.
This module works best with a live coding demo. Walk through the full workflow in real time — create a repo, make changes, commit, show the log. Students should follow along on their own machines.
Setup note: Students need Git installed before class. On macOS, Git ships with Xcode Command Line Tools (xcode-select --install). On Windows, install Git for Windows. Consider sending setup instructions a week before this module.
Pairing: This module pairs naturally with B1: Terminal Basics. Students who are comfortable in the terminal will find Git much less intimidating. If your students haven’t done B1, budget an extra 15 minutes for terminal orientation.
Common sticking points: (1) The staging area concept — use the “packing a suitcase” analogy (you choose what goes in before you zip it shut). (2) The difference between Git and GitHub — draw it on the board. (3) Authentication with GitHub — consider having students set up SSH keys or a personal access token before class.
Assessment idea: Have students submit a link to a GitHub repository with at least 5 meaningful commits showing the progression of an analysis. Grade the commit messages and the logical flow of the project, not just the final code.