Public Policy Analytics
2026-01-19
Problem: AI tools make it easy to complete code without understanding
Solution:
GitHub: Industry standard for version control and collaboration
Quarto: Modern approach to reproducible research and documentation
R: Powerful for spatial analysis and policy-focused statistics
These aren’t just “class tools” - they’re career tools:
Version control system that tracks changes in files
Think of it as:
Cloud hosting for Git repositories
Repository (repo): Folder containing your project files
Commit: Snapshot of your work at a point in time
Push: Send your changes to GitHub cloud
Pull: Get latest changes from GitHub cloud
Clone: Local copy of your repo
Your workflow each week:
This becomes second nature by mid-semester!
Publishing system that combines:
Into professional documents
Reproducible research:
Career relevance:
If you know R Markdown:
YAML header:
R code chunk:
Bold text
Italic text
Bold and italic
code text
Strikethrough
Use headers to organize your analysis sections.
Essential for professional portfolios:
Always work within projects for:
"data/cars.csv" works for everyoneProfessional standard - employers expect this!
We’ll work in projects all semester - builds good habits!
Step 1: Clone your GitHub repository - You can clone in GitHub desktop or through the terminal in RStudio - GitHub desktop is easier at first, but terminal is faster once you learn how
Step 2: Open in RStudio
.Rproj fileOrganized structure from day one:
Professional habit:
Be consistent and descriptive:
Best practices for this course:
# Raw data (never edit these!)
raw_census <- read_csv(here("data", "raw", "acs_2022_philadelphia.csv"))
# Process and save cleaned versions
clean_census <- raw_census %>%
clean_names() %>%
filter(!is.na(median_income))
write_csv(clean_census, here("data", "processed", "acs_2022_clean.csv"))
# Use processed data in analysis
analysis_data <- read_csv(here("data", "processed", "acs_2022_clean.csv"))Critical settings to change RIGHT NOW:
Tools → Global Options → General:
Tools → Global Options → Workspace:
Without these changes:
With these settings:
Keep your environment clean:
Collection of packages designed for data science:
Tidyverse uses “tibbles” - enhanced data frames:
Smarter printing:
We’ll see the difference with our car data…
We’ll use these constantly:
select() - choose columnsfilter() - choose rowsmutate() - create new variablessummarize() - calculate statisticsgroup_by() - operate on groupsAll dplyr functions follow the same pattern:
This consistency makes dplyr predictable and easy to learn!
# Rule 1: Data frame first
select(car_data, Manufacturer, Price)
filter(car_data, Price > 20000)
mutate(car_data, price_k = Price / 1000)
# Rule 2: Column names without quotes
select(car_data, Manufacturer, Model, Price) # Not "Manufacturer"
filter(car_data, Year >= 2020, Mileage < 50000)
# Rule 3: Always returns a new data frame
new_data <- select(car_data, Manufacturer, Price)
# car_data is unchanged, new_data contains selected columnsPipes (%>%) are the magic of dplyr:
# The power of pipes - read as "then"
car_summary <- data %>%
filter(`Year of manufacture` >= 2020) %>% # Recent models only
select(Manufacturer, Model, Price, Mileage) %>% # Key variables
mutate(price_k = Price / 1000) %>% # Convert to thousands
filter(Mileage < 50000) %>% # Low mileage cars
group_by(Manufacturer) %>% # Group by brand
summarize( # Calculate statistics
avg_price = mean(price_k, na.rm = TRUE),
count = n()
)What is %>%?
Without pipes (nested functions):
Alternative: create many intermediate objects
# Clutters your environment
recent_cars <- filter(data, `Year of manufacture` >= 2020)
key_vars <- select(recent_cars, Manufacturer, Model, Price, Mileage)
price_thousands <- mutate(key_vars, price_k = Price / 1000)
low_mileage <- filter(price_thousands, Mileage < 50000)
grouped_cars <- group_by(low_mileage, Manufacturer)
car_summary <- summarize(grouped_cars,
avg_price = mean(price_k, na.rm = TRUE),
count = n())Problems: Lots of temporary objects, hard to follow the logic
Readable: Follow the logical flow of analysis
Efficient: No temporary objects cluttering environment
Debuggable: Easy to run line-by-line
Professional: Industry standard for data science
Read as:
“Take car_data, then filter for cars over $15,000, then select manufacturer and price columns, then group by manufacturer, then calculate average price”
These functions work as a team:
group_by() - sets up grouping for subsequent operationssummarize() - collapses rows into summary statisticsKey insight: group_by() prepares the data, doesn’t transform it yet
# Without grouping - one row of results
car_data %>%
summarize(
avg_price = mean(Price, na.rm = TRUE),
total_cars = n()
)
# Result: 1 row with overall averages
# With grouping - one row per group
car_data %>%
group_by(Manufacturer) %>%
summarize(
avg_price = mean(Price, na.rm = TRUE),
total_cars = n()
)
# Result: 5 rows (one per manufacturer)Original data (imagine this):
Manufacturer Price Mileage
Toyota 25000 30000
Toyota 28000 15000
Honda 22000 45000
Honda 30000 20000
Ford 35000 10000
After group_by(Manufacturer) %>% summarize(…):
Manufacturer avg_price total_cars
Toyota 26500 2
Honda 26000 2
Ford 35000 1
Essential summary functions:
car_data %>%
group_by(Manufacturer) %>%
summarize(
count = n(), # Number of rows
avg_price = mean(Price, na.rm = TRUE), # Average
med_price = median(Price, na.rm = TRUE), # Median
min_price = min(Price, na.rm = TRUE), # Minimum
max_price = max(Price, na.rm = TRUE), # Maximum
std_dev = sd(Price, na.rm = TRUE) # Standard deviation
)Perfect for policy questions like:
Pattern: group_by(category) %>% summarize(metric)
Tuesday Class: - New concepts and methods - Hands-on coding practice - Lab work with TA support
During Week: - Complete portfolio assignments - Weekly notes and reflection - Office hours for help
Focus on understanding, not perfect code:
Your GitHub portfolio will include:
By end of today:
Next: Portfolio setup + Lab 0
Remember: This is a learning process - ask for help when you need it!
[Switch to live demonstration of GitHub workflow]