6 Working with Large Datasets
When your data doesn’t fit in memory
7 The Problem
Most social science workflows assume your data fits in memory. You load a CSV into R or pandas, run your analysis, and move on. This works until it doesn’t — and the threshold arrives faster than you’d expect. A 10GB administrative dataset will crash R on a laptop with 16GB of RAM. Four such datasets, each covering a different outcome or data source, and you’re managing 40GB+ of raw files that no standard tool can open.
This is increasingly common. Government administrative data (tax records, mortgage filings, Medicare claims, customs transactions), web-scraped corpora, satellite imagery extracts, and digitized historical records routinely exceed what a dataframe can hold. If your research question requires linking across multiple large sources, the problem compounds.
The traditional solutions are painful: buy more RAM, rent a cloud server, learn SQL, or spend weeks writing custom scripts to process files in chunks. Each requires specialized knowledge that most social scientists don’t have and shouldn’t need to acquire from scratch.
This is where Claude Code changes the equation. The data engineering knowledge hasn’t gotten easier — but you no longer need to acquire it yourself. You can describe what you need in plain English and let Claude handle the implementation. What previously required weeks of Stack Overflow research and trial-and-error can be done in hours (Goldsmith-Pinkham, Markus Academy, 2026).
This tutorial covers the workflow for getting large datasets under control: converting to efficient formats, building a queryable database, and making the whole thing self-documenting so Claude (and your collaborators) can work with it across sessions.
8 The Tool Stack
Two tools solve the large-data problem for most social science use cases:
Parquet is a file format. Where CSV stores data as plain text (every number is a string of characters), parquet stores data in a compressed binary format organized by columns. The practical difference: a 10GB CSV typically compresses to 2–1.5GB in parquet, depending on data type — categorical and string-heavy datasets (common in administrative data) compress the most, while numeric-heavy datasets see smaller gains. Reading a single column from a parquet file doesn’t require loading the entire file — the format supports lazy reading, where only the requested columns enter memory.
DuckDB is a database engine that runs inside your R or Python session — no server, no installation beyond a package, no configuration. It reads parquet files directly and executes SQL queries on datasets far larger than RAM. Think of it as SQLite for analytical workloads.
| Tool | What It Replaces | Why It’s Better |
|---|---|---|
| Parquet | CSV files | 5–15x smaller, column-selective reads, typed |
| DuckDB | Loading full datasets into R/Python | Queries 100GB+ without loading into memory |
You don’t need to learn either in depth. Claude Code knows both well. What you need is the workflow: when to use them and how to ask Claude to set them up.
Installing. In R: install.packages(c("arrow", "duckdb")). In Python: pip install pyarrow duckdb. Claude will handle this if you ask, but you can also install them in advance.
9 Step 1: Convert Raw Files to Parquet
Start with your raw data — typically CSVs downloaded from a government agency, data archive, or API. Don’t delete the originals. Convert them to parquet and work from the parquet versions going forward.
Prompt:
“I have [N] CSV files in data/raw/ totaling about [X] GB. Convert them all to parquet format and save to data/parquet/. Keep the original CSVs untouched. Report the size reduction.”
Claude will write a script that reads each CSV (in chunks if needed), converts to parquet with appropriate type inference, and saves the compressed version. Expect 5–15x compression depending on data type, with string-heavy administrative datasets at the high end.
Leading zeros disappear silently. When converting CSV to parquet, type inference can strip leading zeros from columns that look numeric but aren’t. FIPS county codes (06037 becomes 6037), ZIP codes (01234 becomes 1234), and similar identifiers are the most common victims. This breaks joins — "06037" won’t match "6037". Tell Claude to treat these columns as strings explicitly: “Convert to parquet but keep county_fips and zip_code as string columns, not integers.” Check identifier columns after conversion.
After conversion, your data/ folder looks like this:
data/
├── raw/ ← original CSVs (keep, but rarely touch again)
├── parquet/ ← compressed versions (your working files)
├── clean/ ← analysis-ready datasets
└── intermediate/ ← figure data, temp files
File format surprises. Government data files are not always what they claim to be. A file with a .csv extension might be pipe-delimited. A “fixed-width” file might have inconsistent column positions across years. Headers might appear on the second row, or not at all. If Claude’s first conversion attempt fails, have it inspect the first few lines of the raw file before retrying — the format documentation is often wrong.
For large downloads, ask for resume capability. If you’re downloading multi-GB files from a government server, include “with resume capability” in your prompt. Claude will write a download script that can pick up where it left off if the connection drops, rather than starting over. This saves hours on unreliable connections.
Scripts, not ad-hoc commands. Claude sometimes takes shortcuts with small files — downloading them via one-off curl commands in the session rather than writing a reusable script. This is convenient but not reproducible: if you need to rebuild the pipeline from scratch, those ad-hoc commands are gone. Goldsmith-Pinkham flags this explicitly (Markus Academy, 2026): every step in the pipeline — including downloads — should be a script you can re-run, not a command you typed once in a session. If you see Claude downloading files interactively, ask it to write the download into a script instead.
10 Step 2: Harmonize Across Sources
If you’re working with multiple datasets — or the same dataset across years where the format changed — you need harmonization before building a database. This means aligning column names, coding schemes, and identifiers across sources.
This is where Claude Code earns its keep. Schema harmonization is tedious, error-prone, and exactly the kind of work Claude handles well: read two file headers, identify which columns correspond, build a crosswalk, and apply it. The judgment calls (is respondent_id the same as lei? did the coding of loan_purpose change in 2018?) require your input, but Claude does the detective work.
Prompt:
“I have data from [source] covering [years]. The format changed in [year]: column names, coding schemes, and identifiers all differ. Read the codebook PDFs in data/docs/ for both periods. Compare the schemas from a pre-change and post-change file. Build a crosswalk mapping old column names and codes to new ones. Flag anything ambiguous for me to decide.”
The key principle: Claude should surface ambiguities, not resolve them silently. If a coding change is ambiguous (did code 31 in the old format map to code 3 or code 2 in the new format?), you need to see it and decide. Add a CLAUDE.md rule if this matters for your project:
## Data Harmonization
- Flag ambiguous crosswalk mappings — do not resolve silently
- Test harmonization on sample years before processing full datasetIf the data provider published codebooks or technical documentation describing format changes, give those to Claude alongside the data files. Claude can read PDFs — a codebook that explains why column names changed in 2018 produces a better crosswalk than schema comparison alone.
11 Step 3: Build a DuckDB Database
Once your parquet files are harmonized, load them into a single DuckDB database. This gives you one file that contains all your data, queryable with SQL.
Prompt:
“Build a DuckDB database at data/project_name.duckdb. Load all the harmonized parquet files into a single table called main_data. Also create any aggregated tables I’ll need for analysis — county-year panels, provider-level summaries, etc. Tell me the final row counts.”
Claude will write a build script that reads the parquet files, inserts them into DuckDB, creates indexes on your key identifiers (county FIPS codes, year, provider ID), and builds the aggregations you’ll need for analysis.
The result: one .duckdb file, typically a few GB, containing everything. You query it from R or Python without loading it into memory:
library(duckdb)
con <- dbConnect(duckdb(), "data/project_name.duckdb")
county_panel <- dbGetQuery(con, "
SELECT county_fips, year, SUM(amount) as total_volume, COUNT(*) as n_obs
FROM main_data
WHERE year BETWEEN 2010 AND 2020
GROUP BY county_fips, year
")The query returns only the aggregated result — a small dataframe — even though the underlying table has hundreds of millions of rows.
One writer at a time. DuckDB allows only one write connection to a database file at a time. If you have an R session connected to the database and try to run a second script that also writes to it, the second connection will fail. Close your connection (dbDisconnect(con)) before running build or update scripts, or open analysis connections as read-only: dbConnect(duckdb(), "data/project_name.duckdb", read_only = TRUE).
You don’t have to write SQL. The examples above use SQL because it’s the most direct way to query DuckDB. But if you’re more comfortable with dplyr, the dbplyr package lets you write dplyr code that gets translated to SQL behind the scenes — tbl(con, "main_data") |> filter(year >= 2010) |> group_by(county_fips) |> summarize(n = n()) works without writing a line of SQL. Claude can also translate between the two: “Rewrite this SQL query as dplyr code” or vice versa.
For very large data (50GB+), partition your parquet files. Instead of one parquet file per source, split by year (or another key variable): data/parquet/year=2010/data.parquet, data/parquet/year=2011/data.parquet, etc. DuckDB reads partitioned parquet natively and only loads the partitions your query touches. Ask Claude to “convert to partitioned parquet, partitioned by year” — it knows the convention.
12 Step 4: Add a Metadata Table
This is the step that makes the whole setup work across sessions and collaborators. It comes from Goldsmith-Pinkham’s “context engineering for data” concept: embed the documentation inside the database itself.
Prompt:
“Create a metadata table in the DuckDB database. For every column in every table, record: column name, description, data type, valid values (or range), and which years it’s available. I want to be able to run SELECT * FROM metadata and get full documentation.”
Here’s what a metadata table looks like in practice:
| table | column | type | description | valid_values | years |
|---|---|---|---|---|---|
| main_data | county_fips | VARCHAR | 5-digit county FIPS code | 01001–56045 | 2007–2024 |
| main_data | loan_amount | DOUBLE | Loan amount in thousands USD | > 0 | 2007–2024 |
| main_data | action_taken | INTEGER | Loan disposition | 1=originated, 2=approved, 3=denied, … | 2007–2024 |
| main_data | lender_type | VARCHAR | Lender classification | bank, credit_union, fintech, non_bank_other | 2007–2024 |
| county_year | hhi | DOUBLE | Herfindahl-Hirschman Index | 0–10000 | 2007–2024 |
Five rows, but a collaborator (or Claude) immediately knows what every column means, what values are valid, and which years are covered.
The metadata table solves a specific problem with Claude Code: session amnesia applied to data. Claude forgets your schema between sessions. Without a metadata table, you either re-explain your variables every session or hope Claude infers them correctly from column names. With one, you start every session with:
Read the metadata table from data/project_name.duckdb and familiarize yourself with the schema.
Claude runs SELECT * FROM metadata, gets full documentation of every variable, and is immediately oriented. No re-explanation needed. This also works for collaborators — a co-author who has never seen the dataset can query the metadata table and understand what they’re working with.
| Without Metadata Table | With Metadata Table |
|---|---|
| Re-explain schema every session | SELECT * FROM metadata |
| Co-authors email you asking “what is column X?” | They query the database |
| Claude guesses variable meanings from names | Claude reads exact definitions |
| Documentation lives in a separate file that drifts | Documentation lives with the data |
This is not just a DuckDB trick. The principle — documentation that lives with the artifact, not in a separate file — applies broadly. You can add metadata tables to SQLite databases, write data dictionaries as CSV files alongside your data, or embed variable labels in Stata .dta files. The specific implementation matters less than the habit: make your data self-documenting.
13 Step 5: Verify the Pipeline
Large data pipelines have a specific failure mode: they run without errors but produce wrong results. A join that silently drops 10% of observations. A type conversion that turns missing values into zeros. A harmonization that maps two distinct categories to the same code.
Before using the database for analysis, run basic checks:
Prompt:
“Verify the DuckDB database. Check: (1) row counts per year match expectations, (2) no unexpected missing values in key variables, (3) identifier coverage — are the same units present across years? (4) spot-check 5 random observations against the raw CSV to confirm values match. Report any discrepancies.”
This is the cross-language replication principle from Tutorial 1 applied to data engineering: don’t trust the pipeline because it ran — verify the output against the input.
14 The Full Workflow
Putting it together, the sequence for each new large dataset:
| Step | What | Prompt Pattern |
|---|---|---|
| 1 | Convert to parquet | “Convert CSVs to parquet, report compression” |
| 2 | Harmonize schemas | “Compare formats, build crosswalk, flag ambiguities” |
| 3 | Build DuckDB | “Load parquet into DuckDB, create aggregations” |
| 4 | Add metadata | “Create metadata table with full variable documentation” |
| 5 | Verify | “Check row counts, missing values, spot-check against raw” |
For a single dataset, this takes one Claude Code session. For multiple datasets that need linking, budget one session per dataset plus one session for the linking and verification.
Use the research-plan-execute pattern (Tutorial 2). For a complex multi-dataset pipeline, start with a research session: “I have four datasets [describe them]. I need to link them by [identifier]. What would this take — is it feasible? What might be missing?” Save Claude’s assessment to a plan file. Then start a fresh session to execute from the plan.
15 Working with the Database in Analysis Sessions
Once the database is built, your daily analysis workflow changes. Instead of loading data into R at the start of every script, you query the database for exactly what you need.
Session startup with a database:
Read the metadata table from data/project_name.duckdb.
Read CLAUDE.md and README.md.
State the goals of this session.
Querying in R scripts:
library(duckdb)
con <- dbConnect(duckdb(), "data/project_name.duckdb")
##### Step 1: Pull analysis sample
df <- dbGetQuery(con, "
SELECT county_fips, year, outcome, treatment, pop, area_sqkm
FROM main_data
WHERE year BETWEEN 2005 AND 2015
AND pop > 1000
")
##### Step 2: Run regression
library(fixest)
model <- feols(outcome ~ treatment | county_fips + year, data = df)The query extracts only the rows and columns you need. The resulting dataframe fits comfortably in memory even if the full dataset doesn’t.
Git and large files. DuckDB database files are typically 1–10GB — far too large for git. Add *.duckdb and data/parquet/ to your .gitignore. Track the scripts that build the database (they’re small text files), not the database itself. Anyone with the raw data and your build scripts can reproduce the database exactly. If you need to share the database with a co-author, use a file transfer service or shared drive — not GitHub.
Keep all modifications in the build script. During analysis, you’ll be tempted to add derived columns or new aggregation tables to the database interactively — “add a log_income column to main_data.” Those changes live only in the database file, which is not version-controlled. If you rebuild from your scripts, they’re lost. The rule: every modification to the database should be a line in build_db.py (or a separate analysis_views.sql file that runs after the build). If Claude adds a column during a session, ask it to add the same step to the build script before you move on.
16 When You Don’t Need This
Not every dataset warrants a database pipeline. The decision is straightforward:
| Dataset Size | Approach |
|---|---|
| Under 1 GB | Load directly into R/Python. No pipeline needed. |
| 1–10 GB | Parquet conversion helps. DuckDB optional — depends on how often you query. |
| 10 GB+ | Full pipeline (parquet + DuckDB + metadata). The investment pays for itself. |
| Multiple sources to link | Full pipeline regardless of individual file sizes. |
If your data fits in memory and you’re not linking across sources, a CSV and a well-commented R script are fine. Don’t build infrastructure you don’t need.
17 Further Reading
| Resource | Link |
|---|---|
| Goldsmith-Pinkham, “Large Datasets and Structured Databases” (worked example with HMDA data) | A Causal Affair |
| DuckDB documentation | duckdb.org |
| Apache Parquet format | parquet.apache.org |
| Arrow R package (parquet I/O) | arrow.apache.org/docs/r |
| Polars (alternative to DuckDB for lazy evaluation in Python) | pola.rs |