IELE756-2026-1: Teaching ecological inference on Chilean census + health microdata

May 9, 2026 in data-science, stats, teaching | Permalink

In 2002, about 2% of Chile’s residents were foreign-born. In 2024, it was over 8%. One of the fastest immigration transitions in Latin America, and the question driving the 2026 edition of IELE756 at UDD: where do immigrants settle, what work do they find, and do their health outcomes diverge from those of Chilean-born residents?

The data

We use the 2024 Census microdata: 19M+ persona records linked to households and dwellings, parquet, three relational tables joined via id_vivienda / id_hogar / id_persona. The migration variables are right there: place of residence 5 years ago (p24_lug_resid5), place of birth (p25_lug_nacimiento), arrival period (p26_llegada_periodo), nationality (p27_nacionalidad). Then two health datasets:

ENO (Enfermedades de Notificación Obligatoria, MINSAL DEIS), 2007 to 2024, ~333k records, semicolon-delimited CSV. The Desconocido nationality category has to be reported and excluded from rate denominators, which is the kind of data hygiene students do not learn from textbook examples.
GRD (Grupos Relacionados de Diagnóstico, MINSAL DEIS), 2019 to 2024, ~5M hospital discharges, pipe-delimited yearly zips, ICD-10 coded against CIE-10.xlsx for diagnostic chapter grouping.

The three datasets contain different people. A census respondent is not the same individual as an ENO notification or a hospital discharge. Linking them is therefore necessarily ecological: aggregate to the comuna level, join on codigo_comuna, fit Poisson or Negative Binomial regression with a population offset, interpret incidence rate ratios, and then sit with the ecological fallacy explicitly.

Class as bin-packing

There are 21 pairs. Each is assigned 1 to 3 comunas (mostly RM) totalling ~330k residents. The exact range is 324,087 to 568,106 with a mean of 356,479; allocation is a small Python script that sorts comunas largest-first and packs them greedily into the bins that currently have the smallest total. Puente Alto (568,106) and Maipú (521,627) end up alone; the long tail (Alhué, 5,765; San Pedro, 9,522; María Pinto, 12,572) gets folded into mixed groups.

Pipeline shape over the trimester: Tarea 0 (shallow contact, all three datasets) -> Tarea 1 (deep on Census, demographic + migration profile) -> Tarea 2 (deep on ENO + GRD, rates by nationality) -> Tarea 3 (merged at the comuna level, count regression fitted, ecological fallacy critiqued explicitly).

Natural vs. artificial intelligence

The final project is 25 of 70 points, split as: GitHub repo (10), 8 to 10 minute video (7), and a handwritten in-person memo (8 points, individual, 45 minutes, no AI / laptops / notes). Tareas and the async final components allow AI with a disclosure paragraph in the README. The memo does not, and is graded per student even though the project is paired.

The point of that split: if you cannot defend your repo on paper, the repo’s score does not save you. It is the only integrity lever in a course where everything else is plausibly AI-augmented, and it changes how teams work weeks before the memo even happens. Pair quizzing each other from the prompt bank, on paper, no tools open.

The final asks for one anomaly, defended. Things that qualify: a coefficient that flips sign between Poisson and Negative Binomial with material consequences for interpretation; a comuna whose ENO rate is more than two SDs from the regional mean for a specific disease, adjusted for population; a choropleth whose spatial pattern contradicts the underlying demographic correlation. Things that do not: “foreign-born residents have higher TB rates” (documented for years; not a surprise).

Materials

Syllabus, week-by-week schedule, dataset notes, group/comuna assignments: leoferres.blog/teaching/iele756/2026-1/.
Walkthrough scripts (load and explore each of the three datasets), week-0 slide deck, and the bin-packing comuna assigner are linked from the Resources page.
Previous editions: Spring 2025 (environmental exposures + NCD burden).

Important note on interpretation

This is an important and sensitive topic. A central part of the course is making clear to students that, although the datasets are real, the work is an academic and pedagogical exercise. The results should not be interpreted as definitive empirical claims about immigrant health, disease burden, or public policy.

The analyses are designed to teach data integration, ecological inference, count regression, uncertainty, and the limits of aggregate data. Any substantive conclusions should be treated as provisional and illustrative. Robust empirical claims in this area require dedicated study designs, domain expertise, institutional review, and interpretation by public health and migration specialists.

If you teach a similar course and want to compare notes on how ecological-inference pedagogy plays out in practice, write to me.

Tags: health, human-mobility, migration, santiago-de-chile

Leo's Blog

IELE756-2026-1: Teaching ecological inference on Chilean census + health microdata

The data

Class as bin-packing

Natural vs. artificial intelligence

Materials

Important note on interpretation

Recent comments

Categories

Recent posts

Archives