stats

You are currently browsing the archive for the stats category.

In 2002, about 2% of Chile’s residents were foreign-born. In 2024, it was over 8%. One of the fastest immigration transitions in Latin America, and the question driving the 2026 edition of IELE756 at UDD: where do immigrants settle, what work do they find, and do their health outcomes diverge from those of Chilean-born residents?

The data

We use the 2024 Census microdata: 19M+ persona records linked to households and dwellings, parquet, three relational tables joined via id_vivienda / id_hogar / id_persona. The migration variables are right there: place of residence 5 years ago (p24_lug_resid5), place of birth (p25_lug_nacimiento), arrival period (p26_llegada_periodo), nationality (p27_nacionalidad). Then two health datasets:

  • ENO (Enfermedades de Notificación Obligatoria, MINSAL DEIS), 2007 to 2024, ~333k records, semicolon-delimited CSV. The Desconocido nationality category has to be reported and excluded from rate denominators, which is the kind of data hygiene students do not learn from textbook examples.
  • GRD (Grupos Relacionados de Diagnóstico, MINSAL DEIS), 2019 to 2024, ~5M hospital discharges, pipe-delimited yearly zips, ICD-10 coded against CIE-10.xlsx for diagnostic chapter grouping.

The three datasets contain different people. A census respondent is not the same individual as an ENO notification or a hospital discharge. Linking them is therefore necessarily ecological: aggregate to the comuna level, join on codigo_comuna, fit Poisson or Negative Binomial regression with a population offset, interpret incidence rate ratios, and then sit with the ecological fallacy explicitly.

Class as bin-packing

There are 21 pairs. Each is assigned 1 to 3 comunas (mostly RM) totalling ~330k residents. The exact range is 324,087 to 568,106 with a mean of 356,479; allocation is a small Python script that sorts comunas largest-first and packs them greedily into the bins that currently have the smallest total. Puente Alto (568,106) and Maipú (521,627) end up alone; the long tail (Alhué, 5,765; San Pedro, 9,522; María Pinto, 12,572) gets folded into mixed groups.

Pipeline shape over the trimester: Tarea 0 (shallow contact, all three datasets) -> Tarea 1 (deep on Census, demographic + migration profile) -> Tarea 2 (deep on ENO + GRD, rates by nationality) -> Tarea 3 (merged at the comuna level, count regression fitted, ecological fallacy critiqued explicitly).

Natural vs. artificial intelligence

The final project is 25 of 70 points, split as: GitHub repo (10), 8 to 10 minute video (7), and a handwritten in-person memo (8 points, individual, 45 minutes, no AI / laptops / notes). Tareas and the async final components allow AI with a disclosure paragraph in the README. The memo does not, and is graded per student even though the project is paired.

The point of that split: if you cannot defend your repo on paper, the repo’s score does not save you. It is the only integrity lever in a course where everything else is plausibly AI-augmented, and it changes how teams work weeks before the memo even happens. Pair quizzing each other from the prompt bank, on paper, no tools open.

The final asks for one anomaly, defended. Things that qualify: a coefficient that flips sign between Poisson and Negative Binomial with material consequences for interpretation; a comuna whose ENO rate is more than two SDs from the regional mean for a specific disease, adjusted for population; a choropleth whose spatial pattern contradicts the underlying demographic correlation. Things that do not: “foreign-born residents have higher TB rates” (documented for years; not a surprise).

Materials

  • Syllabus, week-by-week schedule, dataset notes, group/comuna assignments: leoferres.blog/teaching/iele756/2026-1/.
  • Walkthrough scripts (load and explore each of the three datasets), week-0 slide deck, and the bin-packing comuna assigner are linked from the Resources page.
  • Previous editions: Spring 2025 (environmental exposures + NCD burden).

Important note on interpretation

This is an important and sensitive topic. A central part of the course is making clear to students that, although the datasets are real, the work is an academic and pedagogical exercise. The results should not be interpreted as definitive empirical claims about immigrant health, disease burden, or public policy.

The analyses are designed to teach data integration, ecological inference, count regression, uncertainty, and the limits of aggregate data. Any substantive conclusions should be treated as provisional and illustrative. Robust empirical claims in this area require dedicated study designs, domain expertise, institutional review, and interpretation by public health and migration specialists.

If you teach a similar course and want to compare notes on how ecological-inference pedagogy plays out in practice, write to me.

Tags: , , ,

The literature has treated cell towers as isotropic light bulbs since 2008. They’re really not (that’s why sometimes you don’t have a mobile phone signal!).


Left: Voronoi tessellation of 1,292 BTS across Santiago (R13), ~6 km window over the downtown core. Each tower owns every point closer to it than any other mast, with 360° coverage assumed.

Right: the same towers, drawn as directional sector wedges built from the azimuth field that has always been in the catalog. Most masts carry three antennas radiating ~120° beams; a few carry six.

Dropping the isotropic assumption pulls the median effective radius from 504 m to 375 m region-wide, and from 245 m to 181 m inside this window. Sub-kilometre spatial resolution from a column already in the data.

The black gaps in the right panel are not missing values. They are the map being honest about where no antenna is aiming.

I’m working on a series of blog posts and “spinoffs” of this paper:

Ferres, L., & Elejalde, E. (2026). Systematic biases in mobile phone mobility data from heterogeneous tower density. Zenodo. https://zenodo.org/records/19484460

Next, I’ll try to explain the statistical methods we used, and how we can correct the values.

Tags: , , ,

New preprint and accompanying software release: “Systematic biases in mobile phone mobility data from heterogeneous tower density,” with Erick Elejalde (L3S, Leibniz Universität Hannover).    

Mobile phone records (CDRs and the higher-resolution XDRs) are now standard inputs for human mobility, epidemic modelling, and disaster response, but the spatial distribution of cell towers introduces measurement biases that are rarely quantified. Towers cluster in cities and thin out in rural areas. The result is a spatially structured detection floor: short rural trips never cross a sector boundary and are invisible, rural users get misattributed to oversized Voronoi cells, and origin-destination matrices end up artificially urban-centric. The biases are correlated with the very variable (urbanicity) that researchers most often want to study.

We characterise the problem and propose a six-step correction pipeline:

1. Sector polygons inferred from antenna azimuth and height, replacing the standard tower-point Voronoi tessellation
2. Detection-floor modelling at the per-site level
3. Dasymetric redistribution of census population onto an H3 hexagonal grid
4. OD construction with intra-site sector-crossing recovery
5. Inverse probability weighting with a tower-density-aware inclusion probability          
6. Fay-Herriot small area smoothing toward a gravity prior

Applied to the Región Metropolitana de Santiago using a 63,832 antenna catalog, the 2024 Chilean census, and 6.5 weeks of XDR data:

– Sector polygons give a 3.0x gain in effective spatial resolution over tower-point Voronoi (median 299 m versus 894 m)
– The 50% detection threshold ranges from 16 m in the urban core to 2,542 m at the most isolated site
– Intra-site sector crossings recover roughly 100 million short-distance trips (median displacement 429 m) that are invisible at the tower level
– IPW uplifts rural comuna flows by 50 to 73%, while the urban core is slightly downweighted
– Fay-Herriot shrinkage weights vary from about 0.7 in the urban core to under 0.1 at the periphery, mirroring the tower-density gradient

The pipeline is implemented in mobilens, an MIT-licensed Python library that is operator- and country-agnostic. The minimum inputs are a tower catalog with azimuth and height, a census population layer at any administrative level, and a study area boundary polygon. Steps 1 to 3 (the spatial characterisation of the bias) can be carried out without any XDR data, which makes the library useful even where records are unavailable.

– Code: https://github.com/leoferres/mobilens
– Preprint: https://zenodo.org/records/19484460

Substantive feedback before journal submission is welcome!!

Tags: , ,