Data
The primary dataset is the Fichier des prenoms published by INSEE (2025 edition), covering all 87 million births registered in France from 1900 to 2024. The dataset contains 711,069 rows, each representing a unique combination of (name, gender, year) with counts rounded to the nearest multiple of 5 for privacy. After excluding the rare names placeholder, 48,516 unique names remain.
20 Etymological Categories
Each name is assigned to its earliest known etymological origin, not its contemporary cultural association. "Marie" is classified as Hebrew (from Miriam) despite being quintessentially French. "Louis" is Germanic (from Frankish Chlodwig) rather than French. Our categories describe the history of words, not the history of the cultures that use them today.
| Category | Description | Examples |
|---|---|---|
| Hebrew / Biblical | Biblical/Hebrew etymology | Marie, Jean, David, Michel, Anne |
| Latin | Latin/Roman etymology | Pierre, Paul, Maxime, Victor, Dominique |
| Germanic | Frankish/Germanic roots | Louis, Henri, Richard, Charles, Francoise |
| Greek | Greek etymology | Philippe, Alexandre, Sophie, Catherine, Nicolas |
| Arabic | Arabic language etymology | Mohamed, Fatima, Karim, Yasmine, Rayan |
| Celtic | Breton/Celtic/Gaelic roots | Alain, Brigitte, Arthur, Nolwenn, Tristan |
| Anglo-Saxon | English language names | Kevin, Dylan, Brandon, Jennifer, Audrey |
| French | Native French formations | Manon, Colette, Gaston, Garance |
| African | Sub-Saharan African etymology | Mamadou, Aminata, Ousmane, Fatoumata |
| Nordic | Scandinavian/Norse roots | Eric, Ingrid, Astrid, Oscar, Nils |
| Slavic | Slavic language etymology | Nadia, Ivan, Katia, Sacha, Mila |
| Italian | Distinctly Italian forms | Giovanni, Salvatore, Concetta, Enzo, Giulia |
| Spanish | Distinctly Spanish forms | Carmen, Dolores, Pilar, Jade, Lola |
| Berber | Amazigh/Berber roots | Kenza, Massinissa, Idir, Jugurtha |
| Turkish | Turkish etymology | Elif, Emre, Ayse, Ayla |
| Persian | Iranian etymology | Cyrus, Darius, Soraya, Roxane, Gaspard |
| Asian | East/South/Southeast Asian | Linh, Mei, Ravi, Kenzo, Tao |
| Basque | Basque etymology | Iker, Eneko, Amaia, Xavier |
| Portuguese | Portuguese forms | Joaquim, Conceicao, Rui, Nuno |
| Other | Unclassifiable, invented, or mixed | (various) |
Classification
Classification was performed using Anthropic's Claude Haiku 4.5 as an automated onomastic classifier. The model was prompted with the complete taxonomy, classification rules, and examples, then asked to classify names in batches of 200 with an associated confidence score (0.0-1.0). Processing parameters: 243 batches, 10 parallel requests, temperature 0. Total processing time: 8 minutes. Cost: under $3.
Validation. 500 names were reclassified independently by Claude Opus for cross-validation. Agreement was 76% overall (Cohen's kappa: 0.74), rising above 90% for Arabic (94%), African (92%), Berber and Basque (100% each). Disagreements concentrated on boundaries between linguistically adjacent European categories (Latin/Greek, Germanic/French), which do not affect the European/extra-European divide. External validation against reference onomastic sources (Dauzat, 1951; Tanet & Horde, 2000) yielded 87% agreement (95% CI: 82-91%).
Look Up a Name
Search any of the 48,516 classified names to see its etymological origin and classification confidence.
Projections
Forward projections to 2050 use Monte Carlo simulation (10,000 trajectories, random seed 42). For each origin, the most recent 5-year rolling slope is evolved as a random walk: st+1 = st + ε, where ε ~ N(0, σ²) and σ is calibrated on 1990-2024 slope volatility. Values are clamped to [0, 100] at each step.
The bands show the 50% interval (25th-75th percentiles, darker shading) and 90% interval (5th-95th percentiles, lighter shading). Categories are projected independently: no constraint ensures shares sum to 100%. These projections assume future volatility comparable to 1990-2024. They do not capture potential structural breaks (migration policy, fertility changes, economic shocks). The intervals measure statistical uncertainty, not scenario uncertainty.
Four Historical Phases
Phase 1: Traditional Dominance (1900-1945). Hebrew, Latin, and Germanic origins collectively account for 80-85% of births. Hebrew names peak at 40% in 1946 (driven by Marie, Jean, Joseph), while Germanic names begin their secular decline from 28%.
Phase 2: Post-War Recomposition (1945-1975). Hebrew names collapse from 40% to 18% as traditional Catholic naming practices weaken. Latin names rise to 37% in 1966 (Pierre, Paul, Michel). Greek names surge from 10% to 29% by 1975 (Philippe, Catherine, Sophie, Alexandre), the most rapid single-origin expansion in the dataset.
Phase 3: Diversification (1975-2000). All four traditional origins decline simultaneously. Arabic names grow from 3% to 5%. Anglo-Saxon names peak at 3.7% in 1991 (Kevin, Dylan, Brandon). Celtic names rise to 7% by 2007. The Shannon diversity index increases sharply.
Phase 4: New Equilibrium? (2000-2024). Arabic names accelerate from 6% to 15.7%, becoming the fifth-largest origin group. Traditional European origins continue declining but at reduced rates. The diversity index continues to increase. Aggregating the four traditional European categories (Hebrew + Latin + Germanic + Greek), their combined share falls from 85% in 1945 to 51% in 2024.
What the Data Cannot Tell Us
The etymological origin of a name is not the ethnic, religious, or national origin of the individual bearing it. Three distinct mechanisms connect names to demographics: cultural continuity (families choosing names from their ancestral tradition), cultural fashion (families adopting names perceived as attractive regardless of background), and assimilation (immigrant families adopting majority-culture names).
Critically, assimilation means the measured 15.7% share of Arabic-origin names likely underestimates the share of births in families with North African heritage. This naming data does not measure immigration rates, population composition, religious affiliation, or demographic "replacement." It documents how the etymological spectrum of names given to newborns in France has shifted over 125 years.