Kosha
Reports florida-creative-class-replication-digital-atlas-6bd7ce

Florida Creative Class Replication — Digital Atlas

A comprehensive empirical replication of Richard Florida's Creative Class thesis at sub-metro resolution using Digital Atlas feature vectors across NYC and Singapore, testing eight hypotheses about talent, technology, and tolerance as drivers of creative-class concentration and e

md 4h ago 20.7 KB florida_creative_class_replication_plan.md
View source Download

Florida Creative Class Replication — Digital Atlas#

Handoff plan for Claude Code. Market-parametric: runs on NYC or Singapore via config switch. Phases gated by validation checks.


1. Objective#

Empirically test Richard Florida's Creative Class thesis at sub-metro resolution using Digital Atlas feature vectors, and produce results that are either (a) the first clean replication at this granularity, or (b) evidence that the thesis fails when amenity and workforce signals are measured directly rather than through coarse proxies.

Florida's original work ran at MSA level using occupation-code shares and index proxies (Bohemian Index, Gay Index, Milken Tech-Pole). We run at census tract (NYC, N=2,304) or subzone (SGP, N=332) using observed place-level signals, typed feature vectors, and graph-aware neighbor controls.


2. Testable claims (target: prove or disprove each)#

# Claim Primary test Outcome
H1 3T's (Talent, Technology, Tolerance) jointly predict creative-class concentration Cross-sectional regression of creative-class share on 3T indices Coefficient signs + magnitude
H2 The 3T's are empirically separable, not a single latent factor Factor analysis + VIF + orthogonalization test 3 factors ≥ 10% variance each, or collapse to 1
H3 Creative-class concentration precedes employment/wage growth (causal direction) Panel FE + IV (NYC only — SGP has no panel) Sign of lagged creative share on Δwage
H4 "Quality of place" amenity signature is a stronger predictor than labor-market factors Compare R² of amenity-only vs. labor-only vs. combined models Amenity R² >> labor R²
H5 Creative concentration is spatially clustered intra-metro, not uniformly distributed Moran's I + LISA hotspots on creative share Moran's I > 0.3, p<0.01
H6 (NYC only) Creative class redistributes intra-metro post-pandemic (2019→2023) Δ creative share × tract type (CBD vs. neighborhood) Significant interaction
H7 Creative concentration capitalizes into rent (inequality claim) Regression of rent on creative share, controlling for amenities and access Positive residual capitalization
H8 Neighbor absorption matters: a tract's apparent creative gap is partially explained by adjacent supply Gap model with/without spatial lag of 3T Spatial lag adds material explained variance

H8 is ours, not Florida's — it's the contribution that goes beyond replication.


3. Repository layout#

/repos/florida_replication/
├── config/
│   ├── market_nyc.yaml
│   └── market_sgp.yaml
├── src/
│   ├── data/              # source loaders per market
│   ├── features/          # 3T index construction
│   ├── models/            # regressions, panel, spatial
│   ├── validate/          # gate checks between phases
│   └── viz/               # maps, coefficient plots
├── notebooks/             # exploratory only; production is src/
├── artifacts/
│   ├── indices/           # *_3t_indices.parquet
│   ├── models/            # *.pkl
│   └── figures/
├── reports/
│   └── florida_replication_report.md  # final deliverable
└── run.py                 # entrypoint: python run.py --market nyc --phases 1,2,3

4. Market configuration#

4.1 NYC config (config/market_nyc.yaml)#

market: nyc
spatial_unit: census_tract
n_units: 2304
crs: EPSG:2263
sources:
  places: /data/nyc/places_full.parquet        # ~250K rows
  tract_features: /data/nyc/tract_features_v3.parquet  # existing region vectors
  tract_geoms: /data/nyc/tracts_2020.geojson
  acs_panel: /data/nyc/acs_5yr_2009_2023.parquet
  rodx: /data/nyc/mega_graph.pkl               # existing mega graph
  tract_adjacency: /data/nyc/tract_queen_contig.parquet
  housing_transactions: /data/nyc/rpad_sales_panel.parquet
outcomes:
  - median_hh_income
  - bachelor_plus_share
  - median_gross_rent
  - employment_total
panel_available: true

4.2 Singapore config (config/market_sgp.yaml)#

market: sgp
spatial_unit: subzone
n_units: 332
crs: EPSG:3414
sources:
  places: /data/sgp/places_overture_plus_local.parquet  # ~67K rows
  subzone_features: /data/sgp/subzone_features_202d.parquet  # 19-step pipeline output
  subzone_geoms: /data/sgp/mp19_subzones.geojson
  singstat_census: /data/sgp/singstat_2020_2025.parquet   # PA-level, dasymetric to subzone
  housing_transactions: /data/sgp/hdb_private_txn_panel.parquet
outcomes:
  - median_hh_income      # from Census, dasymetric-allocated
  - professional_share    # proxy for creative class via industry
  - private_psf           # price per sqft, private residential
panel_available: false   # 2 census points only, cross-sectional primary

5. Phase plan#

Phase 0 — Environment setup and sanity checks (0.5 day)#

  • Load market config, verify all source files exist and have expected schema.
  • Spatial sanity: n_units == len(geoms) == len(features), no orphan tracts/subzones.
  • Feature vector integrity: for SGP confirm 202-dim, for NYC confirm current production dimensionality. Log any NaN or zero-variance columns.
  • Produce artifacts/00_data_inventory.md listing N units, N places, feature columns, date range.

Gate: all sources loadable, no schema mismatches, feature vector dimension matches config. Stop if not.


Phase 1 — Operationalize the 3T's at our spatial unit (2 days)#

We construct three indices per tract/subzone: T_talent, T_tech, T_tolerance. Each is a composite z-score of several observed measures. No proxy is imported from Florida verbatim — we use our richer data and document each mapping explicitly.

1.1 Talent index#
Component NYC source SGP source Weight
Bachelor-plus share ACS B15003 SingStat (PA→subzone dasymetric) 0.4
Graduate-degree share ACS B15003 SingStat 0.3
Creative-occupation share ACS C24010 (SOC-based, see §5.1.3) Proxy via tech-firm + design-firm workplace density 0.3

5.1.3 — SOC → Creative Class mapping (NYC). Use Florida's occupation set: SOC 11 (Management), 13 (Business/Financial Ops), 15 (Computer/Math), 17 (Architecture/Engineering), 19 (Life/Physical/Social Sci), 21 (Community/Social Svc), 23 (Legal), 25 (Education), 27 (Arts/Design/Entertainment/Media), 29 (Healthcare Practitioners). Store mapping as src/features/soc_creative_class.py with the exact SOC codes — reviewer-auditable.

SGP proxy. Without occupation-at-subzone data, derive a "creative workplace intensity" score from place-type density: software/tech offices, design studios, architecture firms, research institutions, media companies, universities. Weight by employment-capacity estimates where available. Document this as a proxy, not a direct measure.

1.2 Technology index#
Component NYC source SGP source Weight
Tech firm density (per km²) Places with NAICS 5112/5415/5417 Overture + local: software, R&D, tech services 0.35
Coworking space count Places typed as coworking Same 0.15
R&D institution presence Universities + research centers NUS/NTU/SMU campuses + A*STAR, research parks 0.25
Patent density (if available) USPTO by zip→tract allocation IPOS if obtainable, else drop 0.25

If patent data isn't available in SGP, renormalize weights and flag in report.

1.3 Tolerance index#

This is where we deliberately depart from Florida's proxies. His Gay Index and Bohemian Index were the best he could do with 2000-era data. We measure tolerance-adjacent urban signals directly:

Component Measure Rationale
Cuisine diversity Shannon entropy over cuisine categories in restaurants Proxy for cultural openness
Venue-type heterogeneity Shannon entropy over all place categories Mixed-use authenticity
Third-place density Indie coffee + bars + bookstores + galleries + music venues, per km² Florida's "street-level culture"
Indie-to-chain ratio Fraction of retail/F&B that are non-chain Authenticity signal

For SGP, add market-specific adjustments: hawker centres, void-deck commercial, and kopitiams count as third places and get mapped into the indie side of the indie/chain ratio (per your existing taxonomy work).

Chain detection: use brand-name frequency threshold (≥10 locations in market = chain). Store src/features/chain_detector.py.

1.4 Composite Creativity Index#

CI = (T_talent + T_tech + T_tolerance) / 3 after z-scoring each component. Also retain the three separately for H2 (separability test).

Deliverable for Phase 1#
  • artifacts/indices/{market}_3t_indices.parquet with columns: unit_id, T_talent, T_tech, T_tolerance, CI, plus all component measures.
  • artifacts/01_3t_construction.md documenting every measure, source, and weight choice.

Gate: Indices computed for ≥99% of units (allow NaN for tracts/subzones with N_places=0 or population=0). Distribution of each index is roughly normal post-z-score. Correlation matrix of 3T components produced and inspected — if any pair has r>0.9, flag as collinearity issue.


Phase 2 — Cross-sectional test (H1, H2, H4, H5) (2 days)#

Both markets. Unit of observation: tract (NYC) or subzone (SGP).

2.1 H1 — 3T → creative-class share#

Baseline OLS:

creative_share_i = β0 + β1·T_talent_i + β2·T_tech_i + β3·T_tolerance_i + X_i·γ + ε_i

Where X_i includes controls: population density, distance to CBD, transit accessibility (isochrone-based), total place count.

Then spatial specification:

creative_share_i = β0 + β·T_i + ρ·W·creative_share_i + X_i·γ + ε_i

Queen-contiguity weights matrix W. Use pysal.model.spreg.ML_Lag. Compare to OLS — if ρ is significant and large, spatial dependence is real and OLS estimates are biased.

H8 sidecar: also fit with spatial lag of T's (SLX) — W·T_talent, W·T_tech, W·T_tolerance — and test whether neighbor 3T's explain own creative share beyond own 3T's. This is the neighbor absorption test at the 3T level.

2.2 H2 — Are the 3T's separable?#
  • Exploratory factor analysis on the underlying component measures (not the indices themselves).
  • Report eigenvalues, scree plot, loadings.
  • Decision rule: if top factor explains >70% of variance and no second factor has eigenvalue >1, Florida's three-factor structure is rejected at this resolution.
2.3 H4 — Amenity vs. labor predictive power#

Three nested models for the same outcome (creative share or median income): - M_A: amenity-only features (place density, diversity, third-place count, walkability) - M_L: labor-only features (education shares, occupation mix for NYC / industry mix for SGP) - M_C: combined

Compare adj-R² and cross-validated R² (5-fold spatial CV — hold out contiguous blocks, not random rows).

2.4 H5 — Spatial clustering#
  • Global Moran's I on creative share.
  • LISA (Local Indicators of Spatial Association) to identify HH (high-high) and LL clusters.
  • Map top-decile creative tracts; check if they form contiguous corridors or scatter.
Deliverable for Phase 2#
  • artifacts/models/phase2_ols.pkl, phase2_spreg.pkl, phase2_slx.pkl.
  • artifacts/figures/02_moran_lisa_{market}.html (interactive).
  • Section 2 of final report written.

Gate: H1 coefficients have expected signs (all three positive, significant at p<0.05). If any is zero or negative with high confidence, pause and diagnose before continuing — this would be a genuinely novel finding that deserves investigation, not a bug to paper over.


Phase 3 — Causal test (H3, H6) — NYC only (3 days)#

SGP has no usable panel. Skip this phase if panel_available: false.

3.1 Panel construction#

ACS 5-year estimates 2009-2013, 2014-2018, 2019-2023 — three periods per tract, N=2304×3 observations. Harmonize tract boundaries across 2010/2020 census revisions using relationship files.

3.2 Fixed effects model#
Δy_i,t = β·creative_share_i,t-1 + γ·X_i,t-1 + α_i + τ_t + ε_i,t

Where y ∈ {log median income, log employment, log rent}, tract FE α_i, period FE τ_t.

Coefficient β is the association of lagged creative concentration with subsequent growth, net of time-invariant tract characteristics. Still not causal — tract FE handles unobserved fixed heterogeneity but not time-varying confounders.

3.3 Instrumental variable strategy#

Use historical university presence (pre-1950 founding dates) as IV for current Talent index. Rationale: legacy university locations are plausibly exogenous to post-2000 gentrification dynamics but predict current educated-worker concentration. Check first-stage F-statistic (≥10 for strong instrument).

Alternative IV: historical transit station placement (subway stations operational pre-1940) as IV for current accessibility, which feeds into 3T components.

Report both OLS, FE, and IV estimates side-by-side. If IV and FE agree, confidence in causal interpretation rises. If they disagree materially, Florida's directional claim is weakened.

3.4 H6 — Post-pandemic redistribution#

Difference-in-differences: pre-period 2015-2019, post-period 2021-2023.

Δcreative_share_i = α + β1·CBD_i + β2·Post + β3·CBD_i×Post + X_i·γ + ε_i

CBD dummy = tracts within the Manhattan CBD polygon. Coefficient β3 tests whether CBD tracts saw differential loss of creative share post-pandemic. Run parallel specs for near-CBD and outer-borough tracts.

Deliverable for Phase 3#
  • artifacts/models/phase3_fe.pkl, phase3_iv.pkl, phase3_did.pkl.
  • Coefficient plot comparing OLS / FE / IV estimates.
  • Section 3 of final report.

Gate: IV first-stage F > 10. If not, drop IV and report FE only with explicit caveat about residual endogeneity.


Phase 4 — Quality of place signature (H4 deep dive) (1.5 days)#

Both markets. What does a high-CI tract look like at street level? This is where DA's resolution advantage over Florida is most visible.

4.1 Signature extraction#

For top-decile CI tracts in each market: - Mean feature vector (202-dim in SGP, production-dim in NYC) - Rank feature dimensions by standardized difference from market median - Identify top 20 distinguishing features

Report as a table: feature name, market median, top-decile mean, z-diff.

4.2 Cross-market comparison#

Align NYC and SGP feature dimensions where possible (both encode: place density, cuisine diversity, transit access, third-place count, etc.). Compare top-decile signatures.

Key question: do creative tracts in NYC and SGP look the same in feature space, modulo market normalization? If yes — Florida's universal claim holds. If no — document what's market-specific and what's universal.

4.3 Taxonomy-aware check (SGP)#

Explicitly test whether hawker centres and void-deck commercial appear in the top-decile signature. If they do, this validates your taxonomy work. If they don't, reconsider — either the taxonomy overweights these or Florida's creative-class framing doesn't fit Singapore.

Deliverable for Phase 4#
  • artifacts/figures/04_signature_table_{market}.html.
  • artifacts/figures/04_cross_market_signature.html (only after both markets run).
  • Section 4 of final report.

Phase 5 — Rent capitalization (H7) (1 day)#

Both markets, data-permitting.

Hedonic regression on housing transactions:

log(price_per_sqft) = β·CI_tract + controls + ε

Controls: building age, unit size, floor, transit access, school quality proxy (schools within 500m).

Test whether the CI coefficient survives the full control set. If yes, creative concentration capitalizes into rent — the inequality channel Florida 2022 emphasizes.

Additional decomposition: regress CI on its three T components separately into the hedonic. Which of Talent / Tech / Tolerance drives most of the rent premium?

Deliverable for Phase 5#
  • artifacts/models/phase5_hedonic.pkl.
  • Section 5 of final report.

Phase 6 — Cross-market synthesis (1 day)#

Run only after both NYC and SGP have completed Phases 1-2 (and 4-5 where data allows). Write final synthesis section comparing:

  • Coefficient magnitudes on 3T → creative share (do they agree in sign and rough magnitude?)
  • Factor structure (is the 3T decomposition stable across markets?)
  • Feature signatures of top-decile tracts (universal or market-specific?)
  • Rent capitalization (same channel or different?)
Deliverable for Phase 6#
  • reports/florida_replication_report.md — full final report, 15-25 pages, with embedded figures.
  • reports/florida_replication_summary.md — 2-page executive summary.

6. Key modeling choices — explicit record#

Maintained as docs/design_decisions.md, updated as the run progresses:

  1. Spatial unit — tracts for NYC, subzones for SGP. Rationale: matches existing DA feature granularity, N large enough for regression, small enough to capture intra-metro variation.
  2. Weights matrix — queen contiguity. Alternative: k-nearest-neighbors (k=6). Run both, report sensitivity in appendix.
  3. Creative class definition — Florida's SOC set for NYC; workplace-density proxy for SGP. Both documented and auditable.
  4. Tolerance measurement — deliberately replaces Bohemian/Gay Index with diversity entropy + third-place density + indie/chain ratio. We should expect our measure to correlate with Florida's but not be identical — if it differs radically from his MSA-level results, the measurement choice is the likely reason.
  5. Causal identification — FE + IV for NYC. SGP is cross-sectional only; no causal claims made from SGP results.
  6. Neighbor absorption (H8) — spatial lag of T's included as standard; not an optional extension.
  7. Chain detection — threshold at ≥10 locations in-market. Review threshold sensitivity at 5/10/20.

7. Risks and mitigations#

Risk Likelihood Mitigation
SGP workforce proxy is too noisy to replicate Talent index Med Run H1 in NYC first; if proxy works (correlates with income), trust it for SGP cross-sectional only
Spatial autocorrelation dominates everything (ρ → 1) Low-med Compare OLS, SAR, SLX; report all three; if ρ > 0.6 write explicit caveat
IV first-stage is weak in NYC Med Have two IV candidates ready (pre-1950 universities and pre-1940 subway stations); if both fail, report FE only
Tolerance measure correlates ~perfectly with total place count Med Always compute and report residualized Tolerance net of place density
Top-decile tracts are just "dense Manhattan" or "CBD Singapore" High Explicitly control for distance-to-CBD and transit access in every spec
Post-pandemic DiD has parallel-trends violation Med Plot pre-period trends by tract type; if non-parallel, switch to synthetic control

8. Out of scope for v1 (parking lot)#

  • Bangalore replication — needs BDA zoning data first (PDF-only issue documented elsewhere).
  • Place-level individual creative-class modeling — v2 with true joint place-region architecture.
  • Causal mechanism decomposition (amenity-driven vs. job-driven migration) — requires IRS migration flows, not in current stack.
  • LLM-extracted creative-class signal from venue descriptions / reviews — promising but adds scope.

9. Execution order and time budget#

Phase Days Market dependency
0 0.5 per market
1 2 per market
2 2 per market
3 3 NYC only
4 1.5 per market
5 1 per market
6 1 after both markets

Recommended run order: NYC full (Phases 0-5) → SGP Phases 0-2, 4-5 → Phase 6 synthesis. Total ~16 person-days for both markets.


10. Entry point#

# Full NYC run
python run.py --market nyc --phases 0,1,2,3,4,5

# SGP cross-sectional run
python run.py --market sgp --phases 0,1,2,4,5

# Synthesis after both
python run.py --synthesis

Each phase writes a gate-check file. run.py refuses to start phase N if phase N-1 gate didn't pass, unless --force is passed.


11. Success criteria for the full project#

We call this a success (regardless of which direction the results go) if:

  1. All eight hypotheses have been tested with clean methodology and published coefficients + CIs.
  2. Cross-market comparison exists — we can say something substantive about whether the 3T framework generalizes.
  3. At least one finding materially extends or contradicts Florida — i.e., we have earned the right to write this up, not just repeat him.
  4. Report is written in a way that a Florida-sympathetic urbanist and a Glaeser-sympathetic economist could both read and say "this is fair."

Null or negative results are fine. We are doing science, not advocacy.