# Florida Creative Class Replication — Digital Atlas

**Handoff plan for Claude Code.** Market-parametric: runs on NYC or Singapore via config switch. Phases gated by validation checks.

---

## 1. Objective

Empirically test Richard Florida's Creative Class thesis at **sub-metro resolution** using Digital Atlas feature vectors, and produce results that are either (a) the first clean replication at this granularity, or (b) evidence that the thesis fails when amenity and workforce signals are measured directly rather than through coarse proxies.

Florida's original work ran at MSA level using occupation-code shares and index proxies (Bohemian Index, Gay Index, Milken Tech-Pole). We run at **census tract (NYC, N=2,304)** or **subzone (SGP, N=332)** using observed place-level signals, typed feature vectors, and graph-aware neighbor controls.

---

## 2. Testable claims (target: prove or disprove each)

| # | Claim | Primary test | Outcome |
|---|-------|--------------|---------|
| H1 | 3T's (Talent, Technology, Tolerance) jointly predict creative-class concentration | Cross-sectional regression of creative-class share on 3T indices | Coefficient signs + magnitude |
| H2 | The 3T's are empirically separable, not a single latent factor | Factor analysis + VIF + orthogonalization test | 3 factors ≥ 10% variance each, or collapse to 1 |
| H3 | Creative-class concentration precedes employment/wage growth (causal direction) | Panel FE + IV (NYC only — SGP has no panel) | Sign of lagged creative share on Δwage |
| H4 | "Quality of place" amenity signature is a stronger predictor than labor-market factors | Compare R² of amenity-only vs. labor-only vs. combined models | Amenity R² >> labor R² |
| H5 | Creative concentration is spatially clustered intra-metro, not uniformly distributed | Moran's I + LISA hotspots on creative share | Moran's I > 0.3, p<0.01 |
| H6 | (NYC only) Creative class redistributes intra-metro post-pandemic (2019→2023) | Δ creative share × tract type (CBD vs. neighborhood) | Significant interaction |
| H7 | Creative concentration capitalizes into rent (inequality claim) | Regression of rent on creative share, controlling for amenities and access | Positive residual capitalization |
| H8 | Neighbor absorption matters: a tract's apparent creative gap is partially explained by adjacent supply | Gap model with/without spatial lag of 3T | Spatial lag adds material explained variance |

**H8 is ours, not Florida's** — it's the contribution that goes beyond replication.

---

## 3. Repository layout

```
/repos/florida_replication/
├── config/
│   ├── market_nyc.yaml
│   └── market_sgp.yaml
├── src/
│   ├── data/              # source loaders per market
│   ├── features/          # 3T index construction
│   ├── models/            # regressions, panel, spatial
│   ├── validate/          # gate checks between phases
│   └── viz/               # maps, coefficient plots
├── notebooks/             # exploratory only; production is src/
├── artifacts/
│   ├── indices/           # *_3t_indices.parquet
│   ├── models/            # *.pkl
│   └── figures/
├── reports/
│   └── florida_replication_report.md  # final deliverable
└── run.py                 # entrypoint: python run.py --market nyc --phases 1,2,3
```

---

## 4. Market configuration

### 4.1 NYC config (`config/market_nyc.yaml`)

```yaml
market: nyc
spatial_unit: census_tract
n_units: 2304
crs: EPSG:2263
sources:
  places: /data/nyc/places_full.parquet        # ~250K rows
  tract_features: /data/nyc/tract_features_v3.parquet  # existing region vectors
  tract_geoms: /data/nyc/tracts_2020.geojson
  acs_panel: /data/nyc/acs_5yr_2009_2023.parquet
  rodx: /data/nyc/mega_graph.pkl               # existing mega graph
  tract_adjacency: /data/nyc/tract_queen_contig.parquet
  housing_transactions: /data/nyc/rpad_sales_panel.parquet
outcomes:
  - median_hh_income
  - bachelor_plus_share
  - median_gross_rent
  - employment_total
panel_available: true
```

### 4.2 Singapore config (`config/market_sgp.yaml`)

```yaml
market: sgp
spatial_unit: subzone
n_units: 332
crs: EPSG:3414
sources:
  places: /data/sgp/places_overture_plus_local.parquet  # ~67K rows
  subzone_features: /data/sgp/subzone_features_202d.parquet  # 19-step pipeline output
  subzone_geoms: /data/sgp/mp19_subzones.geojson
  singstat_census: /data/sgp/singstat_2020_2025.parquet   # PA-level, dasymetric to subzone
  housing_transactions: /data/sgp/hdb_private_txn_panel.parquet
outcomes:
  - median_hh_income      # from Census, dasymetric-allocated
  - professional_share    # proxy for creative class via industry
  - private_psf           # price per sqft, private residential
panel_available: false   # 2 census points only, cross-sectional primary
```

---

## 5. Phase plan

### Phase 0 — Environment setup and sanity checks (0.5 day)

- Load market config, verify all source files exist and have expected schema.
- Spatial sanity: `n_units == len(geoms) == len(features)`, no orphan tracts/subzones.
- Feature vector integrity: for SGP confirm 202-dim, for NYC confirm current production dimensionality. Log any NaN or zero-variance columns.
- Produce `artifacts/00_data_inventory.md` listing N units, N places, feature columns, date range.

**Gate:** all sources loadable, no schema mismatches, feature vector dimension matches config. Stop if not.

---

### Phase 1 — Operationalize the 3T's at our spatial unit (2 days)

We construct three indices per tract/subzone: `T_talent`, `T_tech`, `T_tolerance`. Each is a composite z-score of several observed measures. **No proxy is imported from Florida verbatim — we use our richer data and document each mapping explicitly.**

#### 1.1 Talent index

| Component | NYC source | SGP source | Weight |
|-----------|-----------|-----------|--------|
| Bachelor-plus share | ACS B15003 | SingStat (PA→subzone dasymetric) | 0.4 |
| Graduate-degree share | ACS B15003 | SingStat | 0.3 |
| Creative-occupation share | ACS C24010 (SOC-based, see §5.1.3) | Proxy via tech-firm + design-firm workplace density | 0.3 |

**5.1.3 — SOC → Creative Class mapping (NYC).** Use Florida's occupation set: SOC 11 (Management), 13 (Business/Financial Ops), 15 (Computer/Math), 17 (Architecture/Engineering), 19 (Life/Physical/Social Sci), 21 (Community/Social Svc), 23 (Legal), 25 (Education), 27 (Arts/Design/Entertainment/Media), 29 (Healthcare Practitioners). Store mapping as `src/features/soc_creative_class.py` with the exact SOC codes — reviewer-auditable.

**SGP proxy.** Without occupation-at-subzone data, derive a "creative workplace intensity" score from place-type density: software/tech offices, design studios, architecture firms, research institutions, media companies, universities. Weight by employment-capacity estimates where available. Document this as a proxy, not a direct measure.

#### 1.2 Technology index

| Component | NYC source | SGP source | Weight |
|-----------|-----------|-----------|--------|
| Tech firm density (per km²) | Places with NAICS 5112/5415/5417 | Overture + local: software, R&D, tech services | 0.35 |
| Coworking space count | Places typed as coworking | Same | 0.15 |
| R&D institution presence | Universities + research centers | NUS/NTU/SMU campuses + A*STAR, research parks | 0.25 |
| Patent density (if available) | USPTO by zip→tract allocation | IPOS if obtainable, else drop | 0.25 |

If patent data isn't available in SGP, renormalize weights and flag in report.

#### 1.3 Tolerance index

**This is where we deliberately depart from Florida's proxies.** His Gay Index and Bohemian Index were the best he could do with 2000-era data. We measure tolerance-adjacent urban signals directly:

| Component | Measure | Rationale |
|-----------|---------|-----------|
| Cuisine diversity | Shannon entropy over cuisine categories in restaurants | Proxy for cultural openness |
| Venue-type heterogeneity | Shannon entropy over all place categories | Mixed-use authenticity |
| Third-place density | Indie coffee + bars + bookstores + galleries + music venues, per km² | Florida's "street-level culture" |
| Indie-to-chain ratio | Fraction of retail/F&B that are non-chain | Authenticity signal |

For SGP, add market-specific adjustments: hawker centres, void-deck commercial, and kopitiams count as third places and get mapped into the indie side of the indie/chain ratio (per your existing taxonomy work).

Chain detection: use brand-name frequency threshold (≥10 locations in market = chain). Store `src/features/chain_detector.py`.

#### 1.4 Composite Creativity Index

`CI = (T_talent + T_tech + T_tolerance) / 3` after z-scoring each component. Also retain the three separately for H2 (separability test).

#### Deliverable for Phase 1

- `artifacts/indices/{market}_3t_indices.parquet` with columns: `unit_id, T_talent, T_tech, T_tolerance, CI`, plus all component measures.
- `artifacts/01_3t_construction.md` documenting every measure, source, and weight choice.

**Gate:** Indices computed for ≥99% of units (allow NaN for tracts/subzones with N_places=0 or population=0). Distribution of each index is roughly normal post-z-score. Correlation matrix of 3T components produced and inspected — if any pair has r>0.9, flag as collinearity issue.

---

### Phase 2 — Cross-sectional test (H1, H2, H4, H5) (2 days)

Both markets. Unit of observation: tract (NYC) or subzone (SGP).

#### 2.1 H1 — 3T → creative-class share

Baseline OLS:
```
creative_share_i = β0 + β1·T_talent_i + β2·T_tech_i + β3·T_tolerance_i + X_i·γ + ε_i
```
Where `X_i` includes controls: population density, distance to CBD, transit accessibility (isochrone-based), total place count.

Then spatial specification:
```
creative_share_i = β0 + β·T_i + ρ·W·creative_share_i + X_i·γ + ε_i
```
Queen-contiguity weights matrix `W`. Use `pysal.model.spreg.ML_Lag`. Compare to OLS — if ρ is significant and large, spatial dependence is real and OLS estimates are biased.

**H8 sidecar:** also fit with spatial lag of T's (SLX) — `W·T_talent`, `W·T_tech`, `W·T_tolerance` — and test whether neighbor 3T's explain own creative share beyond own 3T's. This is the neighbor absorption test at the 3T level.

#### 2.2 H2 — Are the 3T's separable?

- Exploratory factor analysis on the underlying component measures (not the indices themselves).
- Report eigenvalues, scree plot, loadings.
- Decision rule: if top factor explains >70% of variance and no second factor has eigenvalue >1, Florida's three-factor structure is rejected at this resolution.

#### 2.3 H4 — Amenity vs. labor predictive power

Three nested models for the same outcome (creative share or median income):
- M_A: amenity-only features (place density, diversity, third-place count, walkability)
- M_L: labor-only features (education shares, occupation mix for NYC / industry mix for SGP)
- M_C: combined

Compare adj-R² and cross-validated R² (5-fold spatial CV — hold out contiguous blocks, not random rows).

#### 2.4 H5 — Spatial clustering

- Global Moran's I on creative share.
- LISA (Local Indicators of Spatial Association) to identify HH (high-high) and LL clusters.
- Map top-decile creative tracts; check if they form contiguous corridors or scatter.

#### Deliverable for Phase 2

- `artifacts/models/phase2_ols.pkl`, `phase2_spreg.pkl`, `phase2_slx.pkl`.
- `artifacts/figures/02_moran_lisa_{market}.html` (interactive).
- Section 2 of final report written.

**Gate:** H1 coefficients have expected signs (all three positive, significant at p<0.05). If any is zero or negative with high confidence, pause and diagnose before continuing — this would be a genuinely novel finding that deserves investigation, not a bug to paper over.

---

### Phase 3 — Causal test (H3, H6) — **NYC only** (3 days)

SGP has no usable panel. Skip this phase if `panel_available: false`.

#### 3.1 Panel construction

ACS 5-year estimates 2009-2013, 2014-2018, 2019-2023 — three periods per tract, N=2304×3 observations. Harmonize tract boundaries across 2010/2020 census revisions using relationship files.

#### 3.2 Fixed effects model

```
Δy_i,t = β·creative_share_i,t-1 + γ·X_i,t-1 + α_i + τ_t + ε_i,t
```
Where `y` ∈ {log median income, log employment, log rent}, tract FE `α_i`, period FE `τ_t`.

Coefficient β is the association of lagged creative concentration with subsequent growth, net of time-invariant tract characteristics. Still not causal — tract FE handles unobserved fixed heterogeneity but not time-varying confounders.

#### 3.3 Instrumental variable strategy

Use **historical university presence (pre-1950 founding dates)** as IV for current Talent index. Rationale: legacy university locations are plausibly exogenous to post-2000 gentrification dynamics but predict current educated-worker concentration. Check first-stage F-statistic (≥10 for strong instrument).

Alternative IV: **historical transit station placement** (subway stations operational pre-1940) as IV for current accessibility, which feeds into 3T components.

Report both OLS, FE, and IV estimates side-by-side. If IV and FE agree, confidence in causal interpretation rises. If they disagree materially, Florida's directional claim is weakened.

#### 3.4 H6 — Post-pandemic redistribution

Difference-in-differences: pre-period 2015-2019, post-period 2021-2023.
```
Δcreative_share_i = α + β1·CBD_i + β2·Post + β3·CBD_i×Post + X_i·γ + ε_i
```
CBD dummy = tracts within the Manhattan CBD polygon. Coefficient β3 tests whether CBD tracts saw differential loss of creative share post-pandemic. Run parallel specs for near-CBD and outer-borough tracts.

#### Deliverable for Phase 3

- `artifacts/models/phase3_fe.pkl`, `phase3_iv.pkl`, `phase3_did.pkl`.
- Coefficient plot comparing OLS / FE / IV estimates.
- Section 3 of final report.

**Gate:** IV first-stage F > 10. If not, drop IV and report FE only with explicit caveat about residual endogeneity.

---

### Phase 4 — Quality of place signature (H4 deep dive) (1.5 days)

Both markets. What does a high-CI tract *look like* at street level? This is where DA's resolution advantage over Florida is most visible.

#### 4.1 Signature extraction

For top-decile CI tracts in each market:
- Mean feature vector (202-dim in SGP, production-dim in NYC)
- Rank feature dimensions by standardized difference from market median
- Identify top 20 distinguishing features

Report as a table: feature name, market median, top-decile mean, z-diff.

#### 4.2 Cross-market comparison

Align NYC and SGP feature dimensions where possible (both encode: place density, cuisine diversity, transit access, third-place count, etc.). Compare top-decile signatures.

**Key question:** do creative tracts in NYC and SGP look the same in feature space, modulo market normalization? If yes — Florida's universal claim holds. If no — document what's market-specific and what's universal.

#### 4.3 Taxonomy-aware check (SGP)

Explicitly test whether hawker centres and void-deck commercial appear in the top-decile signature. If they do, this validates your taxonomy work. If they don't, reconsider — either the taxonomy overweights these or Florida's creative-class framing doesn't fit Singapore.

#### Deliverable for Phase 4

- `artifacts/figures/04_signature_table_{market}.html`.
- `artifacts/figures/04_cross_market_signature.html` (only after both markets run).
- Section 4 of final report.

---

### Phase 5 — Rent capitalization (H7) (1 day)

Both markets, data-permitting.

Hedonic regression on housing transactions:
```
log(price_per_sqft) = β·CI_tract + controls + ε
```
Controls: building age, unit size, floor, transit access, school quality proxy (schools within 500m).

Test whether the CI coefficient survives the full control set. If yes, creative concentration capitalizes into rent — the inequality channel Florida 2022 emphasizes.

Additional decomposition: regress CI on its three T components separately into the hedonic. Which of Talent / Tech / Tolerance drives most of the rent premium?

#### Deliverable for Phase 5

- `artifacts/models/phase5_hedonic.pkl`.
- Section 5 of final report.

---

### Phase 6 — Cross-market synthesis (1 day)

Run only after both NYC and SGP have completed Phases 1-2 (and 4-5 where data allows). Write final synthesis section comparing:

- Coefficient magnitudes on 3T → creative share (do they agree in sign and rough magnitude?)
- Factor structure (is the 3T decomposition stable across markets?)
- Feature signatures of top-decile tracts (universal or market-specific?)
- Rent capitalization (same channel or different?)

#### Deliverable for Phase 6

- `reports/florida_replication_report.md` — full final report, 15-25 pages, with embedded figures.
- `reports/florida_replication_summary.md` — 2-page executive summary.

---

## 6. Key modeling choices — explicit record

Maintained as `docs/design_decisions.md`, updated as the run progresses:

1. **Spatial unit** — tracts for NYC, subzones for SGP. Rationale: matches existing DA feature granularity, N large enough for regression, small enough to capture intra-metro variation.
2. **Weights matrix** — queen contiguity. Alternative: k-nearest-neighbors (k=6). Run both, report sensitivity in appendix.
3. **Creative class definition** — Florida's SOC set for NYC; workplace-density proxy for SGP. Both documented and auditable.
4. **Tolerance measurement** — deliberately replaces Bohemian/Gay Index with diversity entropy + third-place density + indie/chain ratio. We should expect our measure to correlate with Florida's but not be identical — if it differs radically from his MSA-level results, the measurement choice is the likely reason.
5. **Causal identification** — FE + IV for NYC. SGP is cross-sectional only; no causal claims made from SGP results.
6. **Neighbor absorption (H8)** — spatial lag of T's included as standard; not an optional extension.
7. **Chain detection** — threshold at ≥10 locations in-market. Review threshold sensitivity at 5/10/20.

---

## 7. Risks and mitigations

| Risk | Likelihood | Mitigation |
|------|-----------|------------|
| SGP workforce proxy is too noisy to replicate Talent index | Med | Run H1 in NYC first; if proxy works (correlates with income), trust it for SGP cross-sectional only |
| Spatial autocorrelation dominates everything (ρ → 1) | Low-med | Compare OLS, SAR, SLX; report all three; if ρ > 0.6 write explicit caveat |
| IV first-stage is weak in NYC | Med | Have two IV candidates ready (pre-1950 universities and pre-1940 subway stations); if both fail, report FE only |
| Tolerance measure correlates ~perfectly with total place count | Med | Always compute and report residualized Tolerance net of place density |
| Top-decile tracts are just "dense Manhattan" or "CBD Singapore" | High | Explicitly control for distance-to-CBD and transit access in every spec |
| Post-pandemic DiD has parallel-trends violation | Med | Plot pre-period trends by tract type; if non-parallel, switch to synthetic control |

---

## 8. Out of scope for v1 (parking lot)

- Bangalore replication — needs BDA zoning data first (PDF-only issue documented elsewhere).
- Place-level individual creative-class modeling — v2 with true joint place-region architecture.
- Causal mechanism decomposition (amenity-driven vs. job-driven migration) — requires IRS migration flows, not in current stack.
- LLM-extracted creative-class signal from venue descriptions / reviews — promising but adds scope.

---

## 9. Execution order and time budget

| Phase | Days | Market dependency |
|-------|------|-------------------|
| 0 | 0.5 | per market |
| 1 | 2 | per market |
| 2 | 2 | per market |
| 3 | 3 | NYC only |
| 4 | 1.5 | per market |
| 5 | 1 | per market |
| 6 | 1 | after both markets |

**Recommended run order:** NYC full (Phases 0-5) → SGP Phases 0-2, 4-5 → Phase 6 synthesis. Total ~16 person-days for both markets.

---

## 10. Entry point

```bash
# Full NYC run
python run.py --market nyc --phases 0,1,2,3,4,5

# SGP cross-sectional run
python run.py --market sgp --phases 0,1,2,4,5

# Synthesis after both
python run.py --synthesis
```

Each phase writes a gate-check file. `run.py` refuses to start phase N if phase N-1 gate didn't pass, unless `--force` is passed.

---

## 11. Success criteria for the full project

We call this a success (regardless of which direction the results go) if:

1. All eight hypotheses have been tested with clean methodology and published coefficients + CIs.
2. Cross-market comparison exists — we can say something substantive about whether the 3T framework generalizes.
3. At least one finding materially extends or contradicts Florida — i.e., we have earned the right to write this up, not just repeat him.
4. Report is written in a way that a Florida-sympathetic urbanist and a Glaeser-sympathetic economist could both read and say "this is fair."

Null or negative results are fine. We are doing science, not advocacy.
