About This Project
How we analyzed 227 million Medicaid billing records — and what we found.
The Story Behind the Data
February 13, 2026
The HHS DOGE team open-sourced what they called “the largest Medicaid dataset in department history” — aggregated, provider-level claims data covering every billing code from 2018 through 2024. The announcement received over 50 million views.
DOGE specifically noted the data could detect the large-scale autism diagnosis fraud in Minnesota — a scheme where providers billed for autism therapy never delivered, resulting in $100M+ in fraudulent claims.
We built OpenMedicaid to make this data permanently accessible — not as a one-time blog post, but as a searchable resource where journalists, researchers, policymakers, and citizens can explore how $1.09 trillion was distributed.
Data Source
All data comes from the HHS Open Data Platform — Medicaid Provider Spending dataset.
Records
227M
Total Payments
$1.09T
Providers
617,503
Procedure Codes
10,881
Codes Benchmarked
9,578
Date Range
2018–2024
Code-Specific Fraud Detection (Primary)
Our primary analysis uses code-specific benchmarks — comparing each provider's cost per claim against the national median for that exact procedure code. We compute decile distributions (p10–p99) for 9,578 HCPCS codes to identify providers billing significantly above their code-specific benchmarks.
Cost Outlier
Billing over 3× the national median for specific procedure codes.
Threshold: Billing >3× the national median cost/claim for a specific HCPCS code
Example: Provider bills $296/claim for G9005 when the national median is $47 (6.3×)
Billing Swing
Experienced over 200% change in year-over-year billing with >$1M absolute change.
Threshold: >200% year-over-year change AND >$1M absolute change
Example: Provider went from $34.6M to $107M in one year (209% increase)
New Entrant
Started billing recently but already receiving millions in Medicaid payments.
Threshold: First appeared 2022+ and already billing >$5M total
Example: Health home LLC appeared Sep 2022 and already billed $239M
Rate Outlier
Billing above the 90th percentile across multiple procedure codes simultaneously.
Threshold: Billing above p90 for 2+ procedure codes simultaneously
Example: Above 90th percentile for both T2022 and G0506 simultaneously
Legacy Fraud Tests (Supplementary)
These 9 additional tests from our earlier analysis remain active. Providers flagged by these tests are included in the combined watchlist.
Unusually High Spending
This provider's total payments are significantly above the median for their specialty.
Threshold: 3+ standard deviations above mean total spending
Example: Residential care agency billing $15,000/day when median is $300/day
High Cost Per Claim
Average payment per claim is much higher than peers billing the same procedures.
Threshold: 3x+ the median cost per claim for same procedure
Example: Chicago EMS at $1,611/ambulance trip vs. $163 median nationally
High Claims Per Patient
Filing an unusually high number of claims per beneficiary compared to peers.
Threshold: Claims-per-beneficiary ratio far above peers
Example: Home health agency filing 26 claims/patient when peers average 4
Spending Spike
Experienced a dramatic increase in billing over a short period.
Threshold: Month-over-month increase of 500%+
Example: $50K/month to $34.6M/month overnight (692x increase)
Explosive Growth
Billing increased over 500% year-over-year — far beyond normal growth patterns.
Threshold: >500% year-over-year growth
Example: Provider growing from $370K to $19.2M in one year (5,106%)
Instant Volume
New provider billing over $1M in their first year of Medicaid participation.
Threshold: New provider billing >$1M in first year
Example: New provider billing $63M in their first year
Single-Code
Billing almost exclusively for 1-2 procedure codes despite high total volume.
Threshold: Only 1-2 codes billed at high volume
Example: Provider billing $1B almost entirely for a single code
Consistent Billing
Monthly billing amounts show almost no natural variation (CV < 0.1).
Threshold: Coefficient of variation < 0.1 across months
Example: Monthly billing almost identical for 78 consecutive months
Beneficiary Stuffing
Filing over 100 claims per beneficiary — far exceeding any normal treatment pattern.
Threshold: >100 claims per beneficiary
Example: 235 claims per patient — impossible in normal practice
OIG Exclusion List Cross-Reference
We cross-referenced our flagged providers against the HHS OIG's LEIE — 82,715 excluded providers.
Result: Zero matches
None of our flagged providers appear on the OIG exclusion list, suggesting our analysis surfaces new suspicious activity.
The Minnesota Autism Fraud Context
DOGE referenced their dataset's ability to detect “large-scale autism diagnosis fraud in Minnesota.” This refers to providers billing for autism therapy (EIDBI services) that was never delivered — $100M+ in fraudulent claims, multiple federal indictments in 2023–2024.
- Sudden spike in autism therapy providers
- Unrealistic billing volumes using procedure codes H2019 and T1024
- Beneficiaries enrolled in multiple providers simultaneously
- Pattern matches our explosive growth and beneficiary stuffing tests
Important Caveats
- 1.Statistical flags are not proof of fraud.
Our tests identify unusual patterns that may warrant investigation. Many flagged providers have legitimate reasons for unusual billing.
- 2.Government entities may legitimately bill high.
State agencies, county health departments, and cities often serve as fiscal agents for large populations. Their aggregate billing is high by design.
- 3.Home care management programs are special cases.
Organizations like Public Partnerships LLC and Consumer Direct manage billing on behalf of thousands of individual caregivers in self-directed care programs. High aggregate billing is inherent to their model, though self-directed care is a fraud-prone category.
- 4.Per diem codes should account for daily rates.
Codes like T2016 (residential habilitation) cover an entire day of care. High per-diem rates may reflect bundled services. Dividing by ~30 days brings some values closer to expected daily rates.
- 5.Specialty drugs have legitimately high costs.
J-codes for injectable drugs reflect actual drug prices, not provider markup.
- 6.This data is aggregated, not claims-level.
We see billing totals per provider per procedure per month — not individual claims.
- 7.Some anomalies reflect state-specific policies.
States set their own reimbursement rates, eligibility rules, and covered services.
- 8.We don't make medical judgments.
We cannot evaluate whether services were medically necessary, appropriately coded, or properly authorized.
Frequently Asked Questions
Where does this data come from?
All data comes from the HHS Open Data Platform — the Medicaid Provider Spending dataset. It contains aggregated, provider-level claims data covering every billing code from 2018 through 2024, totaling 227 million records. The data was released publicly by HHS on February 13, 2026.
What does 'flagged' mean?
A 'flagged' provider has been identified by one or more of our 13 statistical fraud detection tests or our ML fraud similarity model as having billing patterns that are unusual compared to peers. Statistical flags and ML scores are combined into a unified risk system with tiers: Critical, High, Elevated, and ML Flag. Being flagged is not proof of fraud — it means the billing patterns warrant further investigation.
Is this proof of fraud?
No. Statistical flags indicate unusual patterns, not proof of wrongdoing. Many flagged providers have legitimate reasons for their billing patterns — government agencies serve large populations, home care management programs bill on behalf of thousands of caregivers, and specialty drugs have inherently high costs. Our analysis surfaces patterns that may warrant investigation by qualified auditors.
Why are hospitals and government entities flagged?
Large institutions — hospitals, county health departments, state agencies — often bill at higher aggregate rates due to overhead costs, specialized services, and the large populations they serve. Our statistical tests flag unusual patterns regardless of entity type. Being flagged means the billing pattern is unusual, not that it is fraudulent. Government entities in particular often serve as fiscal agents for entire state programs.
How accurate is the ML model?
Our random forest ML model has an AUC of 0.7762, meaning it correctly ranks a randomly chosen fraud case above a randomly chosen legitimate provider 77.6% of the time. The model was trained on 514 providers confirmed by the OIG as fraudulent. While useful for identifying patterns similar to known fraud, it is one signal among many — not a definitive fraud detector.
What advanced detection methods do you use?
Beyond the 13 core statistical tests, we apply five advanced techniques: billing velocity analysis (flagging providers filing 50+ claims per working day), Benford’s Law analysis (testing whether claim amounts follow expected leading-digit distributions), CUSUM change point detection (identifying the exact month billing behavior shifted 3x or more), billing pattern similarity (cosine similarity between providers’ HCPCS distributions to find coordinated billing), and HCPCS concentration analysis (Herfindahl index flagging providers billing >$1M on just 1–3 codes). See our full methodology for details.
Can I download the data?
Yes. The watchlist page includes a CSV export button that downloads all filtered results. For the raw underlying data, visit the HHS Open Data Platform (opendata.hhs.gov) where the original 227M-record Medicaid Provider Spending dataset is publicly available.
How do I report suspected fraud?
If you suspect Medicaid fraud, you can report it to the HHS Office of Inspector General (OIG) at 1-800-HHS-TIPS (1-800-447-8477) or online at oig.hhs.gov. You can also contact your state’s Medicaid Fraud Control Unit (MFCU). Whistleblower protections exist under the False Claims Act for those who report fraud.
How often is this updated?
The underlying HHS data covers 2018–2024. We update our analysis when HHS releases new data. The current analysis was published in February 2026 based on the initial public data release.
Project Timeline
February 13, 2026
HHS Data Release
HHS DOGE open-sources 227 million aggregated Medicaid billing records covering 2018–2024 — the largest Medicaid dataset in department history.
February 14–15, 2026
Analysis & Fraud Detection
Built 13 statistical fraud tests including 4 code-specific smart tests with national benchmarks across 9,578 HCPCS codes. Trained random forest ML model on 514 OIG-excluded providers (AUC: 0.77).
February 16, 2026
Site Launch
OpenMedicaid goes live with 12,800+ static pages covering 1,889 providers, 10,881 procedures, and 49 states. Data journalism articles published.
Built By
OpenMedicaid is a project of TheDataProject.ai, building data-driven transparency tools from public records.
If you're a journalist, researcher, or policymaker interested in this data, get in touch.
Follow Us
Follow TheDataProject for updates on OpenMedicaid and our other data journalism projects.