Explainable Machine Learning Outperforms Linear Regression for Predicting County-Level Lung Cancer Mortality Rates in the United States

A recent study, Explainable Machine Learning (EML) models have demonstrated superiority over linear regression for predicting lung cancer mortality rates at the county level in the United States.

The researchers applied three models: Random Forest, Gradient Boosting Regression, and Linear Regression. Model performance was evaluated using R-squared and root mean squared error (RMSE).

The study's results were striking:
* The Random Forest model achieved an R-squared value of 41.9% and a RMSE of 12.8.
* Linear regression performed poorly compared to the other models with lower values.

The results were analyzed using Shapley Additive Explanations (SHAP) to determine variable importance and their directional impact. Smoking rate was identified as the most important predictor, followed by median home value and the percentage of Hispanic ethnic population.

Furthermore, Getis-Ord (Gi*) hotspot analysis revealed significant clusters of elevated lung cancer mortality in the mid-eastern counties of the United States. The Random Forest model demonstrated superior predictive performance for lung cancer mortality rates, emphasizing the critical roles of smoking prevalence, housing values, and the percentage of Hispanic ethnic population.

These findings offer valuable actionable insights for designing targeted interventions, promoting screening, and addressing health disparities in regions most affected by lung cancer in the United States.