Background

Obesity, caused by complex etiology, is a major health issue in the United States. Physical activity is usually considered as a contributing factor. Consequently, we conducted a study, whose objectives are: 1) to predict adult weight using physical activity and basic demographic status based on machine learning techniques, 2) assess the performance and predictive power of a set of popular machine learning algorithms.

Methods

National Health and Nutrition Examination Survey (NHANES, 2003 to 2006) were used. Nine most popular and state-of-the-art classifying algorithms— including Naïve Bayes, RBF, KNN, Classification via Regression, Random Subspace, Decision table, Multi Objective, Random Forest and J48— were implemented and evaluated, and compared with traditional logistic regression model estimates. We analyzed the accuracy, sensitivity, and specificity of each method with or without physical activity data.

Results

Of all methods analyzed, Random Forest classifying algorithm achieved the highest accuracy (70.08%) and sensitivity (72.3%) using physical activity and basic demographic status to identify overweight samples. Random Subspace classifier had the highest specificity (57.3%) with accuracy of 70.01% exceeding the average. Intriguingly, logistic regression model had a significantly low specificity of 3.4% with accuracy of 69.44%. Additionally, all methods showed similar accuracy, sensitivity and specificity with or without physical activity data. The highest accuracy (69.34%) using only demographic status was obtained by Multi Objective classifier.

Conclusions

Physical activity was not a necessary predictor of obesity. Machine learning based methods do not always outperform traditional logistic regression methods.