Insights & Innovators: Using Machine Learning to Develop Disease Predictability Models

Q&A with Chao Li, PhD

Q: Dr. Li, can you give some background on your study?

A: Recently, our team conducted a study to replicate a previous study using a machine learning approach to predict Hidradenitis Suppurativa (HS) diagnosis. HS is a condition that causes small, painful lumps to form under the skin, usually in areas where your skin rubs together, such as the armpits, groin, buttocks, and breasts.

Q: What data sources did you use and how was the study designed?

A: We used two commercial databases as well as Medicare and Medicaid databases. For the study analytics we used Panalgo’s IHD Data Science module.

We took a three-year baseline period prior to the index data (the first HS or control diagnosis) and required at least two HS diagnoses during the follow-up to confirm that they were HS patients. We also required the patients to have 36 months continuous enrollment for medical and pharmacy coverage prior to the index and six months post index to ensure the integrity of the database. Finally, we ensured there were no other immune conditions, cancer, or cancer-related medications in the pre-index period.

The control cohorts included patients with ≥1 ICD-9/10 diagnosis claim indicating abscess or cellulitis and without HS diagnosis during the entire study period. We also did random matching, so the control cohort HS diagnosis matched the index date of the study group.

Q: After designing the study, how was it conducted?

A: First we constructed the project in the core platform, IHD Analytics. We then we divided the machine learning project using the IHD Data Science module. We split the data into two parts – 20% became the testing set and the other 80% became the training set. For the training set we did some data preprocessing such as feature selection and we were then able to construct and validate the machine learning model.

We then used the testing set to evaluate the developed model using performance measurements like accuracy, precision, and area under the curve (AUC). Following this evaluation, we were able to easily select the best algorithm to develop the final prediction model.

Q: How did you ensure a balanced training data set?

A: Since we had a sample size of over 17,000 HS patients in our case cohort, we were able to do random sampling from our control cohort to develop a 1:1 ratio of case vs. control to prepare a balanced training data set.

Q: Which algorithms and features did you consider?

A: The nice thing about the Data Science module is that we could select from a number of algorithm candidates to see which worked best. We chose a few including Lasso Regression, extreme Gradient Boosting (XGBoost), Random Forest, Neural Networks, Support Vector Machines, and Naïve Bayes.

We looked at several features including demographic characteristics, diagnoses, procedures, drug classes, provider specialties, insurance payor type, and other healthcare resource utilization data. The IHD Data Science module made it easy compare the case cohort to the control cohort to select the most significantly different features. That gave us the ability to choose the ones we were most interested in before we began the model training.

Q: You mentioned comparing and ultimately selecting the best model for your study needs. How did IHD enable you to do this?

A: IHD provided an easy visualization of the evaluation metrics or feature scaling for each algorithm candidate. We used AUC as the evaluation metric to train 50 models at the same time.

After we finished the model training, we could use AUC to compare all the models we selected for further model development. Based on this comparison, we selected the XGBoost as our final prediction model since it outperformed other metrics with the highest accuracy, recall, precision, F1 score, and AUC. IHD also provided a plot to help us select the best model as well by comparing the training to validate AUC.

We were also able to test our model easily to make sure it held up. IHD provides an evaluation tool so we could see that the model was high in AUC, average precision, accuracy, recall, and specificity – all important metrics for evaluating model performance.

Q: Did you conduct any validation with external sources?

A: Yes. We compared the Medicaid population with the tested population and received a similar evaluation metric. We used the precision, recall and ROC curves to evaluate. We also used populations from another commercial database to test and validate the model performance. Since the validation data sources were outside of IHD, the IHD support team helped us develop steps to conduct the comparison and shared resources from their Knowledge Center to guide us. This allowed us to generate the results using the same features as our testing set.

Q: How did the results compare to the previous study?

A: With the IHD readout we were able to easily compare our results to those of previous study which matched up well in precision, recall, and accuracy.

Q: What are your main takeaways from the project? Are there any lessons learned?

A: We learned that machine learning methods could be used to predict probability of HS diagnosis and distinguish it from cutaneous abscess and cellulitis (the most common mimics of HS) and that XGBoost was the best prediction performance with the lowest number of features.

IHD makes machine learning accessible and easy to conduct studies based on the most widely used models like XGBoost, random forest, and decision tree. There was almost no need for coding, and we received real time support from the IHD experts as questions arose.

The platform is also really efficient and fast. It took us only a week to get familiar with all the functions of the IHD Data Science module. After the core project was developed, the machine learning project was straightforward. The preliminary results were generated in a month and the entire project was completed in two months which is incredibly fast for this type of study.


Chao Li is currently a data scientist on the HEOR RWE Analytics team at AbbVie who supports the dermatology therapeutic areas. Prior to joining AbbVie, Chao obtained his PhD in Pharmaceutical Sciences with an emphasis on Health Outcomes Research and Policy from Auburn University. During his career journey, Chao has conducted multiple projects to apply pharmacoepidemiology and machine learning methods for exploring various areas such as drug utilization, post-marketing drug safety and population prediction modeling.

Dr. Li presented his recent research at our annual IHD User Conference in April, and we recently spoke with him about the ways in which machine learning can be used to help predict disease, replicate prior studies, and conduct external validation for machine learning model performance.