Healthcare Provider Fraud Detection And Analysis — Machine learning

Supplementary endeavour to solve binary classification problem based on Kaggle’s Healthcare Provider Fraud Detection dataset as a part of a pioneering self-case study.

Rohan kumar soni
13 min readSep 28, 2020

Healthcare fraud is an organized crime which involves peers of providers, physicians, beneficiaries acting together to make fraud claims. Provider fraud is one of the biggest problems facing Medicare. This not only adversely affects the insurance provider companies in terms of money but also the genuine claim settlement cases. Also the major challenge is the cost of determining or predicting potential fraud in any claim submitted is very high because if it is done incorrectly it may irritate legitimate customers and it may result in delayed claims adjudication.

Below are some key points that may help to raise suspicion on the claim submitted for adjudication:

  • Charging excessive prices for a treatment or medicine in a health centre;
  • Unusually high number of invoices for a particular insuree in a short time frame;
  • Claims whose payable amounts are greater than the invoice amounts that insurance companies will pay;
  • Billing for services that were not provided;
  • Duplicate submission of a claim for the same service;
  • Misrepresenting the service provided;
  • Charging for a more complex or expensive service than was actually provided;
  • Billing for a covered service when the service actually provided was not covered.

In this blog we will try to come up with a solution to Predict the potentially fraudulent providers based on the claims filed by them. Also we will discover what are the most important variables helpful in detecting the behaviour of potentially fraud providers so that we may flag such claims and perform in depth investigation on them.

https://cdn-gcp.marutitech.com/wp-media/2017/06/Fraud-detection-process.jpg

Table of Contents

  1. Business Problem
  2. Cost function
  3. Model performance Evaluation Matric
  4. Application of Machine learning Algorithms to our problem
  5. Source of Data
  6. Exploratory Data Analysis
  7. Data preparation
  8. Machine Learning Models
  9. Future work
  10. LinkedIn and GitHub Repository
  11. Reference

1. Business Problem

  1. Predict the probability that a provider is potential fraud and flag them accordingly.
  2. Discover important variables that might be helpful in flagging providers as potential fraud.

Business constraints:

  • Cost of misclassification is very high.
  • No strict latency requirements.
  • Model interpretability is very important.

2. Cost function

Log loss : Log loss is a popular cost function used in machine learning for optimising classification algorithms. Log loss can be directly applied to binary classification problems and extended to multi-class problems. In simple terms log loss effectively computes the log difference between predicted class probabilities and the ground truth label (represented as a 1 or 0)

code snippet to calculate log loss

3. Model performance Evaluation Matric

  1. Binary Confusion Matrix : A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.
code snippet to generate confusion matrix

2. Precision, recall and F1 score

Example
  • Precision : Of all the points the model predicted to be positive what percentage of them is actually positive.
  • Recall : Of all the points that actually belongs to class 1,how many of them the model detected to be belonging to class 1.
  • F1 Score : It is the harmonic mean of precision and recall values, calculated by the below equation :

3. AUC Score : AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve.

AUC (Area under the ROC Curve)
  • AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
  • AUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
code snippet to calculate AUC Score

4. Application of Machine learning Algorithms to our problem

This is a binary classification problem, for a given set of features we need to predict whether the corresponding provider needs to be flagged as potential fraud or not.

5. Source of Data

All the dataset for this case study is available at Kaggle’s website. Please visit the following link to download the dataset :

Lets us go through some of the important columns in the dataset that we have :

  1. Inpatient and outpatient data:
  • ClaimID : This a unique id for each claim submitted.
  • Bene_ID : contains Unique Id of the beneficiaries registered for insurance scheme.
  • AttendingPhysician : column contains the id of the physicians who attended the patient.
  • OperatingPhysician: column contains the id of the the physicians who operated on the patient.
  • ClmDiagnosisCode : contains Diagnosis Codes for that are performed on the patients by providers.
  • ClmProcedureCode : contains procedures codes that patients undergo.
  • Provider : unique Id of the healthcare providers.
  • InscClaimAmtReimbursed : Total amount paid to the claimant after settlement.
  • ClaimStartDt / ClaimEndDt : These columns has the date on which the claims were submitted and the date when claims got settled respectively.
  • AdmissionDt / DischargeDt : Date on which the patient got admitted in hospital and the date on which patient got discharged . Applicable for only inpatient data.

2. Beneficiaries data :

  • Bene_ID : contains Unique Id of the beneficiaries registered for the insurance scheme.
  • DOB : Date of birth of the beneficiaries registered for the insurance scheme.
  • Gender : Gender of the beneficiaries registered for the insurance scheme.
  • State/Country : contains state/country code for the registered members
  • Premedical conditions : There are some columns such as RenalDiseaseIndicator,ChronicCond_Depression,ChronicCond_Diabetes etc to indicate if the member has any prior medical condition.

3. Target data:

  • Provider: This a unique id for each of the healthcare providers.
  • Target: This column has two values Yes and No indicating if the corresponding provider is flagged as potential fraud or not.

6. Exploratory Data Analysis

  1. Inpatient /Outpatient data —Diagnosis code
Most recommended Procedure for Inpatient and outpatient data.

Conclusion:

  • Around 6.6% Inpatients have undergone procedure code 4019.
  • Procedure code 4019,9904,2724,8154 and 66 are top 5 procedure code for inpatient data.
  • Around 7.5% Outpatients have undergone procedure code 9904.
  • Procedure code 9904,3722,4516,2724 and 66 are top 5 procedure code for inpatient data.

2. Inpatient /Outpatient data — Diagnosis code

Most recommended Diagnosis for Inpatient and outpatient data.

Conclusion:

  • Around 4.5% Inpatients have undergone Diagnosis code 4019.
  • Diagnosis code 4019,2724,25000,41401 and 4280 are top 5 Diagnosis code for inpatient data.
  • Around 4.8% patients have undergone Diagnosis code 4019.
  • Diagnosis code 4019,25000,2724,V5869 and 401 are top 5 Diagnosis code for inpatient data.

3. Inpatient /Outpatient data — Attending Physician

Inpatient /Outpatient data — Attending Physician

Conclusion:

  • Most Inpatients are attended by physician PHY422134.
  • Around 1% of the patients are attended by physician PHY422134.
  • Most Outpatients are attended by physician PHY330576.
  • Around 0.48% of the Outpatients are attended by physician PHY330576.

4. Inpatient /Outpatient data — Operating Physician

Inpatient /Outpatient data — Operating Physician

Conclusion:

  • Physician PHY429430 preforms the most of the operations.
  • Around 1% of the patients are attended by physician PHY429430.
  • Physician PHY330576 preforms the most of the operations.
  • Around 0.48% of the patients are attended by physician PHY330576.

5. Inpatient /Outpatient data —ClaimAmtReimbursed

Inpatient /Outpatient data — ClaimAmtReimbursed

Conclusion:

  • Distribution of amount that is paid as claim reimbursement seems like log normal distribution.
  • All most all Reimbursed amount is between 0 and 20000.
  • In very few cases amount more than 20000 is paid for claim reimbursement.
  • Outpatient :- 99.9 percentile value is 3500 and Plot indicates this columns has some outliers.

6. Calculating Money lost in Fraud — Inpatient/outpatient data

Money lost in Fraud — Inpatient/outpatient data

Conclusion:

  • Total money lost due fraudulent encounters per year is 290 Millions. This is huge loss !!

7. Inpatient /Outpatient data — State

Inpatient /Outpatient data — State

Conclusion:

  • State coded as 5,10,33 and 45 have most fraudulent encounters for Inpatient data.
  • State coded as 5,33,10 and 39 have most fraudulent encounters fo Inpatient data.

8. Inpatient /Outpatient data — County

Inpatient /Outpatient data — County

Conclusion:

  • County coded as 10,200,160 and 60 have most fraudulent encounters for Inpatient data.
  • County coded as 200,470,400 and 590 have most fraudulent encounters for Outpatient data.

7. Data preparation

We have four csv files for this case study , so in order to come up to a solution for the problem at hand, we need to merge all of them together. Files for inpatient data and outpatients data have almost similar columns. So we can easily merge them together on those columns, also we can merge beneficiary data with them on BeneID(Unique Id for each enrolled beneficiary).Lastly we merge the target data file on Provider column. We have a compiled dataset to work on now! Grate!! .

Next we need to take care of the Null values and do some feature engineering.!

Most of the DOD(date of death) column in the dataset is empty, this means patients got well after the treatment. Hence we add add column ‘Is_Dead’ to indicate if the patient is alive or not and remove DOD column.

Important thing to note here, if we carefully explore every aspect of the dataset we are working on, we can come some additional useful features. Adding those feature vectors might proved to be very useful in solving the task at hand. Some of them are below :

  • Age of the patients when they requested for claims settlement.
  • Number of days a patient was admitted to hospital(applicable only for Inpatient data).
  • Top 5 suspicious procedure and diagnosis codes.
  • Number of unique physicians for each patient.
  • Number of different physicians who attend a patient.
  • Number of Patients who was attended by only 1 physicians.
  • Number of times 1 physician had multiple role to attend a patient.
  • Total Insurance claims amount pair to a beneficiary.
  • Mean amount paid to the beneficiary per claim request.
  1. lets explore how the distribution for patient’s age looks like .
  • We don't see any difference in distribution of age that may enable us to flag potential fraud.
  • But we see increasing trend in the potential fraud cases for patients age > 65 Also most of the patient who applied for claim fall in this age range only.

2. Add column for Number of days a patient was admitted to hospital.

For all Inpatient data we have date of admission and date of discharge, we can come up with a feature that hold the information regarding the number of days a patient was admitted to the hospital.

we drop the columns for date of admission and date of discharge.

3. Add Top 5 suspicious procedure and diagnosis codes

From EDA we observed that there are some procedure and diagnosis codes that a fraudulent provider tend to recommend the most. Hence we take top 5 such procedure and diagnosis code as our feature vector.

Top 5 suspicious procedure and diagnosis code

Below is the code snippet to use these suspicious codes as feature vector —

4. Number of unique physicians for each patient

5. Number of different physicians who attend a patient

6. Number of Patients who was attended by only 1 physicians

7. Number of times 1 physician had multiple role to attend a patient

8. Total/Mean Insurance claims amount paid to a beneficiary

Another observation from the dataset is that we have multiple claim settlement request for a single Beneficiary. Hence we include a feature vector to hold total insurance claim amount per beneficiary and mean amount paid to them per claim request.

Let us also plot bar graph to observe the trend in the fraud cases in relation to the categorical column — Race

Fraud cases vs Race

Conclusion:

  • It is very important to see that most fraudulent cases is with patients belonging to a particular Race and that is labelled as 1.

Finally plot Feature correlation heatmap to understand the impact of each feature value on target column.

Above heatmap implicates that are some features that do not add much value to determine our target value. so we may need to remove those columns.

At the end out dataset looks like :

Wow !! Now its time to Normalize numerical columns. Below code snippet can be very handy.

8. Machine Learning Models

Now we have reached to the part where we are well equipped with the dataset to solve our problem. We will now try to experiment with different approaches and try to develop our first cut Model and then we choose the best performing model to reach our objective.

  • Approach 1 :Train different classification models using all the feature vectors.
  • Approach 2:Retrain those classification models using only important feature.
  • Approach 3: Implement custom ensemble model.

But before we move ahead we need to build some utility functions to help us with analysing our model performance and compare them.

Below is the code snippet to do so —

Approach 1 : Train classification models using all the feature vectors.

we split our data set into test and train dataset. Train data set will be used for model training and the reserved test dataset will be used for the model evaluation. We use 33% of the data for testing and rest 77% for training. Please note here we are using stratified sampling.

For each of the classification model we train, it is vey important do hyperparameter tuning. This can be easily done by Using GridSearchCV from the sk-learn library.

As our first cut model we trained Linear classification model like Logistics regression and extended our experiment to Xg-boost. The graph below compare their performance.

Model performance Summary

It is evidently clear that Xg-boost has the best performance score among all of them.

Approach 2: Retrain classification models using important feature only.

One of the major advantage of using Tree based algorithms is that they use Information gain formulation for splitting the data into branches. Column with the maximum information gain is chosen and marked as important feature. Sk-learn implementation of these algorithms return those feature index in just one line of code.

Important features returned by Xb-boost model

Hence now we use these features only and retrain all the above models. The graph below compare their performance.

Model performance Summary using important features only.

In second approach also Xg-boost seems to be performing very well. But there’s a catch! Xb-boost performance score in our first approach is better than the second. Hence we will consider all features for the final model.

Approach 3: Implement custom ensemble model.

In our final approach, we now implement the custom ensemble model.

Steps to be followed :

  • Step 1 : Split dataset into two parts. Test and train (80:20).
  • Step 2 : Now from the 80% of the training data, split them into D1 and D2.
  • Step 3 : Now from this D1 will do sampling with replacement to create d1,d2,d3….dk(k samples).To further randomise these k models we use column sampling also.
  • Step 4 : Pass data points in D2 through each of these k models and make the predictions. At the end for each data point in D2 we have k predictions.
  • Step 5 : Now using these k predictions creates a new dataset, and for D2, we already have it’s corresponding target values, so now we train a meta-model with these k predictions.
  • Step 6 : At the end, for model evaluation we can use the 20% data that we have kept as the test set. we pass this test set to each of the base models and get ‘k’ predictions. Now we create a new dataset with these k predictions and pass it to our metamodel and we will get the final prediction. Now using this final prediction as well as the targets for the test set, we can calculate the models performance score.

In this case study we use Decision Tree as our base model and select k = 70 after tuning for number of base models.

For metamodel we will experiment with different classification models.

From the above graph, it is evidently clear that all models have almost similar performance score. Hence, we go with the simple linear classification model— Logistic regression as our metamodel.

Performance Summary for our final Model

Below is the code snippet for complete pipeline that take raw csv file as input and returns predicted values.

Output :

Expected output Example

Below is the code snippet for complete pipeline to take test data as input and return complete model performance summary

Output :

9. Future work

We can extend our work one step further by employing Deep Multilayer perceptron model(deep leaning technique), with Relu activation function in the hidden and Sigmoid/SoftMax in the final layer.

10. LinkedIn and GitHub Repository

GitHub : https://github.com/rohansoni634

LinkedIn : : https://bit.ly/3aSDxPn

11. Reference

https://journalofbigdata.springeropen.com/articles/10.1186/s40537-018-0138-3#Sec13

https://www.kaggle.com/roshankhatri03/kernel83ef294a68

https://www.sciencedirect.com/science/article/pii/S1877042812036099/pdf?md5=41d8dab8c2c4c83ea6d975f4fad31f00&pid=1-s2.0-S1877042812036099-main.pdf

https://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS18/paper/download/17617/16814

--

--

Rohan kumar soni
Rohan kumar soni

Data Science Professional with 1.5 years of experience in creating data regression models, using predictive data modelling, and analysing data mining algorithms

No responses yet