Mod 3 Project: Binomal Classification
Using binomial classification to predict COVID-19 infection on a large dataset (>618K samples) with extreme imbalance and minority class (.13% of samples) as target.
The final iteration is a manually tuned random forsest classifier with >95% accuracy and >64% recall that uses biological, behavioral and environmental data collected to predict those who would test postive for COVID-19.
Main Files:
-
student.ipynb - Code, details and visualizations
-
Covid19Preso.pptx - A non-technical presentation with findings
-
Blog post URL: https://andiosika.github.io/imbalanced_data
Project Links Within Main student.ipynb File:
| Link | Description |
|---|---|
| Background | Details around the subject, datasource and objective |
| Features and Descriptions | Details on each feature collected in the dataset |
| Preprocessing | Steps taken to prepare data for modeling and evaluation |
| Main Dataset | The dataset in it's final form used for the predictive modeling results described in the Conclusion section |
| Modeling | Various iterations of predictive classification modeling including Decision Trees, Random Forest and XGBoost |
| Best Model | Random Forest Classification Model including Visualizations Confusion Matrix, ROC Curve, Feature Importance by Rank, Correlations |
| Conclusion | Summation of outcomes from modeling |
Background:
Coronavirus disease (COVID-19) is an infectious disease. It was discovered in late 2019 and early 2020 and originated from Wuhan, China. It escalated into a global pandemic.
According to the World Health Organization, most people infected with the COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment. Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.
The best way to prevent and slow down transmission is be well informed about the COVID-19 virus, the disease it causes and how it spreads. In response, much data has been collected in various ways to further inform ways to slow the spread.
The dataset used in this evaluation was created by a project created by a UK based platform-solutions company called Nexoid. At the start of the pandemic, Nexoid noted that there was a lack of large datasets required to predict the spread and mortality rates related to COVID-19. They took it upon themselves to create and share this dataset as an effort to better understand these factors. It is a not-for-profit project with the goal of providing researchers and governments the data needed to help understand and fight COVID-19. It is a sample provided by self-reporting of over 618,000 individuals and collects biological, behavioral, and environmental factors as well as their COVID-19 status.
The data is collected here:
https://www.covid19survivalcalculator.com/ . In exchange for the data, a risk of infection and mortality are returned to the user based on Nexoid's model which is not publicly sharded, yet recorded in this dataset post-hoc. These values are reflected in the columns risk_infection and risk_mortality.
The questionaire used to collect data has since undergone several versions and several features collected during this sample are no longer being tracked. Data for this observation was collected between March 27 - April 10 of 2020, and only a very small rate (.13%) of respondents reported testing postive for COVID-19. It should be noted that at this time there was a shortage of tests available in the United States and latency in recieving results was up to two weeks.
The intention of this classification project is to identify primary contributing factors for contracting COVID-19.
##Importing dataset
import pandas as pd
df = pd.read_csv("master_dataset4.csv")
pd.set_option('display.max_columns', 0)
df.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
Features and Descriptions:
There are 43 features on which data was collected around biometetrics, behavior and enviromnent.
Details are below:
| Feature | Description |
|---|---|
| survey_date | The date the survey was submitted |
| region | |
| country | The country collected from IP address long, lat |
| ip_latitude | ip latitude of device at time of survey |
| ip_longitude | ip longitude of device at time of survey |
| ip_accuracy | -n/a |
| sex | Self reported sex |
| age | Self reported age based on birthdate |
| height | Height in cm |
| weight | Weight in kg |
| bmi | Body Mass Index as calculated from self-reported height and weight |
| blood_type | Blood type |
| smoking | reported smoking/vapeing habits (never, do, 1-5x, 6-20x, 20+, quit<5yrs, quit>5yrs, quit>10yrs |
| alcohol | reported days of alcohol consuption in last 14 days |
| cannabis | reported days of cannabis consumpiton in last 28 days |
| amphetamines | reported days of amphetamine consumpiton in last 28 days |
| cocaine | reported days of cocaine consumpiton in last 28 days |
| lsd | reported days of lsd consumpiton in last 28 days |
| mdma | reported days of mdma(ecstacy) consumpiton in last 28 days |
| contacts_count | reported contacts in the last week (1-20 and 20+) |
| house_count | how many people live in the subjects dwelling |
| text_working | work/school travel behaviors (0-5 never did, always did, have stopped, critical only, still do) |
| rate_government_action | scale of attitude that government is taking covid-19 seriously (disagree, neutral, agree) |
| rate_reducing_risk_single | scale of self-assesment to reduce risk(social distancing, hand washing) (disagree, neutral, agree) |
| rate_reducing_risk_house | scale of assessesed co-habitators risk reduction (social distancing, hand washing)(disagree, neutral, agree) |
| rate_reducing_mask | scale of how often a mask is worn outside dwelling 1-5 rarely, sometimes, usually) |
| covid19_positive | A binomial value o=no, 1=yes to the question "Do you have?" |
| covid19_symptoms | A binomial value o=no, 1=yes to the question "Do you have?" |
| covid19_contact | A binomial value 0=no, 1=yes to the question "Have you been in contact with someone who has tested positive?" |
| asthma | A binomial value 0=no, 1=yes to the question "Do you have?" |
| kidney_disease | A binomial value 0=no, 1=yes to the question "Do you have?" |
| compromised_immune | A binomial value 0=no, 1=yes to the question "Do you have?" |
| heart_disease | A binomial value 0=no, 1=yes to the question "Do you have?" |
| lung_disease | A binomial value 0=no, 1=yes to the question "Do you have?" |
| diabetes | A binomial value 0=no, 1=yes to the question "Do you have?" |
| hiv_positive | A binomial value 0=no, 1=yes to the question "Do you have?" |
| hypertension | A binomial value 0=no, 1=yes to the question "Do you have?" |
| other_chronic | A binomial value 0=no, 1=yes to the question "Do you have?" |
| prescription_medication | Reported prescription medications |
| opinion_infection | No information is given about this feature, no longer collecting data on this, it is theorized that it had to do with if the subject believed they had the infection. |
| opinion_mortality | No information is given about this feature, no longer collecting data on this, it is theorized that it had to do with if the subject believed they could die from the infection. |
| risk_infection | calc'd risk for infection (based on their models) |
| risk_mortality | calc'd risk for mortality (based on their models) |
Inspecting the dataset:
Software Package Installs:
# Package Installs
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
import functions as fn
import importlib
from sklearn.metrics import accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSamplerC:\Users\aosika\AppData\Local\Continuum\anaconda3\envs\learn-env\lib\site-packages\sklearn\externals\six.py:31: FutureWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
"(https://pypi.org/project/six/).", FutureWarning)
2020-05-07 09:49:31.767234-07:00
[i] Timer started at05/07/20 - 09:49 AM
[i] Timer ended at 05/07/20 - 09:49 AM
- Total time = 0:00:00
C:\Users\aosika\AppData\Local\Continuum\anaconda3\envs\learn-env\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.neighbors.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
warnings.warn(message, FutureWarning)
This set of data contains just over 619K entries and has 43 columns of both numeric and categorical data. Because of the size of this dataset, pandas profiling was used to inform potential considerations for dataset selection and develop a strategy to manage preprocessing of a set this size.
Data Background Observation:
The data was provided by subjects from 173 countries. It is noted that 87% of the data comes from the US. The next top provider of data is Canada ~5% , followed by the United Kingdom ~2.3%:
countriesdf.head(5).plot(kind='bar', color='r')
plt.title('US Represents 87% of Data:')Text(0.5, 1.0, 'US Represents 87% of Data:')
df['covid19_positive'].value_counts()0 618134
1 893
Name: covid19_positive, dtype: int64
Target Class is highly imbalanced:
Out of the nearly 618,134 samples, 893 tested positive for COVID-19, or .0014%
This is an approximate ratio of 1:1000
Inspecting correlations:
df_cor = pd.DataFrame(df.corr()['covid19_positive'].sort_values(ascending=False))
df_cor.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
| covid19_positive | |
|---|---|
| covid19_positive | 1.000000 |
| risk_infection | 0.198632 |
| covid19_symptoms | 0.089861 |
| opinion_infection | 0.054837 |
| covid19_contact | 0.050774 |
| risk_mortality | 0.014074 |
| mdma | 0.012152 |
| heart_disease | 0.007975 |
| weight | 0.007503 |
| lsd | 0.007137 |
| height | 0.006999 |
| cocaine | 0.006833 |
| rate_reducing_mask | 0.006201 |
| ip_longitude | 0.006122 |
| diabetes | 0.005700 |
| kidney_disease | 0.004725 |
| other_chronic | 0.004638 |
| compromised_immune | 0.004308 |
| bmi | 0.004280 |
| hypertension | 0.004055 |
| hiv_positive | 0.003993 |
| contacts_count | 0.003741 |
| ip_latitude | 0.003448 |
| lung_disease | 0.003296 |
| amphetamines | 0.002425 |
| asthma | 0.001956 |
| house_count | -0.001151 |
| ip_accuracy | -0.001347 |
| opinion_mortality | -0.002450 |
| alcohol | -0.004070 |
| cannabis | -0.004418 |
| rate_government_action | -0.005191 |
| rate_reducing_risk_house | -0.010192 |
| rate_reducing_risk_single | -0.013982 |
df.corr()['covid19_positive'].sort_values(ascending=False).plot(kind='barh', figsize=(12,12))<matplotlib.axes._subplots.AxesSubplot at 0x2278095ccc0>
df.corr().style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1) Raw Data Inspection Observations:
Most of the data collected ~ 87% comes from the United states with Canada 5% and UK 2.5% next. The rest of the countries reporting are even smaller in terms of contribution size. A very small percentage: .0014% tested positive for COVID-19 in this sample. There are no direct correlations and the most highly correlated features of the unprocessed data are:
| Feature: | Correlation: |
|---|---|
| covid19_positive | 1.000000 |
| risk_infection | 0.198632 |
| covid19_symptoms | 0.089861 |
| opinion_infection | 0.054837 |
| covid19_contact | 0.050774 |
| risk_mortality | 0.014074 |
| mdma | 0.012152 |
| heart_disease | 0.007975 |
| weight | 0.007503 |
| lsd | 0.007137 |
| height | 0.006999 |
Preprocessing:
This section outlines steps taken to prepare the data for analysis. The first step was to address missing/null values.
Initial visual inspection of null values indicates that region and prescription medication are sparsely populated. Since region was ~90% missing, it was dropped. Prescription medication had 57K values and details are included in this section.
The opinion_infections and opinion_mortality columns are also a little 'light' in terms of responses and have the same number of responses. This null rate of ~16% was imputed with the median values for each respective field.
Null values in columns that contain <5% null values were dropped.
Other than those outlined above, there doesn't seem to be be any other apparent patterns for incomplete data. (See below).
Null or Missing Data:
import missingno
missingno.matrix(df)<matplotlib.axes._subplots.AxesSubplot at 0x22780ff8b38>
Aditional inspection shows that there are quite a few columns with less than 5% null values. Since this dataset is so large, it seems reasonable to remove these. Details follow:
nulls = pd.DataFrame(df.isna().sum()/len(df)*100)
nulls = pd.DataFrame(nulls.reset_index())
nulls.columns=['variable', '%_Null']
nulls.sort_values(by='%_Null', ascending=False, inplace=True)
nulls.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
| variable | %_Null | |
|---|---|---|
| 1 | region | 93.167342 |
| 38 | prescription_medication | 68.800876 |
| 40 | opinion_mortality | 17.445604 |
| 39 | opinion_infection | 17.445604 |
| 16 | cocaine | 4.705611 |
| 15 | amphetamines | 4.430825 |
| 17 | lsd | 4.089644 |
| 18 | mdma | 3.513255 |
| 14 | cannabis | 2.017198 |
| 21 | text_working | 0.683654 |
| 19 | contacts_count | 0.683654 |
| 25 | rate_reducing_mask | 0.299341 |
| 12 | smoking | 0.299341 |
| 13 | alcohol | 0.299341 |
| 41 | risk_infection | 0.012439 |
| 42 | risk_mortality | 0.012439 |
| 2 | country | 0.002746 |
| 5 | ip_accuracy | 0.000162 |
| 11 | blood_type | 0.000000 |
| 30 | kidney_disease | 0.000000 |
| 3 | ip_latitude | 0.000000 |
| 4 | ip_longitude | 0.000000 |
| 37 | other_chronic | 0.000000 |
| 36 | hypertension | 0.000000 |
| 35 | hiv_positive | 0.000000 |
| 34 | diabetes | 0.000000 |
| 33 | lung_disease | 0.000000 |
| 32 | heart_disease | 0.000000 |
| 31 | compromised_immune | 0.000000 |
| 29 | asthma | 0.000000 |
| 10 | bmi | 0.000000 |
| 28 | covid19_contact | 0.000000 |
| 27 | covid19_symptoms | 0.000000 |
| 26 | covid19_positive | 0.000000 |
| 6 | sex | 0.000000 |
| 24 | rate_reducing_risk_house | 0.000000 |
| 23 | rate_reducing_risk_single | 0.000000 |
| 22 | rate_government_action | 0.000000 |
| 7 | age | 0.000000 |
| 20 | house_count | 0.000000 |
| 8 | height | 0.000000 |
| 9 | weight | 0.000000 |
| 0 | survey_date | 0.000000 |
Main Dataset:
Columns dropped:
These columns were dropped in prior processing:
- Date While the date the data was collected could have a bearing on whether or not someone tested postivie, it would not provide insight to biological, behavioral or geographical indicators.
- Region This was a feature that substantially lacked data in the inital collection with 93% of the values missing.
In addition the following columns were dropped with rationale below:
- ip_accuracy - This feature measures the accuracy of the IP location and is used in the data collection process rather than for predicting a medical condition.
- risk_infection - This is a value calculated post-hoc, based on the data collected from this dataset
- risk_mortality - This is a value calculated post-hoc, based on the data collected from this dataset
- prescription_medication - This column contains text-strings and has over 57K values. A column was added called taking_prescription_medication to capture if an individual is taking prescribed medicine. It's proposed to deal with this column separately if it's indicated to be a factor separately since this is computationally expensive.
Train/Test Split:
from sklearn.model_selection import train_test_splity = df2['covid19_positive'].copy()
X = df2.drop('covid19_positive', axis=1).copy()y.value_counts()0 574419
1 791
Name: covid19_positive, dtype: int64
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.25, stratify=y, random_state=123)len(X_test)143803
len(y_test)143803
len(y_train)431407
len(X_train)431407
print(df['covid19_positive'].value_counts(normalize=True))
coviddf= pd.DataFrame(df['covid19_positive'].value_counts(normalize=True)*100)
coviddf.plot(kind='bar', color='r')
plt.title('Covid19 Positive Rates')0 0.998625
1 0.001375
Name: covid19_positive, dtype: float64
Text(0.5, 1.0, 'Covid19 Positive Rates')
Inspecting training set for imbalance
y_train.value_counts()0 430814
1 593
Name: covid19_positive, dtype: int64
y_test.value_counts()0 143605
1 198
Name: covid19_positive, dtype: int64
Modeling:
First Attempt: Using 'vanilla' Decision Tree and SMOTE to address imbalances
Several attempts were made implementing different attempts to deal with the extreme imbalance since the tartget class was also the minority class. Testing demonstrated that tuning the model's weight class returned increased performance over random oversampling or undersampling. Various models were tested including Decision Tree, Random Forest and XGBoost. When GridSearch was applied, modeling performed poorly when compared to manual tuning. In some cases, the size of the dataset proved to be too computationally expensive to implement GridSearch and testing began to run over 20 hours. Because manual tuning proved to be more efficient, GridSearch was abandoned. The final iteration was a manually tuned random forsest classifier with >90% accuracy and >64% recall.
BEST MODEL: Manually tuned Random Forest
time = fn.Timer()
time.start()
rf_clf8 = RandomForestClassifier(criterion='gini', max_depth=2, max_features=.45, class_weight='balanced',n_estimators=80, random_state=111)
rf_clf8.fit(X_train, y_train)
time.stop()yh8=rf_clf8.predict(X_test)mean_rf_cv_score = np.mean(cross_val_score(rf_clf8, X_train, y_train, cv=3))
print(f"Mean Cross Validation Score for Random Forest Classifier: {mean_rf_cv_score :.2%}")fn.evaluate_model(X_test, y_test, yh8, X_train, y_train, rf_clf8)Observations on manually tuned random forest:
The overtraining data issues were addressed via several iterations of tuning the modeling. This model has an overall accuracy average of .96 and weighted recall of .96 this iteration has yieled the highest true positive rate and is highest rated in terms of overall performance. Mean Cross Validation Score revealed a result of 95.96%.
The area under the curve demonstrates the reliability of the model is 87.1% which is a 34% increase over the baseline of .53 introduced in the inital model.
The most important factors are listed below:
fn.df_import(rf_clf8,X_train,n=10)fn.plot_importance(rf_clf8,X_train)Decision Tree visualizations from Random Forest Model:
dot_data1 = export_graphviz(rf_clf8.estimators_[3], out_file=None,
feature_names=X_train.columns,
class_names=np.unique(y).astype('str'),
filled=True, rounded=True, special_characters=True)
# Draw graph
graph1 = graph_from_dot_data(dot_data1)
# Show graph
Image(graph1.create_png())Attempting Randomized Search:
WARNING: The following 7 input lines tast take 94.5 mins to run and have been commented out.
The model was clearly overtrained and performed poorly. Observations are recorded below:
from sklearn.model_selection import RandomizedSearchCV# stop
# time = fn.Timer()
# time.start()
# rf_clfb = RandomForestClassifier(class_weight='balanced', random_state=111)
# ## Set up param grid
# param_grid = {'criterion':['gini','entropy'],
# 'max_depth':[7,8, 10,15],
# 'max_features':[.2, .3, .45],
# 'n_estimators' :[75,100,125, 150]}
# ## Instantiate GridSearchCV
# rgrid_clfb = RandomizedSearchCV(rf_clfb, param_grid, n_jobs=-1, verbose=1, cv=skf)
# time.stop()#rgrid_clfb.fit(X_train, y_train)yhtrgrid = rgrid_clfb.predict(X_test)rgrid_clfb.best_params_rf_clfb1 = RandomForestClassifier(criterion = 'gini', n_estimators=100, max_features=.2,
max_depth=15, class_weight='balanced', random_state=111)
time = fn.Timer()
time.start()
rf_clfb1.fit(X_train, y_train)
time.stop()# hytb1 = rf_clfb1.predict(X_test)#fn.evaluate_model(X_test, y_test, hytb1, X_train_res, y_train_res, rf_clfb1)Observations:
Validates that a manually tuned Random Forest model performed best. Depite the AUC remaining relatively high(86.1), the true positive rate is extremely poor at .12.
| precision | recall | f1-score | support |
|---|---|---|---|
| 0 | 0.999 | 0.998 | 0.999 |
| 1 | 0.085 | 0.116 | 0.098 |
| accuracy | 0.997 | 0.997 | 0.997 |
| macro avg | 0.542 | 0.557 | 0.548 |
| weighted avg | 0.998 | 0.997 | 0.997 |
Training Accuracy : 0.9988614576127981
Test Accuracy : 0.9970584758315195
Conclusion:
This dataset was a random sample of over 618,000 individuals reporting biological, behavioral, and environmental factors as well as their COVID-19 status. The medium used to collect the data is operated by a UK - based an open-source platform that's main focus is data analytics and is non-medical in nature. The questionaire used to collect data has undergone several versions and several features collected during this sample are no longer being tracked. A very small rate (.013%) reported testing positive providing a hyper-imbalanced dataset. It should be noted that at this time there was a shortage of tests available in the United States as well as time taken to get results was up to two weeks in latency.
Using data collected over a 15 day span (March 27 - April 10, 2020), a predictive model was developed to identify top factors in contracting COVID-19.
The factors that rated highest in predicting contraction of COVID-19 were derived using a Random Forest Classification model that yeilded an overall accuracy average of .97 and weighted recall of .97. A Receiver Operator Characteristic (ROC) curve demonstrates a diagnostic ability of this binary classifier to be 84.9%. This model ran at a higher sensitivy rate of 96% than it's specificity rate of 64%.
The most important factors are listed below. The + or - signs indicate whether the correlations associated with these factors were positive or negative:
| Factor | Description | Importance | Correlation |
|---|---|---|---|
| opinion_infection | Individual believed they contracted COVID-19 | 0.529735 | + |
| covid19_symptoms | Individual exhibited COVID-19 symptops | 0.285415 | + |
| covid19_contact | Individual came in contact with another who was COVID-19+ | 0.122519 | + |
| rate_reducing_risk_house | Househould practiced social distancing and hygiene | 0.0136553 | - |
| omwasnull | Individual did not respond if they believed they could die from COVID-19 | 0.0127974 | + |
| rate_reducing_risk_single | Individual practiced social distancing and hygiene | 0.0108048 | - |
| oiwasnull | Individual did not respond if they believed they could die from COVID-19 | 0.00731367 | + |
| sex_male | Indivdiual was male | 0.00641232 | + |
| sex_female | Individual was female | 0.00354491 | - |
| bmi | Body Mass Index (kg/m** 2) | 0.00247806 | + |
| taking prescription medication | The individual was taking prescription medication | 0.001504 | + |
It's common sense that having symptoms and coming into contact with someone infected would be factors in contracting a contagious disease, and are proven as such since they classified as top factors. More information is needed on 'opinion_infection' as this is also a top factor. It is believed that this feature indicates the individual is infected with COVID-19. However there is no background and data is no longer being collected on this datapoint.
The fields 'oiwasnull' and 'omwasnull' is associated with this factor - and reflects that respondents did not fill this value in. Possible suggested reasons for this are 1) lack of attention to detail by the respondant, 2) stigma or other psychological rationale associated with being COVID-19+ 3) Survey stopped collecting data.
This model also illustrates that behavioral modifications such as taking precautionary measures by practicing social distancing and hygiene individually and as a household as important factors in predicting disease contraction. It should be noted that collective action ranked higher than individual action and the more these behaviors increased, classifing the individual as COVID-19+ decreased. In addition, these two factors were the two most negatively correlated factors with testing positive.
Sex ranked in the top 10 as important factors as well. Men ranked higher than women and this is accentuated by the fact that men had a positive correlation with becoming infected and women had a negative correlation.
Both BMI and taking prescriptions even relatively small in comparison were identified in terms of feature importance.
There are many possible reasons why increased BMI could be associated with an increased succeptibility. This study supports those who have a bmi greater than 35 are at higher risk. Future work could be done to further investigate the BMI feature interpret if this dataset aligns with this model.
Those taking prescription medications could suggest a state poor health, but claiming this rationale is somewhat presumptive and would obviously need further investigation. The data provided did include quantities and labels for each prescription medication, however since the nature of 57000 unique values in the original dataset, it was tracked to evaluate feature importance. Future work could be done in this area while taking into consideration the ranking of it's relatively low feature importance.
Recommendations:
- If an individual has symptoms associated with COVID-19, or think they could be infected, testing is recommended for confirmation.
- Practice social distancing and hygiene individually and as a household
*This is especially true for males and those with high BMI.- Avoid contact with those known to be infected.
?4. If taking prescription medication, it is recommended to discuss additional risk for infection with a physician.
Appendix:
This section contains code for some of the visualizations and supporting information used in the non-technial presentation.
importlib.reload(fn)#getting all necessary feature imporance values from best model
corrs = pd.Series(rf_clf8.feature_importances_, index=X_train.columns, name='importance')
x = corrs.sort_values(ascending=False).head(11)
x#df with all features and corresponding correlation values
df_cor = pd.DataFrame(df2.corr()['covid19_positive'].sort_values(ascending=False))
df_cor#pulling all correlations for each of the important features for best model:
y = df_cor.loc[list(x.index),:]
yVizualizations for non-technical presentation:
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import confusion_matrix
font = {'family': 'serif',
'color': 'darkred',
'weight': 'normal',
'size': 16,
}
labels = ['COVID-19 -','COVID-19 +']
cnf_matrix = metrics.plot_confusion_matrix(rf_clf8,X_test,y_test,cmap='Reds',
normalize='true',display_labels=labels)
plt.title('Random Forest Classifier Prediction Rates', fontdict=font)
pos_map = {'0' : 'Covid19 Negative',
'1': 'Covid19 Posistive'}
coviddf['covid19_positive'] = coviddf['covid19_positive'].map(pos_map)font = {'family': 'serif',
'color': 'darkred',
'weight': 'normal',
'size': 18,
}
font1 = {'family': 'serif',
'color': 'black',
'weight': 'normal',
'size': 12,
}
print(df['covid19_positive'].value_counts(normalize=True))
lables = ['Negative', 'Positive']
coviddf= pd.DataFrame(df['covid19_positive'].value_counts(normalize=True)*100)
coviddf.plot(kind='bar', color='darkred')
plt.title('COVID-19 Rates', fontdict=font)
plt.xlabel('Negative ::: Positive', fontdict=font1)
plt.ylabel('Percent', fontdict=font1)
fig= plt.figure()
df_import = pd.Series(rf_clf8.feature_importances_, index=X_train.columns, name='importance')
df_import.sort_values().tail(11).plot(kind='barh',color='red', figsize=(7,5))
plt.title('Top 10 Feature Importance for the Contraction of COVID-19')
import pyplotdf['covid19_positive'].value_counts()#!pip install Counter
from collections import Counter
# summarize class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label
data = {"x":[], "y":[], "label":[]}
for label, coord in counter.items():
data["x"].append(coord[0])
data["y"].append(coord[1])
label["label"].append(label)
plt.figure(figsize=(10,8))
plt.title('Scatter Plot', fontsize=20)
plt.xlabel('x', fontsize=15)
plt.ylabel('y', fontsize=15)
plt.scatter(data["x"], data["y"], marker = 'o')
# add labels
for label, x, y in zip(data["label"], data["x"], data["y"]):
plt.annotate(label, xy = (x, y))counter.items()# featimpt = ['covid19_symptoms','opinion_infection', 'covid19_contact','rate_reducing_risk_house',
# 'taking_prescription_medication','text_working_travel critical','rate_reducing_risk_single',
# 'smoking_never', 'rate_reducing_mask', 'oiwasnull']featimptimportlib.reload(fn)df_cor.loc['smoking':]df_cor.head(60)df_cor.tail(50)def _plot_classification_report(y_true, y_pred_class):
import sklearn.metrics as metrics
report = metrics.classification_report(y_true, y_pred_class, output_dict=True)
report_df = pd.DataFrame(report).transpose().round(4)
fig, ax = plt.subplots()
ax.axis('off')
ax.axis('tight')
ax.table(cellText=report_df.values,
colLabels=report_df.columns,
rowLabels=report_df.index,
loc='center',
bbox=[0.2, 0.2, 0.8, 0.8])
fig.tight_layout()
return fig #alternative code for feature importance:
#df_import_tree = pd.Series(tree.feature_importances_, index=X_train.columns, name='importance').head(20)
# df_import_tree.sort_values().plot(kind='barh', figsize=(15,12))!pip install mictools!pip install ppscore



