Categorizing tumor types using over thirty independent descriptors from medical reports by
using best clustering algorithm amongst Gaussian Naive Bayes, Decision Tree, and Random Forest,
Achieved 96.26% accuracy
Description of Data
- Cancer Data Set has 8 different features which are computed from a digitized image of a
fine needle aspirate (FNA) of a breast mass
- Ten real-valued features are computed for each cell nucleus
- Data set is composed of the mean, standard error and the “worst” of the above attributes
- Data set contains patient ID number, and Diagnosis (M = malignant, B = benign)
Creating a Model
Created different Models to test which gives highest accuracy score with narrowed features:
- Decision Tree
- Random Forest
- Multiple Logistic Regression
- Gaussian Naive Bayes
Gaussian Model Benefits:
- Easy and fast to predict class of test data set.
- Performs well in multi class prediction
- When assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression using less training data.
Validation Result
- Training Data
- R2 value : 0.705126
- Mean Absolute Error : 0.0689
- Testing Data
- R2 value : 0.83474
- Mean Absolute Error : 0.03738
- Both the measures improved on the test data, as R2 got closer to 1 , and the MAE reduced in size.
- Confusion Matrix:
Predicted | False | True |
0 | 69 (True Negative) | 1 (False Positive) |
1 | 3 (False Negative) | 34 (True Positive) |
Gaussian Model Result Explanation:
- The model exemplifies that the 4 features used (area, texture, smoothness, concavity) can be used to predict diagnosis with the caveat that they are assumed to be independent