employee-statistics

Employee Statistics

The dataset stores the employee attributes with some earning more than 50K while others less than 50K annually. Attributes of the employee known to us are as follows:

age: Age of the employee
workclass: Workclass of the employee
education: Highest qualification of the employee
education-num: Total number of years spent in education
marital-status: Marital Status of the employee
occupation: Profession of the employee
relationship: Dependents of the employee
race: Ethnicity of the employee
sex: Gender of the employee
hours-per-week: Hours per week of the employee
native-country: Native country of the employee
hours-per-week: Salary: 50K or more

The classes are categorized into two types: “Yes” and “No” and not balanced (e.g. there are many more “No” than “Yes”). No missing values were found. Outliers are detected and removed. Additionally, we are using feature scaling techniques.

Goal

The objective is to classify the data using binomial classification. A list of machine learning models will be trained and tested to determine which will yield the best results:

Logistic Regression
MLP Classifier
Random Forest Classifier
AdaBoost Classifier

Exploratory Data Analysis

age_sex_count

As the age increases, so does the salary of the employee. We don’t see employees earning more than 50K in their early 20s

sex_workclass_distribution

Capacity comprises of more than 30% of the workforce are Female. Employees belonging to Private companies are much more than those of other workclasses

hours_per_week_for_occupation

In most occupations, employees work 40 hours a week except Armed Forces Employees working as ‘Exec-managerial’ or ‘Prof-specialty’ require more expertise and therefore earn more than others

Results

Best Model: MLP Classifier Best Score: 0.792

confusion_matrix

We can predict whether an employee will have a salary > 50K by the recall. If precision determines the accuracy of our model, then recall evaluates the model’s ability to serve the purpose intended. It appears our model can correctly predict the outcomes up to 53%. Type 1 errors are observed as assuming Salary_50K is “Yes” when actually that is false. Type 2 errors are observed as assuming Salary_50K is “No” when actually that is not true

roc_curve

A ROC curve of 0.69 is moderate level of accuracy in distinguishing between positive and negative classes. Classifiers that give curves closer to the top-left corner indicate a better performance of the model.