Data processing

Mitigating salary bias due to gender using Automl fairness

Table of Content

Introduction
Necessary Imports
Accessing the dataset
Model Building using AutoML
Check fairness of unmitigated model
- Analyse model fairness
- Choosing a Metric
Check fairness score of trained model
Mitigation using demographic parity ratio
DPR mitigation Analysis
Mitigation using Equalized Odds Ratio
Reducing the threshold for a successful mitigation
Conclusion
Data resources

Introduction

Bias is prevalent in most datasets, often introduced during data collection and due to other factors. While preprocessing typically addresses problems such as missing data, corrupted records, outliers, featue engineering, etc., bias in datasets is frequently overlooked. Consequently, models trained on biased data can produce biased predictions. To address this, we present an elaborate methodology demonstrating detection and mitigation of gender bias in predicting salaries as a specific case study. Removing bias is a complex process, and we leverage the capabilities of AutoML to both remove bias and identify optimal unbiased models.

Necessary Imports

%matplotlib inline
import matplotlib.pyplot as plt

import pandas as pd

import arcgis
from arcgis.gis import GIS


from arcgis.learn import prepare_tabulardata, AutoML
from sklearn.model_selection import train_test_split

from arcgis.learn import prepare_tabulardata, AutoML
from sklearn.metrics import accuracy_score

Connecting to ArcGIS

gis = GIS("home")

Accessing the dataset

The dataset comprises demographic and employment information for a diverse group of individuals in the United States, featuring variables such as age, education level, occupation, marital status, salary, and more. Our goal is to train a model that predicts whether an individual's salary is above or below 50k.

data_table = gis.content.get("9f56292f1bec417da75d577bbd131889")
data_table

salary

CSV by api_data_owner
Last Modified: July 17, 2024
0 comments, 0 views

# Download the csv and saving it in local folder
data_path = data_table.get_data()

adult_income = pd.read_csv(data_path).drop(["Unnamed: 0"], axis=1)
adult_income.head()

	Age	Workclass	Fnlwgt	Education	Education-num	Marital-status	Occupation	Relationship	Race	Gender	Capital-gain	Hours-per-week	Native-country	Salary	annual_salary_$
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K	64375
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K	19304
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K	55493
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K	78591
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K	55388

The dataset consists of 32,561 records, with 21,790 males and 10,771 females. The age range is from 18 to 59 years old. The majority of the individuals are from the United States (93%), with a few from Puerto Rico, Jamaica, and Cuba. The most common education level is HS-grad (34%), followed by Some-college (20%), and Bachelors (15%). The majority of the individuals are married (63%), with a significant number being divorced (15%) or never-married (12%).

A basic analysis of salary distributions by gender reveals a gender imbalance, with 30.57% of males earning more than 50K compared to only 10.95% of females. This disparity suggests potential bias or disparities in salary distribution based on gender. Further analysis and fairness mitigation strategies will be necessary to address and understand the underlying causes of this imbalance.

adult_income.columns

Index(['Age', 'Workclass', 'Fnlwgt', 'Education', 'Education-num',
       'Marital-status', 'Occupation', 'Relationship', 'Race', 'Gender',
       'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native-country',
       'Salary', 'annual_salary_$'],
      dtype='object')

Data processing consists of first splitting the dataset into a training dataset and a testing dataset as follows:

test_size = 0.20
train, test = train_test_split(adult_income, test_size = test_size, random_state=32, shuffle=True)

Model Building using AutoML

First, we will train a baseline model using AutoML, which will generate a fairness score for evaluation. This will be a classification model trained using relevant demographic explanatory features from the dataset for predicting and classifying the salary of employees. Here Education-num, Capital-gain, Capital-loss and Hours-per-week are considered as continuous variable, and the rest being categorical. The target variable Salary has two classes and is sutiable for the current automl implementation for fairness mitigation, which can handle only binary classification.

Data Preparation

The preparation of the data is carried out by the prepare_tabulardata method from the arcgis.learn module in the ArcGIS API for Python. This function will take either a non spatial dataframe, a feature layer, or a spatial dataframe containing the dataset as input and will return a TabularDataObject that can be fed into the model. Here we are using a non spatial dataframe.

The primary input parameters required for the tool are:

input_features : non spatial dataframe containing the  primary dataset
variable_predict : field name `Salary` as the y-variable to be predicted from the input dataframe
explanatory_variables : The selected list of explanatory variables.

explanatory_variables = [
    ('Age', True), ('Workclass', True), ('Education', True), 'Education-num',
    ('Marital-status', True), ('Occupation', True), ('Relationship', True),
    ('Race', True), ('Gender', True), 'Capital-gain', 'Capital-loss',
    'Hours-per-week', ('Native-country', True)
]

data = prepare_tabulardata(train, 'Salary', explanatory_variables=explanatory_variables)

Dataframe is not spatial, Rasters and distance layers will not work

data.show_batch()

	Age	Capital-loss	Education	Education-num	Gender	Hours-per-week	Marital-status	Native-country	Occupation	Race	Relationship	Salary	Workclass
1174	47	0	HS-grad	9	Male	40	Married-civ-spouse	United-States	Other-service	White	Husband	<=50K	Private
5093	34	0	HS-grad	9	Male	40	Married-civ-spouse	United-States	Machine-op-inspct	White	Husband	<=50K	Private
11204	46	1977	Masters	14	Male	40	Married-civ-spouse	United-States	Tech-support	White	Husband	>50K	Private
12586	30	0	Some-college	10	Female	40	Divorced	United-States	Adm-clerical	White	Not-in-family	<=50K	Local-gov
29133	30	0	11th	7	Male	40	Married-spouse-absent	Mexico	Handlers-cleaners	Amer-Indian-Eskimo	Not-in-family	<=50K	Private

Model initialization

Here we will initialize the AutoML model by pasing the preprared tabular data from above. We can also pass the mode of the model as Basic, Intermediate or Advanced. The default is Basic.

automl_classifier_plain = AutoML(data=data)

Model training

Finally, the model is ready for training. To train the model, we call the model.fit() function. Based on the mode of the model, it will start training for the relevant epochs until it finds the best model. The time it takes to train the model will depend on the mode chosen, with basic being the fastest and advanced being the most time consuming.

The model will use various available sets of algorithms as a backbone, like Decision Tree, Random Trees, Extra Trees, LightGBM, Xgboost specialized for tabular data, and model ensembling to find the best model.

automl_classifier_plain.fit()

Neural Network algorithm was disabled because it doesn't support n_jobs parameter.
Linear algorithm was disabled.
AutoML directory: C:\Users\sup10432\AppData\Local\Temp\scratch\tmpfvj5jmu4
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Decision Tree', 'Random Trees', 'Extra Trees', 'LightGBM', 'Xgboost']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 1 model
DecisionTreeAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
1_DecisionTree logloss 0.361375 trained in 6.29 seconds
* Step default_algorithms will try to check up to 4 models
LightgbmAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
Exception while producing SHAP explanations. pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: Workclass: object, Education: object, Marital-status: object, Occupation: object, Relationship: object, Race: object, Gender: object, Native-country: object
Continuing ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
2_Default_LightGBM logloss 0.27688 trained in 5.5 seconds
There was an error during 3_Default_Xgboost training.
Please check C:\Users\sup10432\AppData\Local\Temp\scratch\tmpfvj5jmu4\errors.md for details.
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees logloss 0.338299 trained in 8.8 seconds
ExtraTreesAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
5_Default_ExtraTrees logloss 0.368012 trained in 8.38 seconds
* Step ensemble will try to check up to 1 model
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
Ensemble logloss 0.27688 trained in 3.35 seconds
AutoML fit time: 39.66 seconds
AutoML best model: 2_Default_LightGBM
All the evaluated models are saved in the path  C:\Users\sup10432\AppData\Local\Temp\scratch\tmpfvj5jmu4

Once trained, the model score is checked to understand the performance of the trained model.

automl_classifier_plain.score()

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

0.7439539347408829

Additional insights into model performance can be observed from the model report, which includes the AutoML leaderboard, performance metrics for each algorithm attempted, a boxplot depicting model performance, and Spearman correlation analysis.

automl_classifier_plain.report()

In case the report html is not rendered appropriately in the notebook, the same can be found in the path C:\Users\sup10432\AppData\Local\Temp\scratch\tmpfvj5jmu4\README.html

AutoML Leaderboard

Best model	name	model_type	metric_type	metric_value	train_time
	1_DecisionTree	Decision Tree	logloss	0.361375	7.05
the best	2_Default_LightGBM	LightGBM	logloss	0.27688	6.4
	4_Default_RandomTrees	Random Trees	logloss	0.338299	9.54
	5_Default_ExtraTrees	Extra Trees	logloss	0.368012	9.12
	Ensemble	Ensemble	logloss	0.27688	3.35

AutoML Performance

AutoML Performance Boxplot

Spearman Correlation of Models

models spearman correlation

	Score	Threshold
logloss	0.361375	nan
auc	0.85532	nan
f1	0.628192	0.301711
accuracy	0.84352	0.424859
precision	1	0.976923
recall	1	0.0479006
mcc	0.534284	0.424859

	score	threshold
logloss	0.361375	nan
auc	0.85532	nan
f1	0.605192	0.424859
accuracy	0.84352	0.424859
precision	0.779441	0.424859
recall	0.494617	0.424859
mcc	0.534284	0.424859

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,712	221
Labeled as >50K	798	781

	Score	Threshold
logloss	0.27688	nan
auc	0.928871	nan
f1	0.735925	0.410305
accuracy	0.875	0.5331
precision	1	0.998262
recall	1	0.000350082
mcc	0.649588	0.410305

	score	threshold
logloss	0.27688	nan
auc	0.928871	nan
f1	0.718339	0.5331
accuracy	0.875	0.5331
precision	0.791762	0.5331
recall	0.657378	0.5331
mcc	0.643465	0.5331

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,660	273
Labeled as >50K	541	1,038

	Score	Threshold
logloss	0.338299	nan
auc	0.890019	nan
f1	0.671846	0.32308
accuracy	0.851505	0.549898
precision	1	0.991478
recall	1	0.0262374
mcc	0.564379	0.32308

	score	threshold
logloss	0.338299	nan
auc	0.890019	nan
f1	0.631057	0.549898
accuracy	0.851505	0.549898
precision	0.793666	0.549898
recall	0.523749	0.549898
mcc	0.561319	0.549898

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,718	215
Labeled as >50K	752	827

	Score	Threshold
logloss	0.368012	nan
auc	0.888746	nan
f1	0.662191	0.353819
accuracy	0.846437	0.476483
precision	1	0.68739
recall	1	0.0135017
mcc	0.551224	0.454251

	score	threshold
logloss	0.368012	nan
auc	0.888746	nan
f1	0.632623	0.476483
accuracy	0.846437	0.476483
precision	0.753281	0.476483
recall	0.545282	0.476483
mcc	0.54992	0.476483

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,651	282
Labeled as >50K	718	861

Model	Weight
2_Default_LightGBM	1

	Score	Threshold
logloss	0.27688	nan
auc	0.928871	nan
f1	0.735925	0.410305
accuracy	0.875	0.5331
precision	1	0.998262
recall	1	0.000350082
mcc	0.649588	0.410305

	score	threshold
logloss	0.27688	nan
auc	0.928871	nan
f1	0.718339	0.5331
accuracy	0.875	0.5331
precision	0.791762	0.5331
recall	0.657378	0.5331
mcc	0.643465	0.5331

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,660	273
Labeled as >50K	541	1,038

Check fairness of unmitigated model for gender

Before proceeding, we need to verify if the baseline model exhibits bias and determine if mitigation is necessary. Initially, the fairness score of the baseline AutoML model is assessed to identify any gender-related bias, its type, and magnitude.

%matplotlib inline
fairness_df = automl_classifier_plain.fairness_score(sensitive_feature ='Gender', visualize=True)

In the output above are four metrics measuring fairness for the classification problems. Equalized odds difference(EOD), Demographic parity difference(DPR), Equalized odds ratio(EOR), Demographic parity ratio(DPR). We discuss the interpretation of these metrics below. To learn more bout the metrics, see how fairness works.

fairness_df[1]

{'equalized_odds_difference': (0.16,
  'The value of equalized_odds_difference is 0.16 which is less than minimum threshold 0.25. The ideal value of this metric is 0. Fairness for this metric is between 0 and 0.25.'),
 'demographic_parity_difference': (0.2,
  'The value of demographic_parity_difference is 0.2 which is less than minimum threshold 0.25. The ideal value of this metric is 0. Fairness for this metric is between 0 and 0.25.'),
 'equalized_odds_ratio': (0.18,
  'The value of equalized_odds_ratio is 0.18 which is less than minimum threshold 0.8. The ideal value of this metric is 1. Fairness for this metric is between 0.8 and 1.'),
 'demographic_parity_ratio': (0.29,
  'The value of demographic_parity_ratio is 0.29 which is less than minimum threshold 0.8. The ideal value of this metric is 1. Fairness for this metric is between 0.8 and 1.')}

The fairness score reveals that the prediciton is biased, as reflected by the equalized odds ratio and the demographic parity ratio, both of which are less than the minimum value of 0.8 and ideal value of 1.

fairness_df[0]

	accuracy	false positive rate	false negative rate	selection rate	count
( Female,)	0.933798	0.013245	0.443396	0.080139	0.330518
( Male,)	0.858372	0.074387	0.283422	0.280963	0.669482

Analyse model fairness

In the fairness report above, the Equalized Odds Ratio (EOR) and Demographic Parity Ratio (DPR) are the two critical metrics that reveal significant unfairness in the prediction outcomes between different genders. These metrics should be the primary focus for mitigation efforts. Strategies such as algorithmic adjustments, feature selection, or targeted interventions may be needed to address the observed biases and improve fairness in salary predictions.

Choosing a Metric

If the primary concern is to ensure fairness in both false positives and false negatives, then Equalized Odds Ratio (EOR) would be the preferred metric for bias mitigation. Addressing disparities in both types of errors can lead to a more balanced and equitable outcome.

However, if the focus is solely on ensuring an equal distribution of positive outcomes between genders, then Demographic Parity Ratio (DPR) might be sufficient for mitigation efforts.

In the context of this example:

Equalized Odds Ratio (EOR):

EOR focuses on ensuring fairness in both false positives and false negatives between different males and females. Specifically, EOR (0.18) indicates that the odds of a true positive prediction for the protected group (e.g., females) are 18% of those for the unprotected group (e.g., males). Mitigating bias using EOR means adjusting the model to achieve more balanced error rates across genders, thereby reducing disparities in both types of prediction errors (false positives and false negatives).

Demographic Parity Ratio (DPR):

DPR primarily aims to ensure an equal distribution of positive outcomes (e.g. salary above 50k) between different genders, regardless of predictive errors. In this example, DPR (0.29) indicates that the ratio of positive outcomes for females is 29% of that for males. Mitigating bias using DPR involves adjusting the model to achieve parity in positive outcome rates across genders, without necessarily addressing disparities in prediction errors.

Following this diagnosis, we will now attempt to mitigate the demographic parity ratio bias caused by gender. First we will initialize the automl model with the fairness metric for bias mitigation.

Mitigation using demographic parity ratio

The first step for mitigation is to identify a sensitive feature in the data that is introducing the bias and specify an appropriate fairness metric based on clasification or regresssion. To do this, we initiate the model using the sensitive variable as Gender and the metric as DPR. DPR defines the fairness metric to be optimized and adjusted to achieve demographic parity in positive outcomes (salary) between different gender groups. Other paramters that can be used are fairness_threshold and underprivileged_groups, but the default values are used here. Refer to the earlier link for more details.

automl_mitigation_dpr_obj = AutoML(data,sensitive_variables= ['Gender'], fairness_metric = 'demographic_parity_ratio')

After creating the AutoML object by passing the data obtained from prepare_tabulardata and using mitigation values for other parameters, we will proceed to training the model using AutoML. This is done by calling the fit method as shown below. After training, all of the models and their variants will be saved in a new folder.

automl_mitigation_dpr_obj.fit()

Neural Network algorithm was disabled because it doesn't support n_jobs parameter.
Linear algorithm was disabled.
AutoML directory: C:\Users\sup10432\AppData\Local\Temp\scratch\tmp__c4b209
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Decision Tree', 'Random Trees', 'Extra Trees', 'LightGBM', 'Xgboost']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'unfairness_mitigation', 'ensemble']
* Step simple_algorithms will try to check up to 1 model
DecisionTreeAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
1_DecisionTree logloss 0.361375 trained in 6.95 seconds
* Step default_algorithms will try to check up to 4 models
LightgbmAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
Exception while producing SHAP explanations. pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: Workclass: object, Education: object, Marital-status: object, Occupation: object, Relationship: object, Race: object, Gender: object, Native-country: object
Continuing ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
2_Default_LightGBM logloss 0.27688 trained in 5.77 seconds
There was an error during 3_Default_Xgboost training.
Please check C:\Users\sup10432\AppData\Local\Temp\scratch\tmp__c4b209\errors.md for details.
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees logloss 0.338299 trained in 9.4 seconds
ExtraTreesAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
5_Default_ExtraTrees logloss 0.368012 trained in 8.57 seconds
* Step unfairness_mitigation will try to check up to 4 models
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees_SampleWeigthing logloss 0.35729 trained in 9.24 seconds
LightgbmAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
Exception while producing SHAP explanations. pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: Workclass: object, Education: object, Marital-status: object, Occupation: object, Relationship: object, Race: object, Gender: object, Native-country: object
Continuing ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
2_Default_LightGBM_SampleWeigthing logloss 0.285305 trained in 5.4 seconds
ExtraTreesAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
5_Default_ExtraTrees_SampleWeigthing logloss 0.384304 trained in 8.8 seconds
DecisionTreeAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
1_DecisionTree_SampleWeigthing logloss 0.423913 trained in 6.93 seconds
* Step unfairness_mitigation_update_1 will try to check up to 4 models
ExtraTreesAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
5_Default_ExtraTrees_SampleWeigthing_Update_1 logloss 0.412036 trained in 14.03 seconds
LightgbmAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
Exception while producing SHAP explanations. pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: Workclass: object, Education: object, Marital-status: object, Occupation: object, Relationship: object, Race: object, Gender: object, Native-country: object
Continuing ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
2_Default_LightGBM_SampleWeigthing_Update_1 logloss 0.295114 trained in 5.58 seconds
DecisionTreeAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
1_DecisionTree_SampleWeigthing_Update_1 logloss 0.462531 trained in 6.81 seconds
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees_SampleWeigthing_Update_1 logloss 0.377543 trained in 12.94 seconds
* Step unfairness_mitigation_update_2 will try to check up to 2 models
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees_SampleWeigthing_Update_2 logloss 0.404245 trained in 9.3 seconds
LightgbmAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
Exception while producing SHAP explanations. pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: Workclass: object, Education: object, Marital-status: object, Occupation: object, Relationship: object, Race: object, Gender: object, Native-country: object
Continuing ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
2_Default_LightGBM_SampleWeigthing_Update_2 logloss 0.307829 trained in 5.91 seconds
* Step ensemble will try to check up to 1 model
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
Ensemble logloss 0.307829 trained in 3.64 seconds
AutoML fit time: 141.1 seconds
AutoML best model: 2_Default_LightGBM_SampleWeigthing_Update_2
All the evaluated models are saved in the path  C:\Users\sup10432\AppData\Local\Temp\scratch\tmp__c4b209

Once the model is trained, it will have mitigated the bias. This can be verified by reviewing the model report and examining the demographic parity ratio metric of the best-trained model. Internally we are using an approach called Reweighing for bias mitigation. Reweighing is a preprocessing method that adjusts the weights of examples in each (group, label) combination to ensure fairness before classification.

automl_mitigation_dpr_obj.report()

In case the report html is not rendered appropriately in the notebook, the same can be found in the path C:\Users\sup10432\AppData\Local\Temp\scratch\tmp__c4b209\README.html

AutoML Leaderboard

Best model	name	model_type	metric_type	metric_value	train_time	fairness_metric	fairness_Gender	is_fair
	1_DecisionTree	Decision Tree	logloss	0.361375	7.75	demographic_parity_ratio	0.1344	False
	2_Default_LightGBM	LightGBM	logloss	0.27688	6.52	demographic_parity_ratio	0.3252	False
	4_Default_RandomTrees	Random Trees	logloss	0.338299	10.23	demographic_parity_ratio	0.332	False
	5_Default_ExtraTrees	Extra Trees	logloss	0.368012	9.44	demographic_parity_ratio	0.2844	False
	4_Default_RandomTrees_SampleWeigthing	Random Trees	logloss	0.35729	10.06	demographic_parity_ratio	0.3612	False
	2_Default_LightGBM_SampleWeigthing	LightGBM	logloss	0.285305	6.22	demographic_parity_ratio	0.5264	False
	5_Default_ExtraTrees_SampleWeigthing	Extra Trees	logloss	0.384304	9.67	demographic_parity_ratio	0.7682	False
	1_DecisionTree_SampleWeigthing	Decision Tree	logloss	0.423913	7.76	demographic_parity_ratio	0.4991	False
	5_Default_ExtraTrees_SampleWeigthing_Update_1	Extra Trees	logloss	0.412036	14.89	demographic_parity_ratio	0.9246	True
	2_Default_LightGBM_SampleWeigthing_Update_1	LightGBM	logloss	0.295114	6.28	demographic_parity_ratio	0.6955	False
	1_DecisionTree_SampleWeigthing_Update_1	Decision Tree	logloss	0.462531	7.61	demographic_parity_ratio	0.4962	False
	4_Default_RandomTrees_SampleWeigthing_Update_1	Random Trees	logloss	0.377543	13.83	demographic_parity_ratio	0.7167	False
	4_Default_RandomTrees_SampleWeigthing_Update_2	Random Trees	logloss	0.404245	10.21	demographic_parity_ratio	0.8917	True
the best	2_Default_LightGBM_SampleWeigthing_Update_2	LightGBM	logloss	0.307829	6.62	demographic_parity_ratio	0.8406	True
	Ensemble	Ensemble	logloss	0.307829	3.64	demographic_parity_ratio	0.8406	True

AutoML Performance

AutoML Performance Boxplot

Performance vs fairness_Gender

Spearman Correlation of Models

models spearman correlation

	Score	Threshold
logloss	0.361375	nan
auc	0.85532	nan
f1	0.628192	0.301711
accuracy	0.84352	0.424859
precision	1	0.976923
recall	1	0.0479006
mcc	0.534284	0.424859

	score	threshold
logloss	0.361375	nan
auc	0.85532	nan
f1	0.605192	0.424859
accuracy	0.84352	0.424859
precision	0.779441	0.424859
recall	0.494617	0.424859
mcc	0.534284	0.424859

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,712	221
Labeled as >50K	798	781

	Samples	Accuracy	Selection Rate	True Positive Rate	False Negative Rate	False Positive Rate	True Negative Rate
Overall	6512	0.8435	0.1539	0.4946	0.5054	0.0448	0.9552
Male	4391	0.8076	0.2143	0.5356	0.4644	0.072	0.928
Female	2121	0.918	0.0288	0.2554	0.7446	0.0011	0.9989


Demographic Parity Difference	0.1855
Demographic Parity Ratio	0.1344
Equalized Odds Difference	0.2802
Equalized Odds Ratio	0.0153

	Score	Threshold
logloss	0.423913	nan
auc	0.75127	nan
f1	0.616901	0.552201
accuracy	0.832924	0.552201
precision	1	0.987721
recall	1	0.10155
mcc	0.517066	0.552201

	score	threshold
logloss	0.423913	nan
auc	0.75127	nan
f1	0.616901	0.552201
accuracy	0.832924	0.552201
precision	0.694687	0.552201
recall	0.554782	0.552201
mcc	0.517066	0.552201

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,548	385
Labeled as >50K	703	876

	Samples	Accuracy	Selection Rate	True Positive Rate	False Negative Rate	False Positive Rate	True Negative Rate
Overall	6512	0.8329	0.1936	0.5548	0.4452	0.078	0.922
Male	4391	0.7978	0.2314	0.5475	0.4525	0.0914	0.9086
Female	2121	0.9057	0.1155	0.5974	0.4026	0.0566	0.9434


Demographic Parity Difference	0.1159
Demographic Parity Ratio	0.4991
Equalized Odds Difference	0.0499
Equalized Odds Ratio	0.6193

	Score	Threshold
logloss	0.462531	nan
auc	0.731943	nan
f1	0.555902	0.391417
accuracy	0.802058	0.697942
precision	0.984375	0.89267
recall	1	0.111521
mcc	0.397334	0.391417

	score	threshold
logloss	0.462531	nan
auc	0.731943	nan
f1	0.364085	0.697942
accuracy	0.802058	0.697942
precision	0.823661	0.697942
recall	0.233692	0.697942
mcc	0.368589	0.697942

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,854	79
Labeled as >50K	1,210	369

	Samples	Accuracy	Selection Rate	True Positive Rate	False Negative Rate	False Positive Rate	True Negative Rate
Overall	6512	0.8021	0.0688	0.2337	0.7663	0.016	0.984
Male	4391	0.7433	0.0517	0.1662	0.8338	0.001	0.999
Female	2121	0.9236	0.1042	0.6277	0.3723	0.0402	0.9598


Demographic Parity Difference	0.0525
Demographic Parity Ratio	0.4962
Equalized Odds Difference	0.4615
Equalized Odds Ratio	0.0249

	Score	Threshold
logloss	0.27688	nan
auc	0.928871	nan
f1	0.735925	0.410305
accuracy	0.875	0.5331
precision	1	0.998262
recall	1	0.000350082
mcc	0.649588	0.410305

	score	threshold
logloss	0.27688	nan
auc	0.928871	nan
f1	0.718339	0.5331
accuracy	0.875	0.5331
precision	0.791762	0.5331
recall	0.657378	0.5331
mcc	0.643465	0.5331

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,660	273
Labeled as >50K	541	1,038

	Samples	Accuracy	Selection Rate	True Positive Rate	False Negative Rate	False Positive Rate	True Negative Rate
Overall	6512	0.875	0.2013	0.6574	0.3426	0.0553	0.9447
Male	4391	0.8445	0.258	0.6669	0.3331	0.0769	0.9231
Female	2121	0.9382	0.0839	0.6017	0.3983	0.0206	0.9794


Demographic Parity Difference	0.1741
Demographic Parity Ratio	0.3252
Equalized Odds Difference	0.0652
Equalized Odds Ratio	0.2679

	Score	Threshold
logloss	0.285305	nan
auc	0.923918	nan
f1	0.72439	0.354694
accuracy	0.872389	0.464436
precision	1	0.99742
recall	1	0.000503872
mcc	0.639276	0.464436

	score	threshold
logloss	0.285305	nan
auc	0.923918	nan
f1	0.718782	0.464436
accuracy	0.872389	0.464436
precision	0.771802	0.464436
recall	0.672578	0.464436
mcc	0.639276	0.464436

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,619	314
Labeled as >50K	517	1,062

	Samples	Accuracy	Selection Rate	True Positive Rate	False Negative Rate	False Positive Rate	True Negative Rate
Overall	6512	0.8724	0.2113	0.6726	0.3274	0.0637	0.9363
Male	4391	0.844	0.2498	0.6528	0.3472	0.0713	0.9287
Female	2121	0.9312	0.1315	0.7879	0.2121	0.0513	0.9487


Demographic Parity Difference	0.1183
Demographic Parity Ratio	0.5264
Equalized Odds Difference	0.1351
Equalized Odds Ratio	0.7195

	Score	Threshold
logloss	0.295114	nan
auc	0.918899	nan
f1	0.71261	0.338787
accuracy	0.865786	0.56145
precision	1	0.997602
recall	1	0.000574183
mcc	0.618117	0.419065

	score	threshold
logloss	0.295114	nan
auc	0.918899	nan
f1	0.683333	0.56145
accuracy	0.865786	0.56145
precision	0.798476	0.56145
recall	0.597213	0.56145
mcc	0.610609	0.56145

	Predicted as <=50K	Predicted as >50K
Labeled as <=50K	4,695	238
Labeled as >50K	636	943

	Samples	Accuracy	Selection Rate	True Positive Rate	False Negative Rate	False Positive Rate	True Negative Rate
Overall	6512	0.8658	0.1814	0.5972	0.4028	0.0482	0.9518
Male	4391	0.8383	0.2013	0.5645	0.4355	0.0404	0.9596
Female	2121	0.9227	0.14	0.7879	0.2121	0.0608	0.9392

The model report shows that 2_Default_LightGBM_SampleWeigthing_Update_2 is the best trained model, with the respective demograpihc_parity_ratio is now 0.84 which is up from 0.29, and surpassing the minimum threshold of 0.80. This suggests that bias mitigation has been successfully achieved. Additionally, the model score is verified to ensure that the performance remains consistent with previous evaluations, which is also the same as before.

DPR mitigation Analysis

Model Performance Metrics Before and After Mitigation for female:

	Accuracy (Female)	False Positive Rate (Female)	False Negative Rate (Female)	Selection Rate (Female)	Count (Female)
Before Mitigation	0.933798	0.013245	0.443396	0.080139	0.330518
After Mitigation	0.8949	0.0979	0.1645	0.1782	0.2121

Model Performance Metrics Before and After Mitigation for male:

	Accuracy (Male)	False Positive Rate (Male)	False Negative Rate (Male)	Selection Rate (Male)	Count (Male)
Before Mitigation	0.858372	0.074387	0.283422	0.280963	0.669482
After Mitigation	0.8394	0.0473	0.4162	0.212	0.4391

Selection Rate:

Selection Rate can be defined as the proportion of samples from a specific sensitive group that were selected or predicted as positive by the model. For example, for the male group, a selection rate value of 0.2809 indicates that approximately 28.09 percent of male samples were predicted as positive outcomes by the model.

Before mitigation, the selection rate for females (0.0801) was significantly lower than for males (0.2809).

After mitigation, the selection rates have become more balanced, with males at 0.2120 and females at 0.1782. This indicates an improvement in demographic parity, ensuring more equitable selection between genders.

False Negative Rate:

Before mitigation, females had a much higher rate of being incorrectly classified as earning less than 50k (false negatives) at 0.4433, compared to males at 0.2834.

After mitigation, the rate of females being incorrectly classified as earning less than 50k (false negatives) significantly decreased to 0.1645, indicating an improvement in correctly identifying females earning above 50k.

However, the rate of males being incorrectly classified as earning less than 50k (false negatives) increased from 0.2834 to 0.4162. This indicates that while the mitigation process improved the false negative rate for females, it had an adverse effect on the false negative rate for males.

Overall Accuracy:

The overall accuracy decreased slightly from the pre-mitigation accuracy levels (males: 0.858372, females: 0.933798) to 0.8575 after mitigation. This is a minor change and indicates that overall predictive performance was maintained.

The mitigation strategy improved demographic parity by balancing the selection rates between males and females, but this came at the cost of increasing the false negative rate for males. This trade-off suggests that while aiming for fairness in selection rates, other metrics such as the false negative rate can be adversely affected.

The mitigation achieved in selection rate shows progress towards demographic parity, ensuring a fairer selection process between genders. However, the increase in the false negative rate for males is a concern, as it indicates more males are being incorrectly classified as negative cases after mitigation. Balancing fairness and performance metrics like false negative rate is crucial, and further adjustments or different mitigation techniques may be necessary to achieve a more equitable outcome without compromising accuracy.

automl_mitigation_dpr_obj.score()

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

0.7439539347408829

Finally the mitigated model is used for final prediction on an unseen data:

result_df = automl_mitigation_dpr_obj.predict(test,prediction_type="dataframe")
result_df.head(5)

	Age	Workclass	Fnlwgt	Education	Education-num	Marital-status	Occupation	Relationship	Race	Gender	Hours-per-week	Native-country	Salary	annual_salary_$	prediction_results	prediction_confidence
24507	57	Private	89182	HS-grad	9	Widowed	Adm-clerical	Not-in-family	White	Female	40	United-States	<=50K	90732	<=50K	0.834847
28351	33	Private	159548	Some-college	10	Divorced	Adm-clerical	Unmarried	Black	Female	38	United-States	<=50K	87710	<=50K	0.974263
717	19	State-gov	378418	HS-grad	9	Never-married	Tech-support	Own-child	White	Female	40	United-States	<=50K	64787	<=50K	0.998881
19417	44	Private	151985	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	24	United-States	>50K	83582	>50K	0.939446
16746	23	Private	406641	Some-college	10	Never-married	Handlers-cleaners	Other-relative	White	Female	18	United-States	<=50K	86347	<=50K	0.998467

In the predicted dataframe, the prediction_results column contains the model's predictions. To validate these predictions, they are compared with the actual values. The accuracy, which is then calculated, shows a high value. Significantly this prediction can be now considered free of bias.

accuracy = accuracy_score(result_df["Salary"], result_df['prediction_results'])
print(accuracy)

0.8628896054045755

Mitigation using Equalized Odds Ratio

To address some of the shortcomings of Demographic Parity Ratio (DPR), let's mitigate the model using Equalized Odds Ratio (EOR). EOR aims to balance fairness and performance metrics by considering both false positive and false negative outcomes.

The aim of the Equalized Odds fairness metric is to guarantee that a machine learning model exhibits equal performance across different demographic groups. It imposes a stricter criterion than demographic parity by mandating that the model's predictions are not only independent of the female and male sensitive group membership, but also that the false positive rates and true positive rates are equal across groups. This distinction holds significance because while a model may achieve demographic parity, meaning its predictions are independent of sensitive group membership, it could still produce a higher number of false positive predictions for one group compared to others. Equalized Odds mitigates this concern by ensuring fairness in both false positive and true positive rates across all groups. Unlike demographic parity, Equalized Odds does not introduce the selection issue discussed earlier. For instance, in the present scenario where the objective is to predict salary by gender, it is important to ensure the model performs equally well in predictign appropriate salary from both groups.

automl_mitigation_eqr_obj = AutoML(data,sensitive_variables= ['Gender'], fairness_metric = 'equalized_odds_ratio')

automl_mitigation_eqr_obj.fit()

Neural Network algorithm was disabled because it doesn't support n_jobs parameter.
Linear algorithm was disabled.
AutoML directory: C:\Users\sup10432\AppData\Local\Temp\scratch\tmppbdan7vj
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Decision Tree', 'Random Trees', 'Extra Trees', 'LightGBM', 'Xgboost']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'unfairness_mitigation', 'ensemble']
* Step simple_algorithms will try to check up to 1 model
DecisionTreeAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
1_DecisionTree logloss 0.361375 trained in 9.32 seconds
* Step default_algorithms will try to check up to 4 models
LightgbmAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
Exception while producing SHAP explanations. pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: Workclass: object, Education: object, Marital-status: object, Occupation: object, Relationship: object, Race: object, Gender: object, Native-country: object
Continuing ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
2_Default_LightGBM logloss 0.27688 trained in 6.03 seconds
There was an error during 3_Default_Xgboost training.
Please check C:\Users\sup10432\AppData\Local\Temp\scratch\tmppbdan7vj\errors.md for details.
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees logloss 0.338299 trained in 10.02 seconds
ExtraTreesAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
5_Default_ExtraTrees logloss 0.368012 trained in 9.05 seconds
* Step unfairness_mitigation will try to check up to 4 models
LightgbmAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
Exception while producing SHAP explanations. pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: Workclass: object, Education: object, Marital-status: object, Occupation: object, Relationship: object, Race: object, Gender: object, Native-country: object
Continuing ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
2_Default_LightGBM_SampleWeigthing logloss 0.285305 trained in 5.86 seconds
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees_SampleWeigthing logloss 0.35729 trained in 9.89 seconds
ExtraTreesAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
5_Default_ExtraTrees_SampleWeigthing logloss 0.384304 trained in 9.6 seconds
DecisionTreeAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
1_DecisionTree_SampleWeigthing logloss 0.423913 trained in 7.02 seconds
* Step unfairness_mitigation_update_1 will try to check up to 4 models
LightgbmAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
Exception while producing SHAP explanations. pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: Workclass: object, Education: object, Marital-status: object, Occupation: object, Relationship: object, Race: object, Gender: object, Native-country: object
Continuing ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
2_Default_LightGBM_SampleWeigthing_Update_1 logloss 0.295114 trained in 6.54 seconds
ExtraTreesAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
5_Default_ExtraTrees_SampleWeigthing_Update_1 logloss 0.412036 trained in 14.37 seconds
DecisionTreeAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
1_DecisionTree_SampleWeigthing_Update_1 logloss 0.462531 trained in 6.86 seconds
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees_SampleWeigthing_Update_1 logloss 0.377543 trained in 13.15 seconds
* Step unfairness_mitigation_update_2 will try to check up to 1 model
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees_SampleWeigthing_Update_2 logloss 0.404245 trained in 9.84 seconds
* Step ensemble will try to check up to 1 model
Ensemble not trained. Can't contruct fair ensemble.
AutoML fit time: 137.57 seconds
AutoML best model: 2_Default_LightGBM_SampleWeigthing
AutoML can't construct model that meets your fairness criteria.
What you can do?
1. Please include more samples that are not biased.
2. Please examine the most unfairly treated samples.
3. Please change fairness threshold.
All the evaluated models are saved in the path  C:\Users\sup10432\AppData\Local\Temp\scratch\tmppbdan7vj

Once the model is trained, it will have mitigated the bias. This can be verified by reviewing the model report and examining the Equalized odds ratio metric of the best-trained model.

automl_mitigation_eqr_obj.report()

In case the report html is not rendered appropriately in the notebook, the same can be found in the path C:\Users\sup10432\AppData\Local\Temp\scratch\tmppbdan7vj\README.html

AutoML Leaderboard

Best model	name	model_type	metric_type	metric_value	train_time	fairness_metric	fairness_Gender	is_fair
	1_DecisionTree	Decision Tree	logloss	0.361375	10.13	equalized_odds_ratio	0.0153	False
	2_Default_LightGBM	LightGBM	logloss	0.27688	6.85	equalized_odds_ratio	0.2679	False
	4_Default_RandomTrees	Random Trees	logloss	0.338299	10.94	equalized_odds_ratio	0.2314	False
	5_Default_ExtraTrees	Extra Trees	logloss	0.368012	9.97	equalized_odds_ratio	0.1706	False
the best	2_Default_LightGBM_SampleWeigthing	LightGBM	logloss	0.285305	6.56	equalized_odds_ratio	0.7195	False
	4_Default_RandomTrees_SampleWeigthing	Random Trees	logloss	0.35729	10.76	equalized_odds_ratio	0.3123	False
	5_Default_ExtraTrees_SampleWeigthing	Extra Trees	logloss	0.384304	10.51	equalized_odds_ratio	0.6825	False
	1_DecisionTree_SampleWeigthing	Decision Tree	logloss	0.423913	7.89	equalized_odds_ratio	0.6193	False
	2_Default_LightGBM_SampleWeigthing_Update_1	LightGBM	logloss	0.295114	7.33	equalized_odds_ratio	0.6645	False
	5_Default_ExtraTrees_SampleWeigthing_Update_1	Extra Trees	logloss	0.412036	15.37	equalized_odds_ratio	0.5769	False
	1_DecisionTree_SampleWeigthing_Update_1	Decision Tree	logloss	0.462531	7.68	equalized_odds_ratio	0.0249	False
	4_Default_RandomTrees_SampleWeigthing_Update_1	Random Trees	logloss	0.377543	14.03	equalized_odds_ratio	0.6816	False
	4_Default_RandomTrees_SampleWeigthing_Update_2	Random Trees	logloss	0.404245	10.65	equalized_odds_ratio	0.463	False

AutoML Performance

AutoML Performance Boxplot

Performance vs fairness_Gender

Spearman Correlation of Models

models spearman correlation

The model report shows that 2_Default_LightGBM_SampleWeigthing is the best model. However, the EOR metric shows that it was not able to construct a fair model despite the significant improvement from 0.18 to 0.71. This is close enough to the threshold of 0.8 to be considered a fair model. In fact, the fairness_threshold parameter can be used to lower the EOR threshold to 0.71 for the model to be formally considered fair.

Reducing the threshold for a successful mitigation

Acknowledging the fact that with an EOR and threshold of 0.8 , the model was not able to find a fair model, we can formalize the marked improvement of the EOR from 0.17 to 0.70 by reducing the threshold to 0.70 in the API and retrain the model.

automl_mitigation_eqr_obj = AutoML(data,sensitive_variables= ['Gender'], fairness_metric = 'equalized_odds_ratio', fairness_threshold=0.70)

automl_mitigation_eqr_obj.fit()

Neural Network algorithm was disabled because it doesn't support n_jobs parameter.
Linear algorithm was disabled.
AutoML directory: C:\Users\sup10432\AppData\Local\Temp\scratch\tmpejag8d56
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Decision Tree', 'Random Trees', 'Extra Trees', 'LightGBM', 'Xgboost']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'unfairness_mitigation', 'ensemble']
* Step simple_algorithms will try to check up to 1 model
DecisionTreeAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
1_DecisionTree logloss 0.361375 trained in 7.8 seconds
* Step default_algorithms will try to check up to 4 models
LightgbmAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
Exception while producing SHAP explanations. pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: Workclass: object, Education: object, Marital-status: object, Occupation: object, Relationship: object, Race: object, Gender: object, Native-country: object
Continuing ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
2_Default_LightGBM logloss 0.27688 trained in 6.33 seconds
There was an error during 3_Default_Xgboost training.
Please check C:\Users\sup10432\AppData\Local\Temp\scratch\tmpejag8d56\errors.md for details.
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees logloss 0.338299 trained in 10.26 seconds
ExtraTreesAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
5_Default_ExtraTrees logloss 0.368012 trained in 9.09 seconds
* Step unfairness_mitigation will try to check up to 4 models
LightgbmAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
Exception while producing SHAP explanations. pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: Workclass: object, Education: object, Marital-status: object, Occupation: object, Relationship: object, Race: object, Gender: object, Native-country: object
Continuing ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
2_Default_LightGBM_SampleWeigthing logloss 0.285305 trained in 5.99 seconds
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees_SampleWeigthing logloss 0.35729 trained in 9.24 seconds
ExtraTreesAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
5_Default_ExtraTrees_SampleWeigthing logloss 0.384304 trained in 9.19 seconds
DecisionTreeAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
1_DecisionTree_SampleWeigthing logloss 0.423913 trained in 7.03 seconds
* Step unfairness_mitigation_update_1 will try to check up to 3 models
ExtraTreesAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
5_Default_ExtraTrees_SampleWeigthing_Update_1 logloss 0.412036 trained in 13.99 seconds
DecisionTreeAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
1_DecisionTree_SampleWeigthing_Update_1 logloss 0.462531 trained in 6.84 seconds
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees_SampleWeigthing_Update_1 logloss 0.377543 trained in 13.35 seconds
* Step unfairness_mitigation_update_2 will try to check up to 1 model
RandomForestAlgorithm should either be a classifier to be used with response_method=predict_proba or the response_method should be 'predict'. Got a regressor with response_method=predict_proba instead.
Problem during computing permutation importance. Skipping ...
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
4_Default_RandomTrees_SampleWeigthing_Update_2 logloss 0.404245 trained in 9.61 seconds
* Step ensemble will try to check up to 1 model
y_true takes value in {' <=50K', ' >50K'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.
Ensemble logloss 0.285305 trained in 3.09 seconds
AutoML fit time: 131.81 seconds
AutoML best model: 2_Default_LightGBM_SampleWeigthing
All the evaluated models are saved in the path  C:\Users\sup10432\AppData\Local\Temp\scratch\tmpejag8d56

automl_mitigation_eqr_obj.report()

In case the report html is not rendered appropriately in the notebook, the same can be found in the path C:\Users\sup10432\AppData\Local\Temp\scratch\tmpejag8d56\README.html

AutoML Leaderboard

Best model	name	model_type	metric_type	metric_value	train_time	fairness_metric	fairness_Gender	is_fair
	1_DecisionTree	Decision Tree	logloss	0.361375	8.58	equalized_odds_ratio	0.0153	False
	2_Default_LightGBM	LightGBM	logloss	0.27688	7.21	equalized_odds_ratio	0.2679	False
	4_Default_RandomTrees	Random Trees	logloss	0.338299	11.17	equalized_odds_ratio	0.2314	False
	5_Default_ExtraTrees	Extra Trees	logloss	0.368012	9.95	equalized_odds_ratio	0.1706	False
the best	2_Default_LightGBM_SampleWeigthing	LightGBM	logloss	0.285305	6.76	equalized_odds_ratio	0.7195	True
	4_Default_RandomTrees_SampleWeigthing	Random Trees	logloss	0.35729	10.11	equalized_odds_ratio	0.3123	False
	5_Default_ExtraTrees_SampleWeigthing	Extra Trees	logloss	0.384304	10.02	equalized_odds_ratio	0.6825	False
	1_DecisionTree_SampleWeigthing	Decision Tree	logloss	0.423913	7.95	equalized_odds_ratio	0.6193	False
	5_Default_ExtraTrees_SampleWeigthing_Update_1	Extra Trees	logloss	0.412036	14.95	equalized_odds_ratio	0.5769	False
	1_DecisionTree_SampleWeigthing_Update_1	Decision Tree	logloss	0.462531	7.73	equalized_odds_ratio	0.0249	False
	4_Default_RandomTrees_SampleWeigthing_Update_1	Random Trees	logloss	0.377543	14.25	equalized_odds_ratio	0.6816	False
	4_Default_RandomTrees_SampleWeigthing_Update_2	Random Trees	logloss	0.404245	10.42	equalized_odds_ratio	0.463	False
	Ensemble	Ensemble	logloss	0.285305	3.09	equalized_odds_ratio	0.7195	True

AutoML Performance

AutoML Performance Boxplot

Performance vs fairness_Gender

Spearman Correlation of Models

models spearman correlation

EOR mitigation Analysis

Model Performance Metrics Before and After Equalized Odds Ratio Mitigation for female:

	Accuracy	False Positive Rate(FPR)	False Negative Rate(FNR)	Selection Rate
Before Mitigation	0.9337	0.0132	0.4433	0.0801
After Mitigation	0.9312	0.0513	0.2121	0.1315

Model Performance Metrics Before and After Equalized Odds Ratio Mitigation for male:

	Accuracy	False Positive Rate(FPR)	False Negative Rate(FNR)	Selection Rate
Before Mitigation	0.8584	0.0744	0.2834	0.2810
After Mitigation	0.8440	0.0713	0.3472	0.2498

The model report now shows that the best model is fair. However, from the comparison table above, the overall assessment shows that the mitigation strategy has led to mixed results:

Improvements: Female FNR has significantly improved, reducing bias against females by lowering the rate of false negatives. Female SR has increased, leading to a fairer representation of females in the positive selections. Male FPR has slightly decreased.

Drawbacks: The mitigation efforts have succeeded in balancing certain metrics across genders but have also introduced new biases, particularly in the false positive and false negative rates. Further fine-tuning of the mitigation technique might be necessary to achieve a more balanced and fair outcome across all metrics, including addition of more data.

Conclusion

In this study, we explored the application of fairness metrics in machine learning, particularly focusing on the limitations and benefits of Demographic Parity Ratio (DPR) and Equalized Odds Ratio (EOR) for fairness assessment.

First, we performed an initial fairness assessment of the model predicting salary by utilizing the demographic variable dataset and a vanilla automl workflow. The initial model showed discrepancies in fairness metrics, particularly with higher false positive rates for certain groups revelaed by the Demographic Parity Ratio (DPR) and the Equalized Odds Ratio (EOR).

Subsequently, fairness mitigation was done first with DPR and then with EOR. While DPR addressed some aspects of fairness, it fell short in balancing false positive and false negative rates across groups, leading to suboptimal performance in fairness. Then migating using the Equalized Odds Ratio metric provided a more comprehensive fairness assessment by ensuring equal false positive and true positive rates across all groups, thereby addressing the limitations observed with DPR.

Finally, adjusting the threshold allowed automl to construct a fair model, which is useful for getting an Ensemble model. Otherwise if the model is not able to construct a fair model, a model ensemble is not created.

Although there might be bias still present in the model, the mitigation workflow was able to reduce it significantly. Thus continuous evaluation and refinement of the fairness workflow would be crucial for achiving more equitable machine learning models and unbiased decision-making processes.

Data resources

Dataset	Citation	Link
Census Income datset	Extraction was done by Barry Becker from the 1994 Census database	https://archive.ics.uci.edu/dataset/20/census+income

                                                  ------End-----

Data processing

Mitigating salary bias due to gender using Automl fairness

Table of Content

Introduction

Necessary Imports

Connecting to ArcGIS

Accessing the dataset

Model Building using AutoML

Data Preparation

Model initialization

Model training

AutoML Leaderboard

AutoML Performance

AutoML Performance Boxplot

Spearman Correlation of Models

Summary of 1_DecisionTree

Model name: Decision Tree

Model parameters

Optimized metric

Training time (Seconds)

Metric details

Metric details with threshold from accuracy metric

Confusion Matrix (at threshold=0.424859)

Learning curves

Confusion Matrix

Normalized Confusion Matrix

ROC Curve

Kolmogorov-Smirnov Statistic

Precision-Recall Curve

SHAP Importance

SHAP Dependence plots

Dependence (Fold 1)

SHAP Dependence plots

Top-10 Worst decisions for class 0 (Fold 1)

Top-10 Best decisions for class 0 (Fold 1)

Top-10 Worst decisions for class 1 (Fold 1)

Top-10 Best decisions for class 1 (Fold 1)

Summary of 2_Default_LightGBM

Model name: LightGBM

Model parameters

Optimized metric

Training time (Seconds)

Metric details

Metric details with threshold from accuracy metric

Confusion Matrix (at threshold=0.5331)

Learning curves

Confusion Matrix

Normalized Confusion Matrix

ROC Curve

Kolmogorov-Smirnov Statistic

Precision-Recall Curve

Summary of 4_Default_RandomTrees

Model name: Random Trees

Model parameters

Optimized metric

Training time (Seconds)

Metric details

Metric details with threshold from accuracy metric

Confusion Matrix (at threshold=0.549898)

Learning curves

Confusion Matrix

Normalized Confusion Matrix

ROC Curve

Kolmogorov-Smirnov Statistic

Precision-Recall Curve

SHAP Importance

SHAP Dependence plots

Dependence (Fold 1)

SHAP Dependence plots

Top-10 Worst decisions for class 0 (Fold 1)

Top-10 Best decisions for class 0 (Fold 1)

Top-10 Worst decisions for class 1 (Fold 1)

Top-10 Best decisions for class 1 (Fold 1)

Summary of 5_Default_ExtraTrees

Model name: Extra Trees Classifier (Extra Trees)

Model parameters

Optimized metric

Training time (Seconds)

Metric details

Metric details with threshold from accuracy metric