As many datasets don't have ground truth labels for individual identity attributes.
So what are you gonna do if you don't have data labeled for the slices you want to investigate?
We know that many datasets don't have ground truth labels for individual identity attributes.
If you find yourself in this position, some approaches are:
1.Identify if there are attributes that you have that may give you some insight into the performance across groups.
Fx.geography, while not equivalent to ethnicity & race, may help you uncover any disparate patterns in performance.
2.Identify if there are representative public datasets that might map well to your problem.
You can find a range of diverse and inclusive datasets on the Google AI site.
Then leverage rules or classifiers to label your data with objective surface level attributes.
Fx.you can label text as to whether or not there is an identity term in the sentence.
Keep in mind that classifiers have their own challenges and if you're not careful, may introduce another layer of bias as well.
Be clear about what your classifier is actually classifying.
And ensure high accuracy for classifiers labeling such attributes.
Always be aware that a classifier can easily pick up on proxies or stereotypes.
3.Find more representative data that is labeled.
You always need to make sure to evaluate on multiple and diverse datasets.
If your evaluation data is not adequately representative of your user base or the types of data likely to be encountered, you may end up with deceptively good fairness metrics.
Be aware that a high model performance on one dataset doesn't guarantee high performance on others.
Also keep in mind that subgroups aren't always the best way to classify individuals.
Humans are multidimensional and belong to more than one group, even within a single dimension.
Many subgroups have fuzzy boundaries which are constantly being redrawn.
By acknowledging that there are a vast number of groups or slices that may be relevant to test its highly recommended slicing and evaluating a diverse and wide range of slices and then make a deep dive where you spot opportunities for improvement.
It is also important to acknowledge that even though you may not see concerns on slices you have tested, that doesn't imply that your product works for all users and getting diverse user feedback is important to ensure that you are continually identifying new opportunities.
You can start by thinking through your particular use case and the different ways users may engage with your product or idea.
By slicing all of your existing performance metrics you're off to a good start.
Then always evaluate your metrics across multiple thresholds in order to understand how the threshold can affect the performance for different groups.
4.The critical fairness metrics for classification:
When thinking about a classification model, think about the effects of errors, which is the differences between the actual "ground truth" label and the label from the model.
If some errors pose more opportunity or harm to your users, make sure you evaluate the rates of these errors across groups of users.
Metrics available today in Fairness Indicators beta.
You can access instructions to add your own metrics to Fairness Indicators. You can also reach out to tfx@tensorflow.org if there are metrics that you would like to see.
A gap in the metric between two groups can be a sign that your model may have unfair skews.
You should interpret your results according to your use case.
The first sign that you may be treating one set of users unfairly is when the metrics between that set of users and your overall are significantly different. Make sure to account for confidence intervals when looking at these differences.
When you have too few samples in a particular slice the difference between metrics may not be accurate.
But achieving equality across groups on Fairness Indicators doesn't mean the model is fair.
As systems are highly complex and achieving equality on one or all of the provided metrics can't guarantee fairness.
Fairness evaluations should be run throughout the development process and post launch.
Just like improving your idea is an ongoing process and subject to adjustment, making your idea fair and equitable requires ongoing attention. As different aspects of the model change, such as training data, inputs from other models or the design itself, fairness metrics are likely to change too.
"Clearing the bar" once isn't enough to ensure that all of the interacting components have remained intact over time.
Also additional defense against rare, targeted examples is crucial as these examples probably will not manifest in training or evaluation data.
Always remember that true integrity is when your actions are aligned with your stated values.
By My