> But the model designers were aware that features could be correlated with demographic groups in a way that would make them proxies.
There's a huge problem with people trying to use umbrella usage to predict flooding. Some people are trying to develop a computer model that uses rainfall instead, but watchdog groups have raised concerns that rainfall may be used as a proxy for umbrella usage.
(It seems rather strange to expect a statistical model trained for accuracy to infer and indirect through a shadow variable that makes it less accurate, simply because it's something easy for humans to observe directly and then use as a lossy shortcut or to promote alternate goals that aren't part of the labels being trained for or whatever.)
> These are two sets of unavoidable tradeoffs: focusing on one fairness definition can lead to worse outcomes on others. Similarly, focusing on one group can lead to worse performance for other groups. In evaluating its model, the city made a choice to focus on false positives and on reducing ethnicity/nationality based disparities. Precisely because the reweighting procedure made some gains in this direction, the model did worse on other dimensions.
Nice to see an investigation that's serious enough to acknowledge this.
tripletao 16 hours ago [-]
They correctly note the existence of a tradeoff, but I don't find their statement of it very clear. Ideally, a model would be fair in the senses that:
1. In aggregate over any nationality, people face the same probability of a false positive.
2. Two people who are identical except for their nationality face the same probability of a false positive.
In general, it's impossible to achieve both properties. If the output and at least one other input correlate with nationality, then a model that ignores nationality fails (1). We can add back nationality and reweight to fix that, but then it fails (2).
This tradeoff is most frequently discussed in the context of statistical models, since those make that explicit. It applies to any process for deciding though, including human decisions.
londons_explore 16 hours ago [-]
> Two people who are identical except for their nationality face the same probability of a false positive
It would be immoral to disadvantage one nationality over another. But we also cannot disadvantage one age group over another. Or one gender over another. Or one hair colour over another. Or one brand of car over another.
So if we update this statement:
> Two people who are identical except for any set of properties face the same probability of a false positive.
With that new constraint, I don't believe it is possible to construct a model which outperforms a data-less coin flip.
drdaeman 10 hours ago [-]
I think you took too much of a jump, considering all properties the same, as if the only way to make the system fair is to make it entirely blind to the applicant.
We tend to distinguish between ascribed and achieved characteristics. It is considered to be unethical to discriminate upon things a person has no control over, such as their nationality, gender, age or natural hair color.
However, things like a car brand are entirely dependent on one's own actions, and if there's a meaningful statistically significant correlation owning a Maserati and fraudulently applying for welfare, I'm not entirely sure it would be unethical to consider such factor.
And it also depends on what a false positive means for a person in question. Fairness (like most things social) is not binary, and while outright rejections can be very unfair, additional scrutiny can be less so, even though still not fair (causing prolonged times and extra stress). If things are working normally, I believe there's a sort of (ever-changing, of course, as times and circumstances evolve) an unspoken social agreement on what's the balance between fairness and abuse that can be afforded.
belorn 4 hours ago [-]
Could we look at what kind of achieved characteristics exists that do not act as a proxy for an ascribed characteristics, because I have a really hard time to find those. Culture and values are highly intertwined with behavior, and the bigger the impact the behavior has on a person life, it seems that the stronger the proxy behavior is going to be.
To take a few examples, looking at employment characteristics will have a strong relationship with gender, generally creating greater false positives for women. Similarly, academic success will have greater false positives for men. Where a person choose to live will proxy heavily towards social economic factors, which in turn has gender as a major factor.
Welfare fraud in itself also has differences between men and women. The sums tend to be higher for men. Women in turn dominate the users of the welfare system. Women and men also tend to receive welfare at different time in their life. It possible even that car brand has a correlation with gender which then would act as a proxy.
In terms of defining fairness, I do find it interesting that the Analogue Process gave men a beneficial advantage, while both the initial and the reweighed model are the opposite and give women an even bigger beneficial advantage. The change in bias against men created by using the detection algorithms is actually about the same size as the change in bias against non-dutch nationality between initial model and the reweighed one.
luckylion 9 hours ago [-]
> It is considered to be unethical to discriminate upon things a person has no control over, such as their nationality, gender, age or natural hair color.
Nationality and natural hair color I understand, but age and gender? A lot of behaviors are not evenly distributed. Riots after a football match? You're unlikely to find a lot of elderly women (and men, but especially women) involved. Someone is fattening a child? That elderly women you've excluded for riots suddenly becomes a prime suspect.
> things like a car brand are entirely dependent on one's own actions
If you assume perfect free will, sure. But do you?
drdaeman 8 hours ago [-]
> A lot of behaviors are not evenly distributed.
That’s true. But the idea is that feeding it to a system as an input could be considered unethical, as one cannot control their age. Even though there’s a valid correlation.
> If you assume perfect free will, sure. But do you?
I’m not. If this matters, I’m actually currently persuaded that free will doesn’t exist. Which doesn’t change that if one buys a car, its make is typically all their decision. Whenever such decision is coming from them having a free will or entirely determined by antecedent causes doesn’t really matter for purposes of fraud detection (or maybe I fail to see how it does).
I mean, we don’t need to care why people do things (at all, in general) - it matters for how we should act upon detection, but not for detecting itself. And, as I understand it, we know we don’t want to cause unfair pressure on groups defined by factors they cannot change. Because when we did that it consistently contributed to various undesirable consequences. E.g. discrimination and stereotypes against women or men, or prejudice against younger or elder people didn’t do us any well.
luckylion 4 hours ago [-]
I get where you're coming from, but I very much doubt it's true RE car makes (and many similar things). There's a reason men and women have very distinct buying habits. E.g. men are ~4x more likely to buy a motorcycle. Individual decisions with that large a discrepancy between groups aren't individual decisions.
Can a young male really change their risk-tolerance or their innate drive to secure their place in the world (which will probably affect both their likelihood to buy sports cars and commit certain crimes)? I don't think we can pretend that everyone from toddler to granny is the same _and_ use any data to solve crimes / detect fraud.
In the end it comes down to where we draw the line between "person can't change this, so it's invalid to consider", "we don't believe it's linked, so it's invalid to consider" and "this is free will, so it's a valid signal", and I haven't seen a line that doesn't feel arbitrary ("I don't like that group, so their thing is free will, but I like this group, so their thing isn't") and is useful.
Borealid 10 hours ago [-]
I think the ethical desire is not to remove bias across all properties. Properties that result from an individual's conscious choices are allowed to be used as factors.
One can't change one's race, but changing marital status is possible.
Where it gets tricky is things like physical fitness or social groups...
like_any_other 3 hours ago [-]
> Ideally, a model would be fair in the senses that: 1. In aggregate over any nationality, people face the same probability of a false positive.
Why? We've been told time and time again that 'nations' don't really exist, they're just recent meaningless social constructs [1]. And 'races' exist even less [2]. So why is it any worse if a model is biased on nation or race, than on left-handedness or musical taste or what brand of car one drives? They're all equally meaningless, aren't they?
> 2. Two people who are identical except for their nationality face the same probability of a false positive.
That seems to fall afoul of the Base Rate Fallacy. Eg, consider 2 groups of 10,000 people and testing on A vs B. First group has 9,999 A and 1 B, second has 1 A and 9,999 B. Unless you make your test blatantly ineffective, you're going to have different false positive rates -- irrespectiveof the test's performance.
kurthr 16 hours ago [-]
This is a really key result. You can't effectively be "blind" to a parameter that is significantly correlated to multiple inputs and your output prediction. By using those inputs to minimize false positives you are not statistically blind, and you can't correct the statistics while being blind.
My suspicion is that in many situations you could build a detector/estimator which was fairly close to being blind without a significant total increase in false positives, but how much is too much?
I'm actually more concerned that where I live even accuracy has ceased to be the point.
thatguymike 17 hours ago [-]
Congrats Amsterdam: they funded a worthy and feasible project; put appropriate ethical guardrails in place; iterated scientifically; then didn’t deploy when they couldn’t achieve a result that satisfied their guardrails. We need more of this in the world.
tbrownaw 17 hours ago [-]
What were the error rates for the various groups with the old process? Was the new process that included the model actually worse for any group, or was it just uneven in how much better it was?
jaoane 15 hours ago [-]
[flagged]
nxobject 11 hours ago [-]
> because I don't even need to look at the data to know that some groups are more likely to commit fraud.
That is by definition prejudice: bias without evidence. Perhaps they want to avoid that.
jaoane 9 hours ago [-]
Thankfully, this project got evidence. Unfortunately, it was shelved.
GardenLetter27 5 hours ago [-]
> None of these features explicitly referred to an applicant’s gender or racial background, as well as other demographic characteristics protected by anti-discrimination law. But the model designers were aware that features could be correlated with demographic groups in a way that would make them proxies.
What's the problem with this? It isn't racism, it's literally just Bayes' Law.
Viliam1234 2 hours ago [-]
> It isn't racism, it's literally just Bayes' Law.
That may be logically correct, but the law is above logic. Sometimes applying Bayes' Law is legally considered racism.
Laws are artificial social constructs that represent an ideal, not a reflection of reality or nature. It's unwise to reject factual reality just because it meets the statutory definition of "racist". Disparate impact theory goes right up there with the politicians who declare the law to be above math itself in cryptography policy debates on the basis of a delusional rejection of reality.
This is why wealthy anti-racist progressives overwhelmingly still avoid decrepit, violence-racked urban slums. It's racist, but anti-racism is a luxury belief held by people who stay far away from the root cause of racism - aggregate behavioral differences that originate from culture, which is deeply intertwined with race, at least in the US.
Bayes' Law isn't wrong. Billions of humans across tens of thousands of years didn't have broken pattern recognition, we've just recently decided that the (very real) patterns everyone recognizes hurt the feelings of the genuinely disadvantaged.
crote 4 hours ago [-]
Let's say you are making a model to judge job applicants. You are aware that the training data is biased in favor of men, so you remove all explicit mentions of gender from their CVs and cover letters.
Upon evaluation, your model seems to accept everyone who mentions a "fraternity" and reject anyone who mentions a "sorority". Swapping out the words turns a strong reject into a strong accept, and vice versa.
But you removed any explicit mention of gender, so surely your model couldn't possibly be showing an anti-women bias, right?
alternatex 4 hours ago [-]
I've never had any implication of my gender other than my name in any CV over the past decade.
Who are these people who make a career history doc include gender-implicating data? And if there are such CVs, they should be stripped of such data before processing.
The fraternity example is such a specific 1 in a 1000 case.
triceratops 14 minutes ago [-]
> I've never had any implication of my gender other than my name in any CV
So you're not implying gender other than by implying gender? If humans can use names to classify people into genders, a model can do the same thing.
3abiton 17 hours ago [-]
> A more concerning limitation is that when the city re-ran parts of its analysis, it did not fully replicate its own data and results. For example, the city was unable to replicate its train and test split. Furthermore, the data related to the model after reweighting is not identical to what the city published in its bias report and although the results are substantively the same, the differences cannot be explained by mere rounding errors.
Very well written, but that last part id concerning and point to one part: did they hire interns? How cone they do not have systems? It just cast a big doubt on the whole experiment.
bananaquant 8 hours ago [-]
What nobody seems to talk about is that their resulting models are basically garbage. If you look at the last provided confusion matrix, their model is right in about 2/3 of cases when it makes a positive prediction. The actual positives are about 60%. So, any improvement is marginal at best and a far cry from ~90% accuracy you would expect from a model in such a high-stakes scenario. They could have thrown a half of cases out at random and had about the same reduction in case load without introducing any bias into the process.
xyzal 8 hours ago [-]
You can't tell a project will fail until you undertake it.
Amsterdam didn't deploy their models when they found their outcome is not satisfactory. I find it a perfectly fine result.
delusional 8 hours ago [-]
> What nobody seems to talk about is that their resulting models are basically garbage.
The post does talk about it when it briefly mentions that the goal of building the model (to decrease the number of cases investigated while increasing the rate of finding fraud) wasn't achieved. They don't say any more than that because that's not the point they are making.
Anyway, the project was shelved after a pilot. So your point is entirely false.
bananaquant 7 hours ago [-]
Good catch about the project being shelved. It is buried pretty deep in the document to the point of making it misleading:
> In late November 2023, the city announced that it would shelve the pilot.
I would agree that implications regarding the use of those models do not hold, but not the ones about their quality.
octo888 2 hours ago [-]
IMO the title would benefit from the word "welfare" before "fraud"
wongarsu 18 hours ago [-]
A big part of the difficulty of such an attempt is that we don't know the ground truth. A model is fair or unbiased if its performance is equally good for all groups. Meaning e.g. if 90% of cases of Arabs committing fraud are flagged as fraud, then 90% of cases of Danish people committing fraud should be flagged as fraud. The paper agrees on this.
The issue is that we don't know how many Danish commit fraud, and we don't know how many Arabs commit fraud, because we don't trust the old process to be unbiased. So how are we supposed to judge if the new model is unbiased? This seems fundamentally impossible without improving our ground truth in some way.
The project presented here instead tries to do some mental gymnastics to define a version of "fair" that doesn't require that better ground truth. They were able to evaluate their results on the false-positive rate by investigating the flagged cases, but they were completely in the dark about the false-negative rate.
In the end, the new model was just as biased, but in the other direction, and performance was simply worse:
> In addition to the reappearance of biases, the model’s performance in the pilot also deteriorated. Crucially, the model was meant to lead to fewer investigations and more rejections. What happened instead was mostly an increase in investigations , while the likelihood to find investigation worthy applications barely changed in comparison to the analogue process. In late November 2023, the city announced that it would shelve the pilot.
golemiprague 18 hours ago [-]
[dead]
tomp 19 hours ago [-]
Key point:
The model is considered fair if its performance is equal across these groups.
One can immediately see why this is problematic, easily by considering equivalent example in less controversial (i.e. emotionally charged) situations.
Should basketball performance be equal across racial, or sex groups? How about marathon performance?
It’s not unusual that relevant features are correlated with protected features. In the specific example above, being an immigrant is likely correlated with not knowing the local language, therefore being underemployed and hence more likely to apply for benefits.
atherton33 18 hours ago [-]
I think they're saying something more subtle.
In your basketball analogy, it's more like they have a model that predicts basketball performance, and they're saying that model should predict performance equally well across groups, not that the groups should themselves perform equally well.
tomp 18 hours ago [-]
You’re right, I misinterpreted it.
Jimmc414 16 hours ago [-]
Amsterdam reduced bias by one measure (False Positive Share) and bias increased by another measure (False Discovery Rate). This isn’t a failure of implementation; it’s a mathematical reality that you often can’t satisfy multiple fairness criteria simultaneously.
Training on past human decisions inevitably bakes in existing biases.
ncruces 16 hours ago [-]
I have a growing feeling that the only way to be fair in these situations is to be completely random.
LorenPechtel 12 hours ago [-]
Why is there so much focus on "fair" even when reality isn't?
Not all misdeeds are equally likely to be detected. What matter is minimizing the false positives and false negatives. But it sounds like they don't even have a base truth to be comparing it against, making the whole thing an exercise in bureaucracy.
Fraterkes 9 hours ago [-]
Who says reality isnt fair? Isnt that up to us, the people inhabiting reality?
BonoboIO 19 hours ago [-]
The article talks a lot about fairness metrics but never mentions whether the system actually catches fraud.
Without figures for true positives, recall, or financial recoveries, its effectiveness remains completely in the dark.
In short: great for moral grandstanding in the comments section, but zero evidence that taxpayer money or investigative time was ever saved.
stefan_ 15 hours ago [-]
It also doesn't mention what numbers we are even talking about that given the expansive size of the Dutch government make this an at all useful thing.
TacticalCoder 17 hours ago [-]
[dead]
zeroCalories 18 hours ago [-]
Does anyone know what they mean by reweighing demographics? Are they penalizing incorrect classifications more heavily for those demographics, or making sure that each demographic is equally represented, or something else? Putting aside the model's degraded performance, I think it's fair to try and make sure the model is performing well for all demographics.
16 hours ago [-]
londons_explore 16 hours ago [-]
In my view, we need to move the goalposts.
Fraud detection models will never be fair. Their job is to find fraud. They will never be perfect, and the mistaken cases will cause a perfectly honest citizen to be disadvantaged in some way.
It does not matter if that group is predominantly 'people with skin colour X' or 'people born on a Tuesday'.
What matters is that the disadvantage those people face is so small as to be irrelevant.
I propose a good starting point would be for each person investigated to be paid money to compensate them for the effort involved - whether or not they committed fraud.
WhyIsItAlwaysHN 7 hours ago [-]
Some groups will be more disadvantaged than others by being investigated. For example for welfare, I expect fraudsters to have more money to support themselves or less people to support (unless the criteria for welfare is something unexpected).
So I'd say that there also needs to be more protections than just providing money.
Nevertheless the idea of giving money is still good imo, because it also incentivizes the fraud detection becoming more efficient, since mistakes now cost more. Unfortunately I have a feeling people might game that to get more money by triggering false investigations.
djohnston 19 hours ago [-]
[flagged]
ordu 17 hours ago [-]
The goal is to avoid penalizing people for their skin color, or for gender/sex/ethnicity/whatever. If some group have higher rate of welfare fraud, the fair/unbiased system must keep false positives for that group at the same level as for general population. Ideally there should be no false positives at all, because they are costly for people, who were marked wrongly, but sadly real systems are not like that. So these false positives have to be spread over all groups proportionally to sizes of the groups.
Though the situation is more complex than that. What I described is named "False Positive Share" in the article (or at least I think so), but the article discusses other metrics too.
The problem is that the policy should make the world better, but if the policy penalizes some groups for law breaking, then it can push these groups to break the law even more. It is possible to create biases this way, and it is possible to do it accidentally. Or, rather, it is hard not to do it accidentally.
I'd recommend to read "Against Prediction", it has a lot of examples how it works. For example, biased False Negatives are also bad, they make it easier for some groups to break the law.
jsemrau 16 hours ago [-]
>The goal is to avoid penalizing people for their skin color [...]
That's not correct. The goal is to identify and flag fraud cases.
If one group has a higher likelihood to perform that, then this will show up in the data. The solution should not be to change the data but educate that group to change their behavior.
Please note that I have neither mentioned any specific group and do not have a specific group in mind. However, an example for such a group that I have seen in my professional life could be female 20 year old CEOs of construction companies (often connected to organized crime)
ordu 15 hours ago [-]
> The solution should not be to change the data but educate that group to change their behavior.
1. This is easier to say than to do.
2. In reality what you see is a correlation. If you try to educate all 20 year old females to not become a connected to organized crime CEOs of construction companies, your efforts will be wasted with 99% of these people, because they are either not connected to organized crime or are not going to become CEOs. Moreover the very your efforts will lead to a discrimination of 20 years old females, if not due to public perception of them, then because you've just increased difficulties for them of becoming a CEO.
> The goal is to identify and flag fraud cases.
Not quite. The goal is to reduce the amount of fraud cases. To identify and flag is a method of achieving that goal. But policymakers has a lot of other goals, like avoiding discrimination or reducing rate of murders. By focusing on one goal policymakers might undermine other goals.
As a side (almost methaphisical) note: it is one of the reason, why techies are bad at social problems. Their math education taught them to ignore all irrelevant details, when dealing with a problem, but society is a big complex system where everything is connected, so in general you can't ignore anything, because everything is relevant. But education have the upper hand, so techies tend to throw away as much complexity as it is needed to make the problem solvable. They will never accept that they don't know how to solve a problem.
jsemrau 14 hours ago [-]
"your efforts will lead to a discrimination of 20 years old females"
I'd think that this is an extremely far-fetched example that fails at basic logic.
Just because a very specific scenario will be flagged does not mean that this scenario is generalized to all CEOs, all females, all 20 year olds.
drdaeman 9 hours ago [-]
I think their point is that a lot of us, upon hearing "group A tend to exhibit more of some negative trait X than some other group B" mentally start to associate A with X and this creates a social stigma - just because how our brains work.
I wish there'd be some way to phrase such statements in a nonjudgmental way, without introducing a perception bias...
drdaeman 9 hours ago [-]
> The goal is to reduce the amount of fraud cases.
I'm sorry but I fail to see how have you reached this conclusion. Can you please elaborate? The way I understand it, a detection system cannot affect anything about its inputs - you need a feedback loop for that to happen. And I don't see anything like that within the scope of the project as covered by the article.
halostatue 15 hours ago [-]
In practice, investigations tend to find the results for which the investigation was started. At the beginning of the article, it was also suggested that such investigations in Amsterdam found no higher rate of actual fraud amongst the groups which were targeted more frequently via implicit bias by human reviewers.
In North America, we know that white people use hard drugs at a slightly higher rate than non-whites. However, the arrest and conviction rate of hard drug users is multiples higher for non-white people than whites. (I mention North America because similar data exist for both Canada and the USA, but the exact ratios and which groups are negatively impacted differ.)
Similarly, when it comes to accusations of welfare fraud, there is substantial bias in the investigations of non-whites and there are deep-seated racist stereotypes (thanks for that, Reagan) that don't hold up to scrutiny especially when the proportion of welfare recipients is slightly higher amongst whites than amongst non-whites[1].
So…saying that the goal is to avoid penalizing people for [innate characteristics] is more correct and a better use of time. The city of Amsterdam already knew that its fraud investigations were flawed.
The better definition of equal performance would obviously be that the metrics for the detector - accuracy or false positive rate etc would be the same for all groups.
I won't comment on why it's defined the way that it is.
Edit: it looks like they define several metrics, including ones like I mention above that consider performance and at least one based on what number or percentage is flagged in each group.
parpfish 18 hours ago [-]
Or that the error distributions are equal across groups. That way you could still detect that one group is committing fraud at a higher rate, but false positives/negatives occur at the same rate in each group
tbrownaw 17 hours ago [-]
There are multiple different ways to measure performance. If different groups have different rates of whatever you're predicting, it is not possible to have all of the different ways of measuring performance agree on whether your model is fair or not.
bsder 9 hours ago [-]
> Why would you assume that all groupings of people commit welfare fraud at the same rate?
Because the goal is NOT just wiping out fraud, but, instead, minimizing harm or possibly maximizing positive results.
Minimizing fraud is super easy--just don't give out any benefits. No fraud--problem solved.
That's not the final goal, though. As such, the ideal amount of fraud is somewhere above zero. We want to avoid falsely penalizing people who, practically be definition, probably don't have the resources to fight the false classification. And we want to minimize the amount of aid resources we use policing said aid.
The goal is to find a balance. Is helping 100 people but carrying 1 fraudster a good tradeoff? Should it be 1000? Should it be 10? Well, that's a political discussion.
fluorinerocket 16 hours ago [-]
Oh no you really stepped in it now
BonoboIO 19 hours ago [-]
Yes it is. This is some ideal world thinking, that has nothing to do with reality and is easily falsifiable, but only if you want to see the real world.
throwawayqqq11 19 hours ago [-]
[flagged]
djohnston 19 hours ago [-]
No... the pre-determined bias in this story is obviously that all subgroups of people behave identically w.r.t. welfare applications, which the data itself did not support and a momentary consideration of socioeconomics would debunk. When they tried to cludge the weights to fit their predetermined bias, the model did so poorly on a pilot run that the city shut it down.
throwawayqqq11 18 hours ago [-]
Being flagged as potential fraud based on eg. ethnicity is what you want to eliminate, so you have to start with the assumption of an even distristribution.
From the article:
> Deciding which definition of fairness to optimize for is a question of values and context.
This optimization is the human feedback required to not have the model stagnate in a local optimum.
pessimizer 16 hours ago [-]
> Why would you assume that all groupings of people commit welfare fraud at the same rate?
What's the alternative? It's an unattainable statistic, the people who get away with crime. Instead, what ends up getting used is the fraud rates under the old system, or ad hoc rules of thumb based in bigoted anecdotes.
So instead you delcare that you don't think that ethnicity is in and of itself a cause of fraud. Even if there may be any number of characteristics that tend to indicate or motivate fraud that are seen more in one specified ethnicity than another (poverty, etc.), and even though we should expect that to lead to more fraud. We can choose to say that those characteristics lead to fraud, rather than the ethnicity, and put that out of scope.
Then we can say that this algorithm isn't meant to solve multiculturalism, it's meant hopefully not to exacerbate the problems with it. If one wants to get rid of weird immigrants, non-whites, or non-Christians, just do it, instead of automating a system to be bigoted.
Also, going after the marginal increase of rates of fraud through defining groups that represent a small portion of the whole is likely to be a waste of money. If 90% of people commit fraud at a 5% rate and 10% commit it at a 10% rate, where should you be spending your time?
dgfitz 16 hours ago [-]
One of these days, I’m still hopeful, we will figure out that behaviors are taught, usually by parents. Intentionally or accidentally, kids learn from what they see.
I don’t care what nationality you are, or what your skin color happens to be, the root cause is how kids are reared.
In my head it’s so simple.
drdaeman 9 hours ago [-]
Not just kids - it's about one's whole life, including the adulthood, to the very last moments. We tend to change a lot over the courses of our lives, constantly being affected by our surroundings.
dgfitz 16 hours ago [-]
Using the word “retarded” like that negates any other point you tried to make.
jaoane 15 hours ago [-]
[flagged]
djoldman 19 hours ago [-]
"Unbiased," and "fair" models are generally somewhat ironic.
It's generally straightforward to develop one if we don't care much about the performance metric:
If we want the output to match a population distribution, we just force it by taking the top predicted for each class and then filling up the class buckets.
For example, if we have 75% squares and 25% circles, but circles are predicted at a 10-1 rate, who cares, just take the top 3 squares predicted and the top 1 circle predicted until we fill the quota.
wongarsu 18 hours ago [-]
So if I want to make a model to recommend inkjet printers then a quarter of all recommendations should be for HP printers? After all, a quarter of all sold printers are HP.
As you say, that would be a crappy model. But in my opinion that would also be hardly a fair or unbiased model. That would be a model unfairly biased in favor of HP, who barely sell anything worth recommending
djoldman 18 hours ago [-]
Yes, well there's the irony.
"Unbiased" and "fair" are quite overloaded here, to borrow a programming term.
I think it's one of those times where single words should expressly NOT be used to describe the intent.
The intent of this is to presume that the rate of the thing we are trying to detect is constant across subgroups. The definition of a "good" model therefore is one that approximates this.
I'm curious if their data matches that assumption. Do subgroups submit bad applications at the same rate?
It may be that they don't have the data and therefore can't answer that.
teekert 18 hours ago [-]
I know a cop, they do public searchings for weapons or drugs. Our law dictates fairness. So every now and then they search an elderly couple. You know how this goes and what the results are.
Any model would be unfair, age-wise but also ethnically.
To be most effective the model would have to be unfair. It would suck to be a law abiding young specific ethnic minority.
But does it help to search elderly couples?
I’m Genuinely curious what would be fair and effective here. You can’t be a Bayesian.
lostlogin 16 hours ago [-]
If this strategy was applied across policing, their metrics would likely improve markedly.
Eg, police shooting and brutality stats wouldn’t be tolerated for very long.
Scarblac 19 hours ago [-]
But that's a bias, if circles are actually more likely to be fraudulant.
djoldman 19 hours ago [-]
If the definition of "unbiased" and "fair" is that the model flags squares and circles at a rate or proportion equal to the population distribution of squares and circles, then the model is unbiased and fair.
As noted above, this doesn't do anything for performance.
talkingtab 17 hours ago [-]
Is this crazy or what? My take away is that the factors the city of Amsterdam is using to predict fraud are probably not actually predictors. For example if you use the last digit of someones phone number as a fraud predictor, you might discover there is a bias against low numbers. So you adjust your model to make it less likely that low numbers generate investigations. It is unlikely that your model will be any more fair after your adjustment.
One has to wonder if the study is more valid a predictor of the implementers' biases than that of the subjects.
There's a huge problem with people trying to use umbrella usage to predict flooding. Some people are trying to develop a computer model that uses rainfall instead, but watchdog groups have raised concerns that rainfall may be used as a proxy for umbrella usage.
(It seems rather strange to expect a statistical model trained for accuracy to infer and indirect through a shadow variable that makes it less accurate, simply because it's something easy for humans to observe directly and then use as a lossy shortcut or to promote alternate goals that aren't part of the labels being trained for or whatever.)
> These are two sets of unavoidable tradeoffs: focusing on one fairness definition can lead to worse outcomes on others. Similarly, focusing on one group can lead to worse performance for other groups. In evaluating its model, the city made a choice to focus on false positives and on reducing ethnicity/nationality based disparities. Precisely because the reweighting procedure made some gains in this direction, the model did worse on other dimensions.
Nice to see an investigation that's serious enough to acknowledge this.
1. In aggregate over any nationality, people face the same probability of a false positive.
2. Two people who are identical except for their nationality face the same probability of a false positive.
In general, it's impossible to achieve both properties. If the output and at least one other input correlate with nationality, then a model that ignores nationality fails (1). We can add back nationality and reweight to fix that, but then it fails (2).
This tradeoff is most frequently discussed in the context of statistical models, since those make that explicit. It applies to any process for deciding though, including human decisions.
It would be immoral to disadvantage one nationality over another. But we also cannot disadvantage one age group over another. Or one gender over another. Or one hair colour over another. Or one brand of car over another.
So if we update this statement:
> Two people who are identical except for any set of properties face the same probability of a false positive.
With that new constraint, I don't believe it is possible to construct a model which outperforms a data-less coin flip.
We tend to distinguish between ascribed and achieved characteristics. It is considered to be unethical to discriminate upon things a person has no control over, such as their nationality, gender, age or natural hair color.
However, things like a car brand are entirely dependent on one's own actions, and if there's a meaningful statistically significant correlation owning a Maserati and fraudulently applying for welfare, I'm not entirely sure it would be unethical to consider such factor.
And it also depends on what a false positive means for a person in question. Fairness (like most things social) is not binary, and while outright rejections can be very unfair, additional scrutiny can be less so, even though still not fair (causing prolonged times and extra stress). If things are working normally, I believe there's a sort of (ever-changing, of course, as times and circumstances evolve) an unspoken social agreement on what's the balance between fairness and abuse that can be afforded.
To take a few examples, looking at employment characteristics will have a strong relationship with gender, generally creating greater false positives for women. Similarly, academic success will have greater false positives for men. Where a person choose to live will proxy heavily towards social economic factors, which in turn has gender as a major factor.
Welfare fraud in itself also has differences between men and women. The sums tend to be higher for men. Women in turn dominate the users of the welfare system. Women and men also tend to receive welfare at different time in their life. It possible even that car brand has a correlation with gender which then would act as a proxy.
In terms of defining fairness, I do find it interesting that the Analogue Process gave men a beneficial advantage, while both the initial and the reweighed model are the opposite and give women an even bigger beneficial advantage. The change in bias against men created by using the detection algorithms is actually about the same size as the change in bias against non-dutch nationality between initial model and the reweighed one.
Nationality and natural hair color I understand, but age and gender? A lot of behaviors are not evenly distributed. Riots after a football match? You're unlikely to find a lot of elderly women (and men, but especially women) involved. Someone is fattening a child? That elderly women you've excluded for riots suddenly becomes a prime suspect.
> things like a car brand are entirely dependent on one's own actions
If you assume perfect free will, sure. But do you?
That’s true. But the idea is that feeding it to a system as an input could be considered unethical, as one cannot control their age. Even though there’s a valid correlation.
> If you assume perfect free will, sure. But do you?
I’m not. If this matters, I’m actually currently persuaded that free will doesn’t exist. Which doesn’t change that if one buys a car, its make is typically all their decision. Whenever such decision is coming from them having a free will or entirely determined by antecedent causes doesn’t really matter for purposes of fraud detection (or maybe I fail to see how it does).
I mean, we don’t need to care why people do things (at all, in general) - it matters for how we should act upon detection, but not for detecting itself. And, as I understand it, we know we don’t want to cause unfair pressure on groups defined by factors they cannot change. Because when we did that it consistently contributed to various undesirable consequences. E.g. discrimination and stereotypes against women or men, or prejudice against younger or elder people didn’t do us any well.
Can a young male really change their risk-tolerance or their innate drive to secure their place in the world (which will probably affect both their likelihood to buy sports cars and commit certain crimes)? I don't think we can pretend that everyone from toddler to granny is the same _and_ use any data to solve crimes / detect fraud.
In the end it comes down to where we draw the line between "person can't change this, so it's invalid to consider", "we don't believe it's linked, so it's invalid to consider" and "this is free will, so it's a valid signal", and I haven't seen a line that doesn't feel arbitrary ("I don't like that group, so their thing is free will, but I like this group, so their thing isn't") and is useful.
One can't change one's race, but changing marital status is possible.
Where it gets tricky is things like physical fitness or social groups...
Why? We've been told time and time again that 'nations' don't really exist, they're just recent meaningless social constructs [1]. And 'races' exist even less [2]. So why is it any worse if a model is biased on nation or race, than on left-handedness or musical taste or what brand of car one drives? They're all equally meaningless, aren't they?
[1] https://www.reddit.com/r/AskHistorians/comments/18ubjpv/the_...
[2] https://www.scientificamerican.com/article/race-is-a-social-...
That seems to fall afoul of the Base Rate Fallacy. Eg, consider 2 groups of 10,000 people and testing on A vs B. First group has 9,999 A and 1 B, second has 1 A and 9,999 B. Unless you make your test blatantly ineffective, you're going to have different false positive rates -- irrespectiveof the test's performance.
My suspicion is that in many situations you could build a detector/estimator which was fairly close to being blind without a significant total increase in false positives, but how much is too much?
I'm actually more concerned that where I live even accuracy has ceased to be the point.
That is by definition prejudice: bias without evidence. Perhaps they want to avoid that.
What's the problem with this? It isn't racism, it's literally just Bayes' Law.
That may be logically correct, but the law is above logic. Sometimes applying Bayes' Law is legally considered racism.
https://en.wikipedia.org/wiki/Disparate_impact
This is why wealthy anti-racist progressives overwhelmingly still avoid decrepit, violence-racked urban slums. It's racist, but anti-racism is a luxury belief held by people who stay far away from the root cause of racism - aggregate behavioral differences that originate from culture, which is deeply intertwined with race, at least in the US.
Bayes' Law isn't wrong. Billions of humans across tens of thousands of years didn't have broken pattern recognition, we've just recently decided that the (very real) patterns everyone recognizes hurt the feelings of the genuinely disadvantaged.
Upon evaluation, your model seems to accept everyone who mentions a "fraternity" and reject anyone who mentions a "sorority". Swapping out the words turns a strong reject into a strong accept, and vice versa.
But you removed any explicit mention of gender, so surely your model couldn't possibly be showing an anti-women bias, right?
Who are these people who make a career history doc include gender-implicating data? And if there are such CVs, they should be stripped of such data before processing.
The fraternity example is such a specific 1 in a 1000 case.
So you're not implying gender other than by implying gender? If humans can use names to classify people into genders, a model can do the same thing.
Very well written, but that last part id concerning and point to one part: did they hire interns? How cone they do not have systems? It just cast a big doubt on the whole experiment.
Amsterdam didn't deploy their models when they found their outcome is not satisfactory. I find it a perfectly fine result.
The post does talk about it when it briefly mentions that the goal of building the model (to decrease the number of cases investigated while increasing the rate of finding fraud) wasn't achieved. They don't say any more than that because that's not the point they are making.
Anyway, the project was shelved after a pilot. So your point is entirely false.
> In late November 2023, the city announced that it would shelve the pilot.
I would agree that implications regarding the use of those models do not hold, but not the ones about their quality.
The issue is that we don't know how many Danish commit fraud, and we don't know how many Arabs commit fraud, because we don't trust the old process to be unbiased. So how are we supposed to judge if the new model is unbiased? This seems fundamentally impossible without improving our ground truth in some way.
The project presented here instead tries to do some mental gymnastics to define a version of "fair" that doesn't require that better ground truth. They were able to evaluate their results on the false-positive rate by investigating the flagged cases, but they were completely in the dark about the false-negative rate.
In the end, the new model was just as biased, but in the other direction, and performance was simply worse:
> In addition to the reappearance of biases, the model’s performance in the pilot also deteriorated. Crucially, the model was meant to lead to fewer investigations and more rejections. What happened instead was mostly an increase in investigations , while the likelihood to find investigation worthy applications barely changed in comparison to the analogue process. In late November 2023, the city announced that it would shelve the pilot.
The model is considered fair if its performance is equal across these groups.
One can immediately see why this is problematic, easily by considering equivalent example in less controversial (i.e. emotionally charged) situations.
Should basketball performance be equal across racial, or sex groups? How about marathon performance?
It’s not unusual that relevant features are correlated with protected features. In the specific example above, being an immigrant is likely correlated with not knowing the local language, therefore being underemployed and hence more likely to apply for benefits.
In your basketball analogy, it's more like they have a model that predicts basketball performance, and they're saying that model should predict performance equally well across groups, not that the groups should themselves perform equally well.
Training on past human decisions inevitably bakes in existing biases.
Not all misdeeds are equally likely to be detected. What matter is minimizing the false positives and false negatives. But it sounds like they don't even have a base truth to be comparing it against, making the whole thing an exercise in bureaucracy.
Without figures for true positives, recall, or financial recoveries, its effectiveness remains completely in the dark.
In short: great for moral grandstanding in the comments section, but zero evidence that taxpayer money or investigative time was ever saved.
Fraud detection models will never be fair. Their job is to find fraud. They will never be perfect, and the mistaken cases will cause a perfectly honest citizen to be disadvantaged in some way.
It does not matter if that group is predominantly 'people with skin colour X' or 'people born on a Tuesday'.
What matters is that the disadvantage those people face is so small as to be irrelevant.
I propose a good starting point would be for each person investigated to be paid money to compensate them for the effort involved - whether or not they committed fraud.
Nevertheless the idea of giving money is still good imo, because it also incentivizes the fraud detection becoming more efficient, since mistakes now cost more. Unfortunately I have a feeling people might game that to get more money by triggering false investigations.
Though the situation is more complex than that. What I described is named "False Positive Share" in the article (or at least I think so), but the article discusses other metrics too.
The problem is that the policy should make the world better, but if the policy penalizes some groups for law breaking, then it can push these groups to break the law even more. It is possible to create biases this way, and it is possible to do it accidentally. Or, rather, it is hard not to do it accidentally.
I'd recommend to read "Against Prediction", it has a lot of examples how it works. For example, biased False Negatives are also bad, they make it easier for some groups to break the law.
That's not correct. The goal is to identify and flag fraud cases. If one group has a higher likelihood to perform that, then this will show up in the data. The solution should not be to change the data but educate that group to change their behavior.
Please note that I have neither mentioned any specific group and do not have a specific group in mind. However, an example for such a group that I have seen in my professional life could be female 20 year old CEOs of construction companies (often connected to organized crime)
1. This is easier to say than to do.
2. In reality what you see is a correlation. If you try to educate all 20 year old females to not become a connected to organized crime CEOs of construction companies, your efforts will be wasted with 99% of these people, because they are either not connected to organized crime or are not going to become CEOs. Moreover the very your efforts will lead to a discrimination of 20 years old females, if not due to public perception of them, then because you've just increased difficulties for them of becoming a CEO.
> The goal is to identify and flag fraud cases.
Not quite. The goal is to reduce the amount of fraud cases. To identify and flag is a method of achieving that goal. But policymakers has a lot of other goals, like avoiding discrimination or reducing rate of murders. By focusing on one goal policymakers might undermine other goals.
As a side (almost methaphisical) note: it is one of the reason, why techies are bad at social problems. Their math education taught them to ignore all irrelevant details, when dealing with a problem, but society is a big complex system where everything is connected, so in general you can't ignore anything, because everything is relevant. But education have the upper hand, so techies tend to throw away as much complexity as it is needed to make the problem solvable. They will never accept that they don't know how to solve a problem.
I'd think that this is an extremely far-fetched example that fails at basic logic. Just because a very specific scenario will be flagged does not mean that this scenario is generalized to all CEOs, all females, all 20 year olds.
I wish there'd be some way to phrase such statements in a nonjudgmental way, without introducing a perception bias...
I'm sorry but I fail to see how have you reached this conclusion. Can you please elaborate? The way I understand it, a detection system cannot affect anything about its inputs - you need a feedback loop for that to happen. And I don't see anything like that within the scope of the project as covered by the article.
In North America, we know that white people use hard drugs at a slightly higher rate than non-whites. However, the arrest and conviction rate of hard drug users is multiples higher for non-white people than whites. (I mention North America because similar data exist for both Canada and the USA, but the exact ratios and which groups are negatively impacted differ.)
Similarly, when it comes to accusations of welfare fraud, there is substantial bias in the investigations of non-whites and there are deep-seated racist stereotypes (thanks for that, Reagan) that don't hold up to scrutiny especially when the proportion of welfare recipients is slightly higher amongst whites than amongst non-whites[1].
So…saying that the goal is to avoid penalizing people for [innate characteristics] is more correct and a better use of time. The city of Amsterdam already knew that its fraud investigations were flawed.
[1] In the US based on 2022 data, https://www.census.gov/library/stories/2022/05/who-is-receiv... shows that excluding Medicaid/CHIP, the rate of welfare is higher for whites.
I won't comment on why it's defined the way that it is.
Edit: it looks like they define several metrics, including ones like I mention above that consider performance and at least one based on what number or percentage is flagged in each group.
Because the goal is NOT just wiping out fraud, but, instead, minimizing harm or possibly maximizing positive results.
Minimizing fraud is super easy--just don't give out any benefits. No fraud--problem solved.
That's not the final goal, though. As such, the ideal amount of fraud is somewhere above zero. We want to avoid falsely penalizing people who, practically be definition, probably don't have the resources to fight the false classification. And we want to minimize the amount of aid resources we use policing said aid.
The goal is to find a balance. Is helping 100 people but carrying 1 fraudster a good tradeoff? Should it be 1000? Should it be 10? Well, that's a political discussion.
From the article:
> Deciding which definition of fairness to optimize for is a question of values and context.
This optimization is the human feedback required to not have the model stagnate in a local optimum.
What's the alternative? It's an unattainable statistic, the people who get away with crime. Instead, what ends up getting used is the fraud rates under the old system, or ad hoc rules of thumb based in bigoted anecdotes.
So instead you delcare that you don't think that ethnicity is in and of itself a cause of fraud. Even if there may be any number of characteristics that tend to indicate or motivate fraud that are seen more in one specified ethnicity than another (poverty, etc.), and even though we should expect that to lead to more fraud. We can choose to say that those characteristics lead to fraud, rather than the ethnicity, and put that out of scope.
Then we can say that this algorithm isn't meant to solve multiculturalism, it's meant hopefully not to exacerbate the problems with it. If one wants to get rid of weird immigrants, non-whites, or non-Christians, just do it, instead of automating a system to be bigoted.
Also, going after the marginal increase of rates of fraud through defining groups that represent a small portion of the whole is likely to be a waste of money. If 90% of people commit fraud at a 5% rate and 10% commit it at a 10% rate, where should you be spending your time?
I don’t care what nationality you are, or what your skin color happens to be, the root cause is how kids are reared.
In my head it’s so simple.
It's generally straightforward to develop one if we don't care much about the performance metric:
If we want the output to match a population distribution, we just force it by taking the top predicted for each class and then filling up the class buckets.
For example, if we have 75% squares and 25% circles, but circles are predicted at a 10-1 rate, who cares, just take the top 3 squares predicted and the top 1 circle predicted until we fill the quota.
As you say, that would be a crappy model. But in my opinion that would also be hardly a fair or unbiased model. That would be a model unfairly biased in favor of HP, who barely sell anything worth recommending
"Unbiased" and "fair" are quite overloaded here, to borrow a programming term.
I think it's one of those times where single words should expressly NOT be used to describe the intent.
The intent of this is to presume that the rate of the thing we are trying to detect is constant across subgroups. The definition of a "good" model therefore is one that approximates this.
I'm curious if their data matches that assumption. Do subgroups submit bad applications at the same rate?
It may be that they don't have the data and therefore can't answer that.
Any model would be unfair, age-wise but also ethnically.
To be most effective the model would have to be unfair. It would suck to be a law abiding young specific ethnic minority.
But does it help to search elderly couples?
I’m Genuinely curious what would be fair and effective here. You can’t be a Bayesian.
Eg, police shooting and brutality stats wouldn’t be tolerated for very long.
As noted above, this doesn't do anything for performance.
One has to wonder if the study is more valid a predictor of the implementers' biases than that of the subjects.