How to Compute the Brier Score for more than Two ClassesEvaluating Unbalanced Multiclass Classifiers: Which Tests to Use?Transform multiclass classification to binary - benefits?Multi-class classification via all pairwise classifications with LDAWhy the Brier Score's better when probabilities are estimated through PAVA instead of Platt Scaling?How to get accuracy, confusion matrix of binary SVM classifier equivalent to multiclass classification?Why would a binary decision tree classifier only work for balanced data?
What is the fastest way to move in Borderlands 3?
Can I color text by using an image, so that the color isn't flat?
Is sleeping on the groud in cold weather better than on an air mattress?
What are the bars protruding from this C-130?
Is there such thing as plasma (from reentry) creating lift?
Why do previous versions of Debian packages vanish in the package repositories? (highly relevant for version-controlled system configuration)
What Supreme Court cases, other than Nixon v. United States, have directly applied or interpreted U.S. Const. Art. I, Section 3, Clause 6?
Can I remake a game I don't own any copyright to?
How to make "acts of patience" exciting?
This fell out of my toilet when I unscrewed the supply line. What is it?
Why didn't Snape ask Dumbledore why he let "Moody" search his office?
Son of the Revenge of the Riley Riddles in Reverse Strikes Again
Would it be easier to colonise a living world or a dead world?
What kind of mission objective would make a parabolic escape trajectory desirable?
Why did a young George Washington sign a document admitting to assassinating a French military officer?
one-liner vs script
How do I break the broom in Untitled Goose Game?
Does the Creighton Method of Natural Family Planning have a failure rate of 3.2% or less?
How stable are PID loops really?
Low-magic medieval fantasy clothes that allow the wearer to grow?
How can I remove rest of file from string for all files?
What is /dev/null and why can't I use hx on it?
How to make a gift without seeming creepy?
A sentient carnivorous species trying to preserve life. How could they find a new food source?
How to Compute the Brier Score for more than Two Classes
Evaluating Unbalanced Multiclass Classifiers: Which Tests to Use?Transform multiclass classification to binary - benefits?Multi-class classification via all pairwise classifications with LDAWhy the Brier Score's better when probabilities are estimated through PAVA instead of Platt Scaling?How to get accuracy, confusion matrix of binary SVM classifier equivalent to multiclass classification?Why would a binary decision tree classifier only work for balanced data?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
$begingroup$
tl;dr
How do I correctly compute the Brier score for more than two classes? I got confusing results with different approaches. Details below.
As suggested to me in a comment to this question, I would like to evaluate the quality of a set of classifiers I trained with the Brier score. These classifiers are multiclass classifiers and the classes are imbalanced. The Brier score should be able to handle these conditions. However, I am not quite confident about how to apply the Brier score test. Say I have 10 data points and 5 classes:
One hot vectors represent which class is present in a given item of data:
targets = array([[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Vectors of probabilities represent the outputs of my classifiers, assigning a probability to each class
probs = array([[0.14, 0.38, 0.4 , 0.04, 0.05],
[0.55, 0.05, 0.34, 0.04, 0.01],
[0.3 , 0.35, 0.18, 0.09, 0.08],
[0.23, 0.22, 0.04, 0.05, 0.46],
[0. , 0.15, 0.47, 0.28, 0.09],
[0.23, 0.13, 0.34, 0.27, 0.03],
[0.32, 0.06, 0.59, 0.02, 0.01],
[0.01, 0.19, 0.01, 0.03, 0.75],
[0.27, 0.38, 0.03, 0.12, 0.2 ],
[0.17, 0.45, 0.11, 0.25, 0.01]])
These matrices are coindexed, so probs[i, j]
is the probability of class targets[i, j]
.
Now, according to Wikipedia the definition of the Brier Score for multiple classes is
$$frac1N sum_t=1^N sum_i=1^R (f_ti - o_ti)^2$$
When I program this in Python and run it on the above targets
and probs
matrices, I get a result of $1.0069$
>>> def brier_multi(targets, probs):
... return np.mean(np.sum((probs - targets)**2, axis=1))
...
>>> brier_multi(targets, probs)
1.0068899999999998
But I am not sure if I interpreted the definition correctly.
For Python the sklearn library provides sklearn.metrics.brier_score_loss
. While the documentation states
The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false
What the function actually does is pick one (or get one passed as an argument) of $n > 2$ classes and treat that class as class $1$ and all other classes as class $0$.
For example, if we choose class 3 (index 2) as the $1$ class and thus all other classes as class $0$, we get:
>>> # get true classes by argmax over binary arrays
... true_classes = np.argmax(targets, axis=1)
>>>
>>> brier_score_loss(true_classes, probs[:,2], pos_label=2)
0.13272999999999996
alternatively:
>>> brier_score_loss(targets[:,2], probs[:,2])
0.13272999999999996
This is indeed the binary version of the Brier score, as can be shown by manually defining and running it:
>>> def brier_bin_(targets, probs):
... return np.mean((targets - probs) ** 2)
>>> brier_bin(targets[:,2], probs[:,2])
0.13272999999999996
As you can see, this is the same result as with sklearn's brier_score_loss
.
Wikipedia states about the binary version:
This formulation is mostly used for binary events (for example "rain"
or "no rain"). The above equation is a proper scoring rule only for
binary events;
So... Now I am confused and have the following questions:
1) If sklearn computes the multi class Brier score as a One vs. All binary score, is that the only and correct way to compute the multi class Brier score?
Which leads me to
2) If that is so, my brier_multi
code must be based on a misconception. What is my misconception about the definition of the multiclass Brier score?
3) Maybe I am on the wrong track altogether. In which case, please explain to me, how I compute the Brier score correctly?
classification scikit-learn model-evaluation scoring-rules
$endgroup$
add a comment
|
$begingroup$
tl;dr
How do I correctly compute the Brier score for more than two classes? I got confusing results with different approaches. Details below.
As suggested to me in a comment to this question, I would like to evaluate the quality of a set of classifiers I trained with the Brier score. These classifiers are multiclass classifiers and the classes are imbalanced. The Brier score should be able to handle these conditions. However, I am not quite confident about how to apply the Brier score test. Say I have 10 data points and 5 classes:
One hot vectors represent which class is present in a given item of data:
targets = array([[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Vectors of probabilities represent the outputs of my classifiers, assigning a probability to each class
probs = array([[0.14, 0.38, 0.4 , 0.04, 0.05],
[0.55, 0.05, 0.34, 0.04, 0.01],
[0.3 , 0.35, 0.18, 0.09, 0.08],
[0.23, 0.22, 0.04, 0.05, 0.46],
[0. , 0.15, 0.47, 0.28, 0.09],
[0.23, 0.13, 0.34, 0.27, 0.03],
[0.32, 0.06, 0.59, 0.02, 0.01],
[0.01, 0.19, 0.01, 0.03, 0.75],
[0.27, 0.38, 0.03, 0.12, 0.2 ],
[0.17, 0.45, 0.11, 0.25, 0.01]])
These matrices are coindexed, so probs[i, j]
is the probability of class targets[i, j]
.
Now, according to Wikipedia the definition of the Brier Score for multiple classes is
$$frac1N sum_t=1^N sum_i=1^R (f_ti - o_ti)^2$$
When I program this in Python and run it on the above targets
and probs
matrices, I get a result of $1.0069$
>>> def brier_multi(targets, probs):
... return np.mean(np.sum((probs - targets)**2, axis=1))
...
>>> brier_multi(targets, probs)
1.0068899999999998
But I am not sure if I interpreted the definition correctly.
For Python the sklearn library provides sklearn.metrics.brier_score_loss
. While the documentation states
The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false
What the function actually does is pick one (or get one passed as an argument) of $n > 2$ classes and treat that class as class $1$ and all other classes as class $0$.
For example, if we choose class 3 (index 2) as the $1$ class and thus all other classes as class $0$, we get:
>>> # get true classes by argmax over binary arrays
... true_classes = np.argmax(targets, axis=1)
>>>
>>> brier_score_loss(true_classes, probs[:,2], pos_label=2)
0.13272999999999996
alternatively:
>>> brier_score_loss(targets[:,2], probs[:,2])
0.13272999999999996
This is indeed the binary version of the Brier score, as can be shown by manually defining and running it:
>>> def brier_bin_(targets, probs):
... return np.mean((targets - probs) ** 2)
>>> brier_bin(targets[:,2], probs[:,2])
0.13272999999999996
As you can see, this is the same result as with sklearn's brier_score_loss
.
Wikipedia states about the binary version:
This formulation is mostly used for binary events (for example "rain"
or "no rain"). The above equation is a proper scoring rule only for
binary events;
So... Now I am confused and have the following questions:
1) If sklearn computes the multi class Brier score as a One vs. All binary score, is that the only and correct way to compute the multi class Brier score?
Which leads me to
2) If that is so, my brier_multi
code must be based on a misconception. What is my misconception about the definition of the multiclass Brier score?
3) Maybe I am on the wrong track altogether. In which case, please explain to me, how I compute the Brier score correctly?
classification scikit-learn model-evaluation scoring-rules
$endgroup$
add a comment
|
$begingroup$
tl;dr
How do I correctly compute the Brier score for more than two classes? I got confusing results with different approaches. Details below.
As suggested to me in a comment to this question, I would like to evaluate the quality of a set of classifiers I trained with the Brier score. These classifiers are multiclass classifiers and the classes are imbalanced. The Brier score should be able to handle these conditions. However, I am not quite confident about how to apply the Brier score test. Say I have 10 data points and 5 classes:
One hot vectors represent which class is present in a given item of data:
targets = array([[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Vectors of probabilities represent the outputs of my classifiers, assigning a probability to each class
probs = array([[0.14, 0.38, 0.4 , 0.04, 0.05],
[0.55, 0.05, 0.34, 0.04, 0.01],
[0.3 , 0.35, 0.18, 0.09, 0.08],
[0.23, 0.22, 0.04, 0.05, 0.46],
[0. , 0.15, 0.47, 0.28, 0.09],
[0.23, 0.13, 0.34, 0.27, 0.03],
[0.32, 0.06, 0.59, 0.02, 0.01],
[0.01, 0.19, 0.01, 0.03, 0.75],
[0.27, 0.38, 0.03, 0.12, 0.2 ],
[0.17, 0.45, 0.11, 0.25, 0.01]])
These matrices are coindexed, so probs[i, j]
is the probability of class targets[i, j]
.
Now, according to Wikipedia the definition of the Brier Score for multiple classes is
$$frac1N sum_t=1^N sum_i=1^R (f_ti - o_ti)^2$$
When I program this in Python and run it on the above targets
and probs
matrices, I get a result of $1.0069$
>>> def brier_multi(targets, probs):
... return np.mean(np.sum((probs - targets)**2, axis=1))
...
>>> brier_multi(targets, probs)
1.0068899999999998
But I am not sure if I interpreted the definition correctly.
For Python the sklearn library provides sklearn.metrics.brier_score_loss
. While the documentation states
The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false
What the function actually does is pick one (or get one passed as an argument) of $n > 2$ classes and treat that class as class $1$ and all other classes as class $0$.
For example, if we choose class 3 (index 2) as the $1$ class and thus all other classes as class $0$, we get:
>>> # get true classes by argmax over binary arrays
... true_classes = np.argmax(targets, axis=1)
>>>
>>> brier_score_loss(true_classes, probs[:,2], pos_label=2)
0.13272999999999996
alternatively:
>>> brier_score_loss(targets[:,2], probs[:,2])
0.13272999999999996
This is indeed the binary version of the Brier score, as can be shown by manually defining and running it:
>>> def brier_bin_(targets, probs):
... return np.mean((targets - probs) ** 2)
>>> brier_bin(targets[:,2], probs[:,2])
0.13272999999999996
As you can see, this is the same result as with sklearn's brier_score_loss
.
Wikipedia states about the binary version:
This formulation is mostly used for binary events (for example "rain"
or "no rain"). The above equation is a proper scoring rule only for
binary events;
So... Now I am confused and have the following questions:
1) If sklearn computes the multi class Brier score as a One vs. All binary score, is that the only and correct way to compute the multi class Brier score?
Which leads me to
2) If that is so, my brier_multi
code must be based on a misconception. What is my misconception about the definition of the multiclass Brier score?
3) Maybe I am on the wrong track altogether. In which case, please explain to me, how I compute the Brier score correctly?
classification scikit-learn model-evaluation scoring-rules
$endgroup$
tl;dr
How do I correctly compute the Brier score for more than two classes? I got confusing results with different approaches. Details below.
As suggested to me in a comment to this question, I would like to evaluate the quality of a set of classifiers I trained with the Brier score. These classifiers are multiclass classifiers and the classes are imbalanced. The Brier score should be able to handle these conditions. However, I am not quite confident about how to apply the Brier score test. Say I have 10 data points and 5 classes:
One hot vectors represent which class is present in a given item of data:
targets = array([[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Vectors of probabilities represent the outputs of my classifiers, assigning a probability to each class
probs = array([[0.14, 0.38, 0.4 , 0.04, 0.05],
[0.55, 0.05, 0.34, 0.04, 0.01],
[0.3 , 0.35, 0.18, 0.09, 0.08],
[0.23, 0.22, 0.04, 0.05, 0.46],
[0. , 0.15, 0.47, 0.28, 0.09],
[0.23, 0.13, 0.34, 0.27, 0.03],
[0.32, 0.06, 0.59, 0.02, 0.01],
[0.01, 0.19, 0.01, 0.03, 0.75],
[0.27, 0.38, 0.03, 0.12, 0.2 ],
[0.17, 0.45, 0.11, 0.25, 0.01]])
These matrices are coindexed, so probs[i, j]
is the probability of class targets[i, j]
.
Now, according to Wikipedia the definition of the Brier Score for multiple classes is
$$frac1N sum_t=1^N sum_i=1^R (f_ti - o_ti)^2$$
When I program this in Python and run it on the above targets
and probs
matrices, I get a result of $1.0069$
>>> def brier_multi(targets, probs):
... return np.mean(np.sum((probs - targets)**2, axis=1))
...
>>> brier_multi(targets, probs)
1.0068899999999998
But I am not sure if I interpreted the definition correctly.
For Python the sklearn library provides sklearn.metrics.brier_score_loss
. While the documentation states
The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false
What the function actually does is pick one (or get one passed as an argument) of $n > 2$ classes and treat that class as class $1$ and all other classes as class $0$.
For example, if we choose class 3 (index 2) as the $1$ class and thus all other classes as class $0$, we get:
>>> # get true classes by argmax over binary arrays
... true_classes = np.argmax(targets, axis=1)
>>>
>>> brier_score_loss(true_classes, probs[:,2], pos_label=2)
0.13272999999999996
alternatively:
>>> brier_score_loss(targets[:,2], probs[:,2])
0.13272999999999996
This is indeed the binary version of the Brier score, as can be shown by manually defining and running it:
>>> def brier_bin_(targets, probs):
... return np.mean((targets - probs) ** 2)
>>> brier_bin(targets[:,2], probs[:,2])
0.13272999999999996
As you can see, this is the same result as with sklearn's brier_score_loss
.
Wikipedia states about the binary version:
This formulation is mostly used for binary events (for example "rain"
or "no rain"). The above equation is a proper scoring rule only for
binary events;
So... Now I am confused and have the following questions:
1) If sklearn computes the multi class Brier score as a One vs. All binary score, is that the only and correct way to compute the multi class Brier score?
Which leads me to
2) If that is so, my brier_multi
code must be based on a misconception. What is my misconception about the definition of the multiclass Brier score?
3) Maybe I am on the wrong track altogether. In which case, please explain to me, how I compute the Brier score correctly?
classification scikit-learn model-evaluation scoring-rules
classification scikit-learn model-evaluation scoring-rules
edited Apr 17 at 11:11
lo tolmencre
asked Apr 17 at 10:42
lo tolmencrelo tolmencre
598 bronze badges
598 bronze badges
add a comment
|
add a comment
|
1 Answer
1
active
oldest
votes
$begingroup$
Wikipedia's version of the Brier score for multiple categories is correct. Compare the original publication by Brier (1950), or any number of academic publications, e.g. Czado et al. (2009) (equation (6), though you would need to do some simple arithmetic and drop a constant 1 to arrive at Brier's formulation).
If sklearn calculates a binary "one against all" Brier score and averages over all choices of a focal class, then it can certainly do so. However, it is simply not the Brier score. Passing it off as such is misleading and wrong.
The misconception lies entirely with sklearn.
Just use your
brier_multi
, it's completely correct.
$endgroup$
add a comment
|
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f403544%2fhow-to-compute-the-brier-score-for-more-than-two-classes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Wikipedia's version of the Brier score for multiple categories is correct. Compare the original publication by Brier (1950), or any number of academic publications, e.g. Czado et al. (2009) (equation (6), though you would need to do some simple arithmetic and drop a constant 1 to arrive at Brier's formulation).
If sklearn calculates a binary "one against all" Brier score and averages over all choices of a focal class, then it can certainly do so. However, it is simply not the Brier score. Passing it off as such is misleading and wrong.
The misconception lies entirely with sklearn.
Just use your
brier_multi
, it's completely correct.
$endgroup$
add a comment
|
$begingroup$
Wikipedia's version of the Brier score for multiple categories is correct. Compare the original publication by Brier (1950), or any number of academic publications, e.g. Czado et al. (2009) (equation (6), though you would need to do some simple arithmetic and drop a constant 1 to arrive at Brier's formulation).
If sklearn calculates a binary "one against all" Brier score and averages over all choices of a focal class, then it can certainly do so. However, it is simply not the Brier score. Passing it off as such is misleading and wrong.
The misconception lies entirely with sklearn.
Just use your
brier_multi
, it's completely correct.
$endgroup$
add a comment
|
$begingroup$
Wikipedia's version of the Brier score for multiple categories is correct. Compare the original publication by Brier (1950), or any number of academic publications, e.g. Czado et al. (2009) (equation (6), though you would need to do some simple arithmetic and drop a constant 1 to arrive at Brier's formulation).
If sklearn calculates a binary "one against all" Brier score and averages over all choices of a focal class, then it can certainly do so. However, it is simply not the Brier score. Passing it off as such is misleading and wrong.
The misconception lies entirely with sklearn.
Just use your
brier_multi
, it's completely correct.
$endgroup$
Wikipedia's version of the Brier score for multiple categories is correct. Compare the original publication by Brier (1950), or any number of academic publications, e.g. Czado et al. (2009) (equation (6), though you would need to do some simple arithmetic and drop a constant 1 to arrive at Brier's formulation).
If sklearn calculates a binary "one against all" Brier score and averages over all choices of a focal class, then it can certainly do so. However, it is simply not the Brier score. Passing it off as such is misleading and wrong.
The misconception lies entirely with sklearn.
Just use your
brier_multi
, it's completely correct.
answered Apr 17 at 11:04
Stephan KolassaStephan Kolassa
57.4k10 gold badges113 silver badges211 bronze badges
57.4k10 gold badges113 silver badges211 bronze badges
add a comment
|
add a comment
|
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f403544%2fhow-to-compute-the-brier-score-for-more-than-two-classes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown