Philosophical question on logistic regression: why isn't the optimal threshold value trained?Why is accuracy not the best measure for assessing classification models?Why isn't Logistic Regression called Logistic Classification?Classification probability thresholdIs accuracy an improper scoring rule in a binary classification setting?Why do we need regularized logistic regression?What is the risk of not pooling dataProbabilistic prediction with specified utility functionROC and false positive rate with over samplingOptimal cut-off calculation in logistic regressionDo I do threshold selection for my logit model on the testing or training subset?Why is ROC curve used in assessing how 'good' a logistic regression model is?How are the training and cross-validation metrics calculated in H2O?

Would removing the "total cover" part of a Paladin's Divine Sense unbalance the feature?

Drawing Super Mario Bros.....in LaTeX

Did Feynman cite a fallacy about only circles having the same width in all directions as a reason for the Challenger disaster?

How to execute a project with two resources where you need three resources?

How to remind myself to lock my doors

Is It Possible to Make a Computer Virus That Acts as an Anti-virus?

5v home network

Get first and last day of the week in Ampscript

Fantasy novel/series with young man who discovers he can use magic that is outlawed

String Operation to Split on Punctuation

Can every language be categorized as either compiled or interpreted?

SSD or HDD for server

UK PM is taking his proposal to EU but has not proposed to his own parliament - can he legally bypass the UK parliament?

A demigod among men

Does the Wall of Stone spell need support or not?

Can I use both 気温 and 温度 when asking for the weather temperature?

Is there a practical way of making democratic-like system skewed towards competence?

Diamondize Some Text

Slow coworker receiving compliments while I receive complaints

Is it reasonable to ask candidates to create a profile on Google Scholar?

Could someone please translate this code into some mathematical notation?

Novel set in the future, children cannot change the class they are born into, one class is made uneducated by associating books with pain

Why is CMYK & PNG not possible?

Why is coffee provided during big chess events when it contains a banned substance?



Philosophical question on logistic regression: why isn't the optimal threshold value trained?


Why is accuracy not the best measure for assessing classification models?Why isn't Logistic Regression called Logistic Classification?Classification probability thresholdIs accuracy an improper scoring rule in a binary classification setting?Why do we need regularized logistic regression?What is the risk of not pooling dataProbabilistic prediction with specified utility functionROC and false positive rate with over samplingOptimal cut-off calculation in logistic regressionDo I do threshold selection for my logit model on the testing or training subset?Why is ROC curve used in assessing how 'good' a logistic regression model is?How are the training and cross-validation metrics calculated in H2O?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









13














$begingroup$


Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?










share|cite|improve this question











$endgroup$






















    13














    $begingroup$


    Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



    Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?










    share|cite|improve this question











    $endgroup$


















      13












      13








      13


      1



      $begingroup$


      Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



      Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?










      share|cite|improve this question











      $endgroup$




      Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



      Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?







      logistic cross-validation optimization roc threshold






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question



      share|cite|improve this question








      edited Apr 25 at 17:05







      StatsSorceress

















      asked Apr 25 at 15:36









      StatsSorceressStatsSorceress

      2242 silver badges11 bronze badges




      2242 silver badges11 bronze badges























          5 Answers
          5






          active

          oldest

          votes


















          18
















          $begingroup$

          A threshold isn't trained with the model because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






          share|cite|improve this answer












          $endgroup$










          • 1




            $begingroup$
            Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
            $endgroup$
            – StatsSorceress
            Apr 25 at 15:55






          • 4




            $begingroup$
            You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
            $endgroup$
            – gung
            Apr 25 at 16:02







          • 1




            $begingroup$
            Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
            $endgroup$
            – StatsSorceress
            Apr 25 at 16:29






          • 4




            $begingroup$
            As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
            $endgroup$
            – gung
            Apr 25 at 18:13






          • 1




            $begingroup$
            @StatsSorceress "... sometimes in machine learning classification ...". There should be a big emphasis on sometimes. It's hard to imagine a project where accuracy is the correct answer. In my experience, it always involves precision and recall of a minority class.
            $endgroup$
            – Wayne
            Apr 26 at 12:10


















          13
















          $begingroup$

          It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



          If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



          Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



          See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






          share|cite|improve this answer










          $endgroup$






















            4
















            $begingroup$

            Philosophical concerns aside, this would cause computational difficulties.



            The reason why is that functions with continuous output are relatively easy to optimize. You look for the direction where the function increases, and then go that way. If we alter our loss function to include the "cutoff" step, our output becomes discrete, and our loss function is therefore also discrete. Now when we alter the parameters of our logistic function by "a little bit" and jointly alter the cutoff value by "a little bit", our loss gives an identical value, and optimization becomes difficult. Of course, it's not impossible (There's a whole field of study in discrete optimization) but continuous optimization is by far the easier problem to solve when you are optimizing many parameters. Conveniently, once the logistic model has been fit, finding the optimal cutoff, though still a discrete output problem, is now only in one variable, and we can just do a grid search, or some such, which is totally viable in one variable.






            share|cite|improve this answer










            $endgroup$






















              3
















              $begingroup$

              Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



              A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



              However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



              Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



              For more information, see ROC Curves for Continuous Data
              by Wojtek J. Krzanowski and David J. Hand.






              share|cite|improve this answer












              $endgroup$














              • $begingroup$
                This doesn't really answer my question, but it's a very nice description of ROC curves.
                $endgroup$
                – StatsSorceress
                Apr 25 at 15:51










              • $begingroup$
                In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
                $endgroup$
                – Sycorax
                Apr 25 at 15:52






              • 2




                $begingroup$
                I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
                $endgroup$
                – Sycorax
                Apr 25 at 15:56






              • 1




                $begingroup$
                "How do I choose a threshold in a way that reduces training time?" seems like a very different question from the one in your original post.
                $endgroup$
                – Sycorax
                Apr 25 at 16:29






              • 1




                $begingroup$
                Regardless, I don't see how this saves time. Making an ROC curve is not the most expensive part of estimating a model, so moving threshold choice into the optimization step seems ad hoc and unnecessary.
                $endgroup$
                – Sycorax
                Apr 25 at 16:32


















              -2
















              $begingroup$

              Usually in biomedical research, we don't use a training set---we just apply logistic regression on the full dataset to see which predictors are significant risk factors for the outcome we're looking at; or to look at one predictor of interest while controlling for the effect of other possible predictors on the outcome.

              I'm not sure quite what you mean by threshold values, but there are various parameters that one may seek to optimize: AUC, cutoff values for a dichotomizing a continuous predictor variable, positive and negative predictive values, confidence intervals and p-values, false positive and false negative rates.
              Logistic regression looks at a population of subjects and assesses the strength and causal direction of risk factors that contribute to the outcome of interest in that population. It's also possible to "run it in reverse," so to speak, and determine an individual's risk of the outcome given the risk factors that the individual has. Logistic regression assigns each individual a risk of the outcome, based on their individual risk factors, and by default this is 0.5. If a subject's probability of having the outcome (based on all the data and subjects in your model) is 0.5 or above, it predicts he will have the outcome; if below 0.5 then it predicts he won't. But you can adjust this cutoff level, for example to flag more individuals who might be at risk of having the outcome, albeit at the price of having more false positives being predicted by the model. You can adjust this cutoff level to optimize screening decisions in order to predict which individuals would be advised to have further medical followup, for example; and to construct your positive predictive value, negative predictive value, and false negative and false positive rates for a screening test based on the logistic regression model. You can develop the model on half your dataset and test it on the other half, but you don't really have to (and doing so will cut your 'training' data in half and thus reduce the power to find significant predictors in the model). So yes, you can 'train the whole thing end to end'. Of course, in biomedical research, you would want to validate it on another population, another data set before saying your results can be generalized to a wider population. Another approach is to use a bootstrapping-type approach where you run your model on a subsample of your study population, then replace those subjects back into the pool and repeat with another sample, many times (typically 1000 times). If you get significant results a prescribed majority of the time (e.g. 95% of the time) then your model can be deemed validated---at least on your own data. But again, the smaller the study population you run your model on, the less likely it will be that some predictors will be statistically significant risk factors for the outcome. This is especially true for biomedical studies with limited numbers of participants.

              Using half of your data to 'train' your model and then 'validating' it on the other half is an unnecessary burden. You don't do that for t-tests or linear regression, so why do it in logistic regression? The most it will do is let you say 'yeah it works' but if you use your full dataset then you determine that anyway. Breaking your data into smaller datasets runs the risk of not detecting significant risk factors in the study population (OR the validation population) when they are in fact present, due to small sample size, having too many predictors for your study size, and the possibility that your 'validation sample' will show no associations just from chance. The logic behind the 'train then validate' approach seems to be that if the risk factors you identify as significant aren't strong enough, then they won't be statistically significant when modeled on some randomly-chosen half of your data. But that randomly-chosen sample might happen to show no association just by chance, or because it is too small for the risk factor(s) to be statistically significant. But it's the magnitude of the risk factor(s) AND their statistical significance which determine their importance and for that reason it's best to use your full dataset to build your model with. Statistical significance will become less significant with smaller sample sizes, as it does with most statistical tests.
              Doing logistic regression is an art almost as much as a statistical science. There are different approaches to use and different parameters to optimize depending on your study design.






              share|cite|improve this answer










              $endgroup$
















                Your Answer








                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "65"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: false,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );














                draft saved

                draft discarded
















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f405041%2fphilosophical-question-on-logistic-regression-why-isnt-the-optimal-threshold-v%23new-answer', 'question_page');

                );

                Post as a guest















                Required, but never shown


























                5 Answers
                5






                active

                oldest

                votes








                5 Answers
                5






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                18
















                $begingroup$

                A threshold isn't trained with the model because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






                share|cite|improve this answer












                $endgroup$










                • 1




                  $begingroup$
                  Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
                  $endgroup$
                  – StatsSorceress
                  Apr 25 at 15:55






                • 4




                  $begingroup$
                  You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
                  $endgroup$
                  – gung
                  Apr 25 at 16:02







                • 1




                  $begingroup$
                  Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
                  $endgroup$
                  – StatsSorceress
                  Apr 25 at 16:29






                • 4




                  $begingroup$
                  As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
                  $endgroup$
                  – gung
                  Apr 25 at 18:13






                • 1




                  $begingroup$
                  @StatsSorceress "... sometimes in machine learning classification ...". There should be a big emphasis on sometimes. It's hard to imagine a project where accuracy is the correct answer. In my experience, it always involves precision and recall of a minority class.
                  $endgroup$
                  – Wayne
                  Apr 26 at 12:10















                18
















                $begingroup$

                A threshold isn't trained with the model because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






                share|cite|improve this answer












                $endgroup$










                • 1




                  $begingroup$
                  Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
                  $endgroup$
                  – StatsSorceress
                  Apr 25 at 15:55






                • 4




                  $begingroup$
                  You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
                  $endgroup$
                  – gung
                  Apr 25 at 16:02







                • 1




                  $begingroup$
                  Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
                  $endgroup$
                  – StatsSorceress
                  Apr 25 at 16:29






                • 4




                  $begingroup$
                  As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
                  $endgroup$
                  – gung
                  Apr 25 at 18:13






                • 1




                  $begingroup$
                  @StatsSorceress "... sometimes in machine learning classification ...". There should be a big emphasis on sometimes. It's hard to imagine a project where accuracy is the correct answer. In my experience, it always involves precision and recall of a minority class.
                  $endgroup$
                  – Wayne
                  Apr 26 at 12:10













                18














                18










                18







                $begingroup$

                A threshold isn't trained with the model because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






                share|cite|improve this answer












                $endgroup$



                A threshold isn't trained with the model because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.







                share|cite|improve this answer















                share|cite|improve this answer




                share|cite|improve this answer



                share|cite|improve this answer








                edited Apr 26 at 11:10

























                answered Apr 25 at 15:43









                gunggung

                114k34 gold badges284 silver badges557 bronze badges




                114k34 gold badges284 silver badges557 bronze badges










                • 1




                  $begingroup$
                  Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
                  $endgroup$
                  – StatsSorceress
                  Apr 25 at 15:55






                • 4




                  $begingroup$
                  You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
                  $endgroup$
                  – gung
                  Apr 25 at 16:02







                • 1




                  $begingroup$
                  Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
                  $endgroup$
                  – StatsSorceress
                  Apr 25 at 16:29






                • 4




                  $begingroup$
                  As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
                  $endgroup$
                  – gung
                  Apr 25 at 18:13






                • 1




                  $begingroup$
                  @StatsSorceress "... sometimes in machine learning classification ...". There should be a big emphasis on sometimes. It's hard to imagine a project where accuracy is the correct answer. In my experience, it always involves precision and recall of a minority class.
                  $endgroup$
                  – Wayne
                  Apr 26 at 12:10












                • 1




                  $begingroup$
                  Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
                  $endgroup$
                  – StatsSorceress
                  Apr 25 at 15:55






                • 4




                  $begingroup$
                  You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
                  $endgroup$
                  – gung
                  Apr 25 at 16:02







                • 1




                  $begingroup$
                  Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
                  $endgroup$
                  – StatsSorceress
                  Apr 25 at 16:29






                • 4




                  $begingroup$
                  As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
                  $endgroup$
                  – gung
                  Apr 25 at 18:13






                • 1




                  $begingroup$
                  @StatsSorceress "... sometimes in machine learning classification ...". There should be a big emphasis on sometimes. It's hard to imagine a project where accuracy is the correct answer. In my experience, it always involves precision and recall of a minority class.
                  $endgroup$
                  – Wayne
                  Apr 26 at 12:10







                1




                1




                $begingroup$
                Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
                $endgroup$
                – StatsSorceress
                Apr 25 at 15:55




                $begingroup$
                Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
                $endgroup$
                – StatsSorceress
                Apr 25 at 15:55




                4




                4




                $begingroup$
                You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
                $endgroup$
                – gung
                Apr 25 at 16:02





                $begingroup$
                You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
                $endgroup$
                – gung
                Apr 25 at 16:02





                1




                1




                $begingroup$
                Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
                $endgroup$
                – StatsSorceress
                Apr 25 at 16:29




                $begingroup$
                Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
                $endgroup$
                – StatsSorceress
                Apr 25 at 16:29




                4




                4




                $begingroup$
                As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
                $endgroup$
                – gung
                Apr 25 at 18:13




                $begingroup$
                As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
                $endgroup$
                – gung
                Apr 25 at 18:13




                1




                1




                $begingroup$
                @StatsSorceress "... sometimes in machine learning classification ...". There should be a big emphasis on sometimes. It's hard to imagine a project where accuracy is the correct answer. In my experience, it always involves precision and recall of a minority class.
                $endgroup$
                – Wayne
                Apr 26 at 12:10




                $begingroup$
                @StatsSorceress "... sometimes in machine learning classification ...". There should be a big emphasis on sometimes. It's hard to imagine a project where accuracy is the correct answer. In my experience, it always involves precision and recall of a minority class.
                $endgroup$
                – Wayne
                Apr 26 at 12:10













                13
















                $begingroup$

                It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



                If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



                Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



                See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






                share|cite|improve this answer










                $endgroup$



















                  13
















                  $begingroup$

                  It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



                  If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



                  Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



                  See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






                  share|cite|improve this answer










                  $endgroup$

















                    13














                    13










                    13







                    $begingroup$

                    It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



                    If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



                    Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



                    See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






                    share|cite|improve this answer










                    $endgroup$



                    It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



                    If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



                    Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



                    See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.







                    share|cite|improve this answer













                    share|cite|improve this answer




                    share|cite|improve this answer



                    share|cite|improve this answer










                    answered Apr 25 at 16:08









                    Stephan KolassaStephan Kolassa

                    57.7k10 gold badges113 silver badges213 bronze badges




                    57.7k10 gold badges113 silver badges213 bronze badges
























                        4
















                        $begingroup$

                        Philosophical concerns aside, this would cause computational difficulties.



                        The reason why is that functions with continuous output are relatively easy to optimize. You look for the direction where the function increases, and then go that way. If we alter our loss function to include the "cutoff" step, our output becomes discrete, and our loss function is therefore also discrete. Now when we alter the parameters of our logistic function by "a little bit" and jointly alter the cutoff value by "a little bit", our loss gives an identical value, and optimization becomes difficult. Of course, it's not impossible (There's a whole field of study in discrete optimization) but continuous optimization is by far the easier problem to solve when you are optimizing many parameters. Conveniently, once the logistic model has been fit, finding the optimal cutoff, though still a discrete output problem, is now only in one variable, and we can just do a grid search, or some such, which is totally viable in one variable.






                        share|cite|improve this answer










                        $endgroup$



















                          4
















                          $begingroup$

                          Philosophical concerns aside, this would cause computational difficulties.



                          The reason why is that functions with continuous output are relatively easy to optimize. You look for the direction where the function increases, and then go that way. If we alter our loss function to include the "cutoff" step, our output becomes discrete, and our loss function is therefore also discrete. Now when we alter the parameters of our logistic function by "a little bit" and jointly alter the cutoff value by "a little bit", our loss gives an identical value, and optimization becomes difficult. Of course, it's not impossible (There's a whole field of study in discrete optimization) but continuous optimization is by far the easier problem to solve when you are optimizing many parameters. Conveniently, once the logistic model has been fit, finding the optimal cutoff, though still a discrete output problem, is now only in one variable, and we can just do a grid search, or some such, which is totally viable in one variable.






                          share|cite|improve this answer










                          $endgroup$

















                            4














                            4










                            4







                            $begingroup$

                            Philosophical concerns aside, this would cause computational difficulties.



                            The reason why is that functions with continuous output are relatively easy to optimize. You look for the direction where the function increases, and then go that way. If we alter our loss function to include the "cutoff" step, our output becomes discrete, and our loss function is therefore also discrete. Now when we alter the parameters of our logistic function by "a little bit" and jointly alter the cutoff value by "a little bit", our loss gives an identical value, and optimization becomes difficult. Of course, it's not impossible (There's a whole field of study in discrete optimization) but continuous optimization is by far the easier problem to solve when you are optimizing many parameters. Conveniently, once the logistic model has been fit, finding the optimal cutoff, though still a discrete output problem, is now only in one variable, and we can just do a grid search, or some such, which is totally viable in one variable.






                            share|cite|improve this answer










                            $endgroup$



                            Philosophical concerns aside, this would cause computational difficulties.



                            The reason why is that functions with continuous output are relatively easy to optimize. You look for the direction where the function increases, and then go that way. If we alter our loss function to include the "cutoff" step, our output becomes discrete, and our loss function is therefore also discrete. Now when we alter the parameters of our logistic function by "a little bit" and jointly alter the cutoff value by "a little bit", our loss gives an identical value, and optimization becomes difficult. Of course, it's not impossible (There's a whole field of study in discrete optimization) but continuous optimization is by far the easier problem to solve when you are optimizing many parameters. Conveniently, once the logistic model has been fit, finding the optimal cutoff, though still a discrete output problem, is now only in one variable, and we can just do a grid search, or some such, which is totally viable in one variable.







                            share|cite|improve this answer













                            share|cite|improve this answer




                            share|cite|improve this answer



                            share|cite|improve this answer










                            answered Apr 26 at 15:20









                            ScottScott

                            5152 silver badges16 bronze badges




                            5152 silver badges16 bronze badges
























                                3
















                                $begingroup$

                                Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



                                A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



                                However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



                                Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



                                For more information, see ROC Curves for Continuous Data
                                by Wojtek J. Krzanowski and David J. Hand.






                                share|cite|improve this answer












                                $endgroup$














                                • $begingroup$
                                  This doesn't really answer my question, but it's a very nice description of ROC curves.
                                  $endgroup$
                                  – StatsSorceress
                                  Apr 25 at 15:51










                                • $begingroup$
                                  In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 15:52






                                • 2




                                  $begingroup$
                                  I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 15:56






                                • 1




                                  $begingroup$
                                  "How do I choose a threshold in a way that reduces training time?" seems like a very different question from the one in your original post.
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 16:29






                                • 1




                                  $begingroup$
                                  Regardless, I don't see how this saves time. Making an ROC curve is not the most expensive part of estimating a model, so moving threshold choice into the optimization step seems ad hoc and unnecessary.
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 16:32















                                3
















                                $begingroup$

                                Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



                                A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



                                However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



                                Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



                                For more information, see ROC Curves for Continuous Data
                                by Wojtek J. Krzanowski and David J. Hand.






                                share|cite|improve this answer












                                $endgroup$














                                • $begingroup$
                                  This doesn't really answer my question, but it's a very nice description of ROC curves.
                                  $endgroup$
                                  – StatsSorceress
                                  Apr 25 at 15:51










                                • $begingroup$
                                  In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 15:52






                                • 2




                                  $begingroup$
                                  I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 15:56






                                • 1




                                  $begingroup$
                                  "How do I choose a threshold in a way that reduces training time?" seems like a very different question from the one in your original post.
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 16:29






                                • 1




                                  $begingroup$
                                  Regardless, I don't see how this saves time. Making an ROC curve is not the most expensive part of estimating a model, so moving threshold choice into the optimization step seems ad hoc and unnecessary.
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 16:32













                                3














                                3










                                3







                                $begingroup$

                                Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



                                A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



                                However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



                                Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



                                For more information, see ROC Curves for Continuous Data
                                by Wojtek J. Krzanowski and David J. Hand.






                                share|cite|improve this answer












                                $endgroup$



                                Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



                                A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



                                However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



                                Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



                                For more information, see ROC Curves for Continuous Data
                                by Wojtek J. Krzanowski and David J. Hand.







                                share|cite|improve this answer















                                share|cite|improve this answer




                                share|cite|improve this answer



                                share|cite|improve this answer








                                edited Apr 25 at 15:50

























                                answered Apr 25 at 15:45









                                SycoraxSycorax

                                47.6k15 gold badges123 silver badges222 bronze badges




                                47.6k15 gold badges123 silver badges222 bronze badges














                                • $begingroup$
                                  This doesn't really answer my question, but it's a very nice description of ROC curves.
                                  $endgroup$
                                  – StatsSorceress
                                  Apr 25 at 15:51










                                • $begingroup$
                                  In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 15:52






                                • 2




                                  $begingroup$
                                  I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 15:56






                                • 1




                                  $begingroup$
                                  "How do I choose a threshold in a way that reduces training time?" seems like a very different question from the one in your original post.
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 16:29






                                • 1




                                  $begingroup$
                                  Regardless, I don't see how this saves time. Making an ROC curve is not the most expensive part of estimating a model, so moving threshold choice into the optimization step seems ad hoc and unnecessary.
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 16:32
















                                • $begingroup$
                                  This doesn't really answer my question, but it's a very nice description of ROC curves.
                                  $endgroup$
                                  – StatsSorceress
                                  Apr 25 at 15:51










                                • $begingroup$
                                  In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 15:52






                                • 2




                                  $begingroup$
                                  I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 15:56






                                • 1




                                  $begingroup$
                                  "How do I choose a threshold in a way that reduces training time?" seems like a very different question from the one in your original post.
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 16:29






                                • 1




                                  $begingroup$
                                  Regardless, I don't see how this saves time. Making an ROC curve is not the most expensive part of estimating a model, so moving threshold choice into the optimization step seems ad hoc and unnecessary.
                                  $endgroup$
                                  – Sycorax
                                  Apr 25 at 16:32















                                $begingroup$
                                This doesn't really answer my question, but it's a very nice description of ROC curves.
                                $endgroup$
                                – StatsSorceress
                                Apr 25 at 15:51




                                $begingroup$
                                This doesn't really answer my question, but it's a very nice description of ROC curves.
                                $endgroup$
                                – StatsSorceress
                                Apr 25 at 15:51












                                $begingroup$
                                In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
                                $endgroup$
                                – Sycorax
                                Apr 25 at 15:52




                                $begingroup$
                                In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
                                $endgroup$
                                – Sycorax
                                Apr 25 at 15:52




                                2




                                2




                                $begingroup$
                                I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
                                $endgroup$
                                – Sycorax
                                Apr 25 at 15:56




                                $begingroup$
                                I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
                                $endgroup$
                                – Sycorax
                                Apr 25 at 15:56




                                1




                                1




                                $begingroup$
                                "How do I choose a threshold in a way that reduces training time?" seems like a very different question from the one in your original post.
                                $endgroup$
                                – Sycorax
                                Apr 25 at 16:29




                                $begingroup$
                                "How do I choose a threshold in a way that reduces training time?" seems like a very different question from the one in your original post.
                                $endgroup$
                                – Sycorax
                                Apr 25 at 16:29




                                1




                                1




                                $begingroup$
                                Regardless, I don't see how this saves time. Making an ROC curve is not the most expensive part of estimating a model, so moving threshold choice into the optimization step seems ad hoc and unnecessary.
                                $endgroup$
                                – Sycorax
                                Apr 25 at 16:32




                                $begingroup$
                                Regardless, I don't see how this saves time. Making an ROC curve is not the most expensive part of estimating a model, so moving threshold choice into the optimization step seems ad hoc and unnecessary.
                                $endgroup$
                                – Sycorax
                                Apr 25 at 16:32











                                -2
















                                $begingroup$

                                Usually in biomedical research, we don't use a training set---we just apply logistic regression on the full dataset to see which predictors are significant risk factors for the outcome we're looking at; or to look at one predictor of interest while controlling for the effect of other possible predictors on the outcome.

                                I'm not sure quite what you mean by threshold values, but there are various parameters that one may seek to optimize: AUC, cutoff values for a dichotomizing a continuous predictor variable, positive and negative predictive values, confidence intervals and p-values, false positive and false negative rates.
                                Logistic regression looks at a population of subjects and assesses the strength and causal direction of risk factors that contribute to the outcome of interest in that population. It's also possible to "run it in reverse," so to speak, and determine an individual's risk of the outcome given the risk factors that the individual has. Logistic regression assigns each individual a risk of the outcome, based on their individual risk factors, and by default this is 0.5. If a subject's probability of having the outcome (based on all the data and subjects in your model) is 0.5 or above, it predicts he will have the outcome; if below 0.5 then it predicts he won't. But you can adjust this cutoff level, for example to flag more individuals who might be at risk of having the outcome, albeit at the price of having more false positives being predicted by the model. You can adjust this cutoff level to optimize screening decisions in order to predict which individuals would be advised to have further medical followup, for example; and to construct your positive predictive value, negative predictive value, and false negative and false positive rates for a screening test based on the logistic regression model. You can develop the model on half your dataset and test it on the other half, but you don't really have to (and doing so will cut your 'training' data in half and thus reduce the power to find significant predictors in the model). So yes, you can 'train the whole thing end to end'. Of course, in biomedical research, you would want to validate it on another population, another data set before saying your results can be generalized to a wider population. Another approach is to use a bootstrapping-type approach where you run your model on a subsample of your study population, then replace those subjects back into the pool and repeat with another sample, many times (typically 1000 times). If you get significant results a prescribed majority of the time (e.g. 95% of the time) then your model can be deemed validated---at least on your own data. But again, the smaller the study population you run your model on, the less likely it will be that some predictors will be statistically significant risk factors for the outcome. This is especially true for biomedical studies with limited numbers of participants.

                                Using half of your data to 'train' your model and then 'validating' it on the other half is an unnecessary burden. You don't do that for t-tests or linear regression, so why do it in logistic regression? The most it will do is let you say 'yeah it works' but if you use your full dataset then you determine that anyway. Breaking your data into smaller datasets runs the risk of not detecting significant risk factors in the study population (OR the validation population) when they are in fact present, due to small sample size, having too many predictors for your study size, and the possibility that your 'validation sample' will show no associations just from chance. The logic behind the 'train then validate' approach seems to be that if the risk factors you identify as significant aren't strong enough, then they won't be statistically significant when modeled on some randomly-chosen half of your data. But that randomly-chosen sample might happen to show no association just by chance, or because it is too small for the risk factor(s) to be statistically significant. But it's the magnitude of the risk factor(s) AND their statistical significance which determine their importance and for that reason it's best to use your full dataset to build your model with. Statistical significance will become less significant with smaller sample sizes, as it does with most statistical tests.
                                Doing logistic regression is an art almost as much as a statistical science. There are different approaches to use and different parameters to optimize depending on your study design.






                                share|cite|improve this answer










                                $endgroup$



















                                  -2
















                                  $begingroup$

                                  Usually in biomedical research, we don't use a training set---we just apply logistic regression on the full dataset to see which predictors are significant risk factors for the outcome we're looking at; or to look at one predictor of interest while controlling for the effect of other possible predictors on the outcome.

                                  I'm not sure quite what you mean by threshold values, but there are various parameters that one may seek to optimize: AUC, cutoff values for a dichotomizing a continuous predictor variable, positive and negative predictive values, confidence intervals and p-values, false positive and false negative rates.
                                  Logistic regression looks at a population of subjects and assesses the strength and causal direction of risk factors that contribute to the outcome of interest in that population. It's also possible to "run it in reverse," so to speak, and determine an individual's risk of the outcome given the risk factors that the individual has. Logistic regression assigns each individual a risk of the outcome, based on their individual risk factors, and by default this is 0.5. If a subject's probability of having the outcome (based on all the data and subjects in your model) is 0.5 or above, it predicts he will have the outcome; if below 0.5 then it predicts he won't. But you can adjust this cutoff level, for example to flag more individuals who might be at risk of having the outcome, albeit at the price of having more false positives being predicted by the model. You can adjust this cutoff level to optimize screening decisions in order to predict which individuals would be advised to have further medical followup, for example; and to construct your positive predictive value, negative predictive value, and false negative and false positive rates for a screening test based on the logistic regression model. You can develop the model on half your dataset and test it on the other half, but you don't really have to (and doing so will cut your 'training' data in half and thus reduce the power to find significant predictors in the model). So yes, you can 'train the whole thing end to end'. Of course, in biomedical research, you would want to validate it on another population, another data set before saying your results can be generalized to a wider population. Another approach is to use a bootstrapping-type approach where you run your model on a subsample of your study population, then replace those subjects back into the pool and repeat with another sample, many times (typically 1000 times). If you get significant results a prescribed majority of the time (e.g. 95% of the time) then your model can be deemed validated---at least on your own data. But again, the smaller the study population you run your model on, the less likely it will be that some predictors will be statistically significant risk factors for the outcome. This is especially true for biomedical studies with limited numbers of participants.

                                  Using half of your data to 'train' your model and then 'validating' it on the other half is an unnecessary burden. You don't do that for t-tests or linear regression, so why do it in logistic regression? The most it will do is let you say 'yeah it works' but if you use your full dataset then you determine that anyway. Breaking your data into smaller datasets runs the risk of not detecting significant risk factors in the study population (OR the validation population) when they are in fact present, due to small sample size, having too many predictors for your study size, and the possibility that your 'validation sample' will show no associations just from chance. The logic behind the 'train then validate' approach seems to be that if the risk factors you identify as significant aren't strong enough, then they won't be statistically significant when modeled on some randomly-chosen half of your data. But that randomly-chosen sample might happen to show no association just by chance, or because it is too small for the risk factor(s) to be statistically significant. But it's the magnitude of the risk factor(s) AND their statistical significance which determine their importance and for that reason it's best to use your full dataset to build your model with. Statistical significance will become less significant with smaller sample sizes, as it does with most statistical tests.
                                  Doing logistic regression is an art almost as much as a statistical science. There are different approaches to use and different parameters to optimize depending on your study design.






                                  share|cite|improve this answer










                                  $endgroup$

















                                    -2














                                    -2










                                    -2







                                    $begingroup$

                                    Usually in biomedical research, we don't use a training set---we just apply logistic regression on the full dataset to see which predictors are significant risk factors for the outcome we're looking at; or to look at one predictor of interest while controlling for the effect of other possible predictors on the outcome.

                                    I'm not sure quite what you mean by threshold values, but there are various parameters that one may seek to optimize: AUC, cutoff values for a dichotomizing a continuous predictor variable, positive and negative predictive values, confidence intervals and p-values, false positive and false negative rates.
                                    Logistic regression looks at a population of subjects and assesses the strength and causal direction of risk factors that contribute to the outcome of interest in that population. It's also possible to "run it in reverse," so to speak, and determine an individual's risk of the outcome given the risk factors that the individual has. Logistic regression assigns each individual a risk of the outcome, based on their individual risk factors, and by default this is 0.5. If a subject's probability of having the outcome (based on all the data and subjects in your model) is 0.5 or above, it predicts he will have the outcome; if below 0.5 then it predicts he won't. But you can adjust this cutoff level, for example to flag more individuals who might be at risk of having the outcome, albeit at the price of having more false positives being predicted by the model. You can adjust this cutoff level to optimize screening decisions in order to predict which individuals would be advised to have further medical followup, for example; and to construct your positive predictive value, negative predictive value, and false negative and false positive rates for a screening test based on the logistic regression model. You can develop the model on half your dataset and test it on the other half, but you don't really have to (and doing so will cut your 'training' data in half and thus reduce the power to find significant predictors in the model). So yes, you can 'train the whole thing end to end'. Of course, in biomedical research, you would want to validate it on another population, another data set before saying your results can be generalized to a wider population. Another approach is to use a bootstrapping-type approach where you run your model on a subsample of your study population, then replace those subjects back into the pool and repeat with another sample, many times (typically 1000 times). If you get significant results a prescribed majority of the time (e.g. 95% of the time) then your model can be deemed validated---at least on your own data. But again, the smaller the study population you run your model on, the less likely it will be that some predictors will be statistically significant risk factors for the outcome. This is especially true for biomedical studies with limited numbers of participants.

                                    Using half of your data to 'train' your model and then 'validating' it on the other half is an unnecessary burden. You don't do that for t-tests or linear regression, so why do it in logistic regression? The most it will do is let you say 'yeah it works' but if you use your full dataset then you determine that anyway. Breaking your data into smaller datasets runs the risk of not detecting significant risk factors in the study population (OR the validation population) when they are in fact present, due to small sample size, having too many predictors for your study size, and the possibility that your 'validation sample' will show no associations just from chance. The logic behind the 'train then validate' approach seems to be that if the risk factors you identify as significant aren't strong enough, then they won't be statistically significant when modeled on some randomly-chosen half of your data. But that randomly-chosen sample might happen to show no association just by chance, or because it is too small for the risk factor(s) to be statistically significant. But it's the magnitude of the risk factor(s) AND their statistical significance which determine their importance and for that reason it's best to use your full dataset to build your model with. Statistical significance will become less significant with smaller sample sizes, as it does with most statistical tests.
                                    Doing logistic regression is an art almost as much as a statistical science. There are different approaches to use and different parameters to optimize depending on your study design.






                                    share|cite|improve this answer










                                    $endgroup$



                                    Usually in biomedical research, we don't use a training set---we just apply logistic regression on the full dataset to see which predictors are significant risk factors for the outcome we're looking at; or to look at one predictor of interest while controlling for the effect of other possible predictors on the outcome.

                                    I'm not sure quite what you mean by threshold values, but there are various parameters that one may seek to optimize: AUC, cutoff values for a dichotomizing a continuous predictor variable, positive and negative predictive values, confidence intervals and p-values, false positive and false negative rates.
                                    Logistic regression looks at a population of subjects and assesses the strength and causal direction of risk factors that contribute to the outcome of interest in that population. It's also possible to "run it in reverse," so to speak, and determine an individual's risk of the outcome given the risk factors that the individual has. Logistic regression assigns each individual a risk of the outcome, based on their individual risk factors, and by default this is 0.5. If a subject's probability of having the outcome (based on all the data and subjects in your model) is 0.5 or above, it predicts he will have the outcome; if below 0.5 then it predicts he won't. But you can adjust this cutoff level, for example to flag more individuals who might be at risk of having the outcome, albeit at the price of having more false positives being predicted by the model. You can adjust this cutoff level to optimize screening decisions in order to predict which individuals would be advised to have further medical followup, for example; and to construct your positive predictive value, negative predictive value, and false negative and false positive rates for a screening test based on the logistic regression model. You can develop the model on half your dataset and test it on the other half, but you don't really have to (and doing so will cut your 'training' data in half and thus reduce the power to find significant predictors in the model). So yes, you can 'train the whole thing end to end'. Of course, in biomedical research, you would want to validate it on another population, another data set before saying your results can be generalized to a wider population. Another approach is to use a bootstrapping-type approach where you run your model on a subsample of your study population, then replace those subjects back into the pool and repeat with another sample, many times (typically 1000 times). If you get significant results a prescribed majority of the time (e.g. 95% of the time) then your model can be deemed validated---at least on your own data. But again, the smaller the study population you run your model on, the less likely it will be that some predictors will be statistically significant risk factors for the outcome. This is especially true for biomedical studies with limited numbers of participants.

                                    Using half of your data to 'train' your model and then 'validating' it on the other half is an unnecessary burden. You don't do that for t-tests or linear regression, so why do it in logistic regression? The most it will do is let you say 'yeah it works' but if you use your full dataset then you determine that anyway. Breaking your data into smaller datasets runs the risk of not detecting significant risk factors in the study population (OR the validation population) when they are in fact present, due to small sample size, having too many predictors for your study size, and the possibility that your 'validation sample' will show no associations just from chance. The logic behind the 'train then validate' approach seems to be that if the risk factors you identify as significant aren't strong enough, then they won't be statistically significant when modeled on some randomly-chosen half of your data. But that randomly-chosen sample might happen to show no association just by chance, or because it is too small for the risk factor(s) to be statistically significant. But it's the magnitude of the risk factor(s) AND their statistical significance which determine their importance and for that reason it's best to use your full dataset to build your model with. Statistical significance will become less significant with smaller sample sizes, as it does with most statistical tests.
                                    Doing logistic regression is an art almost as much as a statistical science. There are different approaches to use and different parameters to optimize depending on your study design.







                                    share|cite|improve this answer













                                    share|cite|improve this answer




                                    share|cite|improve this answer



                                    share|cite|improve this answer










                                    answered May 2 at 20:47









                                    JeremyJeremy

                                    523 bronze badges




                                    523 bronze badges































                                        draft saved

                                        draft discarded















































                                        Thanks for contributing an answer to Cross Validated!


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid


                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.

                                        Use MathJax to format equations. MathJax reference.


                                        To learn more, see our tips on writing great answers.




                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f405041%2fphilosophical-question-on-logistic-regression-why-isnt-the-optimal-threshold-v%23new-answer', 'question_page');

                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown









                                        Popular posts from this blog

                                        Tamil (spriik) Luke uk diar | Nawigatjuun

                                        Align equal signs while including text over equalitiesAMS align: left aligned text/math plus multicolumn alignmentMultiple alignmentsAligning equations in multiple placesNumbering and aligning an equation with multiple columnsHow to align one equation with another multline equationUsing \ in environments inside the begintabularxNumber equations and preserving alignment of equal signsHow can I align equations to the left and to the right?Double equation alignment problem within align enviromentAligned within align: Why are they right-aligned?

                                        Training a classifier when some of the features are unknownWhy does Gradient Boosting regression predict negative values when there are no negative y-values in my training set?How to improve an existing (trained) classifier?What is effect when I set up some self defined predisctor variables?Why Matlab neural network classification returns decimal values on prediction dataset?Fitting and transforming text data in training, testing, and validation setsHow to quantify the performance of the classifier (multi-class SVM) using the test data?How do I control for some patients providing multiple samples in my training data?Training and Test setTraining a convolutional neural network for image denoising in MatlabShouldn't an autoencoder with #(neurons in hidden layer) = #(neurons in input layer) be “perfect”?