How to give a higher importance to certain features in a (k-means) clustering model?K-Means clustering for mixed numeric and categorical dataperform cluster on a multiple dimensional data in RCalculate feature weight vector for one-hot-encoded data frame in RModel-agnostic variable importance metricData scaling before PCA: how to deal with categorical values?Extracting useful features for k-means clusteringHow to deal with Nominal categorical with label encoding?Perform k-means clustering over multiple columnsClustering, Mixed Data Set with Ordinal and Nominal Scale Data

Can an NPC use the teleport spell to affect an object they can see with the scry spell?

Can 35 mm film which went through a washing machine still be developed?

Can I voluntarily exit from the US after a 20 year overstay, or could I be detained at the airport?

Can I return my ability to cast Wish by using Glyph of warding?

As an interviewer, how to conduct interviews with candidates you already know will be rejected?

C - Learning Linked Lists, Pointer Manipulation - Store some ints, print and free memory

Does Hogwarts have its own anthem?

Did Joe Biden "stop a prosecution" into his son in Ukraine? And did he brag about stopping the prosecution?

Would houseruling two or more instances of resistance to the same element as immunity be overly unbalanced?

How fast are we moving relative to the CMB?

In search of a pedagogically simple example of asymmetric encryption routine?

Find all matrices satisfy

Sci-fi story about aliens with cells based on arsenic or nitrogen, poisoned by oxygen

Determining if auto stats update is in progress

What's the correct way to determine turn order in this situation?

Redirect output on-the-fly - looks not possible in Linux, why?

Check reference list in pandas column using numpy vectorization

As a girl, how can I voice male characters effectively?

What makes a character irredeemable?

Is it unethical to give a gift to my professor who might potentially write me a LOR?

How to explain that the sums of numerators over sums of denominators isn't the same as the mean of ratios?

Mac no longer boots

Sum of series with addition

Enumerating all permutations that are "square roots" of derangements



How to give a higher importance to certain features in a (k-means) clustering model?


K-Means clustering for mixed numeric and categorical dataperform cluster on a multiple dimensional data in RCalculate feature weight vector for one-hot-encoded data frame in RModel-agnostic variable importance metricData scaling before PCA: how to deal with categorical values?Extracting useful features for k-means clusteringHow to deal with Nominal categorical with label encoding?Perform k-means clustering over multiple columnsClustering, Mixed Data Set with Ordinal and Nominal Scale Data






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









6












$begingroup$


I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.



For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.



My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.










share|improve this question









$endgroup$




















    6












    $begingroup$


    I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.



    For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.



    My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.










    share|improve this question









    $endgroup$
















      6












      6








      6


      1



      $begingroup$


      I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.



      For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.



      My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.










      share|improve this question









      $endgroup$




      I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.



      For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.



      My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.







      machine-learning clustering feature-scaling dummy-variables






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 16 at 8:33









      EvaEva

      363 bronze badges




      363 bronze badges























          3 Answers
          3






          active

          oldest

          votes


















          8














          $begingroup$

          You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with categorical variables. Check out the answers to this similar question.



          You can use the following rules for performing clustering with k-means or one of its derivates:



          If your data contains only metric variables:



          Scale the data and use k-means (R) (Python).



          If your data contains only categorical variables:



          Use k-modes (R) (Python).



          If your data contains categorical and metric variables:



          Scale the metric variables and use k-prototypes (R) (Python).






          share|improve this answer











          $endgroup$






















            3














            $begingroup$

            Clearly the objective function uses a sum over the features.



            So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.



            However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.






            share|improve this answer









            $endgroup$






















              3














              $begingroup$

              You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
              Please check the following paper:



              "Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.






              share|improve this answer









              $endgroup$
















                Your Answer








                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "557"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: false,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );














                draft saved

                draft discarded
















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49381%2fhow-to-give-a-higher-importance-to-certain-features-in-a-k-means-clustering-mo%23new-answer', 'question_page');

                );

                Post as a guest















                Required, but never shown

























                3 Answers
                3






                active

                oldest

                votes








                3 Answers
                3






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                8














                $begingroup$

                You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with categorical variables. Check out the answers to this similar question.



                You can use the following rules for performing clustering with k-means or one of its derivates:



                If your data contains only metric variables:



                Scale the data and use k-means (R) (Python).



                If your data contains only categorical variables:



                Use k-modes (R) (Python).



                If your data contains categorical and metric variables:



                Scale the metric variables and use k-prototypes (R) (Python).






                share|improve this answer











                $endgroup$



















                  8














                  $begingroup$

                  You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with categorical variables. Check out the answers to this similar question.



                  You can use the following rules for performing clustering with k-means or one of its derivates:



                  If your data contains only metric variables:



                  Scale the data and use k-means (R) (Python).



                  If your data contains only categorical variables:



                  Use k-modes (R) (Python).



                  If your data contains categorical and metric variables:



                  Scale the metric variables and use k-prototypes (R) (Python).






                  share|improve this answer











                  $endgroup$

















                    8














                    8










                    8







                    $begingroup$

                    You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with categorical variables. Check out the answers to this similar question.



                    You can use the following rules for performing clustering with k-means or one of its derivates:



                    If your data contains only metric variables:



                    Scale the data and use k-means (R) (Python).



                    If your data contains only categorical variables:



                    Use k-modes (R) (Python).



                    If your data contains categorical and metric variables:



                    Scale the metric variables and use k-prototypes (R) (Python).






                    share|improve this answer











                    $endgroup$



                    You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with categorical variables. Check out the answers to this similar question.



                    You can use the following rules for performing clustering with k-means or one of its derivates:



                    If your data contains only metric variables:



                    Scale the data and use k-means (R) (Python).



                    If your data contains only categorical variables:



                    Use k-modes (R) (Python).



                    If your data contains categorical and metric variables:



                    Scale the metric variables and use k-prototypes (R) (Python).







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Aug 9 at 13:05

























                    answered Apr 16 at 9:15









                    georg-ungeorg-un

                    8282 silver badges17 bronze badges




                    8282 silver badges17 bronze badges


























                        3














                        $begingroup$

                        Clearly the objective function uses a sum over the features.



                        So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.



                        However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.






                        share|improve this answer









                        $endgroup$



















                          3














                          $begingroup$

                          Clearly the objective function uses a sum over the features.



                          So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.



                          However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.






                          share|improve this answer









                          $endgroup$

















                            3














                            3










                            3







                            $begingroup$

                            Clearly the objective function uses a sum over the features.



                            So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.



                            However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.






                            share|improve this answer









                            $endgroup$



                            Clearly the objective function uses a sum over the features.



                            So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.



                            However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Apr 16 at 13:34









                            Anony-MousseAnony-Mousse

                            6,1198 silver badges28 bronze badges




                            6,1198 silver badges28 bronze badges
























                                3














                                $begingroup$

                                You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
                                Please check the following paper:



                                "Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.






                                share|improve this answer









                                $endgroup$



















                                  3














                                  $begingroup$

                                  You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
                                  Please check the following paper:



                                  "Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.






                                  share|improve this answer









                                  $endgroup$

















                                    3














                                    3










                                    3







                                    $begingroup$

                                    You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
                                    Please check the following paper:



                                    "Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.






                                    share|improve this answer









                                    $endgroup$



                                    You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
                                    Please check the following paper:



                                    "Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Apr 16 at 22:18









                                    Christos KaratsalosChristos Karatsalos

                                    6522 silver badges10 bronze badges




                                    6522 silver badges10 bronze badges































                                        draft saved

                                        draft discarded















































                                        Thanks for contributing an answer to Data Science Stack Exchange!


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid


                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.

                                        Use MathJax to format equations. MathJax reference.


                                        To learn more, see our tips on writing great answers.




                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49381%2fhow-to-give-a-higher-importance-to-certain-features-in-a-k-means-clustering-mo%23new-answer', 'question_page');

                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown







                                        Popular posts from this blog

                                        Tamil (spriik) Luke uk diar | Nawigatjuun

                                        Align equal signs while including text over equalitiesAMS align: left aligned text/math plus multicolumn alignmentMultiple alignmentsAligning equations in multiple placesNumbering and aligning an equation with multiple columnsHow to align one equation with another multline equationUsing \ in environments inside the begintabularxNumber equations and preserving alignment of equal signsHow can I align equations to the left and to the right?Double equation alignment problem within align enviromentAligned within align: Why are they right-aligned?

                                        Training a classifier when some of the features are unknownWhy does Gradient Boosting regression predict negative values when there are no negative y-values in my training set?How to improve an existing (trained) classifier?What is effect when I set up some self defined predisctor variables?Why Matlab neural network classification returns decimal values on prediction dataset?Fitting and transforming text data in training, testing, and validation setsHow to quantify the performance of the classifier (multi-class SVM) using the test data?How do I control for some patients providing multiple samples in my training data?Training and Test setTraining a convolutional neural network for image denoising in MatlabShouldn't an autoencoder with #(neurons in hidden layer) = #(neurons in input layer) be “perfect”?