Check reference list in pandas column using numpy vectorizationHow to set the value of a pandas column as listApply a function to series of list without apply in pandasHow do I check if a list is empty?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasConvert pandas dataframe to NumPy arrayDelete column from pandas DataFrameHow to drop rows of Pandas DataFrame whose value in a certain column is NaNChange data type of columns in PandasHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headers

Is there any reason a person would voluntarily choose to have PMI?

How to divide list according to the pattern?

Template not provided using create-react-app

Can I re-whip whipped cream?

A novel (or maybe a whole series) where a weird disease infects men and machines

Should I still follow "programming to an interface not implementation" even if I think using concrete class members is the simpler solution?

Is it academically dishonest to submit the same project to two different classes in the same semester?

Operator norm of square root of matrix vs original

Did the computer mouse always output relative x/y and not absolute?

Why do amateur radio operators call an RF choke a balun?

How to build an overfitted network in order to increase performances

Prefix all commands in shell

Why should I use an ~/extensions directory rather than the default ~/ext for new extensions?

How to pay less tax on a high salary?

Feeling of forcing oneself to do something

If exceed 24 hours show "23:59:59"

Is the use of ellipsis (...) dismissive or rude?

Conditional types in TypeScript

How can resurrecting a person make them evil?

Is a triangle waveform a type of pulse width modulation?

What is the purpose of the rules in counterpoint composition?

When can "qui" mean "how"?

Whats wrong with this model theoretic proof of the twin primes conjecture?

Is Jupiter still an anomaly?



Check reference list in pandas column using numpy vectorization


How to set the value of a pandas column as listApply a function to series of list without apply in pandasHow do I check if a list is empty?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasConvert pandas dataframe to NumPy arrayDelete column from pandas DataFrameHow to drop rows of Pandas DataFrame whose value in a certain column is NaNChange data type of columns in PandasHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headers






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









15


















I have a reference list



ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']


And a dataframe



df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]


I want to check which elements from reference list is present in each row, and convert into binary list



I can achieve this using apply



def convert_month_to_binary(ref,lst):
s = pd.Series(ref)
return s.isin(lst).astype(int).tolist()

df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df

Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]


However, using apply on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?



Extension:



I wanted to use numpy vectorization because I need to now apply another function on this list



I am trying like this, but performance is very slow. Similar results with apply



def count_one(lst):
index = [i for i, e in enumerate(lst) if e != 0]
return len(index)

vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])









share|improve this question






















  • 1





    FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

    – Divakar
    Sep 27 at 15:33

















15


















I have a reference list



ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']


And a dataframe



df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]


I want to check which elements from reference list is present in each row, and convert into binary list



I can achieve this using apply



def convert_month_to_binary(ref,lst):
s = pd.Series(ref)
return s.isin(lst).astype(int).tolist()

df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df

Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]


However, using apply on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?



Extension:



I wanted to use numpy vectorization because I need to now apply another function on this list



I am trying like this, but performance is very slow. Similar results with apply



def count_one(lst):
index = [i for i, e in enumerate(lst) if e != 0]
return len(index)

vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])









share|improve this question






















  • 1





    FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

    – Divakar
    Sep 27 at 15:33













15













15









15


6






I have a reference list



ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']


And a dataframe



df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]


I want to check which elements from reference list is present in each row, and convert into binary list



I can achieve this using apply



def convert_month_to_binary(ref,lst):
s = pd.Series(ref)
return s.isin(lst).astype(int).tolist()

df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df

Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]


However, using apply on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?



Extension:



I wanted to use numpy vectorization because I need to now apply another function on this list



I am trying like this, but performance is very slow. Similar results with apply



def count_one(lst):
index = [i for i, e in enumerate(lst) if e != 0]
return len(index)

vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])









share|improve this question
















I have a reference list



ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']


And a dataframe



df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]


I want to check which elements from reference list is present in each row, and convert into binary list



I can achieve this using apply



def convert_month_to_binary(ref,lst):
s = pd.Series(ref)
return s.isin(lst).astype(int).tolist()

df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df

Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]


However, using apply on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?



Extension:



I wanted to use numpy vectorization because I need to now apply another function on this list



I am trying like this, but performance is very slow. Similar results with apply



def count_one(lst):
index = [i for i, e in enumerate(lst) if e != 0]
return len(index)

vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])






python pandas numpy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 27 at 15:23







Hardik Gupta

















asked Sep 27 at 14:11









Hardik GuptaHardik Gupta

3,1584 gold badges16 silver badges43 bronze badges




3,1584 gold badges16 silver badges43 bronze badges










  • 1





    FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

    – Divakar
    Sep 27 at 15:33












  • 1





    FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

    – Divakar
    Sep 27 at 15:33







1




1





FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

– Divakar
Sep 27 at 15:33





FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

– Divakar
Sep 27 at 15:33












4 Answers
4






active

oldest

votes


















8



















In pandas is better not use lists this way, but it is possible with MultiLabelBinarizer and DataFrame.reindex for added missing categories, last convert values to numpy array and then to lists if performance is important:



from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()


Or with Series.str.join, Series.str.get_dummies and reindex:



df['Binary_Month_List'] = (df['Month_List'].str.join('|')
.str.get_dummies()
.reindex(columns=ref, fill_value=0)
.values
.tolist())
print (df)
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]


Performance is different:



df = pd.concat([df] * 1000, ignore_index=True)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)





share|improve this answer



























  • how can I use vectorization ?

    – Hardik Gupta
    Sep 27 at 15:08


















10



















We can using explode with get_dummies, notice explode is available after 0.25



df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]:
[[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]

#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()





share|improve this answer



























  • I am getting this error - 'Series' object has no attribute 'explode'

    – Hardik Gupta
    Sep 27 at 14:58











  • @Hardikgupta "notice explode is available after 0.25 "

    – Michael Gardner
    Sep 27 at 15:09











  • if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

    – Hardik Gupta
    Sep 27 at 15:10


















4



















Here's one with NumPy tools -



def isin_lists(df_col, ref):
a = np.concatenate(df_col)
b = np.asarray(ref)

sidx = b.argsort()
c = sidx[np.searchsorted(b,a,sorter=sidx)]

l = np.array([len(i) for i in df_col])
r = np.repeat(np.arange(len(l)),l)

out = np.zeros((len(l),len(b)), dtype=bool)
out[r,c] = 1
return out.view('i1')


Output for given sample -



In [79]: bin_ar = isin_lists(df['Month_List'], ref)

In [80]: bin_ar
Out[80]:
array([[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]], dtype=int8)

# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()

# To get counts
In [82]: df['Value'] = bin_ar.sum(1)

In [83]: df
Out[83]:
Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3


If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -



In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)





share|improve this answer



























  • I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

    – Hardik Gupta
    Sep 27 at 15:51











  • @Hardikgupta Got it. Edited post.

    – Divakar
    Sep 27 at 15:56


















0



















I am not sure if this will be faster. But count-vector can also be used in this case.



from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)

mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()





share|improve this answer


























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );














    draft saved

    draft discarded
















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f58136267%2fcheck-reference-list-in-pandas-column-using-numpy-vectorization%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown


























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    8



















    In pandas is better not use lists this way, but it is possible with MultiLabelBinarizer and DataFrame.reindex for added missing categories, last convert values to numpy array and then to lists if performance is important:



    from sklearn.preprocessing import MultiLabelBinarizer

    mlb = MultiLabelBinarizer()
    df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
    df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()


    Or with Series.str.join, Series.str.get_dummies and reindex:



    df['Binary_Month_List'] = (df['Month_List'].str.join('|')
    .str.get_dummies()
    .reindex(columns=ref, fill_value=0)
    .values
    .tolist())
    print (df)
    Month_List Binary_Month_List
    0 [July] [0, 0, 1, 0, 0, 0, 0]
    1 [August] [0, 1, 0, 0, 0, 0, 0]
    2 [July, June] [0, 0, 1, 1, 0, 0, 0]
    3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]


    Performance is different:



    df = pd.concat([df] * 1000, ignore_index=True)

    from sklearn.preprocessing import MultiLabelBinarizer

    mlb = MultiLabelBinarizer()

    In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
    31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
    5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
    58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)





    share|improve this answer



























    • how can I use vectorization ?

      – Hardik Gupta
      Sep 27 at 15:08















    8



















    In pandas is better not use lists this way, but it is possible with MultiLabelBinarizer and DataFrame.reindex for added missing categories, last convert values to numpy array and then to lists if performance is important:



    from sklearn.preprocessing import MultiLabelBinarizer

    mlb = MultiLabelBinarizer()
    df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
    df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()


    Or with Series.str.join, Series.str.get_dummies and reindex:



    df['Binary_Month_List'] = (df['Month_List'].str.join('|')
    .str.get_dummies()
    .reindex(columns=ref, fill_value=0)
    .values
    .tolist())
    print (df)
    Month_List Binary_Month_List
    0 [July] [0, 0, 1, 0, 0, 0, 0]
    1 [August] [0, 1, 0, 0, 0, 0, 0]
    2 [July, June] [0, 0, 1, 1, 0, 0, 0]
    3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]


    Performance is different:



    df = pd.concat([df] * 1000, ignore_index=True)

    from sklearn.preprocessing import MultiLabelBinarizer

    mlb = MultiLabelBinarizer()

    In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
    31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
    5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
    58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)





    share|improve this answer



























    • how can I use vectorization ?

      – Hardik Gupta
      Sep 27 at 15:08













    8















    8











    8









    In pandas is better not use lists this way, but it is possible with MultiLabelBinarizer and DataFrame.reindex for added missing categories, last convert values to numpy array and then to lists if performance is important:



    from sklearn.preprocessing import MultiLabelBinarizer

    mlb = MultiLabelBinarizer()
    df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
    df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()


    Or with Series.str.join, Series.str.get_dummies and reindex:



    df['Binary_Month_List'] = (df['Month_List'].str.join('|')
    .str.get_dummies()
    .reindex(columns=ref, fill_value=0)
    .values
    .tolist())
    print (df)
    Month_List Binary_Month_List
    0 [July] [0, 0, 1, 0, 0, 0, 0]
    1 [August] [0, 1, 0, 0, 0, 0, 0]
    2 [July, June] [0, 0, 1, 1, 0, 0, 0]
    3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]


    Performance is different:



    df = pd.concat([df] * 1000, ignore_index=True)

    from sklearn.preprocessing import MultiLabelBinarizer

    mlb = MultiLabelBinarizer()

    In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
    31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
    5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
    58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)





    share|improve this answer
















    In pandas is better not use lists this way, but it is possible with MultiLabelBinarizer and DataFrame.reindex for added missing categories, last convert values to numpy array and then to lists if performance is important:



    from sklearn.preprocessing import MultiLabelBinarizer

    mlb = MultiLabelBinarizer()
    df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
    df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()


    Or with Series.str.join, Series.str.get_dummies and reindex:



    df['Binary_Month_List'] = (df['Month_List'].str.join('|')
    .str.get_dummies()
    .reindex(columns=ref, fill_value=0)
    .values
    .tolist())
    print (df)
    Month_List Binary_Month_List
    0 [July] [0, 0, 1, 0, 0, 0, 0]
    1 [August] [0, 1, 0, 0, 0, 0, 0]
    2 [July, June] [0, 0, 1, 1, 0, 0, 0]
    3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]


    Performance is different:



    df = pd.concat([df] * 1000, ignore_index=True)

    from sklearn.preprocessing import MultiLabelBinarizer

    mlb = MultiLabelBinarizer()

    In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
    31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
    5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
    58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)






    share|improve this answer















    share|improve this answer




    share|improve this answer








    edited Sep 27 at 14:34

























    answered Sep 27 at 14:14









    jezraeljezrael

    440k35 gold badges487 silver badges533 bronze badges




    440k35 gold badges487 silver badges533 bronze badges















    • how can I use vectorization ?

      – Hardik Gupta
      Sep 27 at 15:08

















    • how can I use vectorization ?

      – Hardik Gupta
      Sep 27 at 15:08
















    how can I use vectorization ?

    – Hardik Gupta
    Sep 27 at 15:08





    how can I use vectorization ?

    – Hardik Gupta
    Sep 27 at 15:08













    10



















    We can using explode with get_dummies, notice explode is available after 0.25



    df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
    Out[79]:
    [[0, 0, 1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 1]]

    #df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()





    share|improve this answer



























    • I am getting this error - 'Series' object has no attribute 'explode'

      – Hardik Gupta
      Sep 27 at 14:58











    • @Hardikgupta "notice explode is available after 0.25 "

      – Michael Gardner
      Sep 27 at 15:09











    • if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

      – Hardik Gupta
      Sep 27 at 15:10















    10



















    We can using explode with get_dummies, notice explode is available after 0.25



    df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
    Out[79]:
    [[0, 0, 1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 1]]

    #df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()





    share|improve this answer



























    • I am getting this error - 'Series' object has no attribute 'explode'

      – Hardik Gupta
      Sep 27 at 14:58











    • @Hardikgupta "notice explode is available after 0.25 "

      – Michael Gardner
      Sep 27 at 15:09











    • if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

      – Hardik Gupta
      Sep 27 at 15:10













    10















    10











    10









    We can using explode with get_dummies, notice explode is available after 0.25



    df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
    Out[79]:
    [[0, 0, 1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 1]]

    #df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()





    share|improve this answer
















    We can using explode with get_dummies, notice explode is available after 0.25



    df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
    Out[79]:
    [[0, 0, 1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 1]]

    #df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()






    share|improve this answer















    share|improve this answer




    share|improve this answer








    edited Sep 27 at 14:24

























    answered Sep 27 at 14:17









    WeNYoBenWeNYoBen

    171k11 gold badges63 silver badges98 bronze badges




    171k11 gold badges63 silver badges98 bronze badges















    • I am getting this error - 'Series' object has no attribute 'explode'

      – Hardik Gupta
      Sep 27 at 14:58











    • @Hardikgupta "notice explode is available after 0.25 "

      – Michael Gardner
      Sep 27 at 15:09











    • if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

      – Hardik Gupta
      Sep 27 at 15:10

















    • I am getting this error - 'Series' object has no attribute 'explode'

      – Hardik Gupta
      Sep 27 at 14:58











    • @Hardikgupta "notice explode is available after 0.25 "

      – Michael Gardner
      Sep 27 at 15:09











    • if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

      – Hardik Gupta
      Sep 27 at 15:10
















    I am getting this error - 'Series' object has no attribute 'explode'

    – Hardik Gupta
    Sep 27 at 14:58





    I am getting this error - 'Series' object has no attribute 'explode'

    – Hardik Gupta
    Sep 27 at 14:58













    @Hardikgupta "notice explode is available after 0.25 "

    – Michael Gardner
    Sep 27 at 15:09





    @Hardikgupta "notice explode is available after 0.25 "

    – Michael Gardner
    Sep 27 at 15:09













    if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

    – Hardik Gupta
    Sep 27 at 15:10





    if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

    – Hardik Gupta
    Sep 27 at 15:10











    4



















    Here's one with NumPy tools -



    def isin_lists(df_col, ref):
    a = np.concatenate(df_col)
    b = np.asarray(ref)

    sidx = b.argsort()
    c = sidx[np.searchsorted(b,a,sorter=sidx)]

    l = np.array([len(i) for i in df_col])
    r = np.repeat(np.arange(len(l)),l)

    out = np.zeros((len(l),len(b)), dtype=bool)
    out[r,c] = 1
    return out.view('i1')


    Output for given sample -



    In [79]: bin_ar = isin_lists(df['Month_List'], ref)

    In [80]: bin_ar
    Out[80]:
    array([[0, 0, 1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

    # To assign as lists for each row into `df`
    In [81]: df['Binary_Month_List'] = bin_ar.tolist()

    # To get counts
    In [82]: df['Value'] = bin_ar.sum(1)

    In [83]: df
    Out[83]:
    Month_List Binary_Month_List Value
    0 [July] [0, 0, 1, 0, 0, 0, 0] 1
    1 [August] [0, 1, 0, 0, 0, 0, 0] 1
    2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
    3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3


    If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -



    In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)





    share|improve this answer



























    • I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

      – Hardik Gupta
      Sep 27 at 15:51











    • @Hardikgupta Got it. Edited post.

      – Divakar
      Sep 27 at 15:56















    4



















    Here's one with NumPy tools -



    def isin_lists(df_col, ref):
    a = np.concatenate(df_col)
    b = np.asarray(ref)

    sidx = b.argsort()
    c = sidx[np.searchsorted(b,a,sorter=sidx)]

    l = np.array([len(i) for i in df_col])
    r = np.repeat(np.arange(len(l)),l)

    out = np.zeros((len(l),len(b)), dtype=bool)
    out[r,c] = 1
    return out.view('i1')


    Output for given sample -



    In [79]: bin_ar = isin_lists(df['Month_List'], ref)

    In [80]: bin_ar
    Out[80]:
    array([[0, 0, 1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

    # To assign as lists for each row into `df`
    In [81]: df['Binary_Month_List'] = bin_ar.tolist()

    # To get counts
    In [82]: df['Value'] = bin_ar.sum(1)

    In [83]: df
    Out[83]:
    Month_List Binary_Month_List Value
    0 [July] [0, 0, 1, 0, 0, 0, 0] 1
    1 [August] [0, 1, 0, 0, 0, 0, 0] 1
    2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
    3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3


    If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -



    In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)





    share|improve this answer



























    • I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

      – Hardik Gupta
      Sep 27 at 15:51











    • @Hardikgupta Got it. Edited post.

      – Divakar
      Sep 27 at 15:56













    4















    4











    4









    Here's one with NumPy tools -



    def isin_lists(df_col, ref):
    a = np.concatenate(df_col)
    b = np.asarray(ref)

    sidx = b.argsort()
    c = sidx[np.searchsorted(b,a,sorter=sidx)]

    l = np.array([len(i) for i in df_col])
    r = np.repeat(np.arange(len(l)),l)

    out = np.zeros((len(l),len(b)), dtype=bool)
    out[r,c] = 1
    return out.view('i1')


    Output for given sample -



    In [79]: bin_ar = isin_lists(df['Month_List'], ref)

    In [80]: bin_ar
    Out[80]:
    array([[0, 0, 1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

    # To assign as lists for each row into `df`
    In [81]: df['Binary_Month_List'] = bin_ar.tolist()

    # To get counts
    In [82]: df['Value'] = bin_ar.sum(1)

    In [83]: df
    Out[83]:
    Month_List Binary_Month_List Value
    0 [July] [0, 0, 1, 0, 0, 0, 0] 1
    1 [August] [0, 1, 0, 0, 0, 0, 0] 1
    2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
    3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3


    If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -



    In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)





    share|improve this answer
















    Here's one with NumPy tools -



    def isin_lists(df_col, ref):
    a = np.concatenate(df_col)
    b = np.asarray(ref)

    sidx = b.argsort()
    c = sidx[np.searchsorted(b,a,sorter=sidx)]

    l = np.array([len(i) for i in df_col])
    r = np.repeat(np.arange(len(l)),l)

    out = np.zeros((len(l),len(b)), dtype=bool)
    out[r,c] = 1
    return out.view('i1')


    Output for given sample -



    In [79]: bin_ar = isin_lists(df['Month_List'], ref)

    In [80]: bin_ar
    Out[80]:
    array([[0, 0, 1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

    # To assign as lists for each row into `df`
    In [81]: df['Binary_Month_List'] = bin_ar.tolist()

    # To get counts
    In [82]: df['Value'] = bin_ar.sum(1)

    In [83]: df
    Out[83]:
    Month_List Binary_Month_List Value
    0 [July] [0, 0, 1, 0, 0, 0, 0] 1
    1 [August] [0, 1, 0, 0, 0, 0, 0] 1
    2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
    3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3


    If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -



    In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)






    share|improve this answer















    share|improve this answer




    share|improve this answer








    edited Sep 27 at 15:55

























    answered Sep 27 at 14:49









    DivakarDivakar

    177k14 gold badges129 silver badges222 bronze badges




    177k14 gold badges129 silver badges222 bronze badges















    • I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

      – Hardik Gupta
      Sep 27 at 15:51











    • @Hardikgupta Got it. Edited post.

      – Divakar
      Sep 27 at 15:56

















    • I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

      – Hardik Gupta
      Sep 27 at 15:51











    • @Hardikgupta Got it. Edited post.

      – Divakar
      Sep 27 at 15:56
















    I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

    – Hardik Gupta
    Sep 27 at 15:51





    I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

    – Hardik Gupta
    Sep 27 at 15:51













    @Hardikgupta Got it. Edited post.

    – Divakar
    Sep 27 at 15:56





    @Hardikgupta Got it. Edited post.

    – Divakar
    Sep 27 at 15:56











    0



















    I am not sure if this will be faster. But count-vector can also be used in this case.



    from sklearn.feature_extraction.text import CountVectorizer
    vect=CountVectorizer(binary=True)

    mys=([(','.join(i)) for i in df['Month_List']])
    X=vect.fit_transform(mys)
    col_names=vect.get_feature_names()
    ndf=pd.SparseDataFrame(X, columns=col_names)
    df=df.join(ndf).astype(str)
    df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()





    share|improve this answer





























      0



















      I am not sure if this will be faster. But count-vector can also be used in this case.



      from sklearn.feature_extraction.text import CountVectorizer
      vect=CountVectorizer(binary=True)

      mys=([(','.join(i)) for i in df['Month_List']])
      X=vect.fit_transform(mys)
      col_names=vect.get_feature_names()
      ndf=pd.SparseDataFrame(X, columns=col_names)
      df=df.join(ndf).astype(str)
      df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()





      share|improve this answer



























        0















        0











        0









        I am not sure if this will be faster. But count-vector can also be used in this case.



        from sklearn.feature_extraction.text import CountVectorizer
        vect=CountVectorizer(binary=True)

        mys=([(','.join(i)) for i in df['Month_List']])
        X=vect.fit_transform(mys)
        col_names=vect.get_feature_names()
        ndf=pd.SparseDataFrame(X, columns=col_names)
        df=df.join(ndf).astype(str)
        df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()





        share|improve this answer














        I am not sure if this will be faster. But count-vector can also be used in this case.



        from sklearn.feature_extraction.text import CountVectorizer
        vect=CountVectorizer(binary=True)

        mys=([(','.join(i)) for i in df['Month_List']])
        X=vect.fit_transform(mys)
        col_names=vect.get_feature_names()
        ndf=pd.SparseDataFrame(X, columns=col_names)
        df=df.join(ndf).astype(str)
        df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()






        share|improve this answer













        share|improve this answer




        share|improve this answer










        answered Oct 5 at 13:17









        shantanuoshantanuo

        23.5k61 gold badges173 silver badges284 bronze badges




        23.5k61 gold badges173 silver badges284 bronze badges































            draft saved

            draft discarded















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f58136267%2fcheck-reference-list-in-pandas-column-using-numpy-vectorization%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown









            Popular posts from this blog

            Tamil (spriik) Luke uk diar | Nawigatjuun

            Align equal signs while including text over equalitiesAMS align: left aligned text/math plus multicolumn alignmentMultiple alignmentsAligning equations in multiple placesNumbering and aligning an equation with multiple columnsHow to align one equation with another multline equationUsing \ in environments inside the begintabularxNumber equations and preserving alignment of equal signsHow can I align equations to the left and to the right?Double equation alignment problem within align enviromentAligned within align: Why are they right-aligned?

            Training a classifier when some of the features are unknownWhy does Gradient Boosting regression predict negative values when there are no negative y-values in my training set?How to improve an existing (trained) classifier?What is effect when I set up some self defined predisctor variables?Why Matlab neural network classification returns decimal values on prediction dataset?Fitting and transforming text data in training, testing, and validation setsHow to quantify the performance of the classifier (multi-class SVM) using the test data?How do I control for some patients providing multiple samples in my training data?Training and Test setTraining a convolutional neural network for image denoising in MatlabShouldn't an autoencoder with #(neurons in hidden layer) = #(neurons in input layer) be “perfect”?