Check reference list in pandas column using numpy vectorizationHow to set the value of a pandas column as listApply a function to series of list without apply in pandasHow do I check if a list is empty?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasConvert pandas dataframe to NumPy arrayDelete column from pandas DataFrameHow to drop rows of Pandas DataFrame whose value in a certain column is NaNChange data type of columns in PandasHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headers

Is there any reason a person would voluntarily choose to have PMI?

How to divide list according to the pattern?

Template not provided using create-react-app

Can I re-whip whipped cream?

A novel (or maybe a whole series) where a weird disease infects men and machines

Should I still follow "programming to an interface not implementation" even if I think using concrete class members is the simpler solution?

Is it academically dishonest to submit the same project to two different classes in the same semester?

Operator norm of square root of matrix vs original

Did the computer mouse always output relative x/y and not absolute?

Why do amateur radio operators call an RF choke a balun?

How to build an overfitted network in order to increase performances

Prefix all commands in shell

Why should I use an ~/extensions directory rather than the default ~/ext for new extensions?

How to pay less tax on a high salary?

Feeling of forcing oneself to do something

If exceed 24 hours show "23:59:59"

Is the use of ellipsis (...) dismissive or rude?

Conditional types in TypeScript

How can resurrecting a person make them evil?

Is a triangle waveform a type of pulse width modulation?

What is the purpose of the rules in counterpoint composition?

When can "qui" mean "how"?

Whats wrong with this model theoretic proof of the twin primes conjecture?

Is Jupiter still an anomaly?

Check reference list in pandas column using numpy vectorization

How to set the value of a pandas column as listApply a function to series of list without apply in pandasHow do I check if a list is empty?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasConvert pandas dataframe to NumPy arrayDelete column from pandas DataFrameHow to drop rows of Pandas DataFrame whose value in a certain column is NaNChange data type of columns in PandasHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headers

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;

I have a reference list

ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']

And a dataframe

df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
 Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]

I want to check which elements from reference list is present in each row, and convert into binary list

I can achieve this using apply

def convert_month_to_binary(ref,lst):
 s = pd.Series(ref)
 return s.isin(lst).astype(int).tolist() 

df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df

 Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]

However, using apply on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?

Extension:

I wanted to use numpy vectorization because I need to now apply another function on this list

I am trying like this, but performance is very slow. Similar results with apply

def count_one(lst):
 index = [i for i, e in enumerate(lst) if e != 0] 
 return len(index)

vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])

edited Sep 27 at 15:23

asked Sep 27 at 14:11

Hardik Gupta

3,1584 gold badges16 silver badges43 bronze badges

1

FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

– Divakar
Sep 27 at 15:33

add a comment
|

I have a reference list

ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']

And a dataframe

df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
 Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]

I want to check which elements from reference list is present in each row, and convert into binary list

I can achieve this using apply

def convert_month_to_binary(ref,lst):
 s = pd.Series(ref)
 return s.isin(lst).astype(int).tolist() 

df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df

 Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]

However, using apply on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?

Extension:

I wanted to use numpy vectorization because I need to now apply another function on this list

I am trying like this, but performance is very slow. Similar results with apply

def count_one(lst):
 index = [i for i, e in enumerate(lst) if e != 0] 
 return len(index)

vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])

edited Sep 27 at 15:23

asked Sep 27 at 14:11

Hardik Gupta

3,1584 gold badges16 silver badges43 bronze badges

1

FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

– Divakar
Sep 27 at 15:33

add a comment
|

I have a reference list

ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']

And a dataframe

df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
 Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]

I want to check which elements from reference list is present in each row, and convert into binary list

I can achieve this using apply

def convert_month_to_binary(ref,lst):
 s = pd.Series(ref)
 return s.isin(lst).astype(int).tolist() 

df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df

 Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]

However, using apply on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?

Extension:

I wanted to use numpy vectorization because I need to now apply another function on this list

I am trying like this, but performance is very slow. Similar results with apply

def count_one(lst):
 index = [i for i, e in enumerate(lst) if e != 0] 
 return len(index)

vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])

edited Sep 27 at 15:23

asked Sep 27 at 14:11

Hardik Gupta

3,1584 gold badges16 silver badges43 bronze badges

I have a reference list

ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']

And a dataframe

df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
 Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]

I want to check which elements from reference list is present in each row, and convert into binary list

I can achieve this using apply

def convert_month_to_binary(ref,lst):
 s = pd.Series(ref)
 return s.isin(lst).astype(int).tolist() 

df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df

 Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]

However, using apply on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?

Extension:

I wanted to use numpy vectorization because I need to now apply another function on this list

I am trying like this, but performance is very slow. Similar results with apply

def count_one(lst):
 index = [i for i, e in enumerate(lst) if e != 0] 
 return len(index)

vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])

python pandas numpy

edited Sep 27 at 15:23

asked Sep 27 at 14:11

Hardik Gupta

3,1584 gold badges16 silver badges43 bronze badges

edited Sep 27 at 15:23

asked Sep 27 at 14:11

Hardik Gupta

3,1584 gold badges16 silver badges43 bronze badges

edited Sep 27 at 15:23

asked Sep 27 at 14:11

Hardik Gupta

3,1584 gold badges16 silver badges43 bronze badges

asked Sep 27 at 14:11

Hardik Gupta

3,1584 gold badges16 silver badges43 bronze badges

asked Sep 27 at 14:11

Hardik Gupta

3,1584 gold badges16 silver badges43 bronze badges

1

FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

– Divakar
Sep 27 at 15:33

add a comment
|

1

FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

– Divakar
Sep 27 at 15:33

FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.

– Divakar
Sep 27 at 15:33

add a comment
|

4 Answers
4

active

oldest

votes

In pandas is better not use lists this way, but it is possible with MultiLabelBinarizer and DataFrame.reindex for added missing categories, last convert values to numpy array and then to lists if performance is important:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()

Or with Series.str.join, Series.str.get_dummies and reindex:

df['Binary_Month_List'] = (df['Month_List'].str.join('|')
 .str.get_dummies()
 .reindex(columns=ref, fill_value=0)
 .values
 .tolist())
print (df)
 Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]

Performance is different:

df = pd.concat([df] * 1000, ignore_index=True)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Sep 27 at 14:34

answered Sep 27 at 14:14

jezrael

440k35 gold badges487 silver badges533 bronze badges

how can I use vectorization ?

– Hardik Gupta
Sep 27 at 15:08

add a comment
|

We can using explode with get_dummies, notice explode is available after 0.25

df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]: 
[[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]]

#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()

edited Sep 27 at 14:24

answered Sep 27 at 14:17

WeNYoBen

171k11 gold badges63 silver badges98 bronze badges

I am getting this error - 'Series' object has no attribute 'explode'

– Hardik Gupta
Sep 27 at 14:58

@Hardikgupta "notice explode is available after 0.25 "

– Michael Gardner
Sep 27 at 15:09

if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

– Hardik Gupta
Sep 27 at 15:10

add a comment
|

Here's one with NumPy tools -

def isin_lists(df_col, ref):
 a = np.concatenate(df_col)
 b = np.asarray(ref)

 sidx = b.argsort()
 c = sidx[np.searchsorted(b,a,sorter=sidx)]

 l = np.array([len(i) for i in df_col])
 r = np.repeat(np.arange(len(l)),l)

 out = np.zeros((len(l),len(b)), dtype=bool)
 out[r,c] = 1
 return out.view('i1')

Output for given sample -

In [79]: bin_ar = isin_lists(df['Month_List'], ref)

In [80]: bin_ar
Out[80]: 
array([[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()

# To get counts
In [82]: df['Value'] = bin_ar.sum(1)

In [83]: df
Out[83]: 
 Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3

If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -

In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)

edited Sep 27 at 15:55

answered Sep 27 at 14:49

Divakar

177k14 gold badges129 silver badges222 bronze badges

I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

– Hardik Gupta
Sep 27 at 15:51

@Hardikgupta Got it. Edited post.

– Divakar
Sep 27 at 15:56

add a comment
|

I am not sure if this will be faster. But count-vector can also be used in this case.

from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)

mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()

answered Oct 5 at 13:17

shantanuo

23.5k61 gold badges173 silver badges284 bronze badges

add a comment
|

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f58136267%2fcheck-reference-list-in-pandas-column-using-numpy-vectorization%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()

Or with Series.str.join, Series.str.get_dummies and reindex:

df['Binary_Month_List'] = (df['Month_List'].str.join('|')
 .str.get_dummies()
 .reindex(columns=ref, fill_value=0)
 .values
 .tolist())
print (df)
 Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]

Performance is different:

df = pd.concat([df] * 1000, ignore_index=True)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Sep 27 at 14:34

answered Sep 27 at 14:14

jezrael

440k35 gold badges487 silver badges533 bronze badges

how can I use vectorization ?

– Hardik Gupta
Sep 27 at 15:08

add a comment
|

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()

Or with Series.str.join, Series.str.get_dummies and reindex:

df['Binary_Month_List'] = (df['Month_List'].str.join('|')
 .str.get_dummies()
 .reindex(columns=ref, fill_value=0)
 .values
 .tolist())
print (df)
 Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]

Performance is different:

df = pd.concat([df] * 1000, ignore_index=True)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Sep 27 at 14:34

answered Sep 27 at 14:14

jezrael

440k35 gold badges487 silver badges533 bronze badges

how can I use vectorization ?

– Hardik Gupta
Sep 27 at 15:08

add a comment
|

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()

Or with Series.str.join, Series.str.get_dummies and reindex:

df['Binary_Month_List'] = (df['Month_List'].str.join('|')
 .str.get_dummies()
 .reindex(columns=ref, fill_value=0)
 .values
 .tolist())
print (df)
 Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]

Performance is different:

df = pd.concat([df] * 1000, ignore_index=True)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Sep 27 at 14:34

answered Sep 27 at 14:14

jezrael

440k35 gold badges487 silver badges533 bronze badges

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()

Or with Series.str.join, Series.str.get_dummies and reindex:

df['Binary_Month_List'] = (df['Month_List'].str.join('|')
 .str.get_dummies()
 .reindex(columns=ref, fill_value=0)
 .values
 .tolist())
print (df)
 Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]

Performance is different:

df = pd.concat([df] * 1000, ignore_index=True)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Sep 27 at 14:34

answered Sep 27 at 14:14

jezrael

440k35 gold badges487 silver badges533 bronze badges

edited Sep 27 at 14:34

answered Sep 27 at 14:14

jezrael

440k35 gold badges487 silver badges533 bronze badges

answered Sep 27 at 14:14

jezrael

440k35 gold badges487 silver badges533 bronze badges

answered Sep 27 at 14:14

jezrael

440k35 gold badges487 silver badges533 bronze badges

how can I use vectorization ?

– Hardik Gupta
Sep 27 at 15:08

add a comment
|

how can I use vectorization ?

– Hardik Gupta
Sep 27 at 15:08

how can I use vectorization ?

– Hardik Gupta
Sep 27 at 15:08

add a comment
|

We can using explode with get_dummies, notice explode is available after 0.25

df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]: 
[[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]]

#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()

edited Sep 27 at 14:24

answered Sep 27 at 14:17

WeNYoBen

171k11 gold badges63 silver badges98 bronze badges

I am getting this error - 'Series' object has no attribute 'explode'

– Hardik Gupta
Sep 27 at 14:58

@Hardikgupta "notice explode is available after 0.25 "

– Michael Gardner
Sep 27 at 15:09

if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

– Hardik Gupta
Sep 27 at 15:10

add a comment
|

We can using explode with get_dummies, notice explode is available after 0.25

df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]: 
[[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]]

#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()

edited Sep 27 at 14:24

answered Sep 27 at 14:17

WeNYoBen

171k11 gold badges63 silver badges98 bronze badges

I am getting this error - 'Series' object has no attribute 'explode'

– Hardik Gupta
Sep 27 at 14:58

@Hardikgupta "notice explode is available after 0.25 "

– Michael Gardner
Sep 27 at 15:09

if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

– Hardik Gupta
Sep 27 at 15:10

add a comment
|

We can using explode with get_dummies, notice explode is available after 0.25

df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]: 
[[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]]

#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()

edited Sep 27 at 14:24

answered Sep 27 at 14:17

WeNYoBen

171k11 gold badges63 silver badges98 bronze badges

We can using explode with get_dummies, notice explode is available after 0.25

df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]: 
[[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]]

#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()

edited Sep 27 at 14:24

answered Sep 27 at 14:17

WeNYoBen

171k11 gold badges63 silver badges98 bronze badges

edited Sep 27 at 14:24

answered Sep 27 at 14:17

WeNYoBen

171k11 gold badges63 silver badges98 bronze badges

answered Sep 27 at 14:17

WeNYoBen

171k11 gold badges63 silver badges98 bronze badges

answered Sep 27 at 14:17

WeNYoBen

171k11 gold badges63 silver badges98 bronze badges

I am getting this error - 'Series' object has no attribute 'explode'

– Hardik Gupta
Sep 27 at 14:58

@Hardikgupta "notice explode is available after 0.25 "

– Michael Gardner
Sep 27 at 15:09

if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

– Hardik Gupta
Sep 27 at 15:10

add a comment
|

I am getting this error - 'Series' object has no attribute 'explode'

– Hardik Gupta
Sep 27 at 14:58

@Hardikgupta "notice explode is available after 0.25 "

– Michael Gardner
Sep 27 at 15:09

if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

– Hardik Gupta
Sep 27 at 15:10

I am getting this error - 'Series' object has no attribute 'explode'

– Hardik Gupta
Sep 27 at 14:58

@Hardikgupta "notice explode is available after 0.25 "

– Michael Gardner
Sep 27 at 15:09

if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)

– Hardik Gupta
Sep 27 at 15:10

add a comment
|

Here's one with NumPy tools -

def isin_lists(df_col, ref):
 a = np.concatenate(df_col)
 b = np.asarray(ref)

 sidx = b.argsort()
 c = sidx[np.searchsorted(b,a,sorter=sidx)]

 l = np.array([len(i) for i in df_col])
 r = np.repeat(np.arange(len(l)),l)

 out = np.zeros((len(l),len(b)), dtype=bool)
 out[r,c] = 1
 return out.view('i1')

Output for given sample -

In [79]: bin_ar = isin_lists(df['Month_List'], ref)

In [80]: bin_ar
Out[80]: 
array([[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()

# To get counts
In [82]: df['Value'] = bin_ar.sum(1)

In [83]: df
Out[83]: 
 Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3

If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -

In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)

edited Sep 27 at 15:55

answered Sep 27 at 14:49

Divakar

177k14 gold badges129 silver badges222 bronze badges

I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

– Hardik Gupta
Sep 27 at 15:51

@Hardikgupta Got it. Edited post.

– Divakar
Sep 27 at 15:56

add a comment
|

Here's one with NumPy tools -

def isin_lists(df_col, ref):
 a = np.concatenate(df_col)
 b = np.asarray(ref)

 sidx = b.argsort()
 c = sidx[np.searchsorted(b,a,sorter=sidx)]

 l = np.array([len(i) for i in df_col])
 r = np.repeat(np.arange(len(l)),l)

 out = np.zeros((len(l),len(b)), dtype=bool)
 out[r,c] = 1
 return out.view('i1')

Output for given sample -

In [79]: bin_ar = isin_lists(df['Month_List'], ref)

In [80]: bin_ar
Out[80]: 
array([[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()

# To get counts
In [82]: df['Value'] = bin_ar.sum(1)

In [83]: df
Out[83]: 
 Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3

If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -

In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)

edited Sep 27 at 15:55

answered Sep 27 at 14:49

Divakar

177k14 gold badges129 silver badges222 bronze badges

I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

– Hardik Gupta
Sep 27 at 15:51

@Hardikgupta Got it. Edited post.

– Divakar
Sep 27 at 15:56

add a comment
|

Here's one with NumPy tools -

def isin_lists(df_col, ref):
 a = np.concatenate(df_col)
 b = np.asarray(ref)

 sidx = b.argsort()
 c = sidx[np.searchsorted(b,a,sorter=sidx)]

 l = np.array([len(i) for i in df_col])
 r = np.repeat(np.arange(len(l)),l)

 out = np.zeros((len(l),len(b)), dtype=bool)
 out[r,c] = 1
 return out.view('i1')

Output for given sample -

In [79]: bin_ar = isin_lists(df['Month_List'], ref)

In [80]: bin_ar
Out[80]: 
array([[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()

# To get counts
In [82]: df['Value'] = bin_ar.sum(1)

In [83]: df
Out[83]: 
 Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3

If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -

In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)

edited Sep 27 at 15:55

answered Sep 27 at 14:49

Divakar

177k14 gold badges129 silver badges222 bronze badges

Here's one with NumPy tools -

def isin_lists(df_col, ref):
 a = np.concatenate(df_col)
 b = np.asarray(ref)

 sidx = b.argsort()
 c = sidx[np.searchsorted(b,a,sorter=sidx)]

 l = np.array([len(i) for i in df_col])
 r = np.repeat(np.arange(len(l)),l)

 out = np.zeros((len(l),len(b)), dtype=bool)
 out[r,c] = 1
 return out.view('i1')

Output for given sample -

In [79]: bin_ar = isin_lists(df['Month_List'], ref)

In [80]: bin_ar
Out[80]: 
array([[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()

# To get counts
In [82]: df['Value'] = bin_ar.sum(1)

In [83]: df
Out[83]: 
 Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3

If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -

In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)

edited Sep 27 at 15:55

answered Sep 27 at 14:49

Divakar

177k14 gold badges129 silver badges222 bronze badges

edited Sep 27 at 15:55

answered Sep 27 at 14:49

Divakar

177k14 gold badges129 silver badges222 bronze badges

answered Sep 27 at 14:49

Divakar

177k14 gold badges129 silver badges222 bronze badges

answered Sep 27 at 14:49

Divakar

177k14 gold badges129 silver badges222 bronze badges

I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

– Hardik Gupta
Sep 27 at 15:51

@Hardikgupta Got it. Edited post.

– Divakar
Sep 27 at 15:56

add a comment
|

I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

– Hardik Gupta
Sep 27 at 15:51

@Hardikgupta Got it. Edited post.

– Divakar
Sep 27 at 15:56

I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's

– Hardik Gupta
Sep 27 at 15:51

@Hardikgupta Got it. Edited post.

– Divakar
Sep 27 at 15:56

add a comment
|

I am not sure if this will be faster. But count-vector can also be used in this case.

from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)

mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()

answered Oct 5 at 13:17

shantanuo

23.5k61 gold badges173 silver badges284 bronze badges

add a comment
|

I am not sure if this will be faster. But count-vector can also be used in this case.

from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)

mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()

answered Oct 5 at 13:17

shantanuo

23.5k61 gold badges173 silver badges284 bronze badges

add a comment
|

I am not sure if this will be faster. But count-vector can also be used in this case.

from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)

mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()

answered Oct 5 at 13:17

shantanuo

23.5k61 gold badges173 silver badges284 bronze badges

I am not sure if this will be faster. But count-vector can also be used in this case.

from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)

mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()

answered Oct 5 at 13:17

shantanuo

23.5k61 gold badges173 silver badges284 bronze badges

answered Oct 5 at 13:17

shantanuo

23.5k61 gold badges173 silver badges284 bronze badges

answered Oct 5 at 13:17

shantanuo

23.5k61 gold badges173 silver badges284 bronze badges

answered Oct 5 at 13:17

shantanuo

23.5k61 gold badges173 silver badges284 bronze badges

add a comment
|

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

AyEp,LaIg,L,Xnh KvUoNdt

搜尋此網誌

Bsrgvty

4 Answers
4

Your Answer

Post as a guest

4 Answers
4

4 Answers
4

Post as a guest

Popular posts from this blog

Tamil (spriik) Luke uk diar | Nawigatjuun

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

4 Answers 4

4 Answers 4

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tamil (spriik) Luke uk diar | Nawigatjuun

4 Answers
4

4 Answers
4

4 Answers
4