Check reference list in pandas column using numpy vectorizationHow to set the value of a pandas column as listApply a function to series of list without apply in pandasHow do I check if a list is empty?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasConvert pandas dataframe to NumPy arrayDelete column from pandas DataFrameHow to drop rows of Pandas DataFrame whose value in a certain column is NaNChange data type of columns in PandasHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headers
Is there any reason a person would voluntarily choose to have PMI?
How to divide list according to the pattern?
Template not provided using create-react-app
Can I re-whip whipped cream?
A novel (or maybe a whole series) where a weird disease infects men and machines
Should I still follow "programming to an interface not implementation" even if I think using concrete class members is the simpler solution?
Is it academically dishonest to submit the same project to two different classes in the same semester?
Operator norm of square root of matrix vs original
Did the computer mouse always output relative x/y and not absolute?
Why do amateur radio operators call an RF choke a balun?
How to build an overfitted network in order to increase performances
Prefix all commands in shell
Why should I use an ~/extensions directory rather than the default ~/ext for new extensions?
How to pay less tax on a high salary?
Feeling of forcing oneself to do something
If exceed 24 hours show "23:59:59"
Is the use of ellipsis (...) dismissive or rude?
Conditional types in TypeScript
How can resurrecting a person make them evil?
Is a triangle waveform a type of pulse width modulation?
What is the purpose of the rules in counterpoint composition?
When can "qui" mean "how"?
Whats wrong with this model theoretic proof of the twin primes conjecture?
Is Jupiter still an anomaly?
Check reference list in pandas column using numpy vectorization
How to set the value of a pandas column as listApply a function to series of list without apply in pandasHow do I check if a list is empty?Selecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasConvert pandas dataframe to NumPy arrayDelete column from pandas DataFrameHow to drop rows of Pandas DataFrame whose value in a certain column is NaNChange data type of columns in PandasHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headers
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
I have a reference list
ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']
And a dataframe
df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]
I want to check which elements from reference list is present in each row, and convert into binary list
I can achieve this using apply
def convert_month_to_binary(ref,lst):
s = pd.Series(ref)
return s.isin(lst).astype(int).tolist()
df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]
However, using apply
on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?
Extension:
I wanted to use numpy vectorization
because I need to now apply another function on this list
I am trying like this, but performance is very slow. Similar results with apply
def count_one(lst):
index = [i for i, e in enumerate(lst) if e != 0]
return len(index)
vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])
python pandas numpy
add a comment
|
I have a reference list
ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']
And a dataframe
df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]
I want to check which elements from reference list is present in each row, and convert into binary list
I can achieve this using apply
def convert_month_to_binary(ref,lst):
s = pd.Series(ref)
return s.isin(lst).astype(int).tolist()
df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]
However, using apply
on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?
Extension:
I wanted to use numpy vectorization
because I need to now apply another function on this list
I am trying like this, but performance is very slow. Similar results with apply
def count_one(lst):
index = [i for i, e in enumerate(lst) if e != 0]
return len(index)
vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])
python pandas numpy
1
FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.
– Divakar
Sep 27 at 15:33
add a comment
|
I have a reference list
ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']
And a dataframe
df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]
I want to check which elements from reference list is present in each row, and convert into binary list
I can achieve this using apply
def convert_month_to_binary(ref,lst):
s = pd.Series(ref)
return s.isin(lst).astype(int).tolist()
df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]
However, using apply
on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?
Extension:
I wanted to use numpy vectorization
because I need to now apply another function on this list
I am trying like this, but performance is very slow. Similar results with apply
def count_one(lst):
index = [i for i, e in enumerate(lst) if e != 0]
return len(index)
vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])
python pandas numpy
I have a reference list
ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']
And a dataframe
df = pd.DataFrame('Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']])
df
Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]
I want to check which elements from reference list is present in each row, and convert into binary list
I can achieve this using apply
def convert_month_to_binary(ref,lst):
s = pd.Series(ref)
return s.isin(lst).astype(int).tolist()
df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]
However, using apply
on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?
Extension:
I wanted to use numpy vectorization
because I need to now apply another function on this list
I am trying like this, but performance is very slow. Similar results with apply
def count_one(lst):
index = [i for i, e in enumerate(lst) if e != 0]
return len(index)
vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List'])
python pandas numpy
python pandas numpy
edited Sep 27 at 15:23
Hardik Gupta
asked Sep 27 at 14:11
Hardik GuptaHardik Gupta
3,1584 gold badges16 silver badges43 bronze badges
3,1584 gold badges16 silver badges43 bronze badges
1
FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.
– Divakar
Sep 27 at 15:33
add a comment
|
1
FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.
– Divakar
Sep 27 at 15:33
1
1
FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.
– Divakar
Sep 27 at 15:33
FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.
– Divakar
Sep 27 at 15:33
add a comment
|
4 Answers
4
active
oldest
votes
In pandas is better not use list
s this way, but it is possible with MultiLabelBinarizer
and DataFrame.reindex
for added missing categories, last convert values to numpy array and then to list
s if performance is important:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()
Or with Series.str.join
, Series.str.get_dummies
and reindex
:
df['Binary_Month_List'] = (df['Month_List'].str.join('|')
.str.get_dummies()
.reindex(columns=ref, fill_value=0)
.values
.tolist())
print (df)
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]
Performance is different:
df = pd.concat([df] * 1000, ignore_index=True)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
how can I use vectorization ?
– Hardik Gupta
Sep 27 at 15:08
add a comment
|
We can using explode
with get_dummies
, notice explode
is available after 0.25
df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]:
[[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]
#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
I am getting this error -'Series' object has no attribute 'explode'
– Hardik Gupta
Sep 27 at 14:58
@Hardikgupta "noticeexplode
is available after 0.25 "
– Michael Gardner
Sep 27 at 15:09
if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)
– Hardik Gupta
Sep 27 at 15:10
add a comment
|
Here's one with NumPy tools -
def isin_lists(df_col, ref):
a = np.concatenate(df_col)
b = np.asarray(ref)
sidx = b.argsort()
c = sidx[np.searchsorted(b,a,sorter=sidx)]
l = np.array([len(i) for i in df_col])
r = np.repeat(np.arange(len(l)),l)
out = np.zeros((len(l),len(b)), dtype=bool)
out[r,c] = 1
return out.view('i1')
Output for given sample -
In [79]: bin_ar = isin_lists(df['Month_List'], ref)
In [80]: bin_ar
Out[80]:
array([[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]], dtype=int8)
# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()
# To get counts
In [82]: df['Value'] = bin_ar.sum(1)
In [83]: df
Out[83]:
Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3
If you can't use the intermediate bin_ar
for some reason and have only 'Binary_Month_List'
header to work with -
In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)
I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's
– Hardik Gupta
Sep 27 at 15:51
@Hardikgupta Got it. Edited post.
– Divakar
Sep 27 at 15:56
add a comment
|
I am not sure if this will be faster. But count-vector can also be used in this case.
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)
mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()
add a comment
|
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f58136267%2fcheck-reference-list-in-pandas-column-using-numpy-vectorization%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
In pandas is better not use list
s this way, but it is possible with MultiLabelBinarizer
and DataFrame.reindex
for added missing categories, last convert values to numpy array and then to list
s if performance is important:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()
Or with Series.str.join
, Series.str.get_dummies
and reindex
:
df['Binary_Month_List'] = (df['Month_List'].str.join('|')
.str.get_dummies()
.reindex(columns=ref, fill_value=0)
.values
.tolist())
print (df)
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]
Performance is different:
df = pd.concat([df] * 1000, ignore_index=True)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
how can I use vectorization ?
– Hardik Gupta
Sep 27 at 15:08
add a comment
|
In pandas is better not use list
s this way, but it is possible with MultiLabelBinarizer
and DataFrame.reindex
for added missing categories, last convert values to numpy array and then to list
s if performance is important:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()
Or with Series.str.join
, Series.str.get_dummies
and reindex
:
df['Binary_Month_List'] = (df['Month_List'].str.join('|')
.str.get_dummies()
.reindex(columns=ref, fill_value=0)
.values
.tolist())
print (df)
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]
Performance is different:
df = pd.concat([df] * 1000, ignore_index=True)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
how can I use vectorization ?
– Hardik Gupta
Sep 27 at 15:08
add a comment
|
In pandas is better not use list
s this way, but it is possible with MultiLabelBinarizer
and DataFrame.reindex
for added missing categories, last convert values to numpy array and then to list
s if performance is important:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()
Or with Series.str.join
, Series.str.get_dummies
and reindex
:
df['Binary_Month_List'] = (df['Month_List'].str.join('|')
.str.get_dummies()
.reindex(columns=ref, fill_value=0)
.values
.tolist())
print (df)
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]
Performance is different:
df = pd.concat([df] * 1000, ignore_index=True)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In pandas is better not use list
s this way, but it is possible with MultiLabelBinarizer
and DataFrame.reindex
for added missing categories, last convert values to numpy array and then to list
s if performance is important:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()
Or with Series.str.join
, Series.str.get_dummies
and reindex
:
df['Binary_Month_List'] = (df['Month_List'].str.join('|')
.str.get_dummies()
.reindex(columns=ref, fill_value=0)
.values
.tolist())
print (df)
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, 0]
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1]
Performance is different:
df = pd.concat([df] * 1000, ignore_index=True)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
edited Sep 27 at 14:34
answered Sep 27 at 14:14
jezraeljezrael
440k35 gold badges487 silver badges533 bronze badges
440k35 gold badges487 silver badges533 bronze badges
how can I use vectorization ?
– Hardik Gupta
Sep 27 at 15:08
add a comment
|
how can I use vectorization ?
– Hardik Gupta
Sep 27 at 15:08
how can I use vectorization ?
– Hardik Gupta
Sep 27 at 15:08
how can I use vectorization ?
– Hardik Gupta
Sep 27 at 15:08
add a comment
|
We can using explode
with get_dummies
, notice explode
is available after 0.25
df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]:
[[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]
#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
I am getting this error -'Series' object has no attribute 'explode'
– Hardik Gupta
Sep 27 at 14:58
@Hardikgupta "noticeexplode
is available after 0.25 "
– Michael Gardner
Sep 27 at 15:09
if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)
– Hardik Gupta
Sep 27 at 15:10
add a comment
|
We can using explode
with get_dummies
, notice explode
is available after 0.25
df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]:
[[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]
#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
I am getting this error -'Series' object has no attribute 'explode'
– Hardik Gupta
Sep 27 at 14:58
@Hardikgupta "noticeexplode
is available after 0.25 "
– Michael Gardner
Sep 27 at 15:09
if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)
– Hardik Gupta
Sep 27 at 15:10
add a comment
|
We can using explode
with get_dummies
, notice explode
is available after 0.25
df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]:
[[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]
#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
We can using explode
with get_dummies
, notice explode
is available after 0.25
df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]:
[[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]
#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
edited Sep 27 at 14:24
answered Sep 27 at 14:17
WeNYoBenWeNYoBen
171k11 gold badges63 silver badges98 bronze badges
171k11 gold badges63 silver badges98 bronze badges
I am getting this error -'Series' object has no attribute 'explode'
– Hardik Gupta
Sep 27 at 14:58
@Hardikgupta "noticeexplode
is available after 0.25 "
– Michael Gardner
Sep 27 at 15:09
if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)
– Hardik Gupta
Sep 27 at 15:10
add a comment
|
I am getting this error -'Series' object has no attribute 'explode'
– Hardik Gupta
Sep 27 at 14:58
@Hardikgupta "noticeexplode
is available after 0.25 "
– Michael Gardner
Sep 27 at 15:09
if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)
– Hardik Gupta
Sep 27 at 15:10
I am getting this error -
'Series' object has no attribute 'explode'
– Hardik Gupta
Sep 27 at 14:58
I am getting this error -
'Series' object has no attribute 'explode'
– Hardik Gupta
Sep 27 at 14:58
@Hardikgupta "notice
explode
is available after 0.25 "– Michael Gardner
Sep 27 at 15:09
@Hardikgupta "notice
explode
is available after 0.25 "– Michael Gardner
Sep 27 at 15:09
if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)
– Hardik Gupta
Sep 27 at 15:10
if we have to use numpy vectorization, how can we achieve it? because I want to now apply another function to each list (which is in series)
– Hardik Gupta
Sep 27 at 15:10
add a comment
|
Here's one with NumPy tools -
def isin_lists(df_col, ref):
a = np.concatenate(df_col)
b = np.asarray(ref)
sidx = b.argsort()
c = sidx[np.searchsorted(b,a,sorter=sidx)]
l = np.array([len(i) for i in df_col])
r = np.repeat(np.arange(len(l)),l)
out = np.zeros((len(l),len(b)), dtype=bool)
out[r,c] = 1
return out.view('i1')
Output for given sample -
In [79]: bin_ar = isin_lists(df['Month_List'], ref)
In [80]: bin_ar
Out[80]:
array([[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]], dtype=int8)
# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()
# To get counts
In [82]: df['Value'] = bin_ar.sum(1)
In [83]: df
Out[83]:
Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3
If you can't use the intermediate bin_ar
for some reason and have only 'Binary_Month_List'
header to work with -
In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)
I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's
– Hardik Gupta
Sep 27 at 15:51
@Hardikgupta Got it. Edited post.
– Divakar
Sep 27 at 15:56
add a comment
|
Here's one with NumPy tools -
def isin_lists(df_col, ref):
a = np.concatenate(df_col)
b = np.asarray(ref)
sidx = b.argsort()
c = sidx[np.searchsorted(b,a,sorter=sidx)]
l = np.array([len(i) for i in df_col])
r = np.repeat(np.arange(len(l)),l)
out = np.zeros((len(l),len(b)), dtype=bool)
out[r,c] = 1
return out.view('i1')
Output for given sample -
In [79]: bin_ar = isin_lists(df['Month_List'], ref)
In [80]: bin_ar
Out[80]:
array([[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]], dtype=int8)
# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()
# To get counts
In [82]: df['Value'] = bin_ar.sum(1)
In [83]: df
Out[83]:
Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3
If you can't use the intermediate bin_ar
for some reason and have only 'Binary_Month_List'
header to work with -
In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)
I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's
– Hardik Gupta
Sep 27 at 15:51
@Hardikgupta Got it. Edited post.
– Divakar
Sep 27 at 15:56
add a comment
|
Here's one with NumPy tools -
def isin_lists(df_col, ref):
a = np.concatenate(df_col)
b = np.asarray(ref)
sidx = b.argsort()
c = sidx[np.searchsorted(b,a,sorter=sidx)]
l = np.array([len(i) for i in df_col])
r = np.repeat(np.arange(len(l)),l)
out = np.zeros((len(l),len(b)), dtype=bool)
out[r,c] = 1
return out.view('i1')
Output for given sample -
In [79]: bin_ar = isin_lists(df['Month_List'], ref)
In [80]: bin_ar
Out[80]:
array([[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]], dtype=int8)
# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()
# To get counts
In [82]: df['Value'] = bin_ar.sum(1)
In [83]: df
Out[83]:
Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3
If you can't use the intermediate bin_ar
for some reason and have only 'Binary_Month_List'
header to work with -
In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)
Here's one with NumPy tools -
def isin_lists(df_col, ref):
a = np.concatenate(df_col)
b = np.asarray(ref)
sidx = b.argsort()
c = sidx[np.searchsorted(b,a,sorter=sidx)]
l = np.array([len(i) for i in df_col])
r = np.repeat(np.arange(len(l)),l)
out = np.zeros((len(l),len(b)), dtype=bool)
out[r,c] = 1
return out.view('i1')
Output for given sample -
In [79]: bin_ar = isin_lists(df['Month_List'], ref)
In [80]: bin_ar
Out[80]:
array([[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]], dtype=int8)
# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()
# To get counts
In [82]: df['Value'] = bin_ar.sum(1)
In [83]: df
Out[83]:
Month_List Binary_Month_List Value
0 [July] [0, 0, 1, 0, 0, 0, 0] 1
1 [August] [0, 1, 0, 0, 0, 0, 0] 1
2 [July, June] [0, 0, 1, 1, 0, 0, 0] 2
3 [May, April, March] [0, 0, 0, 0, 1, 1, 1] 3
If you can't use the intermediate bin_ar
for some reason and have only 'Binary_Month_List'
header to work with -
In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)
edited Sep 27 at 15:55
answered Sep 27 at 14:49
DivakarDivakar
177k14 gold badges129 silver badges222 bronze badges
177k14 gold badges129 silver badges222 bronze badges
I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's
– Hardik Gupta
Sep 27 at 15:51
@Hardikgupta Got it. Edited post.
– Divakar
Sep 27 at 15:56
add a comment
|
I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's
– Hardik Gupta
Sep 27 at 15:51
@Hardikgupta Got it. Edited post.
– Divakar
Sep 27 at 15:56
I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's
– Hardik Gupta
Sep 27 at 15:51
I cannot use sum directly, as there are more steps in that functions. Just for this example I am calculating count of 1's
– Hardik Gupta
Sep 27 at 15:51
@Hardikgupta Got it. Edited post.
– Divakar
Sep 27 at 15:56
@Hardikgupta Got it. Edited post.
– Divakar
Sep 27 at 15:56
add a comment
|
I am not sure if this will be faster. But count-vector can also be used in this case.
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)
mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()
add a comment
|
I am not sure if this will be faster. But count-vector can also be used in this case.
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)
mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()
add a comment
|
I am not sure if this will be faster. But count-vector can also be used in this case.
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)
mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()
I am not sure if this will be faster. But count-vector can also be used in this case.
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)
mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()
answered Oct 5 at 13:17
shantanuoshantanuo
23.5k61 gold badges173 silver badges284 bronze badges
23.5k61 gold badges173 silver badges284 bronze badges
add a comment
|
add a comment
|
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f58136267%2fcheck-reference-list-in-pandas-column-using-numpy-vectorization%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one.
– Divakar
Sep 27 at 15:33