word frequency from file using partial matchHow to divide a list of values by a number in command line?How to count duplicated last columns without removing them?How to retrieve certain fields from the output of “CDP neighbors detail”?How can I append an incremental count to every predefined word of a text file?Sorting some lines in a fileSearching match of multi-line regex in files (without pcregrep)Search for a keyword and get its count occurrenceCompare two text files, extract matching rows of file2 plus additional rows
Mixing 3.5 hdd and 2.5 hdd in LSI RAID 1
What did the Oracle take from the Merovingian?
If thermodynamics says entropy always increases, how can the universe end in heat death?
Star developer didn’t get a promotion because he isn’t a people person, so he has scaled back his contributions. How can I motivate him?
Use column from Raster Attribute Table as Legend in Leaflet Map
How to verify router firmware is legit?
Fingering for Bach's Toccata in E-minor BWV914
My code seems to be a train wreck
Get rows only joined to certain types of row in another table
Does driving a speaker with a DC offset AC signal matter?
What are the advantages of taking "Sculptor of Flesh" eldritch invocation over taking Polymorph as a spell?
Why is this negated with nicht and not kein?
Can I travel to UK as a cabin crew after being refused entry to Ireland?
How to open terminal output with a texteditor without the creation of a new file?
What difference does horsepower make? If the engine can spin the propeller fast enough, why does it need power behind it?
Frictional force doesn't depend on surface area, but why does this application work?
What is a short code for generating this matrix in R?
Pay everything now or gradually?
How do you say "to play Devil's advocate" in German?
How did the Corona (Key Hole) satellites film canisters deorbit?
Is it acceptable to say that a divergent series that tends to infinity is 'equal to' infinity?
Can Alice win the game?
Sending non-work emails to colleagues. Is it rude?
Rite of Winter: How to Stop Crescian Couples from Mutual Assassination
word frequency from file using partial match
How to divide a list of values by a number in command line?How to count duplicated last columns without removing them?How to retrieve certain fields from the output of “CDP neighbors detail”?How can I append an incremental count to every predefined word of a text file?Sorting some lines in a fileSearching match of multi-line regex in files (without pcregrep)Search for a keyword and get its count occurrenceCompare two text files, extract matching rows of file2 plus additional rows
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
I have a text file like this:
tom
and
jerry
went
to
america
and
england
I want to get the frequency of each word.
When I tried the following command
cat test.txt |sort|uniq -c
I got the following output
1 america
2 and
1 england
1 jerry
1 to
1 tom
1 went
But I need partial matches too. ie, the word to present in the word tom. So my expected word count of to is 2. Is it possible using unix commands?
text-processing command-line
add a comment
|
I have a text file like this:
tom
and
jerry
went
to
america
and
england
I want to get the frequency of each word.
When I tried the following command
cat test.txt |sort|uniq -c
I got the following output
1 america
2 and
1 england
1 jerry
1 to
1 tom
1 went
But I need partial matches too. ie, the word to present in the word tom. So my expected word count of to is 2. Is it possible using unix commands?
text-processing command-line
add a comment
|
I have a text file like this:
tom
and
jerry
went
to
america
and
england
I want to get the frequency of each word.
When I tried the following command
cat test.txt |sort|uniq -c
I got the following output
1 america
2 and
1 england
1 jerry
1 to
1 tom
1 went
But I need partial matches too. ie, the word to present in the word tom. So my expected word count of to is 2. Is it possible using unix commands?
text-processing command-line
I have a text file like this:
tom
and
jerry
went
to
america
and
england
I want to get the frequency of each word.
When I tried the following command
cat test.txt |sort|uniq -c
I got the following output
1 america
2 and
1 england
1 jerry
1 to
1 tom
1 went
But I need partial matches too. ie, the word to present in the word tom. So my expected word count of to is 2. Is it possible using unix commands?
text-processing command-line
text-processing command-line
edited Sep 20 at 15:11
terdon♦
154k39 gold badges299 silver badges481 bronze badges
154k39 gold badges299 silver badges481 bronze badges
asked Sep 20 at 14:40
jasonroyjasonroy
1433 bronze badges
1433 bronze badges
add a comment
|
add a comment
|
4 Answers
4
active
oldest
votes
Here's one way, but it isn't very elegant:
$ sort -u file | while IFS= read -r word; do
printf '%st%sn' "$word" "$(grep -cFe "$word" file)";
done
america 1
and 3
england 1
jerry 1
to 2
tom 1
went 1
add a comment
|
An awk approach:
awk '
!x c[$0]; next
for (i in c) if (index($0, i)) c[i]++
ENDfor (i in c) print c[i]"t"i' file x=1 file | sort -k1rn
Which on your input give
3 and
2 to
1 america
1 england
1 jerry
1 tom
1 went
We process the input in two passes. In the first pass, we record the list of distinct words as the keys of the c hash table.
In the second pass, for each line in the file, we loop over all the keys in c and increment the corresponding value if the key is found in the line.
The list of distinct words in the file ends up being stored in memory. If those are English words, it shouldn't be a problem as there are fewer and 200,000 distinct words in the English language.
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– jasonroy
Sep 20 at 21:11
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
Sep 20 at 21:14
Hmm. then that would be a problem. it may crash the system.
– jasonroy
Sep 20 at 21:20
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
Sep 21 at 0:43
@A.Danischewski, awk as a language has no issue with large files. There are many implementations of an interpreter for that language. Could you please be more specific as to which implementation can't handle large files and how? How would using a SQL database help here?
– Stéphane Chazelas
Sep 21 at 7:41
|
show 1 more comment
This won't crash the system but it may take a long time to run, since it parses the input multiple times. Assuming the input file is called "in":
sort -u < in | while read w
do
printf "%dt%sn" `grep -c "$w" in` "$w"
done
which on your input got me:
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
add a comment
|
It's not clear to me if the partial matches are to be anchored to the beginning of the line.
Assuming the answer is yes, what might speed things up here is the use of binary search via the venerable look command.
Of course look needs that its input file be sorted.
Therefore, first create a sorted version of the original file
sort file > file.sorted
Then loop through the original file, looking up one word at a time against the sorted file.
while read -r word; do
printf "%s %dn" "$word" "$(look -b "$word" file.sorted | wc -l)";
done <file
Some systems don't need the -b flag to be passed to look to force a binary search.
Disk caching of the sorted file could help speed things up even further
add a comment
|
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f542850%2fword-frequency-from-file-using-partial-match%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here's one way, but it isn't very elegant:
$ sort -u file | while IFS= read -r word; do
printf '%st%sn' "$word" "$(grep -cFe "$word" file)";
done
america 1
and 3
england 1
jerry 1
to 2
tom 1
went 1
add a comment
|
Here's one way, but it isn't very elegant:
$ sort -u file | while IFS= read -r word; do
printf '%st%sn' "$word" "$(grep -cFe "$word" file)";
done
america 1
and 3
england 1
jerry 1
to 2
tom 1
went 1
add a comment
|
Here's one way, but it isn't very elegant:
$ sort -u file | while IFS= read -r word; do
printf '%st%sn' "$word" "$(grep -cFe "$word" file)";
done
america 1
and 3
england 1
jerry 1
to 2
tom 1
went 1
Here's one way, but it isn't very elegant:
$ sort -u file | while IFS= read -r word; do
printf '%st%sn' "$word" "$(grep -cFe "$word" file)";
done
america 1
and 3
england 1
jerry 1
to 2
tom 1
went 1
edited Sep 20 at 20:48
Stéphane Chazelas
346k59 gold badges676 silver badges1059 bronze badges
346k59 gold badges676 silver badges1059 bronze badges
answered Sep 20 at 14:56
terdon♦terdon
154k39 gold badges299 silver badges481 bronze badges
154k39 gold badges299 silver badges481 bronze badges
add a comment
|
add a comment
|
An awk approach:
awk '
!x c[$0]; next
for (i in c) if (index($0, i)) c[i]++
ENDfor (i in c) print c[i]"t"i' file x=1 file | sort -k1rn
Which on your input give
3 and
2 to
1 america
1 england
1 jerry
1 tom
1 went
We process the input in two passes. In the first pass, we record the list of distinct words as the keys of the c hash table.
In the second pass, for each line in the file, we loop over all the keys in c and increment the corresponding value if the key is found in the line.
The list of distinct words in the file ends up being stored in memory. If those are English words, it shouldn't be a problem as there are fewer and 200,000 distinct words in the English language.
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– jasonroy
Sep 20 at 21:11
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
Sep 20 at 21:14
Hmm. then that would be a problem. it may crash the system.
– jasonroy
Sep 20 at 21:20
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
Sep 21 at 0:43
@A.Danischewski, awk as a language has no issue with large files. There are many implementations of an interpreter for that language. Could you please be more specific as to which implementation can't handle large files and how? How would using a SQL database help here?
– Stéphane Chazelas
Sep 21 at 7:41
|
show 1 more comment
An awk approach:
awk '
!x c[$0]; next
for (i in c) if (index($0, i)) c[i]++
ENDfor (i in c) print c[i]"t"i' file x=1 file | sort -k1rn
Which on your input give
3 and
2 to
1 america
1 england
1 jerry
1 tom
1 went
We process the input in two passes. In the first pass, we record the list of distinct words as the keys of the c hash table.
In the second pass, for each line in the file, we loop over all the keys in c and increment the corresponding value if the key is found in the line.
The list of distinct words in the file ends up being stored in memory. If those are English words, it shouldn't be a problem as there are fewer and 200,000 distinct words in the English language.
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– jasonroy
Sep 20 at 21:11
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
Sep 20 at 21:14
Hmm. then that would be a problem. it may crash the system.
– jasonroy
Sep 20 at 21:20
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
Sep 21 at 0:43
@A.Danischewski, awk as a language has no issue with large files. There are many implementations of an interpreter for that language. Could you please be more specific as to which implementation can't handle large files and how? How would using a SQL database help here?
– Stéphane Chazelas
Sep 21 at 7:41
|
show 1 more comment
An awk approach:
awk '
!x c[$0]; next
for (i in c) if (index($0, i)) c[i]++
ENDfor (i in c) print c[i]"t"i' file x=1 file | sort -k1rn
Which on your input give
3 and
2 to
1 america
1 england
1 jerry
1 tom
1 went
We process the input in two passes. In the first pass, we record the list of distinct words as the keys of the c hash table.
In the second pass, for each line in the file, we loop over all the keys in c and increment the corresponding value if the key is found in the line.
The list of distinct words in the file ends up being stored in memory. If those are English words, it shouldn't be a problem as there are fewer and 200,000 distinct words in the English language.
An awk approach:
awk '
!x c[$0]; next
for (i in c) if (index($0, i)) c[i]++
ENDfor (i in c) print c[i]"t"i' file x=1 file | sort -k1rn
Which on your input give
3 and
2 to
1 america
1 england
1 jerry
1 tom
1 went
We process the input in two passes. In the first pass, we record the list of distinct words as the keys of the c hash table.
In the second pass, for each line in the file, we loop over all the keys in c and increment the corresponding value if the key is found in the line.
The list of distinct words in the file ends up being stored in memory. If those are English words, it shouldn't be a problem as there are fewer and 200,000 distinct words in the English language.
edited Sep 21 at 7:29
answered Sep 20 at 20:54
Stéphane ChazelasStéphane Chazelas
346k59 gold badges676 silver badges1059 bronze badges
346k59 gold badges676 silver badges1059 bronze badges
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– jasonroy
Sep 20 at 21:11
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
Sep 20 at 21:14
Hmm. then that would be a problem. it may crash the system.
– jasonroy
Sep 20 at 21:20
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
Sep 21 at 0:43
@A.Danischewski, awk as a language has no issue with large files. There are many implementations of an interpreter for that language. Could you please be more specific as to which implementation can't handle large files and how? How would using a SQL database help here?
– Stéphane Chazelas
Sep 21 at 7:41
|
show 1 more comment
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– jasonroy
Sep 20 at 21:11
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
Sep 20 at 21:14
Hmm. then that would be a problem. it may crash the system.
– jasonroy
Sep 20 at 21:20
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
Sep 21 at 0:43
@A.Danischewski, awk as a language has no issue with large files. There are many implementations of an interpreter for that language. Could you please be more specific as to which implementation can't handle large files and how? How would using a SQL database help here?
– Stéphane Chazelas
Sep 21 at 7:41
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– jasonroy
Sep 20 at 21:11
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– jasonroy
Sep 20 at 21:11
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
Sep 20 at 21:14
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
Sep 20 at 21:14
Hmm. then that would be a problem. it may crash the system.
– jasonroy
Sep 20 at 21:20
Hmm. then that would be a problem. it may crash the system.
– jasonroy
Sep 20 at 21:20
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
Sep 21 at 0:43
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
Sep 21 at 0:43
@A.Danischewski, awk as a language has no issue with large files. There are many implementations of an interpreter for that language. Could you please be more specific as to which implementation can't handle large files and how? How would using a SQL database help here?
– Stéphane Chazelas
Sep 21 at 7:41
@A.Danischewski, awk as a language has no issue with large files. There are many implementations of an interpreter for that language. Could you please be more specific as to which implementation can't handle large files and how? How would using a SQL database help here?
– Stéphane Chazelas
Sep 21 at 7:41
|
show 1 more comment
This won't crash the system but it may take a long time to run, since it parses the input multiple times. Assuming the input file is called "in":
sort -u < in | while read w
do
printf "%dt%sn" `grep -c "$w" in` "$w"
done
which on your input got me:
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
add a comment
|
This won't crash the system but it may take a long time to run, since it parses the input multiple times. Assuming the input file is called "in":
sort -u < in | while read w
do
printf "%dt%sn" `grep -c "$w" in` "$w"
done
which on your input got me:
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
add a comment
|
This won't crash the system but it may take a long time to run, since it parses the input multiple times. Assuming the input file is called "in":
sort -u < in | while read w
do
printf "%dt%sn" `grep -c "$w" in` "$w"
done
which on your input got me:
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
This won't crash the system but it may take a long time to run, since it parses the input multiple times. Assuming the input file is called "in":
sort -u < in | while read w
do
printf "%dt%sn" `grep -c "$w" in` "$w"
done
which on your input got me:
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
answered Sep 21 at 0:33
sitaramsitaram
1915 bronze badges
1915 bronze badges
add a comment
|
add a comment
|
It's not clear to me if the partial matches are to be anchored to the beginning of the line.
Assuming the answer is yes, what might speed things up here is the use of binary search via the venerable look command.
Of course look needs that its input file be sorted.
Therefore, first create a sorted version of the original file
sort file > file.sorted
Then loop through the original file, looking up one word at a time against the sorted file.
while read -r word; do
printf "%s %dn" "$word" "$(look -b "$word" file.sorted | wc -l)";
done <file
Some systems don't need the -b flag to be passed to look to force a binary search.
Disk caching of the sorted file could help speed things up even further
add a comment
|
It's not clear to me if the partial matches are to be anchored to the beginning of the line.
Assuming the answer is yes, what might speed things up here is the use of binary search via the venerable look command.
Of course look needs that its input file be sorted.
Therefore, first create a sorted version of the original file
sort file > file.sorted
Then loop through the original file, looking up one word at a time against the sorted file.
while read -r word; do
printf "%s %dn" "$word" "$(look -b "$word" file.sorted | wc -l)";
done <file
Some systems don't need the -b flag to be passed to look to force a binary search.
Disk caching of the sorted file could help speed things up even further
add a comment
|
It's not clear to me if the partial matches are to be anchored to the beginning of the line.
Assuming the answer is yes, what might speed things up here is the use of binary search via the venerable look command.
Of course look needs that its input file be sorted.
Therefore, first create a sorted version of the original file
sort file > file.sorted
Then loop through the original file, looking up one word at a time against the sorted file.
while read -r word; do
printf "%s %dn" "$word" "$(look -b "$word" file.sorted | wc -l)";
done <file
Some systems don't need the -b flag to be passed to look to force a binary search.
Disk caching of the sorted file could help speed things up even further
It's not clear to me if the partial matches are to be anchored to the beginning of the line.
Assuming the answer is yes, what might speed things up here is the use of binary search via the venerable look command.
Of course look needs that its input file be sorted.
Therefore, first create a sorted version of the original file
sort file > file.sorted
Then loop through the original file, looking up one word at a time against the sorted file.
while read -r word; do
printf "%s %dn" "$word" "$(look -b "$word" file.sorted | wc -l)";
done <file
Some systems don't need the -b flag to be passed to look to force a binary search.
Disk caching of the sorted file could help speed things up even further
edited Sep 21 at 15:17
answered Sep 21 at 3:45
iruvariruvar
13.9k6 gold badges35 silver badges64 bronze badges
13.9k6 gold badges35 silver badges64 bronze badges
add a comment
|
add a comment
|
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f542850%2fword-frequency-from-file-using-partial-match%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown