Explicit song lyrics checkerA convenient wrapper for Telegram Bot librarySong lyric generator using Markov Chains - PythonSIC C command checker programming fragmentPython qwerty keyboard checkerPython 3 palindrome checkerVoting plugin for an IRC botWPF Palindrome Checker Application99 beers song in Rust
What potential problems are there with dumping dex in bard build?
In what way were Renaissance battles like chess matches?
Is it possible to save a (science) PhD in 10 months?
Understanding the use of 'eo' in a sentence from LLPSI
Why do 可不 and 可不是 by themselves imply agreement?
Wood versus marble rolling pin 'performance'
What spacing difference is acceptable with tile?
What does it take to make metal music?
Is there an operator or an easy way to match an expression one or more times with the LIKE operator in SQL?
How did composers "test" their music?
Why would gloves be necessary for handling flobberworms?
Employer reneged on negotiated clauses that weren't part of agreed contract, citing budget cuts - what can I do?
"The" for the first time only
Locked out of my own server: getting "Too many authentication failures" right away when connecting via ssh
How do I complete the "A Brilliant Smile" triumph?
Is this bible in Koine Greek?
Why is the air inside airliners so dry (low humidity)?
Characteristic scale degrees
What is self hosted version control system?
Tic Tac Toe console program
Is SSH key with passphrase a 2FA
The output -1 becomes a slash in the loop
How do you link two checking accounts from two different banks in two different countries?
How do I disable vim from producing backup files?
Explicit song lyrics checker
A convenient wrapper for Telegram Bot librarySong lyric generator using Markov Chains - PythonSIC C command checker programming fragmentPython qwerty keyboard checkerPython 3 palindrome checkerVoting plugin for an IRC botWPF Palindrome Checker Application99 beers song in Rust
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
$begingroup$
To stay in practice with my Python I've decided to write an explicit song checker. It checks each word in the song against an array of 14 explicit words contained in the text file. I've decided not to include the words that are to be checked against, for I am not sure about the rule of having explicit words in a program's code. Feedback on efficiency and structure is what I'm going for mostly. Style and other critiques are invited as well.
explicit_song_checker.py
explicit_words = [
#explicit words not shown
]
def isClean(song_path):
with open(song_path) as song:
for line in song:
words = line.split(" ")
for word in words:
for explicit_word in explicit_words:
if word == explicit_word:
return False
return True
def main():
song = raw_input("Enter path to song: n")
if isClean(song):
print("CLEAN")
else:
print("EXPLICIT")
if __name__ == '__main__':
main()
python strings file
$endgroup$
add a comment
|
$begingroup$
To stay in practice with my Python I've decided to write an explicit song checker. It checks each word in the song against an array of 14 explicit words contained in the text file. I've decided not to include the words that are to be checked against, for I am not sure about the rule of having explicit words in a program's code. Feedback on efficiency and structure is what I'm going for mostly. Style and other critiques are invited as well.
explicit_song_checker.py
explicit_words = [
#explicit words not shown
]
def isClean(song_path):
with open(song_path) as song:
for line in song:
words = line.split(" ")
for word in words:
for explicit_word in explicit_words:
if word == explicit_word:
return False
return True
def main():
song = raw_input("Enter path to song: n")
if isClean(song):
print("CLEAN")
else:
print("EXPLICIT")
if __name__ == '__main__':
main()
python strings file
$endgroup$
6
$begingroup$
The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable namedsongcontains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) Sosong_path(as used in the function) is already much better :-)
$endgroup$
– wovano
Jun 15 at 19:31
9
$begingroup$
As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…
$endgroup$
– gidds
Jun 16 at 0:32
$begingroup$
Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.
$endgroup$
– Viktor Mellgren
Jun 17 at 8:19
$begingroup$
why not use a dictionary and some sort of hash. a linear search would be horrible
$endgroup$
– Andrew Scott Evans
Jun 18 at 7:23
1
$begingroup$
There is a meta CR post that involves this post
$endgroup$
– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46
add a comment
|
$begingroup$
To stay in practice with my Python I've decided to write an explicit song checker. It checks each word in the song against an array of 14 explicit words contained in the text file. I've decided not to include the words that are to be checked against, for I am not sure about the rule of having explicit words in a program's code. Feedback on efficiency and structure is what I'm going for mostly. Style and other critiques are invited as well.
explicit_song_checker.py
explicit_words = [
#explicit words not shown
]
def isClean(song_path):
with open(song_path) as song:
for line in song:
words = line.split(" ")
for word in words:
for explicit_word in explicit_words:
if word == explicit_word:
return False
return True
def main():
song = raw_input("Enter path to song: n")
if isClean(song):
print("CLEAN")
else:
print("EXPLICIT")
if __name__ == '__main__':
main()
python strings file
$endgroup$
To stay in practice with my Python I've decided to write an explicit song checker. It checks each word in the song against an array of 14 explicit words contained in the text file. I've decided not to include the words that are to be checked against, for I am not sure about the rule of having explicit words in a program's code. Feedback on efficiency and structure is what I'm going for mostly. Style and other critiques are invited as well.
explicit_song_checker.py
explicit_words = [
#explicit words not shown
]
def isClean(song_path):
with open(song_path) as song:
for line in song:
words = line.split(" ")
for word in words:
for explicit_word in explicit_words:
if word == explicit_word:
return False
return True
def main():
song = raw_input("Enter path to song: n")
if isClean(song):
print("CLEAN")
else:
print("EXPLICIT")
if __name__ == '__main__':
main()
python strings file
python strings file
edited Jun 15 at 16:59
dfhwze
13k3 gold badges24 silver badges94 bronze badges
13k3 gold badges24 silver badges94 bronze badges
asked Jun 15 at 3:44
LinnyLinny
3,5172 gold badges11 silver badges51 bronze badges
3,5172 gold badges11 silver badges51 bronze badges
6
$begingroup$
The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable namedsongcontains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) Sosong_path(as used in the function) is already much better :-)
$endgroup$
– wovano
Jun 15 at 19:31
9
$begingroup$
As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…
$endgroup$
– gidds
Jun 16 at 0:32
$begingroup$
Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.
$endgroup$
– Viktor Mellgren
Jun 17 at 8:19
$begingroup$
why not use a dictionary and some sort of hash. a linear search would be horrible
$endgroup$
– Andrew Scott Evans
Jun 18 at 7:23
1
$begingroup$
There is a meta CR post that involves this post
$endgroup$
– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46
add a comment
|
6
$begingroup$
The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable namedsongcontains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) Sosong_path(as used in the function) is already much better :-)
$endgroup$
– wovano
Jun 15 at 19:31
9
$begingroup$
As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…
$endgroup$
– gidds
Jun 16 at 0:32
$begingroup$
Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.
$endgroup$
– Viktor Mellgren
Jun 17 at 8:19
$begingroup$
why not use a dictionary and some sort of hash. a linear search would be horrible
$endgroup$
– Andrew Scott Evans
Jun 18 at 7:23
1
$begingroup$
There is a meta CR post that involves this post
$endgroup$
– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46
6
6
$begingroup$
The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable named
song contains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) So song_path (as used in the function) is already much better :-)$endgroup$
– wovano
Jun 15 at 19:31
$begingroup$
The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable named
song contains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) So song_path (as used in the function) is already much better :-)$endgroup$
– wovano
Jun 15 at 19:31
9
9
$begingroup$
As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…
$endgroup$
– gidds
Jun 16 at 0:32
$begingroup$
As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…
$endgroup$
– gidds
Jun 16 at 0:32
$begingroup$
Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.
$endgroup$
– Viktor Mellgren
Jun 17 at 8:19
$begingroup$
Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.
$endgroup$
– Viktor Mellgren
Jun 17 at 8:19
$begingroup$
why not use a dictionary and some sort of hash. a linear search would be horrible
$endgroup$
– Andrew Scott Evans
Jun 18 at 7:23
$begingroup$
why not use a dictionary and some sort of hash. a linear search would be horrible
$endgroup$
– Andrew Scott Evans
Jun 18 at 7:23
1
1
$begingroup$
There is a meta CR post that involves this post
$endgroup$
– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46
$begingroup$
There is a meta CR post that involves this post
$endgroup$
– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46
add a comment
|
6 Answers
6
active
oldest
votes
$begingroup$
I recommend practicing Python 3 rather than Python 2 these days.
According to PEP 8, isClean() should be is_clean().
One or more of your loops could be replaced by some use of the any() function. Note that this suggests that an is_explicit() function would be a more natural concept than an is_clean() function.
I would expect song lyrics to contain a mixture of uppercase and lowercase. Words may be delimited by punctuation as well as spaces. Therefore, words = line.split(" ") is probably too naïve. Furthermore, I would expect each song to be short enough to fit entirely in memory very comfortably, so processing each file line by line is an unnecessary complication.
I would rewrite the program to use a regular expression instead.
import re
explicit_words = [
…
]
is_explicit = re.compile(
r'b(?:' +
'|'.join(re.escape(w) for w in explicit_words) +
r')b',
re.IGNORECASE).search
def main():
with open(raw_input("Enter path to song: ")) as song:
print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")
if __name__ == '__main__':
main()
The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:
def is_explicit(text):
return re.search(
r'b(?:' +
'|'.join(re.escape(w) for w in explicit_words) +
r')b',
text,
re.IGNORECASE
)
$endgroup$
$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50
$begingroup$
@OscarSmith The program calls the helper function once frommain(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.
$endgroup$
– 200_success
Jun 18 at 0:00
$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21
$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32
add a comment
|
$begingroup$
Splitting line into words using words = line.split(" ") will only split the line on space characters. If two words are separated by a tab character, or other white space character other than a space, the split will not happen at that point, which may confuse your explicit checker.
Using words = line.split() would split on spaces, tab characters, and any other white space characters.
You might also want to investigate splitting using the word break regular expression b, which would treat non-word characters as break points, so that a line containing an explicit word immediately followed by a comma (such as "explicit,") or period, or semicolon, etc. won't sneak through the filter.
for explicit_word in explicit_words:
if word == explicit_word:
return False
Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:
if word in explicit_words:
return False
Or better, use a set() of explicit words, and the in operation drops from $O(n)$ to $O(1)$ time complexity:
explicit_words =
"explicit", "bad", "ugly", "offensive", "words"
# ...
if word in explicit_words:
return False
We've improved things a little, but we still have an explicit (pun intended) loop over the list of words in a line. We can use the any() method, and a list comprehension, to get Python to do the looping for us, which should be faster.
for line in song:
words = line.split()
if any(word in explicit_words for word in words):
return False
If word in explicits returns True for any of the word in the list words, the any() method will "short-circuit" and immediately return True without examining the remaining items in words.
If "explicit" is a bad word, and the song lyrics contains "Explicit" or "EXPLICIT", you wouldn't want to mark the song as "CLEAN", right? Perhaps you want word.lower() in explicit_words? As @wizzwizz4 pointed out, .casefold() is better than .lower().
Resulting method:
explicit_words = set(explicit_word.casefold()
for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))
def isClean(song_path):
with open(song_path) as song:
for line in song:
words = line.split()
if any(word.casefold() in explicit_words for word in words):
return False
return True
$endgroup$
1
$begingroup$
Why does it need to iterate over the song lines?song.split()would also split onn(andr, of course). Call me unpythonic, but I would directlyreturn any(word.lower() in explicit_words for word in song.split())
$endgroup$
– Helio
Jun 15 at 12:22
$begingroup$
@Helio I expect you would needsong.read().split(), notsong.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because thesplit()doesn’t consider punctuation as word separators. That area still needed replacement with a regexbstyle word splitting, so embedding it inside of theany(...)hides the work that still needed to be done.
$endgroup$
– AJNeufeld
Jun 15 at 12:37
$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then runningsplit. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.
$endgroup$
– Helio
Jun 15 at 12:56
3
$begingroup$
Why notstr.casefoldinstead ofstr.lower? They're the same for (most of) Latin, but different for stuff like Greek.
$endgroup$
– wizzwizz4
Jun 15 at 16:17
$begingroup$
@wizzwizz4 Fair point; well made. All the words inexplicit_wordswould need to be run throughcasefold()as well.
$endgroup$
– AJNeufeld
Jun 15 at 16:26
add a comment
|
$begingroup$
Some suggestions:
- Using
argparseto parse arguments is preferable to any kind of interactive input, because it means that- it's much easier to automate and integrate in other scripts,
- you could trivially support passing multiple songs in one command,
- it's what any experienced shell user would expect, and
- if the user adds
--helpor doesn't pass any arguments they will get a synopsis for the command.
- The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with
argparse) so that you could do fancy stuff like./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt blackwill automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.flake8will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use ofraw_input(calledinputin Python 3).mypycan be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.
$endgroup$
add a comment
|
$begingroup$
To simplify your input you could change
word == explicit_wordtoexplicit_word in word.This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.
Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".
Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.
This could also match thinks like "sexx".
Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.
>>> print('expu200blitive')
explitive
>>> 'expu200blitive' == 'explitive'
False
>>> print('expu1DD3litive')
expᷓlitiveYour code doesn't reasonably detect leet speak.
Your code doesn't reasonably detect creative Unicode character selections.
>>> print('u2131u01DAu0255u212A')
ℱǚɕKYour code doesn't reasonably handle phonetic abuse.
"Phuck" is pronounced "Fuck".
"Bend Dover" sounds like "bend over".
"I CUP" spelt out is "I see you pee".
"See you next Tuesday" each starting phonetic spells "CU...".
When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚɕK" your word list would have to be ridiculous in size to match it.
It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.
Practicality vs Perfection
Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.
You should then probably use a 'similar check' to match "sexx" or "afuck". This is as duplicating a letter or missing a space is readable and easy.
This check may result in problems as "Scunthorpe" could be understood as "s cunt thorp e" so would be split into two two known words and two extra letters, and then the similarity check my come back low enough to say it's a swearword.
From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.
$endgroup$
1
$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37
1
$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43
add a comment
|
$begingroup$
explicit_words = [
#explicit words not shown
]
Instead of embedding the word list within the program, you might want to treat it as configuration, and load it from a data file. The advantage of this is that it can allow the use of alternative configurations by command-line option (e.g. use a list of French obscenities for French-language lyrics).
On efficiency: we're only interested in the set of words in the lyrics. Given that many songs have repeated lyrics, then it's probably worth reducing the lyrics to a set of words before comparing that set with the banned words (use & for set intersection), or set.isdisjoint() for a potentially faster test.
$endgroup$
add a comment
|
$begingroup$
The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:
- The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check
- The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.
$endgroup$
$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14
add a comment
|
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f222339%2fexplicit-song-lyrics-checker%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I recommend practicing Python 3 rather than Python 2 these days.
According to PEP 8, isClean() should be is_clean().
One or more of your loops could be replaced by some use of the any() function. Note that this suggests that an is_explicit() function would be a more natural concept than an is_clean() function.
I would expect song lyrics to contain a mixture of uppercase and lowercase. Words may be delimited by punctuation as well as spaces. Therefore, words = line.split(" ") is probably too naïve. Furthermore, I would expect each song to be short enough to fit entirely in memory very comfortably, so processing each file line by line is an unnecessary complication.
I would rewrite the program to use a regular expression instead.
import re
explicit_words = [
…
]
is_explicit = re.compile(
r'b(?:' +
'|'.join(re.escape(w) for w in explicit_words) +
r')b',
re.IGNORECASE).search
def main():
with open(raw_input("Enter path to song: ")) as song:
print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")
if __name__ == '__main__':
main()
The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:
def is_explicit(text):
return re.search(
r'b(?:' +
'|'.join(re.escape(w) for w in explicit_words) +
r')b',
text,
re.IGNORECASE
)
$endgroup$
$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50
$begingroup$
@OscarSmith The program calls the helper function once frommain(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.
$endgroup$
– 200_success
Jun 18 at 0:00
$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21
$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32
add a comment
|
$begingroup$
I recommend practicing Python 3 rather than Python 2 these days.
According to PEP 8, isClean() should be is_clean().
One or more of your loops could be replaced by some use of the any() function. Note that this suggests that an is_explicit() function would be a more natural concept than an is_clean() function.
I would expect song lyrics to contain a mixture of uppercase and lowercase. Words may be delimited by punctuation as well as spaces. Therefore, words = line.split(" ") is probably too naïve. Furthermore, I would expect each song to be short enough to fit entirely in memory very comfortably, so processing each file line by line is an unnecessary complication.
I would rewrite the program to use a regular expression instead.
import re
explicit_words = [
…
]
is_explicit = re.compile(
r'b(?:' +
'|'.join(re.escape(w) for w in explicit_words) +
r')b',
re.IGNORECASE).search
def main():
with open(raw_input("Enter path to song: ")) as song:
print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")
if __name__ == '__main__':
main()
The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:
def is_explicit(text):
return re.search(
r'b(?:' +
'|'.join(re.escape(w) for w in explicit_words) +
r')b',
text,
re.IGNORECASE
)
$endgroup$
$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50
$begingroup$
@OscarSmith The program calls the helper function once frommain(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.
$endgroup$
– 200_success
Jun 18 at 0:00
$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21
$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32
add a comment
|
$begingroup$
I recommend practicing Python 3 rather than Python 2 these days.
According to PEP 8, isClean() should be is_clean().
One or more of your loops could be replaced by some use of the any() function. Note that this suggests that an is_explicit() function would be a more natural concept than an is_clean() function.
I would expect song lyrics to contain a mixture of uppercase and lowercase. Words may be delimited by punctuation as well as spaces. Therefore, words = line.split(" ") is probably too naïve. Furthermore, I would expect each song to be short enough to fit entirely in memory very comfortably, so processing each file line by line is an unnecessary complication.
I would rewrite the program to use a regular expression instead.
import re
explicit_words = [
…
]
is_explicit = re.compile(
r'b(?:' +
'|'.join(re.escape(w) for w in explicit_words) +
r')b',
re.IGNORECASE).search
def main():
with open(raw_input("Enter path to song: ")) as song:
print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")
if __name__ == '__main__':
main()
The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:
def is_explicit(text):
return re.search(
r'b(?:' +
'|'.join(re.escape(w) for w in explicit_words) +
r')b',
text,
re.IGNORECASE
)
$endgroup$
I recommend practicing Python 3 rather than Python 2 these days.
According to PEP 8, isClean() should be is_clean().
One or more of your loops could be replaced by some use of the any() function. Note that this suggests that an is_explicit() function would be a more natural concept than an is_clean() function.
I would expect song lyrics to contain a mixture of uppercase and lowercase. Words may be delimited by punctuation as well as spaces. Therefore, words = line.split(" ") is probably too naïve. Furthermore, I would expect each song to be short enough to fit entirely in memory very comfortably, so processing each file line by line is an unnecessary complication.
I would rewrite the program to use a regular expression instead.
import re
explicit_words = [
…
]
is_explicit = re.compile(
r'b(?:' +
'|'.join(re.escape(w) for w in explicit_words) +
r')b',
re.IGNORECASE).search
def main():
with open(raw_input("Enter path to song: ")) as song:
print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")
if __name__ == '__main__':
main()
The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:
def is_explicit(text):
return re.search(
r'b(?:' +
'|'.join(re.escape(w) for w in explicit_words) +
r')b',
text,
re.IGNORECASE
)
answered Jun 15 at 4:40
200_success200_success
136k21 gold badges175 silver badges448 bronze badges
136k21 gold badges175 silver badges448 bronze badges
$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50
$begingroup$
@OscarSmith The program calls the helper function once frommain(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.
$endgroup$
– 200_success
Jun 18 at 0:00
$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21
$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32
add a comment
|
$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50
$begingroup$
@OscarSmith The program calls the helper function once frommain(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.
$endgroup$
– 200_success
Jun 18 at 0:00
$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21
$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32
$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50
$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50
$begingroup$
@OscarSmith The program calls the helper function once from
main(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.$endgroup$
– 200_success
Jun 18 at 0:00
$begingroup$
@OscarSmith The program calls the helper function once from
main(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.$endgroup$
– 200_success
Jun 18 at 0:00
$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21
$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21
$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32
$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32
add a comment
|
$begingroup$
Splitting line into words using words = line.split(" ") will only split the line on space characters. If two words are separated by a tab character, or other white space character other than a space, the split will not happen at that point, which may confuse your explicit checker.
Using words = line.split() would split on spaces, tab characters, and any other white space characters.
You might also want to investigate splitting using the word break regular expression b, which would treat non-word characters as break points, so that a line containing an explicit word immediately followed by a comma (such as "explicit,") or period, or semicolon, etc. won't sneak through the filter.
for explicit_word in explicit_words:
if word == explicit_word:
return False
Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:
if word in explicit_words:
return False
Or better, use a set() of explicit words, and the in operation drops from $O(n)$ to $O(1)$ time complexity:
explicit_words =
"explicit", "bad", "ugly", "offensive", "words"
# ...
if word in explicit_words:
return False
We've improved things a little, but we still have an explicit (pun intended) loop over the list of words in a line. We can use the any() method, and a list comprehension, to get Python to do the looping for us, which should be faster.
for line in song:
words = line.split()
if any(word in explicit_words for word in words):
return False
If word in explicits returns True for any of the word in the list words, the any() method will "short-circuit" and immediately return True without examining the remaining items in words.
If "explicit" is a bad word, and the song lyrics contains "Explicit" or "EXPLICIT", you wouldn't want to mark the song as "CLEAN", right? Perhaps you want word.lower() in explicit_words? As @wizzwizz4 pointed out, .casefold() is better than .lower().
Resulting method:
explicit_words = set(explicit_word.casefold()
for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))
def isClean(song_path):
with open(song_path) as song:
for line in song:
words = line.split()
if any(word.casefold() in explicit_words for word in words):
return False
return True
$endgroup$
1
$begingroup$
Why does it need to iterate over the song lines?song.split()would also split onn(andr, of course). Call me unpythonic, but I would directlyreturn any(word.lower() in explicit_words for word in song.split())
$endgroup$
– Helio
Jun 15 at 12:22
$begingroup$
@Helio I expect you would needsong.read().split(), notsong.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because thesplit()doesn’t consider punctuation as word separators. That area still needed replacement with a regexbstyle word splitting, so embedding it inside of theany(...)hides the work that still needed to be done.
$endgroup$
– AJNeufeld
Jun 15 at 12:37
$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then runningsplit. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.
$endgroup$
– Helio
Jun 15 at 12:56
3
$begingroup$
Why notstr.casefoldinstead ofstr.lower? They're the same for (most of) Latin, but different for stuff like Greek.
$endgroup$
– wizzwizz4
Jun 15 at 16:17
$begingroup$
@wizzwizz4 Fair point; well made. All the words inexplicit_wordswould need to be run throughcasefold()as well.
$endgroup$
– AJNeufeld
Jun 15 at 16:26
add a comment
|
$begingroup$
Splitting line into words using words = line.split(" ") will only split the line on space characters. If two words are separated by a tab character, or other white space character other than a space, the split will not happen at that point, which may confuse your explicit checker.
Using words = line.split() would split on spaces, tab characters, and any other white space characters.
You might also want to investigate splitting using the word break regular expression b, which would treat non-word characters as break points, so that a line containing an explicit word immediately followed by a comma (such as "explicit,") or period, or semicolon, etc. won't sneak through the filter.
for explicit_word in explicit_words:
if word == explicit_word:
return False
Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:
if word in explicit_words:
return False
Or better, use a set() of explicit words, and the in operation drops from $O(n)$ to $O(1)$ time complexity:
explicit_words =
"explicit", "bad", "ugly", "offensive", "words"
# ...
if word in explicit_words:
return False
We've improved things a little, but we still have an explicit (pun intended) loop over the list of words in a line. We can use the any() method, and a list comprehension, to get Python to do the looping for us, which should be faster.
for line in song:
words = line.split()
if any(word in explicit_words for word in words):
return False
If word in explicits returns True for any of the word in the list words, the any() method will "short-circuit" and immediately return True without examining the remaining items in words.
If "explicit" is a bad word, and the song lyrics contains "Explicit" or "EXPLICIT", you wouldn't want to mark the song as "CLEAN", right? Perhaps you want word.lower() in explicit_words? As @wizzwizz4 pointed out, .casefold() is better than .lower().
Resulting method:
explicit_words = set(explicit_word.casefold()
for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))
def isClean(song_path):
with open(song_path) as song:
for line in song:
words = line.split()
if any(word.casefold() in explicit_words for word in words):
return False
return True
$endgroup$
1
$begingroup$
Why does it need to iterate over the song lines?song.split()would also split onn(andr, of course). Call me unpythonic, but I would directlyreturn any(word.lower() in explicit_words for word in song.split())
$endgroup$
– Helio
Jun 15 at 12:22
$begingroup$
@Helio I expect you would needsong.read().split(), notsong.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because thesplit()doesn’t consider punctuation as word separators. That area still needed replacement with a regexbstyle word splitting, so embedding it inside of theany(...)hides the work that still needed to be done.
$endgroup$
– AJNeufeld
Jun 15 at 12:37
$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then runningsplit. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.
$endgroup$
– Helio
Jun 15 at 12:56
3
$begingroup$
Why notstr.casefoldinstead ofstr.lower? They're the same for (most of) Latin, but different for stuff like Greek.
$endgroup$
– wizzwizz4
Jun 15 at 16:17
$begingroup$
@wizzwizz4 Fair point; well made. All the words inexplicit_wordswould need to be run throughcasefold()as well.
$endgroup$
– AJNeufeld
Jun 15 at 16:26
add a comment
|
$begingroup$
Splitting line into words using words = line.split(" ") will only split the line on space characters. If two words are separated by a tab character, or other white space character other than a space, the split will not happen at that point, which may confuse your explicit checker.
Using words = line.split() would split on spaces, tab characters, and any other white space characters.
You might also want to investigate splitting using the word break regular expression b, which would treat non-word characters as break points, so that a line containing an explicit word immediately followed by a comma (such as "explicit,") or period, or semicolon, etc. won't sneak through the filter.
for explicit_word in explicit_words:
if word == explicit_word:
return False
Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:
if word in explicit_words:
return False
Or better, use a set() of explicit words, and the in operation drops from $O(n)$ to $O(1)$ time complexity:
explicit_words =
"explicit", "bad", "ugly", "offensive", "words"
# ...
if word in explicit_words:
return False
We've improved things a little, but we still have an explicit (pun intended) loop over the list of words in a line. We can use the any() method, and a list comprehension, to get Python to do the looping for us, which should be faster.
for line in song:
words = line.split()
if any(word in explicit_words for word in words):
return False
If word in explicits returns True for any of the word in the list words, the any() method will "short-circuit" and immediately return True without examining the remaining items in words.
If "explicit" is a bad word, and the song lyrics contains "Explicit" or "EXPLICIT", you wouldn't want to mark the song as "CLEAN", right? Perhaps you want word.lower() in explicit_words? As @wizzwizz4 pointed out, .casefold() is better than .lower().
Resulting method:
explicit_words = set(explicit_word.casefold()
for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))
def isClean(song_path):
with open(song_path) as song:
for line in song:
words = line.split()
if any(word.casefold() in explicit_words for word in words):
return False
return True
$endgroup$
Splitting line into words using words = line.split(" ") will only split the line on space characters. If two words are separated by a tab character, or other white space character other than a space, the split will not happen at that point, which may confuse your explicit checker.
Using words = line.split() would split on spaces, tab characters, and any other white space characters.
You might also want to investigate splitting using the word break regular expression b, which would treat non-word characters as break points, so that a line containing an explicit word immediately followed by a comma (such as "explicit,") or period, or semicolon, etc. won't sneak through the filter.
for explicit_word in explicit_words:
if word == explicit_word:
return False
Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:
if word in explicit_words:
return False
Or better, use a set() of explicit words, and the in operation drops from $O(n)$ to $O(1)$ time complexity:
explicit_words =
"explicit", "bad", "ugly", "offensive", "words"
# ...
if word in explicit_words:
return False
We've improved things a little, but we still have an explicit (pun intended) loop over the list of words in a line. We can use the any() method, and a list comprehension, to get Python to do the looping for us, which should be faster.
for line in song:
words = line.split()
if any(word in explicit_words for word in words):
return False
If word in explicits returns True for any of the word in the list words, the any() method will "short-circuit" and immediately return True without examining the remaining items in words.
If "explicit" is a bad word, and the song lyrics contains "Explicit" or "EXPLICIT", you wouldn't want to mark the song as "CLEAN", right? Perhaps you want word.lower() in explicit_words? As @wizzwizz4 pointed out, .casefold() is better than .lower().
Resulting method:
explicit_words = set(explicit_word.casefold()
for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))
def isClean(song_path):
with open(song_path) as song:
for line in song:
words = line.split()
if any(word.casefold() in explicit_words for word in words):
return False
return True
edited Jun 15 at 16:33
answered Jun 15 at 4:35
AJNeufeldAJNeufeld
13.7k1 gold badge13 silver badges42 bronze badges
13.7k1 gold badge13 silver badges42 bronze badges
1
$begingroup$
Why does it need to iterate over the song lines?song.split()would also split onn(andr, of course). Call me unpythonic, but I would directlyreturn any(word.lower() in explicit_words for word in song.split())
$endgroup$
– Helio
Jun 15 at 12:22
$begingroup$
@Helio I expect you would needsong.read().split(), notsong.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because thesplit()doesn’t consider punctuation as word separators. That area still needed replacement with a regexbstyle word splitting, so embedding it inside of theany(...)hides the work that still needed to be done.
$endgroup$
– AJNeufeld
Jun 15 at 12:37
$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then runningsplit. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.
$endgroup$
– Helio
Jun 15 at 12:56
3
$begingroup$
Why notstr.casefoldinstead ofstr.lower? They're the same for (most of) Latin, but different for stuff like Greek.
$endgroup$
– wizzwizz4
Jun 15 at 16:17
$begingroup$
@wizzwizz4 Fair point; well made. All the words inexplicit_wordswould need to be run throughcasefold()as well.
$endgroup$
– AJNeufeld
Jun 15 at 16:26
add a comment
|
1
$begingroup$
Why does it need to iterate over the song lines?song.split()would also split onn(andr, of course). Call me unpythonic, but I would directlyreturn any(word.lower() in explicit_words for word in song.split())
$endgroup$
– Helio
Jun 15 at 12:22
$begingroup$
@Helio I expect you would needsong.read().split(), notsong.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because thesplit()doesn’t consider punctuation as word separators. That area still needed replacement with a regexbstyle word splitting, so embedding it inside of theany(...)hides the work that still needed to be done.
$endgroup$
– AJNeufeld
Jun 15 at 12:37
$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then runningsplit. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.
$endgroup$
– Helio
Jun 15 at 12:56
3
$begingroup$
Why notstr.casefoldinstead ofstr.lower? They're the same for (most of) Latin, but different for stuff like Greek.
$endgroup$
– wizzwizz4
Jun 15 at 16:17
$begingroup$
@wizzwizz4 Fair point; well made. All the words inexplicit_wordswould need to be run throughcasefold()as well.
$endgroup$
– AJNeufeld
Jun 15 at 16:26
1
1
$begingroup$
Why does it need to iterate over the song lines?
song.split() would also split on n (and r, of course). Call me unpythonic, but I would directly return any(word.lower() in explicit_words for word in song.split())$endgroup$
– Helio
Jun 15 at 12:22
$begingroup$
Why does it need to iterate over the song lines?
song.split() would also split on n (and r, of course). Call me unpythonic, but I would directly return any(word.lower() in explicit_words for word in song.split())$endgroup$
– Helio
Jun 15 at 12:22
$begingroup$
@Helio I expect you would need
song.read().split(), not song.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because the split() doesn’t consider punctuation as word separators. That area still needed replacement with a regex b style word splitting, so embedding it inside of the any(...) hides the work that still needed to be done.$endgroup$
– AJNeufeld
Jun 15 at 12:37
$begingroup$
@Helio I expect you would need
song.read().split(), not song.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because the split() doesn’t consider punctuation as word separators. That area still needed replacement with a regex b style word splitting, so embedding it inside of the any(...) hides the work that still needed to be done.$endgroup$
– AJNeufeld
Jun 15 at 12:37
$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then running
split. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.$endgroup$
– Helio
Jun 15 at 12:56
$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then running
split. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.$endgroup$
– Helio
Jun 15 at 12:56
3
3
$begingroup$
Why not
str.casefold instead of str.lower? They're the same for (most of) Latin, but different for stuff like Greek.$endgroup$
– wizzwizz4
Jun 15 at 16:17
$begingroup$
Why not
str.casefold instead of str.lower? They're the same for (most of) Latin, but different for stuff like Greek.$endgroup$
– wizzwizz4
Jun 15 at 16:17
$begingroup$
@wizzwizz4 Fair point; well made. All the words in
explicit_words would need to be run through casefold() as well.$endgroup$
– AJNeufeld
Jun 15 at 16:26
$begingroup$
@wizzwizz4 Fair point; well made. All the words in
explicit_words would need to be run through casefold() as well.$endgroup$
– AJNeufeld
Jun 15 at 16:26
add a comment
|
$begingroup$
Some suggestions:
- Using
argparseto parse arguments is preferable to any kind of interactive input, because it means that- it's much easier to automate and integrate in other scripts,
- you could trivially support passing multiple songs in one command,
- it's what any experienced shell user would expect, and
- if the user adds
--helpor doesn't pass any arguments they will get a synopsis for the command.
- The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with
argparse) so that you could do fancy stuff like./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt blackwill automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.flake8will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use ofraw_input(calledinputin Python 3).mypycan be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.
$endgroup$
add a comment
|
$begingroup$
Some suggestions:
- Using
argparseto parse arguments is preferable to any kind of interactive input, because it means that- it's much easier to automate and integrate in other scripts,
- you could trivially support passing multiple songs in one command,
- it's what any experienced shell user would expect, and
- if the user adds
--helpor doesn't pass any arguments they will get a synopsis for the command.
- The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with
argparse) so that you could do fancy stuff like./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt blackwill automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.flake8will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use ofraw_input(calledinputin Python 3).mypycan be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.
$endgroup$
add a comment
|
$begingroup$
Some suggestions:
- Using
argparseto parse arguments is preferable to any kind of interactive input, because it means that- it's much easier to automate and integrate in other scripts,
- you could trivially support passing multiple songs in one command,
- it's what any experienced shell user would expect, and
- if the user adds
--helpor doesn't pass any arguments they will get a synopsis for the command.
- The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with
argparse) so that you could do fancy stuff like./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt blackwill automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.flake8will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use ofraw_input(calledinputin Python 3).mypycan be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.
$endgroup$
Some suggestions:
- Using
argparseto parse arguments is preferable to any kind of interactive input, because it means that- it's much easier to automate and integrate in other scripts,
- you could trivially support passing multiple songs in one command,
- it's what any experienced shell user would expect, and
- if the user adds
--helpor doesn't pass any arguments they will get a synopsis for the command.
- The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with
argparse) so that you could do fancy stuff like./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt blackwill automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.flake8will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use ofraw_input(calledinputin Python 3).mypycan be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.
edited Jun 15 at 6:38
answered Jun 15 at 6:27
l0b0l0b0
6,77613 silver badges32 bronze badges
6,77613 silver badges32 bronze badges
add a comment
|
add a comment
|
$begingroup$
To simplify your input you could change
word == explicit_wordtoexplicit_word in word.This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.
Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".
Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.
This could also match thinks like "sexx".
Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.
>>> print('expu200blitive')
explitive
>>> 'expu200blitive' == 'explitive'
False
>>> print('expu1DD3litive')
expᷓlitiveYour code doesn't reasonably detect leet speak.
Your code doesn't reasonably detect creative Unicode character selections.
>>> print('u2131u01DAu0255u212A')
ℱǚɕKYour code doesn't reasonably handle phonetic abuse.
"Phuck" is pronounced "Fuck".
"Bend Dover" sounds like "bend over".
"I CUP" spelt out is "I see you pee".
"See you next Tuesday" each starting phonetic spells "CU...".
When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚɕK" your word list would have to be ridiculous in size to match it.
It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.
Practicality vs Perfection
Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.
You should then probably use a 'similar check' to match "sexx" or "afuck". This is as duplicating a letter or missing a space is readable and easy.
This check may result in problems as "Scunthorpe" could be understood as "s cunt thorp e" so would be split into two two known words and two extra letters, and then the similarity check my come back low enough to say it's a swearword.
From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.
$endgroup$
1
$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37
1
$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43
add a comment
|
$begingroup$
To simplify your input you could change
word == explicit_wordtoexplicit_word in word.This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.
Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".
Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.
This could also match thinks like "sexx".
Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.
>>> print('expu200blitive')
explitive
>>> 'expu200blitive' == 'explitive'
False
>>> print('expu1DD3litive')
expᷓlitiveYour code doesn't reasonably detect leet speak.
Your code doesn't reasonably detect creative Unicode character selections.
>>> print('u2131u01DAu0255u212A')
ℱǚɕKYour code doesn't reasonably handle phonetic abuse.
"Phuck" is pronounced "Fuck".
"Bend Dover" sounds like "bend over".
"I CUP" spelt out is "I see you pee".
"See you next Tuesday" each starting phonetic spells "CU...".
When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚɕK" your word list would have to be ridiculous in size to match it.
It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.
Practicality vs Perfection
Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.
You should then probably use a 'similar check' to match "sexx" or "afuck". This is as duplicating a letter or missing a space is readable and easy.
This check may result in problems as "Scunthorpe" could be understood as "s cunt thorp e" so would be split into two two known words and two extra letters, and then the similarity check my come back low enough to say it's a swearword.
From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.
$endgroup$
1
$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37
1
$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43
add a comment
|
$begingroup$
To simplify your input you could change
word == explicit_wordtoexplicit_word in word.This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.
Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".
Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.
This could also match thinks like "sexx".
Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.
>>> print('expu200blitive')
explitive
>>> 'expu200blitive' == 'explitive'
False
>>> print('expu1DD3litive')
expᷓlitiveYour code doesn't reasonably detect leet speak.
Your code doesn't reasonably detect creative Unicode character selections.
>>> print('u2131u01DAu0255u212A')
ℱǚɕKYour code doesn't reasonably handle phonetic abuse.
"Phuck" is pronounced "Fuck".
"Bend Dover" sounds like "bend over".
"I CUP" spelt out is "I see you pee".
"See you next Tuesday" each starting phonetic spells "CU...".
When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚɕK" your word list would have to be ridiculous in size to match it.
It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.
Practicality vs Perfection
Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.
You should then probably use a 'similar check' to match "sexx" or "afuck". This is as duplicating a letter or missing a space is readable and easy.
This check may result in problems as "Scunthorpe" could be understood as "s cunt thorp e" so would be split into two two known words and two extra letters, and then the similarity check my come back low enough to say it's a swearword.
From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.
$endgroup$
To simplify your input you could change
word == explicit_wordtoexplicit_word in word.This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.
Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".
Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.
This could also match thinks like "sexx".
Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.
>>> print('expu200blitive')
explitive
>>> 'expu200blitive' == 'explitive'
False
>>> print('expu1DD3litive')
expᷓlitiveYour code doesn't reasonably detect leet speak.
Your code doesn't reasonably detect creative Unicode character selections.
>>> print('u2131u01DAu0255u212A')
ℱǚɕKYour code doesn't reasonably handle phonetic abuse.
"Phuck" is pronounced "Fuck".
"Bend Dover" sounds like "bend over".
"I CUP" spelt out is "I see you pee".
"See you next Tuesday" each starting phonetic spells "CU...".
When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚɕK" your word list would have to be ridiculous in size to match it.
It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.
Practicality vs Perfection
Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.
You should then probably use a 'similar check' to match "sexx" or "afuck". This is as duplicating a letter or missing a space is readable and easy.
This check may result in problems as "Scunthorpe" could be understood as "s cunt thorp e" so would be split into two two known words and two extra letters, and then the similarity check my come back low enough to say it's a swearword.
From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.
answered Jun 17 at 0:07
PeilonrayzPeilonrayz
30.7k4 gold badges46 silver badges123 bronze badges
30.7k4 gold badges46 silver badges123 bronze badges
1
$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37
1
$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43
add a comment
|
1
$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37
1
$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43
1
1
$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37
$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37
1
1
$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43
$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43
add a comment
|
$begingroup$
explicit_words = [
#explicit words not shown
]
Instead of embedding the word list within the program, you might want to treat it as configuration, and load it from a data file. The advantage of this is that it can allow the use of alternative configurations by command-line option (e.g. use a list of French obscenities for French-language lyrics).
On efficiency: we're only interested in the set of words in the lyrics. Given that many songs have repeated lyrics, then it's probably worth reducing the lyrics to a set of words before comparing that set with the banned words (use & for set intersection), or set.isdisjoint() for a potentially faster test.
$endgroup$
add a comment
|
$begingroup$
explicit_words = [
#explicit words not shown
]
Instead of embedding the word list within the program, you might want to treat it as configuration, and load it from a data file. The advantage of this is that it can allow the use of alternative configurations by command-line option (e.g. use a list of French obscenities for French-language lyrics).
On efficiency: we're only interested in the set of words in the lyrics. Given that many songs have repeated lyrics, then it's probably worth reducing the lyrics to a set of words before comparing that set with the banned words (use & for set intersection), or set.isdisjoint() for a potentially faster test.
$endgroup$
add a comment
|
$begingroup$
explicit_words = [
#explicit words not shown
]
Instead of embedding the word list within the program, you might want to treat it as configuration, and load it from a data file. The advantage of this is that it can allow the use of alternative configurations by command-line option (e.g. use a list of French obscenities for French-language lyrics).
On efficiency: we're only interested in the set of words in the lyrics. Given that many songs have repeated lyrics, then it's probably worth reducing the lyrics to a set of words before comparing that set with the banned words (use & for set intersection), or set.isdisjoint() for a potentially faster test.
$endgroup$
explicit_words = [
#explicit words not shown
]
Instead of embedding the word list within the program, you might want to treat it as configuration, and load it from a data file. The advantage of this is that it can allow the use of alternative configurations by command-line option (e.g. use a list of French obscenities for French-language lyrics).
On efficiency: we're only interested in the set of words in the lyrics. Given that many songs have repeated lyrics, then it's probably worth reducing the lyrics to a set of words before comparing that set with the banned words (use & for set intersection), or set.isdisjoint() for a potentially faster test.
edited Jun 17 at 14:07
answered Jun 17 at 13:59
Toby SpeightToby Speight
33.3k7 gold badges46 silver badges143 bronze badges
33.3k7 gold badges46 silver badges143 bronze badges
add a comment
|
add a comment
|
$begingroup$
The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:
- The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check
- The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.
$endgroup$
$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14
add a comment
|
$begingroup$
The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:
- The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check
- The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.
$endgroup$
$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14
add a comment
|
$begingroup$
The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:
- The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check
- The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.
$endgroup$
The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:
- The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check
- The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.
answered Jun 18 at 6:13
N Praveen ChandharN Praveen Chandhar
111 bronze badge
111 bronze badge
$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14
add a comment
|
$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14
$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14
$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14
add a comment
|
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f222339%2fexplicit-song-lyrics-checker%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
6
$begingroup$
The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable named
songcontains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) Sosong_path(as used in the function) is already much better :-)$endgroup$
– wovano
Jun 15 at 19:31
9
$begingroup$
As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…
$endgroup$
– gidds
Jun 16 at 0:32
$begingroup$
Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.
$endgroup$
– Viktor Mellgren
Jun 17 at 8:19
$begingroup$
why not use a dictionary and some sort of hash. a linear search would be horrible
$endgroup$
– Andrew Scott Evans
Jun 18 at 7:23
1
$begingroup$
There is a meta CR post that involves this post
$endgroup$
– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46