Explicit song lyrics checkerA convenient wrapper for Telegram Bot librarySong lyric generator using Markov Chains - PythonSIC C command checker programming fragmentPython qwerty keyboard checkerPython 3 palindrome checkerVoting plugin for an IRC botWPF Palindrome Checker Application99 beers song in Rust

What potential problems are there with dumping dex in bard build?

In what way were Renaissance battles like chess matches?

Is it possible to save a (science) PhD in 10 months?

Understanding the use of 'eo' in a sentence from LLPSI

Why do 可不 and 可不是 by themselves imply agreement?

Wood versus marble rolling pin 'performance'

What spacing difference is acceptable with tile?

What does it take to make metal music?

Is there an operator or an easy way to match an expression one or more times with the LIKE operator in SQL?

How did composers "test" their music?

Why would gloves be necessary for handling flobberworms?

Employer reneged on negotiated clauses that weren't part of agreed contract, citing budget cuts - what can I do?

"The" for the first time only

Locked out of my own server: getting "Too many authentication failures" right away when connecting via ssh

How do I complete the "A Brilliant Smile" triumph?

Is this bible in Koine Greek?

Why is the air inside airliners so dry (low humidity)?

Characteristic scale degrees

What is self hosted version control system?

Tic Tac Toe console program

Is SSH key with passphrase a 2FA

The output -1 becomes a slash in the loop

How do you link two checking accounts from two different banks in two different countries?

How do I disable vim from producing backup files?

Explicit song lyrics checker

A convenient wrapper for Telegram Bot librarySong lyric generator using Markov Chains - PythonSIC C command checker programming fragmentPython qwerty keyboard checkerPython 3 palindrome checkerVoting plugin for an IRC botWPF Palindrome Checker Application99 beers song in Rust

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;

To stay in practice with my Python I've decided to write an explicit song checker. It checks each word in the song against an array of 14 explicit words contained in the text file. I've decided not to include the words that are to be checked against, for I am not sure about the rule of having explicit words in a program's code. Feedback on efficiency and structure is what I'm going for mostly. Style and other critiques are invited as well.

explicit_song_checker.py

explicit_words = [
 #explicit words not shown 
]

def isClean(song_path):
 with open(song_path) as song:
 for line in song:
 words = line.split(" ")
 for word in words:
 for explicit_word in explicit_words:
 if word == explicit_word:
 return False

 return True

def main():
 song = raw_input("Enter path to song: n")
 if isClean(song):
 print("CLEAN")
 else:
 print("EXPLICIT")

if __name__ == '__main__':
 main()

edited Jun 15 at 16:59

dfhwze

13k3 gold badges24 silver badges94 bronze badges

asked Jun 15 at 3:44

Linny

3,5172 gold badges11 silver badges51 bronze badges

6

$begingroup$
The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable named song contains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) So song_path (as used in the function) is already much better :-)
$endgroup$
– wovano
Jun 15 at 19:31

9

$begingroup$
As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…
$endgroup$
– gidds
Jun 16 at 0:32

$begingroup$
Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.
$endgroup$
– Viktor Mellgren
Jun 17 at 8:19

$begingroup$
why not use a dictionary and some sort of hash. a linear search would be horrible
$endgroup$
– Andrew Scott Evans
Jun 18 at 7:23

1

$begingroup$
There is a meta CR post that involves this post
$endgroup$
– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46

add a comment
|

explicit_song_checker.py

explicit_words = [
 #explicit words not shown 
]

def isClean(song_path):
 with open(song_path) as song:
 for line in song:
 words = line.split(" ")
 for word in words:
 for explicit_word in explicit_words:
 if word == explicit_word:
 return False

 return True

def main():
 song = raw_input("Enter path to song: n")
 if isClean(song):
 print("CLEAN")
 else:
 print("EXPLICIT")

if __name__ == '__main__':
 main()

edited Jun 15 at 16:59

dfhwze

13k3 gold badges24 silver badges94 bronze badges

asked Jun 15 at 3:44

Linny

3,5172 gold badges11 silver badges51 bronze badges

6

$begingroup$
The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable named song contains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) So song_path (as used in the function) is already much better :-)
$endgroup$
– wovano
Jun 15 at 19:31

9

$begingroup$
As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…
$endgroup$
– gidds
Jun 16 at 0:32

$begingroup$
Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.
$endgroup$
– Viktor Mellgren
Jun 17 at 8:19

$begingroup$
why not use a dictionary and some sort of hash. a linear search would be horrible
$endgroup$
– Andrew Scott Evans
Jun 18 at 7:23

1

$begingroup$
There is a meta CR post that involves this post
$endgroup$
– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46

add a comment
|

explicit_song_checker.py

explicit_words = [
 #explicit words not shown 
]

def isClean(song_path):
 with open(song_path) as song:
 for line in song:
 words = line.split(" ")
 for word in words:
 for explicit_word in explicit_words:
 if word == explicit_word:
 return False

 return True

def main():
 song = raw_input("Enter path to song: n")
 if isClean(song):
 print("CLEAN")
 else:
 print("EXPLICIT")

if __name__ == '__main__':
 main()

edited Jun 15 at 16:59

dfhwze

13k3 gold badges24 silver badges94 bronze badges

asked Jun 15 at 3:44

Linny

3,5172 gold badges11 silver badges51 bronze badges

explicit_song_checker.py

explicit_words = [
 #explicit words not shown 
]

def isClean(song_path):
 with open(song_path) as song:
 for line in song:
 words = line.split(" ")
 for word in words:
 for explicit_word in explicit_words:
 if word == explicit_word:
 return False

 return True

def main():
 song = raw_input("Enter path to song: n")
 if isClean(song):
 print("CLEAN")
 else:
 print("EXPLICIT")

if __name__ == '__main__':
 main()

python strings file

edited Jun 15 at 16:59

dfhwze

13k3 gold badges24 silver badges94 bronze badges

asked Jun 15 at 3:44

Linny

3,5172 gold badges11 silver badges51 bronze badges

edited Jun 15 at 16:59

dfhwze

13k3 gold badges24 silver badges94 bronze badges

asked Jun 15 at 3:44

Linny

3,5172 gold badges11 silver badges51 bronze badges

edited Jun 15 at 16:59

dfhwze

13k3 gold badges24 silver badges94 bronze badges

edited Jun 15 at 16:59

dfhwze

13k3 gold badges24 silver badges94 bronze badges

edited Jun 15 at 16:59

dfhwze

13k3 gold badges24 silver badges94 bronze badges

asked Jun 15 at 3:44

Linny

3,5172 gold badges11 silver badges51 bronze badges

asked Jun 15 at 3:44

Linny

3,5172 gold badges11 silver badges51 bronze badges

asked Jun 15 at 3:44

Linny

3,5172 gold badges11 silver badges51 bronze badges

6

$begingroup$
The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable named song contains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) So song_path (as used in the function) is already much better :-)
$endgroup$
– wovano
Jun 15 at 19:31

9

$begingroup$
As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…
$endgroup$
– gidds
Jun 16 at 0:32

$begingroup$
Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.
$endgroup$
– Viktor Mellgren
Jun 17 at 8:19

$begingroup$
why not use a dictionary and some sort of hash. a linear search would be horrible
$endgroup$
– Andrew Scott Evans
Jun 18 at 7:23

1

$begingroup$
There is a meta CR post that involves this post
$endgroup$
– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46

add a comment
|

6

$begingroup$
The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable named song contains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) So song_path (as used in the function) is already much better :-)
$endgroup$
– wovano
Jun 15 at 19:31

9

$begingroup$
As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…
$endgroup$
– gidds
Jun 16 at 0:32

$begingroup$
Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.
$endgroup$
– Viktor Mellgren
Jun 17 at 8:19

$begingroup$
why not use a dictionary and some sort of hash. a linear search would be horrible
$endgroup$
– Andrew Scott Evans
Jun 18 at 7:23

1

$begingroup$
There is a meta CR post that involves this post
$endgroup$
– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46

The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable named song contains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) So song_path (as used in the function) is already much better :-)

– wovano
Jun 15 at 19:31

As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous…

– gidds
Jun 16 at 0:32

Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it.

– Viktor Mellgren
Jun 17 at 8:19

why not use a dictionary and some sort of hash. a linear search would be horrible

– Andrew Scott Evans
Jun 18 at 7:23

There is a meta CR post that involves this post

– Sᴀᴍ Onᴇᴌᴀ
Jun 19 at 22:46

add a comment
|

6 Answers
6

active

oldest

votes

I recommend practicing Python 3 rather than Python 2 these days.

According to PEP 8, isClean() should be is_clean().

One or more of your loops could be replaced by some use of the any() function. Note that this suggests that an is_explicit() function would be a more natural concept than an is_clean() function.

I would expect song lyrics to contain a mixture of uppercase and lowercase. Words may be delimited by punctuation as well as spaces. Therefore, words = line.split(" ") is probably too naïve. Furthermore, I would expect each song to be short enough to fit entirely in memory very comfortably, so processing each file line by line is an unnecessary complication.

I would rewrite the program to use a regular expression instead.

import re

explicit_words = [
 …
]

is_explicit = re.compile(
 r'b(?:' +
 '|'.join(re.escape(w) for w in explicit_words) +
 r')b',
 re.IGNORECASE).search

def main():
 with open(raw_input("Enter path to song: ")) as song:
 print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")

if __name__ == '__main__':
 main()

The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:

def is_explicit(text):
 return re.search(
 r'b(?:' +
 '|'.join(re.escape(w) for w in explicit_words) +
 r')b',
 text,
 re.IGNORECASE
 )

answered Jun 15 at 4:40

200_success

136k21 gold badges175 silver badges448 bronze badges

$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50

$begingroup$
@OscarSmith The program calls the helper function once from main(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.
$endgroup$
– 200_success
Jun 18 at 0:00

$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21

$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32

add a comment
|

Splitting line into words using words = line.split(" ") will only split the line on space characters. If two words are separated by a tab character, or other white space character other than a space, the split will not happen at that point, which may confuse your explicit checker.

Using words = line.split() would split on spaces, tab characters, and any other white space characters.

You might also want to investigate splitting using the word break regular expression b, which would treat non-word characters as break points, so that a line containing an explicit word immediately followed by a comma (such as "explicit,") or period, or semicolon, etc. won't sneak through the filter.

 for explicit_word in explicit_words:
 if word == explicit_word:
 return False

Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:

 if word in explicit_words:
 return False

Or better, use a set() of explicit words, and the in operation drops from $O(n)$ to $O(1)$ time complexity:

explicit_words = 
 "explicit", "bad", "ugly", "offensive", "words" 


# ...

 if word in explicit_words:
 return False

We've improved things a little, but we still have an explicit (pun intended) loop over the list of words in a line. We can use the any() method, and a list comprehension, to get Python to do the looping for us, which should be faster.

 for line in song:
 words = line.split()
 if any(word in explicit_words for word in words):
 return False

If word in explicits returns True for any of the word in the list words, the any() method will "short-circuit" and immediately return True without examining the remaining items in words.

If "explicit" is a bad word, and the song lyrics contains "Explicit" or "EXPLICIT", you wouldn't want to mark the song as "CLEAN", right? Perhaps you want word.lower() in explicit_words? As @wizzwizz4 pointed out, .casefold() is better than .lower().

Resulting method:

explicit_words = set(explicit_word.casefold()
 for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))

def isClean(song_path):
 with open(song_path) as song:
 for line in song:
 words = line.split()
 if any(word.casefold() in explicit_words for word in words):
 return False 
 return True

edited Jun 15 at 16:33

answered Jun 15 at 4:35

AJNeufeld

13.7k1 gold badge13 silver badges42 bronze badges

1

$begingroup$
Why does it need to iterate over the song lines? song.split() would also split on n (and r, of course). Call me unpythonic, but I would directly return any(word.lower() in explicit_words for word in song.split())
$endgroup$
– Helio
Jun 15 at 12:22

$begingroup$
@Helio I expect you would need song.read().split(), not song.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because the split() doesn’t consider punctuation as word separators. That area still needed replacement with a regex b style word splitting, so embedding it inside of the any(...) hides the work that still needed to be done.
$endgroup$
– AJNeufeld
Jun 15 at 12:37

$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then running split. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.
$endgroup$
– Helio
Jun 15 at 12:56

3

$begingroup$
Why not str.casefold instead of str.lower? They're the same for (most of) Latin, but different for stuff like Greek.
$endgroup$
– wizzwizz4
Jun 15 at 16:17

$begingroup$
@wizzwizz4 Fair point; well made. All the words in explicit_words would need to be run through casefold() as well.
$endgroup$
– AJNeufeld
Jun 15 at 16:26

add a comment
|

Some suggestions:

Using argparse to parse arguments is preferable to any kind of interactive input, because it means that
- it's much easier to automate and integrate in other scripts,
- you could trivially support passing multiple songs in one command,
- it's what any experienced shell user would expect, and
- if the user adds --help or doesn't pass any arguments they will get a synopsis for the command.

The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with argparse) so that you could do fancy stuff like ./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt

black will automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.

flake8 will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use of raw_input (called input in Python 3).

mypy can be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.

edited Jun 15 at 6:38

answered Jun 15 at 6:27

l0b0

6,77613 silver badges32 bronze badges

add a comment
|

To simplify your input you could change word == explicit_word to explicit_word in word.

This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.

Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".

Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.

This could also match thinks like "sexx".

Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.

>>> print('expu200blitive')
explitive
>>> 'expu200blitive' == 'explitive'
False
>>> print('expu1DD3litive')
expᷓlitive

Your code doesn't reasonably detect leet speak.

Your code doesn't reasonably detect creative Unicode character selections.
```
>>> print('u2131u01DAu0255u212A')
ℱǚɕK
```

Your code doesn't reasonably handle phonetic abuse.

"Phuck" is pronounced "Fuck".

"Bend Dover" sounds like "bend over".

"I CUP" spelt out is "I see you pee".

"See you next Tuesday" each starting phonetic spells "CU...".

When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚɕK" your word list would have to be ridiculous in size to match it.

It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.

Practicality vs Perfection

Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.

You should then probably use a 'similar check' to match "sexx" or "afuck". This is as duplicating a letter or missing a space is readable and easy.

This check may result in problems as "Scunthorpe" could be understood as "s cunt thorp e" so would be split into two two known words and two extra letters, and then the similarity check my come back low enough to say it's a swearword.

From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.

answered Jun 17 at 0:07

Peilonrayz

30.7k4 gold badges46 silver badges123 bronze badges

1

$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37

1

$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43

add a comment
|

explicit_words = [
 #explicit words not shown 
]

Instead of embedding the word list within the program, you might want to treat it as configuration, and load it from a data file. The advantage of this is that it can allow the use of alternative configurations by command-line option (e.g. use a list of French obscenities for French-language lyrics).

On efficiency: we're only interested in the set of words in the lyrics. Given that many songs have repeated lyrics, then it's probably worth reducing the lyrics to a set of words before comparing that set with the banned words (use & for set intersection), or set.isdisjoint() for a potentially faster test.

edited Jun 17 at 14:07

answered Jun 17 at 13:59

Toby Speight

33.3k7 gold badges46 silver badges143 bronze badges

add a comment
|

The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:

The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check

The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.

answered Jun 18 at 6:13

N Praveen Chandhar

111 bronze badge

$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14

add a comment
|

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f222339%2fexplicit-song-lyrics-checker%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

6 Answers
6

active

oldest

votes

6 Answers
6

active

oldest

votes

I recommend practicing Python 3 rather than Python 2 these days.

According to PEP 8, isClean() should be is_clean().

I would rewrite the program to use a regular expression instead.

import re

explicit_words = [
 …
]

is_explicit = re.compile(
 r'b(?:' +
 '|'.join(re.escape(w) for w in explicit_words) +
 r')b',
 re.IGNORECASE).search

def main():
 with open(raw_input("Enter path to song: ")) as song:
 print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")

if __name__ == '__main__':
 main()

The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:

def is_explicit(text):
 return re.search(
 r'b(?:' +
 '|'.join(re.escape(w) for w in explicit_words) +
 r')b',
 text,
 re.IGNORECASE
 )

answered Jun 15 at 4:40

200_success

136k21 gold badges175 silver badges448 bronze badges

$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50

$begingroup$
@OscarSmith The program calls the helper function once from main(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.
$endgroup$
– 200_success
Jun 18 at 0:00

$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21

$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32

add a comment
|

I recommend practicing Python 3 rather than Python 2 these days.

According to PEP 8, isClean() should be is_clean().

I would rewrite the program to use a regular expression instead.

import re

explicit_words = [
 …
]

is_explicit = re.compile(
 r'b(?:' +
 '|'.join(re.escape(w) for w in explicit_words) +
 r')b',
 re.IGNORECASE).search

def main():
 with open(raw_input("Enter path to song: ")) as song:
 print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")

if __name__ == '__main__':
 main()

The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:

def is_explicit(text):
 return re.search(
 r'b(?:' +
 '|'.join(re.escape(w) for w in explicit_words) +
 r')b',
 text,
 re.IGNORECASE
 )

answered Jun 15 at 4:40

200_success

136k21 gold badges175 silver badges448 bronze badges

$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50

$begingroup$
@OscarSmith The program calls the helper function once from main(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.
$endgroup$
– 200_success
Jun 18 at 0:00

$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21

$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32

add a comment
|

I recommend practicing Python 3 rather than Python 2 these days.

According to PEP 8, isClean() should be is_clean().

I would rewrite the program to use a regular expression instead.

import re

explicit_words = [
 …
]

is_explicit = re.compile(
 r'b(?:' +
 '|'.join(re.escape(w) for w in explicit_words) +
 r')b',
 re.IGNORECASE).search

def main():
 with open(raw_input("Enter path to song: ")) as song:
 print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")

if __name__ == '__main__':
 main()

The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:

def is_explicit(text):
 return re.search(
 r'b(?:' +
 '|'.join(re.escape(w) for w in explicit_words) +
 r')b',
 text,
 re.IGNORECASE
 )

answered Jun 15 at 4:40

200_success

136k21 gold badges175 silver badges448 bronze badges

I recommend practicing Python 3 rather than Python 2 these days.

According to PEP 8, isClean() should be is_clean().

I would rewrite the program to use a regular expression instead.

import re

explicit_words = [
 …
]

is_explicit = re.compile(
 r'b(?:' +
 '|'.join(re.escape(w) for w in explicit_words) +
 r')b',
 re.IGNORECASE).search

def main():
 with open(raw_input("Enter path to song: ")) as song:
 print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")

if __name__ == '__main__':
 main()

The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:

def is_explicit(text):
 return re.search(
 r'b(?:' +
 '|'.join(re.escape(w) for w in explicit_words) +
 r')b',
 text,
 re.IGNORECASE
 )

answered Jun 15 at 4:40

200_success

136k21 gold badges175 silver badges448 bronze badges

answered Jun 15 at 4:40

200_success

136k21 gold badges175 silver badges448 bronze badges

answered Jun 15 at 4:40

200_success

136k21 gold badges175 silver badges448 bronze badges

answered Jun 15 at 4:40

200_success

136k21 gold badges175 silver badges448 bronze badges

$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50

$begingroup$
@OscarSmith The program calls the helper function once from main(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.
$endgroup$
– 200_success
Jun 18 at 0:00

$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21

$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32

add a comment
|

$begingroup$
The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.
$endgroup$
– Oscar Smith
Jun 17 at 23:50

$begingroup$
@OscarSmith The program calls the helper function once from main(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.
$endgroup$
– 200_success
Jun 18 at 0:00

$begingroup$
Oh, I was mainly commenting on the bottom block, where it's in the function.
$endgroup$
– Oscar Smith
Jun 18 at 0:21

$begingroup$
I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171
$endgroup$
– Captain Delano
Jun 18 at 7:32

The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would.

– Oscar Smith
Jun 17 at 23:50

@OscarSmith The program calls the helper function once from main(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution.

– 200_success
Jun 18 at 0:00

Oh, I was mainly commenting on the bottom block, where it's in the function.

– Oscar Smith
Jun 18 at 0:21

I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171

– Captain Delano
Jun 18 at 7:32

add a comment
|

Using words = line.split() would split on spaces, tab characters, and any other white space characters.

 for explicit_word in explicit_words:
 if word == explicit_word:
 return False

Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:

 if word in explicit_words:
 return False

Or better, use a set() of explicit words, and the in operation drops from $O(n)$ to $O(1)$ time complexity:

explicit_words = 
 "explicit", "bad", "ugly", "offensive", "words" 


# ...

 if word in explicit_words:
 return False

 for line in song:
 words = line.split()
 if any(word in explicit_words for word in words):
 return False

Resulting method:

explicit_words = set(explicit_word.casefold()
 for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))

def isClean(song_path):
 with open(song_path) as song:
 for line in song:
 words = line.split()
 if any(word.casefold() in explicit_words for word in words):
 return False 
 return True

edited Jun 15 at 16:33

answered Jun 15 at 4:35

AJNeufeld

13.7k1 gold badge13 silver badges42 bronze badges

1

$begingroup$
Why does it need to iterate over the song lines? song.split() would also split on n (and r, of course). Call me unpythonic, but I would directly return any(word.lower() in explicit_words for word in song.split())
$endgroup$
– Helio
Jun 15 at 12:22

$begingroup$
@Helio I expect you would need song.read().split(), not song.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because the split() doesn’t consider punctuation as word separators. That area still needed replacement with a regex b style word splitting, so embedding it inside of the any(...) hides the work that still needed to be done.
$endgroup$
– AJNeufeld
Jun 15 at 12:37

$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then running split. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.
$endgroup$
– Helio
Jun 15 at 12:56

3

$begingroup$
Why not str.casefold instead of str.lower? They're the same for (most of) Latin, but different for stuff like Greek.
$endgroup$
– wizzwizz4
Jun 15 at 16:17

$begingroup$
@wizzwizz4 Fair point; well made. All the words in explicit_words would need to be run through casefold() as well.
$endgroup$
– AJNeufeld
Jun 15 at 16:26

add a comment
|

Using words = line.split() would split on spaces, tab characters, and any other white space characters.

 for explicit_word in explicit_words:
 if word == explicit_word:
 return False

Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:

 if word in explicit_words:
 return False

Or better, use a set() of explicit words, and the in operation drops from $O(n)$ to $O(1)$ time complexity:

explicit_words = 
 "explicit", "bad", "ugly", "offensive", "words" 


# ...

 if word in explicit_words:
 return False

 for line in song:
 words = line.split()
 if any(word in explicit_words for word in words):
 return False

Resulting method:

explicit_words = set(explicit_word.casefold()
 for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))

def isClean(song_path):
 with open(song_path) as song:
 for line in song:
 words = line.split()
 if any(word.casefold() in explicit_words for word in words):
 return False 
 return True

edited Jun 15 at 16:33

answered Jun 15 at 4:35

AJNeufeld

13.7k1 gold badge13 silver badges42 bronze badges

1

$begingroup$
Why does it need to iterate over the song lines? song.split() would also split on n (and r, of course). Call me unpythonic, but I would directly return any(word.lower() in explicit_words for word in song.split())
$endgroup$
– Helio
Jun 15 at 12:22

$begingroup$
@Helio I expect you would need song.read().split(), not song.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because the split() doesn’t consider punctuation as word separators. That area still needed replacement with a regex b style word splitting, so embedding it inside of the any(...) hides the work that still needed to be done.
$endgroup$
– AJNeufeld
Jun 15 at 12:37

$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then running split. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.
$endgroup$
– Helio
Jun 15 at 12:56

3

$begingroup$
Why not str.casefold instead of str.lower? They're the same for (most of) Latin, but different for stuff like Greek.
$endgroup$
– wizzwizz4
Jun 15 at 16:17

$begingroup$
@wizzwizz4 Fair point; well made. All the words in explicit_words would need to be run through casefold() as well.
$endgroup$
– AJNeufeld
Jun 15 at 16:26

add a comment
|

Using words = line.split() would split on spaces, tab characters, and any other white space characters.

 for explicit_word in explicit_words:
 if word == explicit_word:
 return False

Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:

 if word in explicit_words:
 return False

Or better, use a set() of explicit words, and the in operation drops from $O(n)$ to $O(1)$ time complexity:

explicit_words = 
 "explicit", "bad", "ugly", "offensive", "words" 


# ...

 if word in explicit_words:
 return False

 for line in song:
 words = line.split()
 if any(word in explicit_words for word in words):
 return False

Resulting method:

explicit_words = set(explicit_word.casefold()
 for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))

def isClean(song_path):
 with open(song_path) as song:
 for line in song:
 words = line.split()
 if any(word.casefold() in explicit_words for word in words):
 return False 
 return True

edited Jun 15 at 16:33

answered Jun 15 at 4:35

AJNeufeld

13.7k1 gold badge13 silver badges42 bronze badges

Using words = line.split() would split on spaces, tab characters, and any other white space characters.

 for explicit_word in explicit_words:
 if word == explicit_word:
 return False

Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:

 if word in explicit_words:
 return False

Or better, use a set() of explicit words, and the in operation drops from $O(n)$ to $O(1)$ time complexity:

explicit_words = 
 "explicit", "bad", "ugly", "offensive", "words" 


# ...

 if word in explicit_words:
 return False

 for line in song:
 words = line.split()
 if any(word in explicit_words for word in words):
 return False

Resulting method:

explicit_words = set(explicit_word.casefold()
 for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))

def isClean(song_path):
 with open(song_path) as song:
 for line in song:
 words = line.split()
 if any(word.casefold() in explicit_words for word in words):
 return False 
 return True

edited Jun 15 at 16:33

answered Jun 15 at 4:35

AJNeufeld

13.7k1 gold badge13 silver badges42 bronze badges

edited Jun 15 at 16:33

answered Jun 15 at 4:35

AJNeufeld

13.7k1 gold badge13 silver badges42 bronze badges

answered Jun 15 at 4:35

AJNeufeld

13.7k1 gold badge13 silver badges42 bronze badges

answered Jun 15 at 4:35

AJNeufeld

13.7k1 gold badge13 silver badges42 bronze badges

1

$begingroup$
Why does it need to iterate over the song lines? song.split() would also split on n (and r, of course). Call me unpythonic, but I would directly return any(word.lower() in explicit_words for word in song.split())
$endgroup$
– Helio
Jun 15 at 12:22

$begingroup$
@Helio I expect you would need song.read().split(), not song.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because the split() doesn’t consider punctuation as word separators. That area still needed replacement with a regex b style word splitting, so embedding it inside of the any(...) hides the work that still needed to be done.
$endgroup$
– AJNeufeld
Jun 15 at 12:37

$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then running split. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.
$endgroup$
– Helio
Jun 15 at 12:56

3

$begingroup$
Why not str.casefold instead of str.lower? They're the same for (most of) Latin, but different for stuff like Greek.
$endgroup$
– wizzwizz4
Jun 15 at 16:17

$begingroup$
@wizzwizz4 Fair point; well made. All the words in explicit_words would need to be run through casefold() as well.
$endgroup$
– AJNeufeld
Jun 15 at 16:26

add a comment
|

1

$begingroup$
Why does it need to iterate over the song lines? song.split() would also split on n (and r, of course). Call me unpythonic, but I would directly return any(word.lower() in explicit_words for word in song.split())
$endgroup$
– Helio
Jun 15 at 12:22

$begingroup$
@Helio I expect you would need song.read().split(), not song.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because the split() doesn’t consider punctuation as word separators. That area still needed replacement with a regex b style word splitting, so embedding it inside of the any(...) hides the work that still needed to be done.
$endgroup$
– AJNeufeld
Jun 15 at 12:37

$begingroup$
yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then running split. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.
$endgroup$
– Helio
Jun 15 at 12:56

3

$begingroup$
Why not str.casefold instead of str.lower? They're the same for (most of) Latin, but different for stuff like Greek.
$endgroup$
– wizzwizz4
Jun 15 at 16:17

$begingroup$
@wizzwizz4 Fair point; well made. All the words in explicit_words would need to be run through casefold() as well.
$endgroup$
– AJNeufeld
Jun 15 at 16:26

Why does it need to iterate over the song lines? song.split() would also split on n (and r, of course). Call me unpythonic, but I would directly return any(word.lower() in explicit_words for word in song.split())

– Helio
Jun 15 at 12:22

@Helio I expect you would need song.read().split(), not song.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because the split() doesn’t consider punctuation as word separators. That area still needed replacement with a regex b style word splitting, so embedding it inside of the any(...) hides the work that still needed to be done.

– AJNeufeld
Jun 15 at 12:37

yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then running split. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK.

– Helio
Jun 15 at 12:56

Why not str.casefold instead of str.lower? They're the same for (most of) Latin, but different for stuff like Greek.

– wizzwizz4
Jun 15 at 16:17

@wizzwizz4 Fair point; well made. All the words in explicit_words would need to be run through casefold() as well.

– AJNeufeld
Jun 15 at 16:26

add a comment
|

Some suggestions:

Using argparse to parse arguments is preferable to any kind of interactive input, because it means that
- it's much easier to automate and integrate in other scripts,
- you could trivially support passing multiple songs in one command,
- it's what any experienced shell user would expect, and
- if the user adds --help or doesn't pass any arguments they will get a synopsis for the command.

The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with argparse) so that you could do fancy stuff like ./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt

black will automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.

flake8 will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use of raw_input (called input in Python 3).

mypy can be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.

edited Jun 15 at 6:38

answered Jun 15 at 6:27

l0b0

6,77613 silver badges32 bronze badges

add a comment
|

Some suggestions:

Using argparse to parse arguments is preferable to any kind of interactive input, because it means that
- it's much easier to automate and integrate in other scripts,
- you could trivially support passing multiple songs in one command,
- it's what any experienced shell user would expect, and
- if the user adds --help or doesn't pass any arguments they will get a synopsis for the command.

The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with argparse) so that you could do fancy stuff like ./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt

black will automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.

flake8 will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use of raw_input (called input in Python 3).

mypy can be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.

edited Jun 15 at 6:38

answered Jun 15 at 6:27

l0b0

6,77613 silver badges32 bronze badges

add a comment
|

Some suggestions:

Using argparse to parse arguments is preferable to any kind of interactive input, because it means that
- it's much easier to automate and integrate in other scripts,
- you could trivially support passing multiple songs in one command,
- it's what any experienced shell user would expect, and
- if the user adds --help or doesn't pass any arguments they will get a synopsis for the command.

The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with argparse) so that you could do fancy stuff like ./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt

black will automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.

flake8 will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use of raw_input (called input in Python 3).

mypy can be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.

edited Jun 15 at 6:38

answered Jun 15 at 6:27

l0b0

6,77613 silver badges32 bronze badges

Some suggestions:

Using argparse to parse arguments is preferable to any kind of interactive input, because it means that
- it's much easier to automate and integrate in other scripts,
- you could trivially support passing multiple songs in one command,
- it's what any experienced shell user would expect, and
- if the user adds --help or doesn't pass any arguments they will get a synopsis for the command.

The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with argparse) so that you could do fancy stuff like ./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt

black will automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.

flake8 will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use of raw_input (called input in Python 3).

mypy can be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.

edited Jun 15 at 6:38

answered Jun 15 at 6:27

l0b0

6,77613 silver badges32 bronze badges

edited Jun 15 at 6:38

answered Jun 15 at 6:27

l0b0

6,77613 silver badges32 bronze badges

answered Jun 15 at 6:27

l0b0

6,77613 silver badges32 bronze badges

answered Jun 15 at 6:27

l0b0

6,77613 silver badges32 bronze badges

add a comment
|

To simplify your input you could change word == explicit_word to explicit_word in word.

This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.

Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".

Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.

This could also match thinks like "sexx".

Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.

>>> print('expu200blitive')
explitive
>>> 'expu200blitive' == 'explitive'
False
>>> print('expu1DD3litive')
expᷓlitive

Your code doesn't reasonably detect leet speak.

Your code doesn't reasonably detect creative Unicode character selections.
```
>>> print('u2131u01DAu0255u212A')
ℱǚɕK
```

Your code doesn't reasonably handle phonetic abuse.

"Phuck" is pronounced "Fuck".

"Bend Dover" sounds like "bend over".

"I CUP" spelt out is "I see you pee".

"See you next Tuesday" each starting phonetic spells "CU...".

When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚɕK" your word list would have to be ridiculous in size to match it.

It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.

Practicality vs Perfection

Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.

From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.

answered Jun 17 at 0:07

Peilonrayz

30.7k4 gold badges46 silver badges123 bronze badges

1

$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37

1

$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43

add a comment
|

To simplify your input you could change word == explicit_word to explicit_word in word.

This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.

Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".

Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.

This could also match thinks like "sexx".

Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.

>>> print('expu200blitive')
explitive
>>> 'expu200blitive' == 'explitive'
False
>>> print('expu1DD3litive')
expᷓlitive

Your code doesn't reasonably detect leet speak.

Your code doesn't reasonably detect creative Unicode character selections.
```
>>> print('u2131u01DAu0255u212A')
ℱǚɕK
```

Your code doesn't reasonably handle phonetic abuse.

"Phuck" is pronounced "Fuck".

"Bend Dover" sounds like "bend over".

"I CUP" spelt out is "I see you pee".

"See you next Tuesday" each starting phonetic spells "CU...".

When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚɕK" your word list would have to be ridiculous in size to match it.

It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.

Practicality vs Perfection

Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.

From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.

answered Jun 17 at 0:07

Peilonrayz

30.7k4 gold badges46 silver badges123 bronze badges

1

$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37

1

$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43

add a comment
|

To simplify your input you could change word == explicit_word to explicit_word in word.

This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.

Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".

Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.

This could also match thinks like "sexx".

Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.

>>> print('expu200blitive')
explitive
>>> 'expu200blitive' == 'explitive'
False
>>> print('expu1DD3litive')
expᷓlitive

Your code doesn't reasonably detect leet speak.

Your code doesn't reasonably detect creative Unicode character selections.
```
>>> print('u2131u01DAu0255u212A')
ℱǚɕK
```

Your code doesn't reasonably handle phonetic abuse.

"Phuck" is pronounced "Fuck".

"Bend Dover" sounds like "bend over".

"I CUP" spelt out is "I see you pee".

"See you next Tuesday" each starting phonetic spells "CU...".

When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚɕK" your word list would have to be ridiculous in size to match it.

It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.

Practicality vs Perfection

Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.

From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.

answered Jun 17 at 0:07

Peilonrayz

30.7k4 gold badges46 silver badges123 bronze badges

To simplify your input you could change word == explicit_word to explicit_word in word.

This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.

Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".

Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.

This could also match thinks like "sexx".

Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.

>>> print('expu200blitive')
explitive
>>> 'expu200blitive' == 'explitive'
False
>>> print('expu1DD3litive')
expᷓlitive

Your code doesn't reasonably detect leet speak.

Your code doesn't reasonably detect creative Unicode character selections.
```
>>> print('u2131u01DAu0255u212A')
ℱǚɕK
```

Your code doesn't reasonably handle phonetic abuse.

"Phuck" is pronounced "Fuck".

"Bend Dover" sounds like "bend over".

"I CUP" spelt out is "I see you pee".

"See you next Tuesday" each starting phonetic spells "CU...".

When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚɕK" your word list would have to be ridiculous in size to match it.

It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.

Practicality vs Perfection

Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.

From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.

answered Jun 17 at 0:07

Peilonrayz

30.7k4 gold badges46 silver badges123 bronze badges

answered Jun 17 at 0:07

Peilonrayz

30.7k4 gold badges46 silver badges123 bronze badges

answered Jun 17 at 0:07

Peilonrayz

30.7k4 gold badges46 silver badges123 bronze badges

answered Jun 17 at 0:07

Peilonrayz

30.7k4 gold badges46 silver badges123 bronze badges

1

$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37

1

$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43

add a comment
|

1

$begingroup$
Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.
$endgroup$
– nick012000
Jun 17 at 2:37

1

$begingroup$
If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.
$endgroup$
– Mike Scott
Jun 17 at 16:43

Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech.

– nick012000
Jun 17 at 2:37

If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems.

– Mike Scott
Jun 17 at 16:43

add a comment
|

explicit_words = [
 #explicit words not shown 
]

edited Jun 17 at 14:07

answered Jun 17 at 13:59

Toby Speight

33.3k7 gold badges46 silver badges143 bronze badges

add a comment
|

explicit_words = [
 #explicit words not shown 
]

edited Jun 17 at 14:07

answered Jun 17 at 13:59

Toby Speight

33.3k7 gold badges46 silver badges143 bronze badges

add a comment
|

explicit_words = [
 #explicit words not shown 
]

edited Jun 17 at 14:07

answered Jun 17 at 13:59

Toby Speight

33.3k7 gold badges46 silver badges143 bronze badges

explicit_words = [
 #explicit words not shown 
]

edited Jun 17 at 14:07

answered Jun 17 at 13:59

Toby Speight

33.3k7 gold badges46 silver badges143 bronze badges

edited Jun 17 at 14:07

answered Jun 17 at 13:59

Toby Speight

33.3k7 gold badges46 silver badges143 bronze badges

answered Jun 17 at 13:59

Toby Speight

33.3k7 gold badges46 silver badges143 bronze badges

answered Jun 17 at 13:59

Toby Speight

33.3k7 gold badges46 silver badges143 bronze badges

add a comment
|

The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:

The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check

The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.

answered Jun 18 at 6:13

N Praveen Chandhar

111 bronze badge

$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14

add a comment
|

The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:

The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check

The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.

answered Jun 18 at 6:13

N Praveen Chandhar

111 bronze badge

$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14

add a comment
|

The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:

The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check

The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.

answered Jun 18 at 6:13

N Praveen Chandhar

111 bronze badge

The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:

The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check

The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.

answered Jun 18 at 6:13

N Praveen Chandhar

111 bronze badge

answered Jun 18 at 6:13

N Praveen Chandhar

111 bronze badge

answered Jun 18 at 6:13

N Praveen Chandhar

111 bronze badge

answered Jun 18 at 6:13

N Praveen Chandhar

111 bronze badge

$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14

add a comment
|

$begingroup$
I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message
$endgroup$
– N Praveen Chandhar
Jun 18 at 6:14

I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message

– N Praveen Chandhar
Jun 18 at 6:14

add a comment
|

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bsrgvty

6 Answers
6

Practicality vs Perfection

Your Answer

Post as a guest

6 Answers
6

6 Answers
6

Practicality vs Perfection

Practicality vs Perfection

Practicality vs Perfection

Practicality vs Perfection

Post as a guest

Popular posts from this blog

6 Answers 6

Practicality vs Perfection

Your Answer

Sign up or log in

Post as a guest

Post as a guest

6 Answers 6

6 Answers 6

Practicality vs Perfection

Practicality vs Perfection

Practicality vs Perfection

Practicality vs Perfection

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

6 Answers
6

6 Answers
6

6 Answers
6