-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IMPROVEMENT] Filter bad words #1139
[IMPROVEMENT] Filter bad words #1139
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few issues there, but it's a decent PR. Should be quick to fix.
Just generally curious but would it be possible to match patterns? Like |
I also found a list from a university that was around 10k words. But most of the words were not profanity in themselves so I gave up, and just took some words to use as placeholder. e.g. of irrelevant word: gay, transgender, mum... |
|
src/lib_ccx/ccx_encoders_helpers.c
Outdated
@@ -468,6 +540,6 @@ void shell_sort(void *base, int nb, size_t size, int(*compar)(const void*p1, con | |||
|
|||
void ccx_encoders_helpers_perform_shellsort_words(void) | |||
{ | |||
shell_sort(spell_lower, spell_words, sizeof(*spell_lower), string_cmp_function, NULL); | |||
shell_sort(spell_correct, spell_words, sizeof(*spell_correct), string_cmp_function, NULL); | |||
shell_sort(spell_lower.words, spell_lower.len, sizeof(*spell_lower.words), string_cmp_function, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems strange to be passing a length there and also using sizeof, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't write that code but:
Seems strange to be passing a length
So that it the function knows how many elements it needs
using sizeof
To know the size of each element
The best would probably to use a stdlib sorting function. (qsort)
Also I didn't sort the list of profane words, which means binary search won't work (so far it has worked because the list was already sorted).
I'll be fixing that tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of shell sort was added in #41 because it was faster than quicksort.
Also to clarify, spell_words
was the length of the array (now spell_correct.len
) so no behaviour should have changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not too reassuring :-) You need to test a lot and be sure that you are not introducing any bug...
@NilsIrl once you have tested it write here how you tested specifically, the output, etc, and we'll merge. |
It would probably be better to modify the subtitles before they are passed to the encoder. This abstracts these features away from encoders. What do you think @cfsmp3 For example, right now I'm implementing the scc encoder and have to deal with this which is even worse because I need to implement the "previous" version, the version before this PR. There also seems to be autodash and trim_subs which might be related (though I haven't looked into them yet and they don't appear on all encoders). |
Not so easy - for example for 608 you'd need to modify the grid, which is
fixed in size - it would be a pain in the ass.
It's not a bad idea though, but I don't think it would save as much time as
you'd think :-)
…On Tue, Dec 17, 2019 at 1:35 PM Nils ANDRÉ-CHANG ***@***.***> wrote:
It would probably be better to modify the subtitles before they are passed
to the encoder. This abstracts these features away from encoders.
What do you think @cfsmp3 <https://github.com/cfsmp3>
For example, right now I'm implementing the scc encoder and have to deal
with this which is even worse because I need to implement the "previous"
version, the version before this PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1139?email_source=notifications&email_token=ABNMTWPFVOYXUF7PTWCMHPLQZFAYXA5CNFSM4JX5BFLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHEA6OY#issuecomment-566759227>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABNMTWJPP63VD62SXNIWVZLQZFAYXANCNFSM4JX5BFLA>
.
|
Multi word swear words will not work Fix 1: Remove multiple word swear words |
a0c861a
to
5fcb31d
Compare
src/lib_ccx/ccx_encoders_helpers.c
Outdated
"Jesus fuck", | ||
"Jesus Harold Christ", | ||
"Jesus wept", | ||
"Judas Priest", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this should be considered as a swear word :p
If there's ever a documentary on https://en.wikipedia.org/wiki/Judas_Priest, this will be annoying ;)
@NilsIrl The existing capfile (the dictionary) no longer works: E:\GitHub\ccextractor\windows\Debug>ccextractorwin.exe --capfile ....\Dictionary\MattS_dictionary.txt E:\Downloads\c83f765c661595e1bfa4750756a54c006c6f2c697a436bc0726986f71f0706cd.ts Please fix this before we can merge. |
d009e9e
to
e1d3060
Compare
@canihavesomecoffee all OK now? |
Didn't retest yet. Can do that tomorrow. |
Dictionary still works as intended, built-in list also works. Will trigger a final re-run of the Test Suite to check. |
CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Your PR breaks these cases:
Check the result page for more info. |
CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Your PR breaks these cases:
Check the result page for more info. |
I'd say it's mergeable. |
Fix #1114
This PR reformats most of the capitalization parts as well to make a better interface.
This PR also removes one of the list that was used for capitalization before
spell_lower
as it was useless.PS: also btw, for the list of words, I took them from Wikipedia, I don't have that good of a vocab 😂