Replies: 2 comments 6 replies
-
This is a very good idea, |
Beta Was this translation helpful? Give feedback.
-
I was watching a video and found a lot of stolen comments, so I tried to measure the effectiveness of this idea:
const comments = Array.from(document.querySelectorAll('#content-text')).map(e => e.textContent)
const uniqueComments = Array.from(new Set(comments))
const mostRepeatedComments = uniqueComments
.map(e => ({
comment: e,
repetitions: comments.filter(comment => comment === e).length
}))
.filter(e => e.repetitions > 1)
.sort((a, b) => b.repetitions - a.repetitions)
console.log('Comments count:', comments.length)
console.log('Unique comments count:', uniqueComments.length)
console.table(mostRepeatedComments) It was this video. I loaded 801 comments and then the above code found ten exact duplicates. One of those (the one with "so bad that Google") was duplicated eight times: the author, five sex bots that are easy to blacklist, and two accounts with seemingly normal names and avatars. Notice that the above code does not handle any kind of string normalization, it only checks for exact matches, so is pretty likely that those last two accounts are bots too, even if their names are not in the blacklist. If some normalization is applied by replacing Based on this test result I think that's pretty good to detect bots with zero false positives - none of the detected comments are simple common comments like "First!" - even when doing normalization. In the other hand, why a bot would steal a comment and still use a normal name? Weird. |
Beta Was this translation helpful? Give feedback.
-
One thing that's pretty common for spammers is stealing comments from other users. An example: original comment, stolen comment.
Sure, in the above example the current filters can detect that the comment comes from a spammer because the names matches one or more words in the blacklists, but, still, detecting duplicated comments would be a useful feature in case spammers (or whoever write the scripts they use) try to workaround those blacklists.
One thing that needs to be considered is: what to do about common comments (such as "first") which may be repeated unintentionally? Because of that possibility a minimal comment length or complexity needs to be set (maybe skip comments with less than 20 characters).
Other thing is what spammers can to do workaround this, like adding invisible characters and/or replacing characters with lookalikes, to counteract it some message normalization would be needed by removing invisible characters, replacing lookalikes, using Unicode normalization, and normalizing spaces (replace consecutive spaces as one then trim). Other possibility would be they using an more aggressive lookalike list, such as
l
withI
(lIlIlI), instead of the commona
withа
, then some string diff algorithm could help.This last point, if too complex, can be skipped and is not that important as just detecting exact duplicated comments seems to be already a good criteria to detect many spammers.
Beta Was this translation helpful? Give feedback.
All reactions