Check for stolen comments as a criteria #218

qgustavor · 2021-12-26T14:46:11Z

qgustavor
Dec 26, 2021

One thing that's pretty common for spammers is stealing comments from other users. An example: original comment, stolen comment.

Sure, in the above example the current filters can detect that the comment comes from a spammer because the names matches one or more words in the blacklists, but, still, detecting duplicated comments would be a useful feature in case spammers (or whoever write the scripts they use) try to workaround those blacklists.

One thing that needs to be considered is: what to do about common comments (such as "first") which may be repeated unintentionally? Because of that possibility a minimal comment length or complexity needs to be set (maybe skip comments with less than 20 characters).

Other thing is what spammers can to do workaround this, like adding invisible characters and/or replacing characters with lookalikes, to counteract it some message normalization would be needed by removing invisible characters, replacing lookalikes, using Unicode normalization, and normalizing spaces (replace consecutive spaces as one then trim). Other possibility would be they using an more aggressive lookalike list, such as l with I (lIlIlI), instead of the common a with а, then some string diff algorithm could help.

This last point, if too complex, can be skipped and is not that important as just detecting exact duplicated comments seems to be already a good criteria to detect many spammers.

KendallDoesCoding · 2022-01-02T08:09:49Z

KendallDoesCoding
Jan 2, 2022

This is a very good idea,

0 replies

qgustavor · 2022-01-05T02:06:04Z

qgustavor
Jan 5, 2022
Author

I was watching a video and found a lot of stolen comments, so I tried to measure the effectiveness of this idea:

Open a popular video with many spammers;
Load many comments as you wish;
Run this code in browser console:

const comments = Array.from(document.querySelectorAll('#content-text')).map(e => e.textContent)
const uniqueComments = Array.from(new Set(comments))
const mostRepeatedComments = uniqueComments
  .map(e => ({
    comment: e,
    repetitions: comments.filter(comment => comment === e).length
  }))
  .filter(e => e.repetitions > 1)
  .sort((a, b) => b.repetitions - a.repetitions)
console.log('Comments count:', comments.length)
console.log('Unique comments count:', uniqueComments.length)
console.table(mostRepeatedComments)

It was this video. I loaded 801 comments and then the above code found ten exact duplicates. One of those (the one with "so bad that Google") was duplicated eight times: the author, five sex bots that are easy to blacklist, and two accounts with seemingly normal names and avatars. Notice that the above code does not handle any kind of string normalization, it only checks for exact matches, so is pretty likely that those last two accounts are bots too, even if their names are not in the blacklist.

If some normalization is applied by replacing .map(e => e.textContent) by .map(e => e.textContent.normalize('NFD').replace(/\W+/g, ' ')) then it returns 13 duplicated comments. Those three additional comments came from bots that added line breaks (e.g. the ones with "I am sure you will" and "this moment in time") or changed line breaks from \n to \r\n (the ones with "We congratulate ourselves").

Based on this test result I think that's pretty good to detect bots with zero false positives - none of the detected comments are simple common comments like "First!" - even when doing normalization. In the other hand, why a bot would steal a comment and still use a normal name? Weird.

6 replies

KendallDoesCoding Jan 5, 2022

Tried this for myself and it works pretty good, surprising less comments then I expected.

mfaizsyahmi May 3, 2023

@KendallDoesCoding submit a pull request!

Lampe2020 May 7, 2023

I was watching a video and found a lot of stolen comments, so I tried to measure the effectiveness of this idea:

1. Open a popular video with many spammers;

2. Load many comments as you wish;

3. Run this code in browser console:

const comments = Array.from(document.querySelectorAll('#content-text')).map(e => e.textContent)
const uniqueComments = Array.from(new Set(comments))
const mostRepeatedComments = uniqueComments
  .map(e => ({
    comment: e,
    repetitions: comments.filter(comment => comment === e).length
  }))
  .filter(e => e.repetitions > 1)
  .sort((a, b) => b.repetitions - a.repetitions)
console.log('Comments count:', comments.length)
console.log('Unique comments count:', uniqueComments.length)
console.table(mostRepeatedComments)

[...]

At least in Firefox 113.0 it doesn't seem to work, it always spits out exactly one comment (no text) with the repititions count being exactly the loaded amount of comments.
f3e.png)

qgustavor May 7, 2023
Author

I'm on mobile, but I don't think that's a browser issue since I use Firefox and that's not something that would have cross browser differences. I think either YouTube changed and broke this demo code (which would not affect the Thio's script as it uses the API) or scammers are copying comments while modifying to avoid detection. Even if no changes are visible they might be adding invisible characters or using lookalike ones.

Lampe2020 May 8, 2023

It must be YT breaking it because I could clearly see several different omments from seemingly genuine people but the script found just one and without text. When I have the time I'll maybe look at the script and try to fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for stolen comments as a criteria #218

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Check for stolen comments as a criteria #218

qgustavor Dec 26, 2021

Replies: 2 comments · 6 replies

KendallDoesCoding Jan 2, 2022

qgustavor Jan 5, 2022 Author

KendallDoesCoding Jan 5, 2022

mfaizsyahmi May 3, 2023

Lampe2020 May 7, 2023

qgustavor May 7, 2023 Author

Lampe2020 May 8, 2023

qgustavor
Dec 26, 2021

Replies: 2 comments 6 replies

KendallDoesCoding
Jan 2, 2022

qgustavor
Jan 5, 2022
Author

qgustavor May 7, 2023
Author