Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Csv importer: work with more problematic data #7456

Merged
merged 5 commits into from
Aug 8, 2017

Conversation

reist
Copy link
Contributor

@reist reist commented Jul 9, 2017

This is a continuation of #6768, fixing import when users.csv, channels.csv or both are missing.
It also fixes issues hit by doing a massive import from our internal chat implementation into RC.

@@ -254,7 +298,7 @@ Importer.CSV = class ImporterCSV extends Importer.Base {
const creator = this.getUserFromUsername(msg.username);
if (creator) {
const msgObj = {
_id: `csv-${ csvChannel.id }-${ msg.ts }`,
_id: `csv-${ csvChannel.id }-${ roomCounter }-${ msg.ts }`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only issue with this is that people who try to reimport data that has been imported before this change will see their data again...

Copy link
Contributor Author

@reist reist Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the code doing our exports take a timestamp range as input with output always being ordered by room-specific counters. The importer code doesn't change that order per room, so the extra counter should be stable for imports done with it.
The only way I see this code duplicating data on re-import is these two cases:

  1. The data being exported in a different order or file splitting, instead of the same files being used.
  2. The data being a different slice of the original chats, but including a percentage of already-imported ones.

Both of these seem to me like edge cases and not very likely to happen.

Before this change, the error crashed the import entirely, reported it in the log, but didn't mark the import as failed and didn't stop the querying on the client side. This made it annoying to find (backtracking through the logs) and required the restart of the RC instance to reset the import status.
This crash is something this pull request doesn't fix. It only fixes the different causes I encountered for crashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@graywolf336 Is my explanation satisfactory, or should I rewrite the code to store the last timestamp and add a suffix on dup timestamps, so old imports won't ever have the chance to be duplicated? Or even store an array of seen timestamps...otherwise the code is dependent on the csv line order to be time-based.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it is good, however you might want to include those caveats in the pull request's description so that it's easily visible to anyone who views this pull request from the change log. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@graywolf336 I tried to change the code so the IDs won't have an extra counter unless needed, so my changes won't break old re-import retries and noticed something odd.
It does work, but after the import the msgs counter in the admin rooms list double up. No duplicates in the room view that I noticed...
The funny thing, this happens with the old import code and IDs as well - duplicate messages counts, but unique when viewed.
Does this make sense? If so, I'll update the pull request with the safer code.

@reist
Copy link
Contributor Author

reist commented Jul 20, 2017

Added view cleanup for skipped files and a block for total message count.

Copy link
Contributor

@graywolf336 graywolf336 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for the improvements.

This way, it will re-import old imports just fine.
For new imports, the order of timestamps in the CSV files will have to
be  the same, always. So, either using the same dumps or enforcing order
on the CSV dump is required.
@RocketChat RocketChat deleted a comment Aug 7, 2017
@rodrigok rodrigok modified the milestones: 0.59.0, 0.58.0-rc.2 Aug 8, 2017
@rodrigok rodrigok merged commit 76b22c2 into RocketChat:develop Aug 8, 2017
rodrigok added a commit that referenced this pull request Aug 14, 2017
[FIX] Csv importer: work with more problematic data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants