Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 sync: s3 -> local redownloads unchanged files #648

Closed
laupow opened this issue Feb 13, 2014 · 35 comments
Closed

S3 sync: s3 -> local redownloads unchanged files #648

laupow opened this issue Feb 13, 2014 · 35 comments
Labels

Comments

@laupow
Copy link

laupow commented Feb 13, 2014

We store a pile of files in S3 and it's handy to have a local copy of our S3 buckets for development and backup. Upon first glance aws s3 sync looks like it'll work.

I ran sync on our entire bucket and it completed successfully; it downloaded a whole bucket to local disk. The second time I ran the command it was redownloaded some files that haven't changed (on S3 or locally) alongside the new ones.

sh-3.2# aws s3 sync s3://serc serc
download: s3://serc/files/NAGTWorkshops/gsa03/activities/EET_Poster.pdf to serc/files/NAGTWorkshops/gsa03/activities/EET_Poster.pdf
download: s3://serc/files/NAGTWorkshops/rtop/genetics_40_minutes.v3.mp4 to serc/files/NAGTWorkshops/rtop/genetics_40_minutes.v3.mp4
download: s3://serc/files/NAGTWorkshops/rtop/introductory_meteorology.mov to serc/files/NAGTWorkshops/rtop/introductory_meteorology.mov
... 5000 more files ...

These files were just downloaded with the first sync. The local modified time & size match S3's values.

While I never rule out the possibility of user error I don't see an obvious cause. The first S3->Local sync completed normally, I run it again and it redownloads some files every time that haven't changed. Not all, just some. And it's the same files redownloaded every time.

My cli version is aws-cli/1.2.13 Python/2.7.6 Darwin/10.8.0
This may or may not be related to issue #599, but I won't personally make that call.

@jamesls
Copy link
Member

jamesls commented Feb 20, 2014

Could you run aws s3 sync s3://bucket/ . --dryrun --debug 2>&1 | grep comparator?

If it's syncing files it will print a log message as to why it's doing so:

2014-02-20 13:03:20,881 - awscli.customizations.s3.comparator - DEBUG - syncing: bucket/foo/bar/file-3 -> /private/tmp/cp/file-3, size_changed: False, last_modified_time_changed: True
2014-02-20 13:03:20,883 - awscli.customizations.s3.comparator - DEBUG - syncing: bucket/foo/bar/file-4 -> /private/tmp/cp/file-4, size_changed: False, last_modified_time_changed: True
2014-02-20 13:03:20,886 - awscli.customizations.s3.comparator - DEBUG - syncing: bucket/foo/bar/file-7 -> /private/tmp/cp/file-7, size_changed: False, last_modified_time_changed: True

I'd be curious to see what the log messages say.

@laupow
Copy link
Author

laupow commented Feb 21, 2014

Output:

2014-02-21 09:47:26,876 - awscli.customizations.s3.comparator - DEBUG - syncing: serc/files/NAGTWorkshops/rtop/genetics_40_minutes.v3.mp4 -> /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/rtop/genetics_40_minutes.v3.mp4, file does not exist at destination
2014-02-21 09:47:26,878 - awscli.customizations.s3.comparator - DEBUG - syncing: serc/files/NAGTWorkshops/rtop/introductory_meteorology.mov -> /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/rtop/introductory_meteorology.mov, file does not exist at destination
2014-02-21 09:47:26,879 - awscli.customizations.s3.comparator - DEBUG - syncing: serc/files/NAGTWorkshops/rtop/kaatjes_glg_101_class.mov -> /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/rtop/kaatjes_glg_101_class.mov, file does not exist at destination
2014-02-21 09:47:26,881 - awscli.customizations.s3.comparator - DEBUG - syncing: serc/files/NAGTWorkshops/rtop/kw_video.mp4 -> /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/rtop/kw_video.mp4, file does not exist at destination

These files exist locally: the first sync downloaded them:

sh-3.2# stat /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/rtop/introductory_meteorology.mov
234881029 53464380 -rw-r--r-- 1 root com.apple.local.ard_admin 0 644979152 "Nov 22 12:54:03 2013" "Feb 13 12:08:57 2014" "Feb 13 12:08:57 2014" "Nov 22 12:54:03 2013" 4096 1259728 0 /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/rtop/introductory_meteorology.mov

Additional syncs certainly grab new files that don't exist locally as expected, but it's also reporting the same files don't exist on each run and then tries to redownload them.

The files in my log example are large movie files but the issue doesn't appear to be isolated to one file size or type. There are plenty of small thumbnail images and other file types, too.

@michaelthoward
Copy link

I discovered this separately and opened it as a case through AWS support.
Here was the conclusion I came to :

See: Case 163857941

The problem with aws s3 sync cli is with dates.
Some dates in the local filesystem do not get updated properly when aws s3 sync is run. Therefore, those files get re-downloaded if aws s3 sync is run again.

When mirroring with aws s3 sync from s3 to a local ec2 linux file system some of the files in the local filesystem are not getting the timestamp updated to match the timestamp on s3.
Therefore, when you run the aws s3 sync command again it pulls down the file again.
File sizes seem to be correct after the first run of aws s3 sync.

Looking at the log file which is previously mentioned in Case 163857941: s3://xxxx/case163857941/debugSyncLog.txt
search for "#### END OF RUN 1 "
After that you see a ls -l directory of the local filesystem after the aws s3 sync command has run. Most of the files are dated Jan 31 22:?? However, there are 7 files which have timestamp Feb 21 16:18/19 You can count these by searching for "Feb 21" scroll down a little further and you will see the listing from s3. Observe that all timestamps are dated 2014-01-31 Now, continue your search for "Feb 21" After the second run there is only one file dated "Feb 21" Your developers should be able to figure this out from here. If they need assistance or a sample dataset then let me know.

Michael

@curiosity26
Copy link

I'm having a similar issue, but going the opposite direction. I want to backup a server. There's 500+ GB of data, so I run the command:

aws s3 sync /mnt/main/backup s3://mybucket/backup

The first time it rolls through pretty well. It goes at 2,200+ parts completed and stops. I figure there's probably a limit on the amount of parts transacted per session, no biggie, just start the sync again right? Wrong. The aws client starts resyncing from the top, gets 2200+ parts in and stops. Needless to say I have 1 subdirectory of backup synced.

I have versioning control on and I can confirm that new versions are being made of these files every time sync runs. I suspended version control to see if it would help, but it appears to be doing the same thing.

The only thing I can think of to do next is to install s3fs, mount the bucket as a fs mountpoint and run rsync -azut. I rather not have to do that. Any ideas?!

@michaelthoward
Copy link

My recollection is that the date-related problem occurs both ways.
I have not encountered a problem with any type of "limit" when uploading/downloading files from s3.
Seems very strange.
I suggest you open a case with aws support.
Michael

@jamesls
Copy link
Member

jamesls commented Aug 13, 2014

I believe this is fixed in the latest version of the CLI? Can anyone confirm if you're still seeing this on later versions of the CLI (>= 1.4.1)? If so, I'll reopen and take another look.

@jamesls jamesls closed this as completed Aug 13, 2014
@laupow
Copy link
Author

laupow commented Aug 15, 2014

Hard to say. I updated awscli and ran sync a couple times just to be sure but it was still downloading files again. It still seems to only affect a subset of files every time but there aren't patterns apparent within the problem files.... files of all types, sizes, and dates are affected.

sh-3.2# aws --version
aws-cli/1.4.2 Python/2.7.6 Darwin/10.8.0
sh-3.2# aws s3 sync --size-only s3://serc serc/
2014-08-15 08:38:58,049 - MainThread - awscli.customizations.s3.comparator - DEBUG - syncing: serc/files/NAGTWorkshops/gsa03/activities/eet_poster.pdf -> /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/gsa03/activities/eet_poster.pdf, file does not exist at destination
2014-08-15 08:39:00,790 - MainThread - awscli.customizations.s3.comparator - DEBUG - syncing: serc/files/NAGTWorkshops/rtop/budd_sed_geo_video.mp4 -> /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/rtop/budd_sed_geo_video.mp4, file does not exist at destination
2014-08-15 08:39:00,792 - MainThread - awscli.customizations.s3.comparator - DEBUG - syncing: serc/files/NAGTWorkshops/rtop/cervato_video_humidity.v2.mp4 -> /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/rtop/cervato_video_humidity.v2.mp4, file does not exist at destination
sh-3.2# stat /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/gsa03/activities/eet_poster.pdf
234881029 53066054 -rw-r--r-- 1 root com.apple.local.ard_admin 0 689649 "Nov 21 12:01:00 2013" "Nov 21 12:01:00 2013" "Aug 15 08:36:22 2014" "Nov 21 12:00:59 2013" 4096 1352 0 /Volumes/RAID/home/backup/s3backup/serc/files/NAGTWorkshops/gsa03/activities/eet_poster.pdf

I also ran a limited sync on a different box where there has always been a problem file (eet_poster.pdf) and it's possible this case is something related to case sensitivity

$ aws --version
aws-cli/1.4.2 Python/2.7.5 Darwin/13.3.0
$ aws s3 sync s3://serc/files/NAGTWorkshops/gsa03/ serc --dryrun --debug 2>&1 | grep comparator
2014-08-15 09:05:13,408 - MainThread - awscli.customizations.s3.comparator - DEBUG - syncing: serc/files/NAGTWorkshops/gsa03/activities/EET_Poster.pdf -> /private/tmp/serc/activities/EET_Poster.pdf, size_changed: False, last_modified_time_changed: True
2014-08-15 09:05:13,415 - MainThread - awscli.customizations.s3.comparator - DEBUG - syncing: serc/files/NAGTWorkshops/gsa03/activities/eet_poster.pdf -> /private/tmp/serc/activities/eet_poster.pdf, file does not exist at destination

But I was still having the same problem (files not found when they actually exist) with plenty of other files:

2014-08-15 11:19:51,329 - MainThread - awscli.customizations.s3.comparator - DEBUG - syncing: serc/images/nagt/william_furnish_neil_miner_650.jpg -> /private/tmp/serc/images/nagt/william_furnish_neil_miner_650.jpg, file does not exist at destination
$ stat /private/tmp/serc/images/nagt/william_furnish_neil_miner_650.jpg
16777218 7942517 -rw-r--r-- 1 mlauer wheel 0 74370 "Nov 21 11:47:21 2013" "Nov 21 11:47:21 2013" "Aug 15 10:17:21 2014" "Nov 21 11:47:21 2013" 4096 152 0 /private/tmp/serc/images/nagt/william_furnish_neil_miner_650.jpg

@ghost
Copy link

ghost commented Apr 8, 2015

Not sure why this has been marked closed. Currently experiencing a similar issue:

#1268

@reverie
Copy link

reverie commented Feb 4, 2016

@jamesls I'm also not sure this is fixed. I can run an s3 sync twice in a row. Out of ~10k files, the same ~80 get re-downloaded every time. Nothing comes up for the 'comparator' grep, though.

@chops888
Copy link

I am having the same issues once "--exact-timestamps" is turned on. Randomly, multiple old files in the same bucket download to local machine. Next sync they're fine. Ran it with --debug | grep 'modified time' and noticed that the timestamps were all off by 1 second every time this anomaly occurs. The comparison is in this file and I think ill just mod it to ignore diffs of < 2 seconds. That'll work for me, anyway. Hope this helps someone else.

@reverie
Copy link

reverie commented Feb 19, 2016

Even with --size-only I get dupes. Some of them are files that have the same filename with different capitalizations, so that may be one issue.

@chops888
Copy link

Different capitalization == different file in linux, thus probably with awscli, even on windows. I would definitely expect that behavior on Linux.

@chops888
Copy link

@reverie - run the sync with debug on and try to find the lines where the program is comparing size and stamp and filename. Look for the reason it wants to sync a particular file.

@reverie
Copy link

reverie commented Feb 19, 2016

Re: capitalization, I'm not surprised that it's treating them as different files. I'm surprised that it's re-syncing them every time. Anyway, that's only some of them. Looks from the log like "file does not exist at destination" is the given reason for syncing.

@chops888
Copy link

Looks like I've created a workaround for my issue by changing line 40 in exacttimestamps.py from
return self.total_seconds(delta) == 0
to
return self.total_seconds(delta) > -1

@chops888
Copy link

Actually,
return -1 < self.total_seconds(delta) < 1
is probably safer.

@zaphod-42
Copy link

This continues to be an issue, with the latest aws cli installed I still have about 100 files out of roughly 200k that continue to download on every sync.

@chops888
Copy link

Did you edit exacttimestamps.py as indicated above? This should fix the issue until you update awscli again

@chops888
Copy link

On one of my servers this file is located at /usr/local/lib/python2.7/dist-packages/awscli/customizations/s3/syncstrategy/exacttimestamps.py

@zaphod-42
Copy link

I wan't using the --exact-timestamps flag, but I'll check it out

@zaphod-42
Copy link

That didn't fix it for me, I also tried with "--dryrun --debug 2>&1 | grep comparator" and got no output

@chops888
Copy link

Do you get anything relevant without piping to grep? Also in sizeonly.py - you could mod the debug output to include src_file.size and dst_file.size to see exact what values the two sizes are on the files that are continually resyncing.

@zaphod-42
Copy link

Ok, I created a second local directory so I could target one of the directories, instead of running through 260k files every time, I have the same problem, so I grepped for a specific filename and got this (I added the astrix) :

MainThread - awscli.customizations.s3.syncstrategy.base - DEBUG - syncing: //wp-content/uploads/2017/03/DOC_image10-1-1-150x150.png -> /Volumes/Backup/tmp_backup/DOC_image10-1-1-150x150.png, file does not exist at destination
(dryrun) download: s3://***/***wp-content/uploads/2017/03/DOC_image10-1-1-150x150.png to tmp_backup/DOC_image10-1-1-150x150.png

I've run it 3 times, and get the same result every time (I run an actual sync in between the debugs)

@zaphod-42
Copy link

So evidently, the files are never being created locally, I did double check and that seems to be the case. That file is never added

@chops888
Copy link

chops888 commented Sep 29, 2017

huh. Well that explains why they try to download every time, but not why they're failing to write in the first place. Dont suppose the destination is windows and the total filename with folders length is too long? If windows, run procmon to see why its failing to write. If linux, do you have full ownership of target dir?

@zaphod-42
Copy link

Nope, destination is a mac, and that's not even the longest filename in the directory, I checked and there are longer filenames that sync successfully

@zaphod-42
Copy link

I've also checked S3 to make sure the file isn't corrupt or something, but I can view it without issue. The only thing I can see is that most of the other files have been updated at least once since this bucket was set to be revisioned, but I'm not sure why that would cause an issue

@chops888
Copy link

How about deleting and re-uploading the problem files in the bucket?

@zaphod-42
Copy link

wow, no luck with that either. Well, at least it's friday, maybe I'll have a eureka moment over the weekend

@FrontSide
Copy link

Happening for me too. When I do a sync from the bucket to a local disc it re-downloads every single file each time. No laughs.

@jquast
Copy link

jquast commented Nov 7, 2017

Saw this too, and it was because the files were from the future. In my case, 5 of 10,000 files kept re-downloading, and these 5 files were dated 2018, and its currently the year 2017.

@wpccolorblind
Copy link

Seeing the same.

@arieljlira
Copy link

Similar issue here when mirroring local folder to s3 bucket. Aws cli re-uploads a modified same file several times ( Ec2 instance, CentOS Linux release 7.5.1804, aws-cli/1.16.89 )
In my case system clock was several seconds ahead of real clock so I activated chronyd (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html) and it fixed the system clock by -67.612530 seconds.
Now every aws-cli sync uploads files only the first time, as expected.

@yvele
Copy link

yvele commented May 13, 2019

Same problem from S3 to S3 all the files already has been copied using sync, and when I re-run the same sync lots of files are copied again! file does not exist at destination

I'm using aws-cli/1.16.150 Python/3.7.3 Darwin/16.7.0 botocore/1.12.140

@cmcfarling
Copy link

Same issue when syncing local directory to S3 on Windows. Of about 8GB of data, it wants to re-copy about 1GB every time. Even though the files exist on the destination the dryrun debug command is reporting that the files don't exist. Here's just one example:

2019-12-13 15:50:30,391 - MainThread - awscli.customizations.s3.syncstrategy.base - DEBUG - syncing: E:\ServerFolders\ADDRESS.DAT -> bucket1/Archive/ADDRESS.DAT, file does not exist at destination

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests