Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS S3 sync does not sync all the files #3273

Open
webdigi opened this issue Apr 18, 2018 · 80 comments
Open

AWS S3 sync does not sync all the files #3273

webdigi opened this issue Apr 18, 2018 · 80 comments
Labels
feature-request A feature should be added or improved. needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue s3sync s3syncstrategy s3

Comments

@webdigi
Copy link

webdigi commented Apr 18, 2018

We have several hundred thousand files and S3 reliably syncs files. However, we have noticed that there were several files which were changed about a year ago and those are different but do not sync or update.

Both source and destination timestamps are also different but the sync never happens. S3 has the more recent file.

Command is as follows
aws s3 s3://source /local-folder --delete

All the files that do not sync have the same date but are spread across multiple different folders.

Is there an S3 touch command to change the timestamp and possibly get the files to sync again?

@JordonPhillips
Copy link
Member

You can possibly use --exact-timestamps to work around this, though that may result in excess uploads if you're uploading.

To help in reproducing, could you get me some information about one of the files that isn't syncing?

  • What is the exact file size locally?
  • What is the exact file size in S3?
  • What is the last modified time locally?
  • What is the last modified time in S3?
  • Is the local file a symlink / behind a symlink?

@JordonPhillips JordonPhillips added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Apr 27, 2018
@webdigi
Copy link
Author

webdigi commented Apr 30, 2018

Example command run
aws s3 sync s3://bucket/ /var/www/folder/ --delete

Several files are missing
Exact local size: 2625
Exact s3: 2625
Exact time stamp local: 06-Jan-2017 9:32:31
Exact time stamp s3: 20-Jun-2017 10:14:57
normal file in S3 and local

There are several cases like that in a list of around 50,000 files. However all the missing in sync are various times on 20 Jun 2017.

Using --exact-timestamps shows much more files to download although they are exactly the same contents. However they are still missing the ones in example above.

@kyleknap kyleknap removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Jun 14, 2018
@overcache
Copy link

overcache commented Jul 3, 2018

same issue here.
aws s3 sync dist/ s3://bucket --delete did not upload s3://bucket/index.html with dist/index.html

dist/index.html and s3://bucket/index.html have the same file size, but their modify time are different.

actually, some times awscli did upload the file, but some times not

@zyv
Copy link
Contributor

zyv commented Jul 26, 2018

Same here, --exact-timestamps doesn't help - index.html is not overwritten.

@samdammers
Copy link

We experienced this issue was well today/last week. Again index.html is the same file size, but the contents and modified times are different.

@stephram
Copy link

Is anybody aware of a workaround for this?

@lylejohnson
Copy link

I just ran into this. Same problem as reported by @icymind and @samdammers: the contents of my (local) index.html file had changed, but its file size was the same as the earlier copy in S3. The {{aws s3 sync}} command didn't upload it. My "workaround" was to delete index.html from S3, and then run the sync again (which then uploaded it as if it were a new file, I guess).

@smxdevst
Copy link

smxdevst commented Feb 20, 2019

Server: EC2 linux
Version: aws-cli/1.16.108 Python/2.7.15 Linux/4.9.62-21.56.amzn1.x86_64 botocore/1.12.98


After aws s3 sync running over 270T of data I lost few GB of files. Sync didn't copy files with special characters at all.

Example of file /data/company/storage/projects/1013815/3.Company Estimates/B. Estimates

Had to use cp -R -n

@checkmypi
Copy link

checkmypi commented Feb 21, 2019

same issue here xml file of the same size but different timestamp not synced correctly

I was able to reproduce this issue

bug.tar.gz
download attached tar file and then

tar -zxvf bug.tar.gz
aws s3 sync a/ s3://<some-bucket-name>/<some_dir>/ --delete
aws s3 sync b/ s3://<some-bucket-name>/<some_dir>/ --delete

you'll see that even though repomd.xml in directories a and b differ in contents and timestamps
attempting to sync b doesn't do anything

Tested on
aws-cli/1.16.88 Python/2.7.15 Darwin/16.7.0 botocore/1.12.78
aws-cli/1.16.109 Python/2.7.5 Linux/3.10.0-693.17.1.el7.x86_64 botocore/1.12.99

@chrispruitt
Copy link

im seeing the same issue. trying to sync a directory of files from s3 where one file was updated to a local directory. that file does not get updated in the local directory

@lqueryvg
Copy link

I'm seeing this too. In my case it's a react app with index.html that refers to generated .js files. I'm syncing them with the --delete option to delete old files which are no longer referred to. The index.html is sometimes not uploaded, resulting in an old index.html which points to .js files which no longer exist.

Hence my website stops working !!!

I'm currently clueless as to why this is happening.

Does anyone have any ideas or workarounds ?

@marns93
Copy link

marns93 commented Mar 27, 2019

We have the same problem, but just found a workaround. I know, it is not the best way, but it works:

aws s3 cp s3://SRC s3://DEST ...
aws s3 sync s3://SRC s3://DEST ... --delete

It seems to us, that the copy is working fine, so first we copy after that we use the sync command to delete files, which are no longer present.
Hope that the issue will be fixed asap.

@lqueryvg
Copy link

I added --exact-timestamps to my pipeline and problem hasn't recurred. But, it was intermittent in the first place so I can't be sure it fixed it. If it happens again I'll go with @marns93 's suggestion.

@JasonQSY
Copy link

We've met this problem and --exact-timestamps resolves our issue. I'm not sure if it's exactly the same problem.

@elliot-nelson
Copy link

I'm seeing this issue, and it's very obvious because each call only has to copy a handful (under a dozen) files.

The situation in which it happens is just like reported above: if the folder being synced into contains a file with different file contents but identical file size, sync will skip copying the new updated file from S3.

We ended up changing scripts to aws s3 cp --recursive to fix it, but this is a nasty bug -- for the longest time we thought we had some kind of race condition in our own application, not realizing that aws-cli was simply choosing not to copy the updated file(s).

@benjamin-issa
Copy link

I saw this as well with an html file

aws-cli/1.16.168 Python/3.6.0 Windows/2012ServerR2 botocore/1.12.158

@nabilfreeman
Copy link

I copy pasted the s3 sync command from a GitHub gist and it had --size-only set on it. Removing that fixed the problem!

@jam13
Copy link

jam13 commented Sep 23, 2019

Just ran into this issue with build artifacts being uploaded to a bucket. Our HTML tended to only change hash codes for asset links and so size was always the same. S3 sync was skipping these if the build was too soon after a previous one. Example:

10:01 - Build 1 runs
10:05 - Build 2 runs
10:06 - Build 1 is uploaded to s3
10:10 - Build 2 is uploaded to s3

Build 2 has HTML files with a timestamp of 10:05, however the HTML files uploaded to s3 by build 1 have a timestamp of 10:06 as that's when the objects were created. This results in them being ignored by s3 sync as remote files are "newer" than local files.

I'm now using s3 cp --recursive follow by s3 sync --delete as suggested earlier.

Hope this might be helpful to someone.

@jay-w-jensen
Copy link

I had the same issue earlier this week; I was not using --size-only. Our index.html was different by a single character (. went to #), so the size was the same, but the timestamp on s3 was 40 minutes earlier than the timestamp of the new index.html. I deleted the index.html as a temporary workaround, but it's infeasible to double check every deployment.

@sabretus
Copy link

The same here, files with the same name but with different timestamp and content are not synced from S3 to local and --delete does not help

@magraeber
Copy link

We experience the same issue. An index.html with same size but newer timestamp is not copied.

This issue was reported over a year ago. Why is it not fixed?

Actually it makes the snyc command useless.

@Rimce
Copy link

Rimce commented Nov 12, 2019

exact-time

--exact-timestamps fixed the issue

@tompetrillo
Copy link

I am also effected by this issue. I added --exact-timestamps and the issue seemed to fix the files i was looking at. i have not done an exhaustive search. I have on the order of 100k files and 20gb, a lot less than the others in here.

@jason-beijing
Copy link

jason-beijing commented Jan 29, 2020

I have faced the same issue, aws s3 sync skip some files, even with different contents and different dates. The log shows that those skipped files are synced but actually not.
But when I run aws s3 sync again, those files got synced. Very weird!

@cbelsole
Copy link

I had this issue when building a site with Hugo and I finally figured it out. I use submodules for my Hugo theme and was not pulling them down on CI. This was causing warnings in Hugo but not failures.

# On local
                   | EN
-------------------+-----
  Pages            | 16
  Paginator pages  |  0
  Non-page files   |  0
  Static files     |  7
  Processed images |  0
  Aliases          |  7
  Sitemaps         |  1
  Cleaned          |  0

# On CI
                   | EN  
-------------------+-----
  Pages            |  7  
  Paginator pages  |  0  
  Non-page files   |  0  
  Static files     |  2  
  Processed images |  0  
  Aliases          |  0  
  Sitemaps         |  1  
  Cleaned          |  0  

Once I updated the submodules everything worked as expected.

@darrynten
Copy link

darrynten commented Mar 11, 2020

We've also been affected by this issue, so much so that a platform went down for ~18 hours after a new vendor/autoload.php file didn't sync, and was out of date with vendor/composer/autoload_real.php so the whole app couldn't load.

This is a very strange problem, and I can't believe the issue has been open for this long.

Why would a sync not use hashes instead of last modified? Makes 0 sense.

For future Googlers, a redacted error I was getting:

---
PHP message: PHP Fatal error:  Uncaught Error: Class 'ComposerAutoloaderInitXXXXXXXXXXXXX' not found in /xxx/xxx/vendor/autoload.php:7
Stack trace:
#0 /xxx/xxx/bootstrap/app.php(3): require_once()
#1 /xxx/xxx/public/index.php(14): require('/xxx/xxx...')
#2 {main}
  thrown in /xxx/xxx/vendor/autoload.php on line 7" while reading response header from upstream: ...
---

@pamela-mei
Copy link

pamela-mei commented Apr 26, 2022

I'm experiencing the issue as well. Is there any plan for the fix?
I did the test to trigger uploading file for the same name but update with different sizes/timestamps, but unfortunately, sync cannot succeed.
The current only solution is to remove the un-synced files manually and re-trigger the sync again.
But I don't understand why this bug open for more than 4 years w/o fix?

@klesher
Copy link

klesher commented May 4, 2022

I'm seeing this too. In my case it's a react app with index.html that refers to generated .js files. I'm syncing them with the --delete option to delete old files which are no longer referred to. The index.html is sometimes not uploaded, resulting in an old index.html which points to .js files which no longer exist.
Hence my website stops working !!!
I'm currently clueless as to why this is happening.
Does anyone have any ideas or workarounds ?

I experienced this same problem today. From my testing, it looks like --exact-timestamps will fix the problem if you are wanting to download from S3 to local and your local file has the same filesize and an older timestamp. However, it doesn't look like it makes a difference if you are trying to upload from local to S3. It appears that if you wanted to upload a local file with the same filesize and an older timestamp, you would have to use cp or delete the file in S3 first.

This just tripped us up as well with our build system when deploying to S3 between various feature branches. Some of them were built at different times, so of course the timestamps were different ( and sometimes older).

Our quick (albeit kinda gross because it negates the large benefits of sync) workaround was to just touch all of the files before sync:
find . -exec touch {} \;

@JoshMcCullough
Copy link

JoshMcCullough commented May 7, 2022

AWS, please help us. We shouldn't have to do such workarounds. The fact that the sync function is simply not synchronizing correctly causes a lot of issues for people. Such issues might go unnoticed for some time, and may be hard to track down, causing lots of wasted time and effort on the part of your customers (as they try to diagnose these issues). Please, please ... can we get some action on this?

@htrappmann
Copy link

Totally support this request!!!

@tooptoop4
Copy link

friends, use https://github.com/peak/s5cmd instead, its much faster too

@FredrikZeiner
Copy link

Using --exact-timestamps doesn't always solve the problem for some reason. We still once in a while have deploys that go bad. Looking into switching to rclone or s5cmd.

@cj2
Copy link

cj2 commented Jun 25, 2022

We are having this issue with a Snowball device and have been billed thousands in overages trying to compensate for the sync command not functioning properly. This is basic stuff...

@mossmoss
Copy link

mossmoss commented Jul 9, 2022

Yikes! I think I'm seeing this issue now. I'm aws s3 syncing a large Adobe (AEM) content management system with several 500g+ app instances, each with many many (mostly smallish) files in hashed directories. I prepare the apps on a souped-up EC2 instance and aws s3 sync them up to S3.

When aws s3 syncing the prepared app down to the target machines, the author application had issues we could not fix that were not present on the source author.

The publish app instances did not seem to have problems, they were also aws s3 synced down from source to target - but now I'm worried any issues on these may just not yet have been noticed. I will use aws s3 cp --recursive from now on!

It seems to work but also seems like it might be slower? One great thing about aws s3 sync is it does seem fast, but useless if it can't be used to accurately copy from source to target. I must have retried the aws s3 sync 3 or 4+ times thinking it was a transient upload error before finding this issue.

What a WASTE of time, resources, and money.

I wonder if redoing the aws s3 sync with the --exact-timestamps --delete flags would fix the existing target and save time by already having most of files in the target? Touching all the source files prior to syncing (as noted above) also seems like an option if it's indeed a workaround.

@mckenzm
Copy link

mckenzm commented Oct 13, 2022

I'm seeing this with elastic beanstalk deploys to empty folders such as ./images and ./media. Just the odd new file not coming. But we can see them there in S3 and manual cli from the instance does get it. Almost as if the read of S3 is cached and it has not been flushed on the update of a new file, or that flush is delayed. Our workaround is a cron job.

@tim-finnigan tim-finnigan added needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue labels Nov 15, 2022
@jeremy-brooks
Copy link

I experienced an issue where aws s3 sync would ignore files completely from the source location. I then discovered that the KMS key used originally to encrypt them had been marked for deletion. But no error messages were appearing when running sync. When I then tried cp instead, the error message popped up telling me the problem. Luckily I could temp enable the old key, change the objects encryption and sync the files. Shame there was no error messages when running sync, would have saved a bit of time for sure.

So in summary, check the KMS keys for any objects ignored by aws s3 sync, because you won't see any useful error messages about it from the command unfortunately.

@abaaslx
Copy link

abaaslx commented Dec 29, 2022

Ran into this trying to setup a custom build cache for our React app. Issue was solved by using aws s3 cp --recursive

Details:

We had no problem deploying new builds of our app, but hit this bug when trying to redeploy old cached versions even with --exact-timestamps!

i.e. the following command didn't work! In particular, index.html wouldn't be synced

aws s3 sync s3://{PATH_TO_CACHED_BUILD} s3://{PATH_TO_BUILD_SERVED_TO_CLOUDFRONT} --exact-timestamps

Possibly noteworthy is that s3://{PATH_TO_CACHED_BUILD} was, at one point in the past, the build being served

@JoshMcCullough
Copy link

@abaaslx --exact-timestamps only applies when transferring from S3 to local -- not local to S3 or S3 to S3.

The fact that some of your files are not being transferred to S3 won't be fixed by that option, and this issue is one we are all seeing in this thread and AFAIK there's been no indication of an actual issue or a solution.

yorickpeterse added a commit to inko-lang/website that referenced this issue Mar 15, 2023
This should hopefully solve the issue of `aws s3 sync` saying it
uploaded a file but not actually doing it. See
aws/aws-cli#3273 for more details.
@mrthankyou
Copy link

mrthankyou commented Mar 20, 2023

@JoshMcCullough,

and this issue is one we are all seeing in this thread and AFAIK there's been no indication of an actual issue or a solution.

Is this an issue or not? Appreciate your insight on why --exact-timestamps won't work.


Would like to get a solution here as this is preventing me from doing deployments (numerous files are not uploading). Is there anything that we can provide to help the maintainers work towards a solution here?

@vigenere23
Copy link

vigenere23 commented Jun 20, 2023

Still occuring, same problem. Also important to mention, I'm pretty sure that the one important thing that cp does not do is removing files that you don't upload (like the sync --delete option). So if you upload files with hashed names, that might get pretty ugly...

@stephen-thomas-buildingestimates

Still an issue for us in 2023

@kayhustle
Copy link

Having the same issue as well. Running aws cli in Windows. Files updated in S3 bucket are not downloaded locally.

moonape1226 pushed a commit to MisoAI/distribution that referenced this issue Aug 7, 2023
There is some issue with aws sync:
aws/aws-cli#3273

We use aws cp instread
@Guria
Copy link

Guria commented Aug 14, 2023

Sorry for clogging this thread, but I'll try to sum up what I've seen in my case and what I have read in this issue discussion.

We know for sure

Note

Mentioned several times in the thread, but worth repeating:
--exact-timestamps partially related to the issue, but will not help when syncing from local to s3.
Only try this option if you're syncing from s3 to local.

Note

s3 cp --recursive is not a replacement for s3 sync --delete, so couldn't be a good workaround in most cases.
We can combine them s3 cp --recursive && s3 sync --delete but it would double traffic and quota usages.

We are still trying to find out

Note

We have a great attempt to have a solid explanation of the issue and it would be great if that was true. Unfortunately we have evidences that proves otherwise.

  • When a file is synced or copied to s3, the timestamp it receives on the bucket is the date it was copied, which is always newer than the date of the source file. This is just how s3 works.

That is a true statement and sometimes could lead to the discussed issue. For example, when newer CI build happened before previous build results were uploaded to S3.

  • Files are only synced if the size changes, or the timestamp on the target is older than the source.
  • This means that if source files are updated but the size of the files remains unchanged and the dates on those changed files pre-date when they were last copied, s3 sync will not sync them again.

And if this would be true it would be explained a lot. But it isn't. In my case that I have observed my local file timestamp was 6 minutes newer than the file timestamp at s3.

Note

We have a script that demonstrates that s3 sync command is ignoring changes happened in file made after initial file was uploaded to s3.

Note

Another observation was made is that this issue may occur if local file newer than s3 copy less than in 1 day. Which leads us to a version that issue root cause maybe related to timezones and difference between mtime and ctime.

@Mitko-Kerezov
Copy link

Just hit this issue as well and can confirm everything said. Was an html file, same size as in S3, using --exact-timestamps but syncing local to S3 so no dice. Local file was modified less than 1 day after the one in S3.
Will try to touch the file prior to running aws s3 sync in hopes of slight mitigation.

@tabascoterrier
Copy link

I just got bit by this too.

In my case I was doing aws s3 sync s3://remote /local --delete (S3 -> local). The file was previously modified less than 24 hours before, and the file size remained the same.

Adding --exact-timestamps seems to resolve for me.

@mark-ship-it
Copy link

Is this still an issue with v2? This issue has been open for 5 and a half years and seems like a serious flaw.

@okummer
Copy link

okummer commented Jan 5, 2024

It just occurred here with v2. Touching the target files on the local disk still helps.

@firewaller
Copy link

+1

@JoshMcCullough
Copy link

We removed our rm -rf ... which we were executing before the S3 sync, just to see if this was magically fixed. But we ran into the same issue again, so it's not fixed. 😦 We have two scripts, local > S3 and S3 > local. I updated the latter to include --exact-timestamps and left the rm commented. We'll see how it goes...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue s3sync s3syncstrategy s3
Projects
None yet
Development

No branches or pull requests