Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to set charset when mime-types are guessed (S3) #1346

Open
gf3 opened this issue May 27, 2015 · 28 comments
Open

Unable to set charset when mime-types are guessed (S3) #1346

gf3 opened this issue May 27, 2015 · 28 comments
Labels
community contribution-ready feature-request A feature should be added or improved. p2 This is a standard priority issue s3mimetype s3

Comments

@gf3
Copy link

gf3 commented May 27, 2015

We are syncing a directory of various file types to an S3 bucket and aws-cli is correctly guessing the mime-types, however in our case it's important that it also append the charset. For example the guessed content-type for index.html might look like this:

Content-Type: text/html

But we'd like a way to be able to tell aws-cli that the charset for all the synced files is UTF8 for instance:

Content-Type: text/html; charset=utf-8

Version: aws-cli/1.7.26 Python/2.7.6 Darwin/14.4.0

@jdjkelly
Copy link

+1

2 similar comments
@chainlink
Copy link

👍

@mathiasbynens
Copy link

👍

@kieran
Copy link

kieran commented May 27, 2015

🐼

@adammeghji
Copy link

👍

1 similar comment
@darcyclarke
Copy link

👍

@kyleknap
Copy link
Contributor

Marking as feature request. Any suggestions on how you would like to see it exposed in the CLI would be appreciated.

@kyleknap kyleknap added the feature-request A feature should be added or improved. label May 27, 2015
@gf3
Copy link
Author

gf3 commented May 27, 2015

@kyleknap perhaps via --charset option? which would be appended to the guessed mime-type.

@quiver
Copy link
Contributor

quiver commented May 28, 2015

You can explicitly set content-type for s3 cp/sync and s3api put-object APIs.

For s3 cp/sync, use --content-type option.

$ aws s3 cp --content-type 'text/plain; charset=utf-8' index.html s3://BUCKET/index.html
upload: ./index.html to s3://BUCKET/index.html
$ aws s3api head-object --bucket BUCKET --key index.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/plain; charset=utf-8",
    "LastModified": "Thu, 28 May 2015 14:18:42 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

$ aws s3 sync foo s3://BUCKET/foo  --content-type 'text/html; charset=utf-8'
upload: foo/index.html to s3://BUCKET/foo/index.html
$ aws s3api head-object --bucket BUCKET --key foo/index.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/html; charset=utf-8",
    "LastModified": "Thu, 28 May 2015 14:30:54 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

For s3api put-object, use --content-type option.

$ aws s3api put-object --content-type 'text/html; charset=latin-1' --bucket
BUCKET --key index2.html --body index.html
$ aws s3api head-object --bucket BUCKET --key index2.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/html; charset=latin-1",
    "LastModified": "Thu, 28 May 2015 14:26:03 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

Is this different from want you want?

@gf3
Copy link
Author

gf3 commented May 28, 2015

@quiver yes it's a bit different; we have client-side app that we're syncing with a multitude of different file types. we'd really like to continue to take advantage of the mime-type guessing feature which saves us from having to batch upload files based on type.

@quiver
Copy link
Contributor

quiver commented May 28, 2015

@gf3
Got it. thanks for your reply.

@gf3
Copy link
Author

gf3 commented Jun 18, 2015

@kyleknap just checking in here, anything i can do to help this along?

@mbystryantsev
Copy link

+1

@ezzatron
Copy link

ezzatron commented Feb 5, 2017

As a stop-gap, could the default "guessed" MIME type for HTML be changed to text/html; charset=utf-8 somehow?

@tihomir-kit
Copy link

Any updates on this perhaps?

When we use -content-type "text/html; charset=utf-8" the files actually default to text/plain which then in turn simply downloads the index.html file instead of serving it. How do I address this? I have the same scenario as @gf3 where I'm trying to sync up a client-side app..

Thanks!

@dmahlow
Copy link

dmahlow commented Aug 28, 2017

Running into this problem myself now, trying to sync a bunch of static website files to a bucket and s3cmd is not setting the correct charset=utf8 content-type when uploading the files which contain utf8 characters.

I'd like to keep the deployment job simple by just syncing the directory up the pipe instead of having to define the content-type on a per file-type or file basis. Any way this is now possible?

@perennialmind
Copy link

As @dmahlow mentioned, you can define the content-type on a per-file-type basis. Just to illustrate what that might look like:

aws s3 sync --exclude "*" --include "*.html" --content-type "text/html; charset=utf-8" --delete ./public s3://www.example.com
aws s3 sync --include "*" --exclude "*.html" --delete ./public  s3://www.example.com

@ASayre
Copy link
Contributor

ASayre commented Feb 6, 2018

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

@jamesls
Copy link
Member

jamesls commented Apr 6, 2018

Based on community feedback, we have decided to return feature requests to GitHub issues.

@jamesls jamesls reopened this Apr 6, 2018
@theory
Copy link

theory commented May 22, 2018

Thanks to @perennialmind's comment for a reasonable work around. Would be nice to be able to specify mappings of some kind, though, to avoid finicky configs like this.

@pnc
Copy link

pnc commented Jun 17, 2019

@gf3 First of all, the NERVE of appearing in my ONLINE EXPERIENCE. How dare you.

Second: the mimetypes module happily returns the guessed encoding (same as using file --mime-encoding [filename] on the command line) as the second element of its return tuple, but it's currently getting thrown away here:

return mimetypes.guess_type(filename)[0]

My Python environment is hosed, but I'll take a run at a patch unless somebody beats me to it.

There's an argument to be made that it isn't aws-cli's responsibility to include the charset= portion of Content-Type for text/html files, but it's such a common use case (and the resulting mojibake so terrifying when it's omitted) that it seems worthwhile to me.

@pnc
Copy link

pnc commented Jun 17, 2019

Alright, so guess_type doesn't actually use libmagic under the hood, and only understands/guesses compression encodings, not text encodings. The following commit "works" to set a charset automatically on uploaded files:

https://github.com/aws/aws-cli/compare/develop...pnc:libmagic?expand=1

However, it:

  1. Probably doesn't work on Windows (at least without Cygwin)
  2. Needs to be tweaked so it doesn't cause the S3 copy unit tests to fail (I think they rely on it guessing based on filename alone?)
  3. Adds another dependency

Leaving it for posterity in case someone wants to pick up the torch, but this doesn't seem super viable unless someone from the core team encourages it.

@JessicaSachs
Copy link

+1 for getting this solved correctly, please! My s3 copy commands are littered with include and exclude statements now 👎... looking very similar to justatheory's

@theory
Copy link

theory commented Jul 5, 2020

Still an issue, I've updated my blog publish script from the broken link above to this script. Sure wish I could specify mappings explicitly and call it once!

@ewan-realitymine
Copy link

This is still an issue in the v2 cli, it's a real pain!

I don't think the current default is sensible. I understand the compatibility impact in updating this, but please put this behind a feature flag at least.

@kdaily kdaily added the needs-review This issue or pull request needs review from a core team member. label Nov 23, 2021
@linorabolini
Copy link

Discovered this issue today in our codebase.
We are using aws CLI to upload files via S3.
It would be lovely to have a way to ensure that the text/html charset is always UTF-8 by default.

@justindho justindho added community contribution-ready and removed needs-review This issue or pull request needs review from a core team member. labels May 11, 2022
@justindho justindho moved this to Contribution Ready in AWS CLI Community Contributions May 11, 2022
@andcip
Copy link

andcip commented May 13, 2022

+1

@tim-finnigan tim-finnigan added the p2 This is a standard priority issue label Nov 10, 2022
@maaiika
Copy link

maaiika commented Jun 14, 2023

Hi guys, this works for my python script.

from awscli.clidriver import create_clidriver
from awscli.customizations.s3 import utils

def on_queued_charset(self, future, **kwargs):
    guessed_type = utils.guess_content_type(self._get_filename(future))
    if not guessed_type:
        return

    if "text/" in guessed_type or "application/" in guessed_type or guessed_type == 'image/svg+xml':
        guessed_type += ";charset=UTF-8"
    future.meta.call_args.extra_args["ContentType"] = guessed_type

utils.BaseProvideContentTypeSubscriber.on_queued = on_queued_charset
driver = create_clidriver()
old_stdout = sys.stdout
old_stderr = sys.stderr
sys.stdout = cli_stdout = StringIO()
sys.stderr = cli_stderr = StringIO()
args = [
        "s3",
        "sync",
       "local_dir",
       "s3://bucket",
        "--region=us-east-2",
        "--delete",
    ]
cli_status = driver.main(args)
sys.stdout = old_stdout
sys.stderr = old_stderr
print(cli_stdout.getvalue(), cli_stderr.getvalue())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community contribution-ready feature-request A feature should be added or improved. p2 This is a standard priority issue s3mimetype s3
Projects
Status: Contribution Ready
Development

No branches or pull requests