Unable to set `charset` when mime-types are guessed (S3) #1346

gf3 · 2015-05-27T16:18:25Z

We are syncing a directory of various file types to an S3 bucket and aws-cli is correctly guessing the mime-types, however in our case it's important that it also append the charset. For example the guessed content-type for index.html might look like this:

Content-Type: text/html

But we'd like a way to be able to tell aws-cli that the charset for all the synced files is UTF8 for instance:

Content-Type: text/html; charset=utf-8

Version: aws-cli/1.7.26 Python/2.7.6 Darwin/14.4.0

The text was updated successfully, but these errors were encountered:

jdjkelly · 2015-05-27T16:19:04Z

+1

chainlink · 2015-05-27T16:19:11Z

👍

mathiasbynens · 2015-05-27T16:31:45Z

👍

kieran · 2015-05-27T16:34:37Z

🐼

adammeghji · 2015-05-27T16:40:59Z

👍

darcyclarke · 2015-05-27T17:01:28Z

👍

kyleknap · 2015-05-27T17:32:56Z

Marking as feature request. Any suggestions on how you would like to see it exposed in the CLI would be appreciated.

gf3 · 2015-05-27T17:57:26Z

@kyleknap perhaps via --charset option? which would be appended to the guessed mime-type.

quiver · 2015-05-28T14:37:36Z

You can explicitly set content-type for s3 cp/sync and s3api put-object APIs.

For s3 cp/sync, use --content-type option.

$ aws s3 cp --content-type 'text/plain; charset=utf-8' index.html s3://BUCKET/index.html
upload: ./index.html to s3://BUCKET/index.html
$ aws s3api head-object --bucket BUCKET --key index.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/plain; charset=utf-8",
    "LastModified": "Thu, 28 May 2015 14:18:42 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

$ aws s3 sync foo s3://BUCKET/foo  --content-type 'text/html; charset=utf-8'
upload: foo/index.html to s3://BUCKET/foo/index.html
$ aws s3api head-object --bucket BUCKET --key foo/index.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/html; charset=utf-8",
    "LastModified": "Thu, 28 May 2015 14:30:54 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

For s3api put-object, use --content-type option.

$ aws s3api put-object --content-type 'text/html; charset=latin-1' --bucket
BUCKET --key index2.html --body index.html
$ aws s3api head-object --bucket BUCKET --key index2.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/html; charset=latin-1",
    "LastModified": "Thu, 28 May 2015 14:26:03 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

Is this different from want you want?

gf3 · 2015-05-28T14:42:59Z

@quiver yes it's a bit different; we have client-side app that we're syncing with a multitude of different file types. we'd really like to continue to take advantage of the mime-type guessing feature which saves us from having to batch upload files based on type.

quiver · 2015-05-28T15:47:55Z

@gf3
Got it. thanks for your reply.

gf3 · 2015-06-18T15:11:34Z

@kyleknap just checking in here, anything i can do to help this along?

mbystryantsev · 2017-01-04T17:57:19Z

+1

ezzatron · 2017-02-05T01:15:04Z

As a stop-gap, could the default "guessed" MIME type for HTML be changed to text/html; charset=utf-8 somehow?

tihomir-kit · 2017-06-19T11:23:04Z

Any updates on this perhaps?

When we use -content-type "text/html; charset=utf-8" the files actually default to text/plain which then in turn simply downloads the index.html file instead of serving it. How do I address this? I have the same scenario as @gf3 where I'm trying to sync up a client-side app..

Thanks!

dmahlow · 2017-08-28T15:31:25Z

Running into this problem myself now, trying to sync a bunch of static website files to a bucket and s3cmd is not setting the correct charset=utf8 content-type when uploading the files which contain utf8 characters.

I'd like to keep the deployment job simple by just syncing the directory up the pipe instead of having to define the content-type on a per file-type or file basis. Any way this is now possible?

perennialmind · 2017-09-30T00:31:36Z

As @dmahlow mentioned, you can define the content-type on a per-file-type basis. Just to illustrate what that might look like:

aws s3 sync --exclude "*" --include "*.html" --content-type "text/html; charset=utf-8" --delete ./public s3://www.example.com
aws s3 sync --include "*" --exclude "*.html" --delete ./public  s3://www.example.com

ASayre · 2018-02-06T10:20:52Z

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

jamesls · 2018-04-06T21:15:48Z

Based on community feedback, we have decided to return feature requests to GitHub issues.

theory · 2018-05-22T02:47:50Z

Thanks to @perennialmind's comment for a reasonable work around. Would be nice to be able to specify mappings of some kind, though, to avoid finicky configs like this.

pnc · 2019-06-17T13:44:40Z

@gf3 First of all, the NERVE of appearing in my ONLINE EXPERIENCE. How dare you.

Second: the mimetypes module happily returns the guessed encoding (same as using file --mime-encoding [filename] on the command line) as the second element of its return tuple, but it's currently getting thrown away here:

aws-cli/awscli/customizations/s3/utils.py

Line 294 in 072688c

return mimetypes.guess_type(filename)[0]

My Python environment is hosed, but I'll take a run at a patch unless somebody beats me to it.

There's an argument to be made that it isn't aws-cli's responsibility to include the charset= portion of Content-Type for text/html files, but it's such a common use case (and the resulting mojibake so terrifying when it's omitted) that it seems worthwhile to me.

pnc · 2019-06-17T15:35:30Z

Alright, so guess_type doesn't actually use libmagic under the hood, and only understands/guesses compression encodings, not text encodings. The following commit "works" to set a charset automatically on uploaded files:

https://github.com/aws/aws-cli/compare/develop...pnc:libmagic?expand=1

However, it:

Probably doesn't work on Windows (at least without Cygwin)
Needs to be tweaked so it doesn't cause the S3 copy unit tests to fail (I think they rely on it guessing based on filename alone?)
Adds another dependency

Leaving it for posterity in case someone wants to pick up the torch, but this doesn't seem super viable unless someone from the core team encourages it.

JessicaSachs · 2019-10-09T19:59:04Z

+1 for getting this solved correctly, please! My s3 copy commands are littered with include and exclude statements now 👎... looking very similar to justatheory's

theory · 2020-07-05T20:50:27Z

Still an issue, I've updated my blog publish script from the broken link above to this script. Sure wish I could specify mappings explicitly and call it once!

ewan-realitymine · 2021-11-08T17:18:43Z

This is still an issue in the v2 cli, it's a real pain!

I don't think the current default is sensible. I understand the compatibility impact in updating this, but please put this behind a feature flag at least.

linorabolini · 2022-04-06T09:36:48Z

Discovered this issue today in our codebase.
We are using aws CLI to upload files via S3.
It would be lovely to have a way to ensure that the text/html charset is always UTF-8 by default.

andcip · 2022-05-13T08:57:10Z

+1

maaiika · 2023-06-14T02:59:15Z

Hi guys, this works for my python script.

from awscli.clidriver import create_clidriver
from awscli.customizations.s3 import utils

def on_queued_charset(self, future, **kwargs):
    guessed_type = utils.guess_content_type(self._get_filename(future))
    if not guessed_type:
        return

    if "text/" in guessed_type or "application/" in guessed_type or guessed_type == 'image/svg+xml':
        guessed_type += ";charset=UTF-8"
    future.meta.call_args.extra_args["ContentType"] = guessed_type

utils.BaseProvideContentTypeSubscriber.on_queued = on_queued_charset
driver = create_clidriver()
old_stdout = sys.stdout
old_stderr = sys.stderr
sys.stdout = cli_stdout = StringIO()
sys.stderr = cli_stderr = StringIO()
args = [
        "s3",
        "sync",
       "local_dir",
       "s3://bucket",
        "--region=us-east-2",
        "--delete",
    ]
cli_status = driver.main(args)
sys.stdout = old_stdout
sys.stderr = old_stderr
print(cli_stdout.getvalue(), cli_stderr.getvalue())

kyleknap added the feature-request A feature should be added or improved. label May 27, 2015

kyleknap added s3 s3mimetype labels Nov 19, 2015

jleclanche mentioned this issue Dec 8, 2016

Incorrect text on Jade Spirit HearthSim/hs-bugs#651

Closed

ezzatron mentioned this issue Feb 5, 2017

Use UTF-8 encoding when guessing a MIME type of text/html for S3 uploads. #2426

Closed

ASayre closed this as completed Feb 6, 2018

jdorfman mentioned this issue Feb 8, 2018

Serve CDN files with charset=UTF-8 jsdelivr/bootstrapcdn#949

Closed

jamesls reopened this Apr 6, 2018

kdaily added the needs-review This issue or pull request needs review from a core team member. label Nov 23, 2021

justindho added community contribution-ready and removed needs-review This issue or pull request needs review from a core team member. labels May 11, 2022

justindho moved this to Contribution Ready in AWS CLI Community Contributions May 11, 2022

justindho added this to AWS CLI Community Contributions May 11, 2022

bparmentier mentioned this issue Aug 11, 2022

fix: encoding issue in CDN responses Redocly/redoc#2130

Merged

tim-finnigan added the p2 This is a standard priority issue label Nov 10, 2022

This was referenced Jul 2, 2023

Fix text encoding issue for CDN scpwiki/sigma#86

Merged

Fix text encoding issue for CDN Nu-SCPTheme/Black-Highlighter#208

Merged

Fix text encoding issue for CDN Basalt-Team/Basalt#25

Merged

tim-finnigan mentioned this issue May 16, 2024

aws s3 type inference should supply a charset for text/ mime types #8574

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to set `charset` when mime-types are guessed (S3) #1346

Unable to set `charset` when mime-types are guessed (S3) #1346

gf3 commented May 27, 2015

jdjkelly commented May 27, 2015

chainlink commented May 27, 2015

mathiasbynens commented May 27, 2015

kieran commented May 27, 2015

adammeghji commented May 27, 2015

darcyclarke commented May 27, 2015

kyleknap commented May 27, 2015

gf3 commented May 27, 2015

quiver commented May 28, 2015

gf3 commented May 28, 2015

quiver commented May 28, 2015

gf3 commented Jun 18, 2015

mbystryantsev commented Jan 4, 2017

ezzatron commented Feb 5, 2017

tihomir-kit commented Jun 19, 2017

dmahlow commented Aug 28, 2017 •

edited

Loading

perennialmind commented Sep 30, 2017

ASayre commented Feb 6, 2018

jamesls commented Apr 6, 2018

theory commented May 22, 2018

pnc commented Jun 17, 2019

pnc commented Jun 17, 2019

JessicaSachs commented Oct 9, 2019

theory commented Jul 5, 2020 •

edited

Loading

ewan-realitymine commented Nov 8, 2021

linorabolini commented Apr 6, 2022

andcip commented May 13, 2022

maaiika commented Jun 14, 2023

Unable to set charset when mime-types are guessed (S3) #1346

Unable to set charset when mime-types are guessed (S3) #1346

Comments

gf3 commented May 27, 2015

jdjkelly commented May 27, 2015

chainlink commented May 27, 2015

mathiasbynens commented May 27, 2015

kieran commented May 27, 2015

adammeghji commented May 27, 2015

darcyclarke commented May 27, 2015

kyleknap commented May 27, 2015

gf3 commented May 27, 2015

quiver commented May 28, 2015

gf3 commented May 28, 2015

quiver commented May 28, 2015

gf3 commented Jun 18, 2015

mbystryantsev commented Jan 4, 2017

ezzatron commented Feb 5, 2017

tihomir-kit commented Jun 19, 2017

dmahlow commented Aug 28, 2017 • edited Loading

perennialmind commented Sep 30, 2017

ASayre commented Feb 6, 2018

jamesls commented Apr 6, 2018

theory commented May 22, 2018

pnc commented Jun 17, 2019

pnc commented Jun 17, 2019

JessicaSachs commented Oct 9, 2019

theory commented Jul 5, 2020 • edited Loading

ewan-realitymine commented Nov 8, 2021

linorabolini commented Apr 6, 2022

andcip commented May 13, 2022

maaiika commented Jun 14, 2023

Unable to set `charset` when mime-types are guessed (S3) #1346

Unable to set `charset` when mime-types are guessed (S3) #1346

dmahlow commented Aug 28, 2017 •

edited

Loading

theory commented Jul 5, 2020 •

edited

Loading