X Tutup
Skip to content
This repository was archived by the owner on Mar 23, 2026. It is now read-only.

S3: fix Unicode handling in System and User Metadata#13663

Merged
bentsku merged 14 commits intomainfrom
fix-s3-unicode
Feb 3, 2026
Merged

S3: fix Unicode handling in System and User Metadata#13663
bentsku merged 14 commits intomainfrom
fix-s3-unicode

Conversation

@bentsku
Copy link
Contributor

@bentsku bentsku commented Jan 29, 2026

Motivation

As reported by #13641, we had an issue with Unicode handling in certain cases. This had not popped up until now because User and System metadata is usually done and set via headers already: the client sends some headers that are then stored, and returned to the client in other operations.

Headers only support the latin-1 charset (also known as ISO 8859-1), so this would "filter" the incoming values and force them to be compliant in S3 and safe to return.

But there are other ways to create objects (which are mentioned in the docs, they even mention SOAP which is now disabled): S3 pre-signed POST. You can create objects by sending a form to S3, so those fields would be contained in the body.

They even mention it specifically:

When using non-US-ASCII characters in your metadata values, the provided Unicode string is examined for non-US-ASCII characters. Values of such headers are character decoded as per RFC 2047 before storing and encoded as per RFC 2047 to make them mail-safe before returning. If the string contains only US-ASCII characters, it is presented as is.

We did not have this behavior, which meant that if we sent invalid headers, they would be returned as-is to the client and crash the WSGI web server with a pretty bad exception 😬

I've scoured the code base to try to understand where, when and how those headers could be returned, and I believe this PR to be fairly comprehensive.

There is also one other case which isn't mentioned in the documentation, which is pre-signed URLs. There, the headers are sent as part of the query string, so they could still contain invalid characters that would not work in headers. This case has also been checked via the new tests.

Another case not covered by the docs is System Metadata (such as Content-Disposition), where S3 just strips and replace the non-latin-1 characters by a space character.

Changes

  • add tests around unicode handling throughout S3:
    • when returning keys containing unicode characters (such as Location in multipart uploads)
    • when returning System Metadata: here S3 will strip the bad characters and replace them with a space character
    • when setting User Metadata: S3 will try to decode it following the RFC2047
    • when returning User Metadata: S3 will encode it following the RFC 2047: if all characters are printable or space characters, it will encode it like quoted-printable, or if it cannot, it will encode it with Base64
  • add new utilities around header encoding. First try was to use the standard library, but of course AWS is doing things its own way, so I had to get inspired by it but change some of the hardcoded values, those utils live in headers.py
  • add unit tests around those, with AWS-validated values
  • rework a bit how we handle metadata and properly encode it throughout the provider
  • url escape the Location when it contains the object key

@bentsku bentsku self-assigned this Jan 29, 2026
@bentsku bentsku added aws:s3 Amazon Simple Storage Service semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases docs: skip Pull request does not require documentation changes notes: skip Pull request does not have to be mentioned in the release notes labels Jan 29, 2026
@bentsku bentsku modified the milestones: 2026.03, 4.14 Jan 29, 2026
@github-actions
Copy link

github-actions bot commented Jan 29, 2026

Test Results - Preflight, Unit

23 112 tests  +24   21 253 ✅ +24   6m 11s ⏱️ +2s
     1 suites ± 0    1 859 💤 ± 0 
     1 files   ± 0        0 ❌ ± 0 

Results for commit 4050b57. ± Comparison against base commit a8c43f2.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jan 29, 2026

S3 Image Test Results (AMD64 / ARM64)

    2 files      2 suites   7m 49s ⏱️
  559 tests   507 ✅  52 💤 0 ❌
1 118 runs  1 014 ✅ 104 💤 0 ❌

Results for commit 4050b57.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jan 29, 2026

Test Results (amd64) - Acceptance

7 tests  ±0   5 ✅ ±0   3m 2s ⏱️ ±0s
1 suites ±0   2 💤 ±0 
1 files   ±0   0 ❌ ±0 

Results for commit 4050b57. ± Comparison against base commit a8c43f2.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jan 29, 2026

Test Results (amd64) - Integration, Bootstrap

    5 files  ±    0      5 suites  ±0   1h 37m 35s ⏱️ - 58m 37s
2 072 tests  - 3 527  1 906 ✅  - 3 135  166 💤  - 392  0 ❌ ±0 
2 078 runs   - 3 527  1 906 ✅  - 3 135  172 💤  - 392  0 ❌ ±0 

Results for commit 4050b57. ± Comparison against base commit a8c43f2.

This pull request removes 3534 and adds 7 tests. Note that renamed tests count towards both.
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…
tests.aws.services.s3.test_s3.TestS3 ‑ test_multipart_with_unicode_character_location
tests.aws.services.s3.test_s3.TestS3 ‑ test_system_metadata_with_unicode
tests.aws.services.s3.test_s3.TestS3 ‑ test_user_metadata_rfc2047_bad_b64_encoded
tests.aws.services.s3.test_s3.TestS3 ‑ test_user_metadata_rfc2047_encoded
tests.aws.services.s3.test_s3.TestS3PresignedPost ‑ test_post_object_with_unicode_metadata
tests.aws.services.s3.test_s3.TestS3PresignedUrl ‑ test_get_response_overrides_unicode_metadata_with_sig_s3
tests.aws.services.s3.test_s3.TestS3PresignedUrl ‑ test_put_unicode_metadata_with_sig_s3

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jan 29, 2026

LocalStack Community integration with Pro

    2 files  ±    0      2 suites  ±0   1h 15m 4s ⏱️ - 43m 32s
2 048 tests  - 3 130  1 878 ✅  - 2 912  170 💤  - 218  0 ❌ ±0 
2 050 runs   - 3 130  1 878 ✅  - 2 912  172 💤  - 218  0 ❌ ±0 

Results for commit 4050b57. ± Comparison against base commit a8c43f2.

This pull request removes 3137 and adds 7 tests. Note that renamed tests count towards both.
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…
tests.aws.services.s3.test_s3.TestS3 ‑ test_multipart_with_unicode_character_location
tests.aws.services.s3.test_s3.TestS3 ‑ test_system_metadata_with_unicode
tests.aws.services.s3.test_s3.TestS3 ‑ test_user_metadata_rfc2047_bad_b64_encoded
tests.aws.services.s3.test_s3.TestS3 ‑ test_user_metadata_rfc2047_encoded
tests.aws.services.s3.test_s3.TestS3PresignedPost ‑ test_post_object_with_unicode_metadata
tests.aws.services.s3.test_s3.TestS3PresignedUrl ‑ test_get_response_overrides_unicode_metadata_with_sig_s3
tests.aws.services.s3.test_s3.TestS3PresignedUrl ‑ test_put_unicode_metadata_with_sig_s3

♻️ This comment has been updated with latest results.

@bentsku bentsku marked this pull request as ready for review January 29, 2026 23:59
@bentsku bentsku requested a review from k-a-il as a code owner January 29, 2026 23:59
@bentsku bentsku requested a review from aidehn January 30, 2026 09:43
Comment on lines +34 to +35
if all(c in _NO_ENCODING_CHARS for c in header_bytes):
return header
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this could be probably be replaced header.isascii()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually did not work, isascii was a bit too permissive!

Copy link
Contributor

@aidehn aidehn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice man! A lot of the headers.py code has gone over my head slightly but I trust it works since it's well tested. Just had a few questions and comments about the code but looks good!

After looking at the code I agree it's probably better to just store it encoded + I think we already spoke about this but it'll be good to leave a comment somewhere that it's stored encoded for future reference!

Also, on a separate note and maybe not one to worry about (maybe?) But do we know how this will behave for previously stored objects whose headers were not encoded?

Copy link
Collaborator

@k-a-il k-a-il left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that this issue didn't come up earlier. Great fix, LGTM 👍

Copy link
Contributor Author

@bentsku bentsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the reviews, I'll address the comments 🙏

Also, on a separate note and maybe not one to worry about (maybe?) But do we know how this will behave for previously stored objects whose headers were not encoded?

That's a very good question, if we had gone the way to actually encode the headers every time they are requested, this would not be an issue. This might actually push me to do it correctly, which would also reduce the amount of code change in the models for now. It itches my brain too much, and I think it will just be cleaner overall. Sorry for the change 😅

@localstack-bot
Copy link
Contributor

Currently, only patch changes are allowed on main. Your PR labels (aws:s3, semver: minor, docs: skip, notes: skip) indicate that it cannot be merged into the main at this time.

@bentsku bentsku requested a review from aidehn January 30, 2026 19:46
Copy link
Contributor Author

@bentsku bentsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @aidehn, I've re-requested your review as I've done substantial changes 😅 I went for the decoding/encoding routing now, because I could actually write new tests that showed that S3 was properly decoding the headers 👍 and the questions you raised pushed me to chose the other design, which now makes more sense!

Most of the changes are in https://github.com/localstack/localstack/pull/13663/changes/d152d48bbf7acbc866f7a51cf22628b5ebd4fbce..3791ce2d0313e84fbe3fe5b824f0936f2e53e3a8, hopefully it makes easy enough tor review

Comment on lines +34 to +35
if all(c in _NO_ENCODING_CHARS for c in header_bytes):
return header
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually did not work, isascii was a bit too permissive!

@bentsku bentsku marked this pull request as draft January 30, 2026 20:01
@bentsku bentsku marked this pull request as ready for review February 1, 2026 19:56
Copy link
Contributor

@aidehn aidehn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had another look through the PR, nice to see the other approach taken! I think both were good for their own reasons, but this does make the code look a lot cleaner, and leaves less ambiguity on how the data is stored! Plus we don't need to worry about changing the data that is stored in the future 🔥

So from what I understand from the PR is:

  • Users will provide encoded user_metadata headers for objects which we will decode and store.
  • When users retrieve their S3 Object we will encode the user_metadata to prevent the issue in the ticket as before we weren't encoding the metadata which was causing failures when we store invalid metadata.
  • Generally users can't upload objects with invalid metadata but it is possible with presigned uploads which doesn't validate the metadata. This is why we don't have any decoding logic for PostObject.
    • In our tests we have to override botocore validation since it validates the user metadata when in reality AWS does not do this for presigned URL uploads and is specific to botocore

Is this correct? I think the last two points are the parts I'm not 100% about but other than that, looks good to me! 🎉

@bentsku
Copy link
Contributor Author

bentsku commented Feb 2, 2026

Thanks a lot for the review, sorry to have made you review it twice 😅

Users will provide encoded user_metadata headers for objects which we will decode and store.

Yes, or at least users will be able to provider encoded Metadata if they want to use non-latin-1 characters. They can keep providing non-encoded data if it is latin-1 encoded.

When users retrieve their S3 Object we will encode the user_metadata to prevent the issue in the ticket as before we weren't encoding the metadata which was causing failures when we store invalid metadata.

Yes, exactly, we will encode it if it needs to be, if the value is latin-1-compatible we send as is, if not we encode it.

Generally users can't upload objects with invalid metadata but it is possible with presigned uploads which doesn't validate the metadata. This is why we don't have any decoding logic for PostObject.

I think you just spotted a mistake I have made 😅 I'll add a test for this case, but I think we should decode in PostObject as well

In our tests we have to override botocore validation since it validates the user metadata when in reality AWS does not do this for presigned URL uploads and is specific to botocore

So, the behavior here is pretty weird in general and is possibly broken: boto/botocore#2552. Which I believe is why botocore just disallows is entirely I guess to avoid errors / weird behavior, and just say "nope you cannot do that" 😄

@bentsku
Copy link
Contributor Author

bentsku commented Feb 2, 2026

Update: PostObject does not decode the data and returns it encoded, even if is only ASCII. I missed that but had implemented it correctly by accident 😅 should be good now! and it also showed a small issue, I was sometimes returning encoded data when it did not need to be.

Copy link
Contributor

@aidehn aidehn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Only question is surrounding a TODO on line 4500 in providers.py, but feel free to merge if it's meant to be there! 🎉

@bentsku bentsku merged commit c7db927 into main Feb 3, 2026
49 checks passed
@bentsku bentsku deleted the fix-s3-unicode branch February 3, 2026 09:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

aws:s3 Amazon Simple Storage Service docs: skip Pull request does not require documentation changes notes: skip Pull request does not have to be mentioned in the release notes semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

X Tutup