S3: fix Unicode handling in System and User Metadata by bentsku · Pull Request #13663 · localstack/localstack

bentsku · 2026-01-29T22:52:22Z

Motivation

As reported by #13641, we had an issue with Unicode handling in certain cases. This had not popped up until now because User and System metadata is usually done and set via headers already: the client sends some headers that are then stored, and returned to the client in other operations.

Headers only support the latin-1 charset (also known as ISO 8859-1), so this would "filter" the incoming values and force them to be compliant in S3 and safe to return.

But there are other ways to create objects (which are mentioned in the docs, they even mention SOAP which is now disabled): S3 pre-signed POST. You can create objects by sending a form to S3, so those fields would be contained in the body.

They even mention it specifically:

When using non-US-ASCII characters in your metadata values, the provided Unicode string is examined for non-US-ASCII characters. Values of such headers are character decoded as per RFC 2047 before storing and encoded as per RFC 2047 to make them mail-safe before returning. If the string contains only US-ASCII characters, it is presented as is.

We did not have this behavior, which meant that if we sent invalid headers, they would be returned as-is to the client and crash the WSGI web server with a pretty bad exception 😬

I've scoured the code base to try to understand where, when and how those headers could be returned, and I believe this PR to be fairly comprehensive.

There is also one other case which isn't mentioned in the documentation, which is pre-signed URLs. There, the headers are sent as part of the query string, so they could still contain invalid characters that would not work in headers. This case has also been checked via the new tests.

Another case not covered by the docs is System Metadata (such as Content-Disposition), where S3 just strips and replace the non-latin-1 characters by a space character.

Changes

add tests around unicode handling throughout S3:
- when returning keys containing unicode characters (such as Location in multipart uploads)
- when returning System Metadata: here S3 will strip the bad characters and replace them with a space character
- when setting User Metadata: S3 will try to decode it following the RFC2047
- when returning User Metadata: S3 will encode it following the RFC 2047: if all characters are printable or space characters, it will encode it like quoted-printable, or if it cannot, it will encode it with Base64
add new utilities around header encoding. First try was to use the standard library, but of course AWS is doing things its own way, so I had to get inspired by it but change some of the hardcoded values, those utils live in headers.py
add unit tests around those, with AWS-validated values
rework a bit how we handle metadata and properly encode it throughout the provider
url escape the Location when it contains the object key

github-actions · 2026-01-29T23:03:10Z

Test Results - Preflight, Unit

23 112 tests +24 21 253 ✅ +24 6m 11s ⏱️ +2s
1 suites ± 0 1 859 💤 ± 0
1 files ± 0 0 ❌ ± 0

Results for commit 4050b57. ± Comparison against base commit a8c43f2.

♻️ This comment has been updated with latest results.

github-actions · 2026-01-29T23:16:24Z

S3 Image Test Results (AMD64 / ARM64)

2 files 2 suites 7m 49s ⏱️
559 tests 507 ✅ 52 💤 0 ❌
1 118 runs 1 014 ✅ 104 💤 0 ❌

Results for commit 4050b57.

♻️ This comment has been updated with latest results.

github-actions · 2026-01-29T23:27:43Z

Test Results (amd64) - Acceptance

7 tests ±0 5 ✅ ±0 3m 2s ⏱️ ±0s
1 suites ±0 2 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 4050b57. ± Comparison against base commit a8c43f2.

♻️ This comment has been updated with latest results.

github-actions · 2026-01-29T23:47:28Z

Test Results (amd64) - Integration, Bootstrap

5 files ± 0 5 suites ±0 1h 37m 35s ⏱️ - 58m 37s
2 072 tests - 3 527 1 906 ✅ - 3 135 166 💤 - 392 0 ❌ ±0
2 078 runs - 3 527 1 906 ✅ - 3 135 172 💤 - 392 0 ❌ ±0

Results for commit 4050b57. ± Comparison against base commit a8c43f2.

This pull request removes 3534 and adds 7 tests. Note that renamed tests count towards both.

tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…

tests.aws.services.s3.test_s3.TestS3 ‑ test_multipart_with_unicode_character_location
tests.aws.services.s3.test_s3.TestS3 ‑ test_system_metadata_with_unicode
tests.aws.services.s3.test_s3.TestS3 ‑ test_user_metadata_rfc2047_bad_b64_encoded
tests.aws.services.s3.test_s3.TestS3 ‑ test_user_metadata_rfc2047_encoded
tests.aws.services.s3.test_s3.TestS3PresignedPost ‑ test_post_object_with_unicode_metadata
tests.aws.services.s3.test_s3.TestS3PresignedUrl ‑ test_get_response_overrides_unicode_metadata_with_sig_s3
tests.aws.services.s3.test_s3.TestS3PresignedUrl ‑ test_put_unicode_metadata_with_sig_s3

♻️ This comment has been updated with latest results.

github-actions · 2026-01-29T23:53:04Z

LocalStack Community integration with Pro

2 files ± 0 2 suites ±0 1h 15m 4s ⏱️ - 43m 32s
2 048 tests - 3 130 1 878 ✅ - 2 912 170 💤 - 218 0 ❌ ±0
2 050 runs - 3 130 1 878 ✅ - 2 912 172 💤 - 218 0 ❌ ±0

Results for commit 4050b57. ± Comparison against base commit a8c43f2.

This pull request removes 3137 and adds 7 tests. Note that renamed tests count towards both.

tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…

tests.aws.services.s3.test_s3.TestS3 ‑ test_multipart_with_unicode_character_location
tests.aws.services.s3.test_s3.TestS3 ‑ test_system_metadata_with_unicode
tests.aws.services.s3.test_s3.TestS3 ‑ test_user_metadata_rfc2047_bad_b64_encoded
tests.aws.services.s3.test_s3.TestS3 ‑ test_user_metadata_rfc2047_encoded
tests.aws.services.s3.test_s3.TestS3PresignedPost ‑ test_post_object_with_unicode_metadata
tests.aws.services.s3.test_s3.TestS3PresignedUrl ‑ test_get_response_overrides_unicode_metadata_with_sig_s3
tests.aws.services.s3.test_s3.TestS3PresignedUrl ‑ test_put_unicode_metadata_with_sig_s3

♻️ This comment has been updated with latest results.

bentsku · 2026-01-30T10:27:11Z

localstack-core/localstack/services/s3/headers.py

+    if all(c in _NO_ENCODING_CHARS for c in header_bytes):
+        return header


note: this could be probably be replaced header.isascii()

actually did not work, isascii was a bit too permissive!

aidehn

Nice man! A lot of the headers.py code has gone over my head slightly but I trust it works since it's well tested. Just had a few questions and comments about the code but looks good!

After looking at the code I agree it's probably better to just store it encoded + I think we already spoke about this but it'll be good to leave a comment somewhere that it's stored encoded for future reference!

Also, on a separate note and maybe not one to worry about (maybe?) But do we know how this will behave for previously stored objects whose headers were not encoded?

localstack-core/localstack/services/s3/headers.py

localstack-core/localstack/services/s3/models.py

localstack-core/localstack/services/s3/provider.py

tests/aws/services/s3/test_s3.py

k-a-il

Interesting that this issue didn't come up earlier. Great fix, LGTM 👍

bentsku

Thanks a lot for the reviews, I'll address the comments 🙏

Also, on a separate note and maybe not one to worry about (maybe?) But do we know how this will behave for previously stored objects whose headers were not encoded?

That's a very good question, if we had gone the way to actually encode the headers every time they are requested, this would not be an issue. This might actually push me to do it correctly, which would also reduce the amount of code change in the models for now. It itches my brain too much, and I think it will just be cleaner overall. Sorry for the change 😅

localstack-core/localstack/services/s3/headers.py

localstack-core/localstack/services/s3/models.py

localstack-core/localstack/services/s3/provider.py

tests/aws/services/s3/test_s3.py

localstack-bot · 2026-01-30T19:45:32Z

Currently, only patch changes are allowed on main. Your PR labels (aws:s3, semver: minor, docs: skip, notes: skip) indicate that it cannot be merged into the main at this time.

bentsku

Sorry @aidehn, I've re-requested your review as I've done substantial changes 😅 I went for the decoding/encoding routing now, because I could actually write new tests that showed that S3 was properly decoding the headers 👍 and the questions you raised pushed me to chose the other design, which now makes more sense!

Most of the changes are in https://github.com/localstack/localstack/pull/13663/changes/d152d48bbf7acbc866f7a51cf22628b5ebd4fbce..3791ce2d0313e84fbe3fe5b824f0936f2e53e3a8, hopefully it makes easy enough tor review

bentsku · 2026-01-30T19:48:45Z

localstack-core/localstack/services/s3/headers.py

+    if all(c in _NO_ENCODING_CHARS for c in header_bytes):
+        return header


actually did not work, isascii was a bit too permissive!

aidehn

Just had another look through the PR, nice to see the other approach taken! I think both were good for their own reasons, but this does make the code look a lot cleaner, and leaves less ambiguity on how the data is stored! Plus we don't need to worry about changing the data that is stored in the future 🔥

So from what I understand from the PR is:

Users will provide encoded user_metadata headers for objects which we will decode and store.
When users retrieve their S3 Object we will encode the user_metadata to prevent the issue in the ticket as before we weren't encoding the metadata which was causing failures when we store invalid metadata.
Generally users can't upload objects with invalid metadata but it is possible with presigned uploads which doesn't validate the metadata. This is why we don't have any decoding logic for PostObject.
- In our tests we have to override botocore validation since it validates the user metadata when in reality AWS does not do this for presigned URL uploads and is specific to botocore

Is this correct? I think the last two points are the parts I'm not 100% about but other than that, looks good to me! 🎉

bentsku · 2026-02-02T10:58:09Z

Thanks a lot for the review, sorry to have made you review it twice 😅

Users will provide encoded user_metadata headers for objects which we will decode and store.

Yes, or at least users will be able to provider encoded Metadata if they want to use non-latin-1 characters. They can keep providing non-encoded data if it is latin-1 encoded.

When users retrieve their S3 Object we will encode the user_metadata to prevent the issue in the ticket as before we weren't encoding the metadata which was causing failures when we store invalid metadata.

Yes, exactly, we will encode it if it needs to be, if the value is latin-1-compatible we send as is, if not we encode it.

Generally users can't upload objects with invalid metadata but it is possible with presigned uploads which doesn't validate the metadata. This is why we don't have any decoding logic for PostObject.

I think you just spotted a mistake I have made 😅 I'll add a test for this case, but I think we should decode in PostObject as well

In our tests we have to override botocore validation since it validates the user metadata when in reality AWS does not do this for presigned URL uploads and is specific to botocore

So, the behavior here is pretty weird in general and is possibly broken: boto/botocore#2552. Which I believe is why botocore just disallows is entirely I guess to avoid errors / weird behavior, and just say "nope you cannot do that" 😄

bentsku · 2026-02-02T16:52:42Z

Update: PostObject does not decode the data and returns it encoded, even if is only ASCII. I missed that but had implemented it correctly by accident 😅 should be good now! and it also showed a small issue, I was sometimes returning encoded data when it did not need to be.

aidehn

LGTM! Only question is surrounding a TODO on line 4500 in providers.py, but feel free to merge if it's meant to be there! 🎉

localstack-core/localstack/services/s3/provider.py

bentsku self-assigned this Jan 29, 2026

bentsku modified the milestones: 2026.03, 4.14 Jan 29, 2026

bentsku marked this pull request as ready for review January 29, 2026 23:59

bentsku requested a review from k-a-il as a code owner January 29, 2026 23:59

bentsku requested a review from aidehn January 30, 2026 09:43

bentsku commented Jan 30, 2026

View reviewed changes

aidehn reviewed Jan 30, 2026

View reviewed changes

tests/aws/services/s3/test_s3.py Show resolved Hide resolved

aidehn approved these changes Jan 30, 2026

View reviewed changes

k-a-il approved these changes Jan 30, 2026

View reviewed changes

bentsku commented Jan 30, 2026

View reviewed changes

bentsku requested a review from aidehn January 30, 2026 19:46

bentsku commented Jan 30, 2026

View reviewed changes

bentsku marked this pull request as draft January 30, 2026 20:01

bentsku marked this pull request as ready for review February 1, 2026 19:56

aidehn approved these changes Feb 2, 2026

View reviewed changes

add tests for unicode handling in headers

b720712

bentsku added 11 commits February 2, 2026 17:51

wip: start encoding header like AWS, different than stdlib

5c7f6e9

fix base64 encoding and detection

2978398

remove emoji, broken in AWS

6c426ed

fix unicode issue in provider

6757399

fix logic for ResponseExpires

21bf2b8

add util for object location

0ea2331

fix removing leftover

fb18a8a

add comments for botocore internal handler removal for testing

b6b846c

rework encoding/decoding

5721ed8

fix leftover

3d3e41b

add test: PostObject metadata is not decoded + fix safe chars

db306a5

bentsku force-pushed the fix-s3-unicode branch from 3791ce2 to db306a5 Compare February 2, 2026 16:51

remove leftover

694b332

aidehn approved these changes Feb 3, 2026

View reviewed changes

bentsku commented Feb 3, 2026

View reviewed changes

localstack-core/localstack/services/s3/provider.py Outdated Show resolved Hide resolved

address PR comment

4050b57

bentsku merged commit c7db927 into main Feb 3, 2026
49 checks passed

bentsku deleted the fix-s3-unicode branch February 3, 2026 09:58

		if all(c in _NO_ENCODING_CHARS for c in header_bytes):
		return header

Uh oh!

Conversation

bentsku commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Uh oh!

github-actions bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results - Preflight, Unit

Uh oh!

github-actions bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

S3 Image Test Results (AMD64 / ARM64)

Uh oh!

github-actions bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results (amd64) - Acceptance

Uh oh!

github-actions bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results (amd64) - Integration, Bootstrap

Uh oh!

github-actions bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LocalStack Community integration with Pro

Uh oh!

bentsku Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

bentsku Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

aidehn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k-a-il left a comment

Choose a reason for hiding this comment

Uh oh!

bentsku left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

localstack-bot commented Jan 30, 2026

Uh oh!

bentsku left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bentsku Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

aidehn left a comment

Choose a reason for hiding this comment

Uh oh!

bentsku commented Feb 2, 2026

Uh oh!

bentsku commented Feb 2, 2026

Uh oh!

aidehn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

bentsku commented Jan 29, 2026 •

edited

Loading

github-actions bot commented Jan 29, 2026 •

edited

Loading

github-actions bot commented Jan 29, 2026 •

edited

Loading

github-actions bot commented Jan 29, 2026 •

edited

Loading

github-actions bot commented Jan 29, 2026 •

edited

Loading

github-actions bot commented Jan 29, 2026 •

edited

Loading

bentsku left a comment •

edited

Loading