S3: fix Unicode handling in System and User Metadata#13663
Conversation
S3 Image Test Results (AMD64 / ARM64) 2 files 2 suites 7m 49s ⏱️ Results for commit 4050b57. ♻️ This comment has been updated with latest results. |
Test Results (amd64) - Integration, Bootstrap 5 files ± 0 5 suites ±0 1h 37m 35s ⏱️ - 58m 37s Results for commit 4050b57. ± Comparison against base commit a8c43f2. This pull request removes 3534 and adds 7 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
LocalStack Community integration with Pro 2 files ± 0 2 suites ±0 1h 15m 4s ⏱️ - 43m 32s Results for commit 4050b57. ± Comparison against base commit a8c43f2. This pull request removes 3137 and adds 7 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
| if all(c in _NO_ENCODING_CHARS for c in header_bytes): | ||
| return header |
There was a problem hiding this comment.
note: this could be probably be replaced header.isascii()
There was a problem hiding this comment.
actually did not work, isascii was a bit too permissive!
aidehn
left a comment
There was a problem hiding this comment.
Nice man! A lot of the headers.py code has gone over my head slightly but I trust it works since it's well tested. Just had a few questions and comments about the code but looks good!
After looking at the code I agree it's probably better to just store it encoded + I think we already spoke about this but it'll be good to leave a comment somewhere that it's stored encoded for future reference!
Also, on a separate note and maybe not one to worry about (maybe?) But do we know how this will behave for previously stored objects whose headers were not encoded?
k-a-il
left a comment
There was a problem hiding this comment.
Interesting that this issue didn't come up earlier. Great fix, LGTM 👍
bentsku
left a comment
There was a problem hiding this comment.
Thanks a lot for the reviews, I'll address the comments 🙏
Also, on a separate note and maybe not one to worry about (maybe?) But do we know how this will behave for previously stored objects whose headers were not encoded?
That's a very good question, if we had gone the way to actually encode the headers every time they are requested, this would not be an issue. This might actually push me to do it correctly, which would also reduce the amount of code change in the models for now. It itches my brain too much, and I think it will just be cleaner overall. Sorry for the change 😅
|
Currently, only patch changes are allowed on main. Your PR labels (aws:s3, semver: minor, docs: skip, notes: skip) indicate that it cannot be merged into the main at this time. |
There was a problem hiding this comment.
Sorry @aidehn, I've re-requested your review as I've done substantial changes 😅 I went for the decoding/encoding routing now, because I could actually write new tests that showed that S3 was properly decoding the headers 👍 and the questions you raised pushed me to chose the other design, which now makes more sense!
Most of the changes are in https://github.com/localstack/localstack/pull/13663/changes/d152d48bbf7acbc866f7a51cf22628b5ebd4fbce..3791ce2d0313e84fbe3fe5b824f0936f2e53e3a8, hopefully it makes easy enough tor review
| if all(c in _NO_ENCODING_CHARS for c in header_bytes): | ||
| return header |
There was a problem hiding this comment.
actually did not work, isascii was a bit too permissive!
aidehn
left a comment
There was a problem hiding this comment.
Just had another look through the PR, nice to see the other approach taken! I think both were good for their own reasons, but this does make the code look a lot cleaner, and leaves less ambiguity on how the data is stored! Plus we don't need to worry about changing the data that is stored in the future 🔥
So from what I understand from the PR is:
- Users will provide encoded
user_metadataheaders for objects which we will decode and store. - When users retrieve their S3 Object we will encode the
user_metadatato prevent the issue in the ticket as before we weren't encoding the metadata which was causing failures when we store invalid metadata. - Generally users can't upload objects with invalid metadata but it is possible with presigned uploads which doesn't validate the metadata. This is why we don't have any decoding logic for
PostObject.- In our tests we have to override
botocorevalidation since it validates the user metadata when in reality AWS does not do this for presigned URL uploads and is specific tobotocore
- In our tests we have to override
Is this correct? I think the last two points are the parts I'm not 100% about but other than that, looks good to me! 🎉
|
Thanks a lot for the review, sorry to have made you review it twice 😅
Yes, or at least users will be able to provider encoded
Yes, exactly, we will encode it if it needs to be, if the value is
I think you just spotted a mistake I have made 😅 I'll add a test for this case, but I think we should decode in
So, the behavior here is pretty weird in general and is possibly broken: boto/botocore#2552. Which I believe is why botocore just disallows is entirely I guess to avoid errors / weird behavior, and just say "nope you cannot do that" 😄 |
3791ce2 to
db306a5
Compare
|
Update: |
aidehn
left a comment
There was a problem hiding this comment.
LGTM! Only question is surrounding a TODO on line 4500 in providers.py, but feel free to merge if it's meant to be there! 🎉
Motivation
As reported by #13641, we had an issue with Unicode handling in certain cases. This had not popped up until now because User and System metadata is usually done and set via headers already: the client sends some headers that are then stored, and returned to the client in other operations.
Headers only support the
latin-1charset (also known asISO 8859-1), so this would "filter" the incoming values and force them to be compliant in S3 and safe to return.But there are other ways to create objects (which are mentioned in the docs, they even mention SOAP which is now disabled): S3 pre-signed POST. You can create objects by sending a form to S3, so those fields would be contained in the body.
They even mention it specifically:
We did not have this behavior, which meant that if we sent invalid headers, they would be returned as-is to the client and crash the WSGI web server with a pretty bad exception 😬
I've scoured the code base to try to understand where, when and how those headers could be returned, and I believe this PR to be fairly comprehensive.
There is also one other case which isn't mentioned in the documentation, which is pre-signed URLs. There, the headers are sent as part of the query string, so they could still contain invalid characters that would not work in headers. This case has also been checked via the new tests.
Another case not covered by the docs is System Metadata (such as
Content-Disposition), where S3 just strips and replace the non-latin-1 characters by a space character.Changes
Locationin multipart uploads)headers.pyLocationwhen it contains the object key