S3Sync.net

General Category => Questions => Topic started by: mcm on November 25, 2008, 03:14:27 PM



Title: Problems with UTF-8 encoded characters in filename
Post by: mcm on November 25, 2008, 03:14:27 PM
Hi (and thanks for the fine program),

I have a problem with using UTF-8 characters in filenames (using Gentoo Linux, set to UTF-8).

What I want to have: I want my UTF-8 encoded files to be uploaded to S3 with their original UTF-8 encoding (so I can access them without problems with other tools). Couldn't figure out how to do it.

This is the behaviour I observed:
--------------------------------------

I have two local files:

1) "test/iso-ä" (with the last character being a german 'umlaut a', encoded in ISO-8859-1, that is hex(e4))
2) "test/utf8-ä" (with the last character being a german 'umlaut a', UTF-8 encoded, that is hex(c3, a4))

(The representation/encoding here in the forum may be wrong, but I double checked my local file names using a hex editor.)
 
Two scenarios:

A) Transferring the files using s3sync without an S3SYNC_NATIVE_CHARSET set (and thus defaulting to ISO-8859-1)
B) Transferring the files using s3sanc with S3SYNC_NATIVE_CHARSET set to 'UTF-8'

In both cases (A and B) both files seem to end up with exactly the same name on S3 (checked with s3fox and the right_aws ruby gem) - that is:
- for the iso file (1): the umlaut a is UTF-8 encoded hex(c3, a4) -> thats probably ok since S3 uses utf-8
- for the utf8 file (2): the umlaut a is encoded as hex(c3, 83, c2, a4) -> why can't it just end up as utf-8?

When resyncing the unchanged files in scenario (A) both files are recognized as already on the server and thus not retransferred.
When resyncing the unchanged files in scenario (B) both files are always retransferred (unnecessarily).

When downloading / syncing back the files with the settings from (A) both files end up with their original name.

When downloading / syncing back the files with the settings from (B) the local file nodes are created with the same names encoded the same way as on S3. After node creation the downloading from S3 (get_stream) fails with a 404 error on both files. The files contain the S3 xml error response in which the umlaut in the requested key is encoded as hex(c3, 83, c2, a4) for the iso file (1) and as hex(c3, 83, c2, 83, c3, 82, c2, a4) for the utf8 file (2).

-------------------

Is this really how it's supposed to work or am I doing something wrong?

I don't really care for the ISO encoded files since I want to ged rid of them anyway. But it would be fine if there was a way to sync the UTF-8 files up/down without altering the names (also on S3).

Thank you very much,
Michael


Title: Re: Problems with UTF-8 encoded characters in filename
Post by: ferrix on November 25, 2008, 07:18:18 PM
Sorry; as I understand the S3 spec, you can't just "send utf-8".  object names must be URL-encoded.

Scenario (B) sounds like the correct approach.  But I see you are getting errors, so maybe there is some bug in s3sync.  Can you make a tar of your test files and send it to me? (see README for contact)


Title: Re: Problems with UTF-8 encoded characters in filename
Post by: mcm on November 26, 2008, 06:34:46 AM
Sorry; as I understand the S3 spec, you can't just "send utf-8".  object names must be URL-encoded.

Yes, for transfer. But after decoding by S3 the object key on the server should be the same utf-8 character as in the original filename, shouldn't it? In my case the utf-8 version seems to be already mangled/über-encoded on the server (independent of the S3SYNC_NATIVE_CHARSET i upload it with).

I just sent you an email with testfiles. If you can confirm that this is a bug, I can try to look closer at it / fix it.

regards,
Michael