Filenames with UTF-8 characters are badly encoded

A.Skwar

Newbie

Posts: 8

Filenames with UTF-8 characters are badly encoded

« on: September 05, 2007, 06:06:49 AM »

Hello.

I transfered some files/dirs to S3 with s3sync. One of this directory is named "Ausflüge". I transfered files with the --public-read option. My system (Gentoo Linux x86) is set to UTF-8.

Now, to be able to access files in this "directory", I need to access "Ausfl%C3%83%C2%BCge", ie. "AusflÃ¼ge". So it seems that s3sync (or S3 itself?) did not notice that the source filename were UTF-8 encoded.

Is there a way to tell s3sync (S3?) that the filename is UTF-8 encoded?

Thanks,
Alexander


	Logged

ferrix

Sr. Member

Posts: 363

(I am greg13070 on AWS forum)

Re: Filenames with UTF-8 characters are badly encoded

« Reply #1 on: September 05, 2007, 08:38:11 AM »

Yes. Check the README for the native character encoding options.


	Logged

A.Skwar

Newbie

Posts: 8

Re: Filenames with UTF-8 characters are badly encoded

« Reply #2 on: September 05, 2007, 11:54:19 AM »

You're refering to S3SYNC_NATIVE_CHARSET? Thx for the hint, I missed that variable.

What are legal values? Is utf-8 legal? Does casing matter (ie. Windows-1252 == WINDOWS-1252)?

However, would it be possible to enhance s3sync so, that it automatically sets the charset, according to the locale?

Thx,
Alexander


	Logged

ferrix

Sr. Member

Posts: 363

(I am greg13070 on AWS forum)

Re: Filenames with UTF-8 characters are badly encoded

« Reply #3 on: September 05, 2007, 07:40:04 PM »

Quote from: A.Skwar on September 05, 2007, 11:54:19 AM

Don't actually know what's legal.. I pass this to the conversion class Iconv from the ruby library. Guess you could look there for more info. See the file S3encoder.rb for the entirety of how I do escaping.

Also I don't know how to detect the right answer.. As I mentioned in another thread, I'm open to suggestions.


	Logged

A.Skwar

Newbie

Posts: 8

Re: Filenames with UTF-8 characters are badly encoded

« Reply #4 on: September 06, 2007, 04:04:39 AM »

Quote from: ferrix on September 05, 2007, 07:40:04 PM

Don't actually know what's legal.. I pass this to the conversion class Iconv from the ruby library.

Well, "UTF-8" works and that's all I care for

Quote from: ferrix on September 05, 2007, 07:40:04 PM

Also I don't know how to detect the right answer..

You mean regarding setting S3SYNC_NATIVE_CHARSET automatically to the "correct" value?

If you have a look at how the locale is set up (ie. to what values the environment variables LANG and LC_* are set), you could induce what the "correct" value would be. If it's set to POSIX or C, it would be US-ASCII; if it's set to something like de_DE.utf-8 (or de_DE.utf8), it would be UTF-8. Other possible values would be something like en_US.ISO8859-1 which would be for ISO-8859-1. Etc. pp.. Basically, the locale setting would also tell, what character encoding is used.

HOWEVER: It would still be very useful, if there were a way to override the automatically detected value.

Best regards,
Alexander


	Logged

lowflyinghawk

Jr. Member

Posts: 52

Re: Filenames with UTF-8 characters are badly encoded

« Reply #5 on: September 14, 2007, 07:35:31 PM »

I don't follow you here...a filename in common linux filesystems is just a sequence of bytes not including '\0' or '/'. these bytes are displayed by various tools according to the current locale, but that has no effect on the underlying name.

s3 isn't the problem, according to AWS:

"We support from U+0001 to U+10FFFF (null character is encodedlikethis: %C0%80). This range is supported for GETs and PUTs, but someXMLparsers might choke on LIST if there are any unpritable charactersin the keys. (We return entity references like  forthesecharacters, but some parsers choke on these). "

there is a good reason why "some parsers choke", because most chars below #x20 are not legal in xml 1.0, which is what a list bucket request returns. however, it turns out the ruby REXML parser will correctly parse all those illegal chars even though it isn't supposed, so you can list, get and put these chars as well as the legal ones.

I have testing get/put files with the illegal chars in them as well as files with umlauts with no problem.

shorter answer: either s3sync is doing something bad (I doubt it if it is using REXML) or...


	Logged

Pages: [1]

« previous next »