S3Sync.net
February 02, 2014, 01:22:59 PM *
Welcome, Guest. Please login or register.

Login with username, password and session length
 
   Home   Help Search Login Register  
Pages: [1]
  Print  
Author Topic: Filenames with UTF-8 characters are badly encoded  (Read 4698 times)
A.Skwar
Newbie
*
Posts: 8


View Profile
« on: September 05, 2007, 06:06:49 AM »

Hello.

I transfered some files/dirs to S3 with s3sync. One of this directory is named "Ausflüge". I transfered files with the --public-read option. My system (Gentoo Linux x86) is set to UTF-8.

Now, to be able to access files in this "directory", I need to access "Ausfl%C3%83%C2%BCge", ie. "Ausflüge". So it seems that s3sync (or S3 itself?) did not notice that the source filename were UTF-8 encoded.

Is there a way to tell s3sync (S3?) that the filename is UTF-8 encoded?

Thanks,
Alexander
Logged
ferrix
Sr. Member
****
Posts: 363


(I am greg13070 on AWS forum)


View Profile
« Reply #1 on: September 05, 2007, 08:38:11 AM »

Yes.  Check the README for the native character encoding options.
Logged
A.Skwar
Newbie
*
Posts: 8


View Profile
« Reply #2 on: September 05, 2007, 11:54:19 AM »

You're refering to S3SYNC_NATIVE_CHARSET? Thx for the hint, I missed that variable.

What are legal values? Is utf-8 legal? Does casing matter (ie. Windows-1252 == WINDOWS-1252)?

However, would it be possible to enhance s3sync so, that it automatically sets the charset, according to the locale?

Thx,
Alexander
Logged
ferrix
Sr. Member
****
Posts: 363


(I am greg13070 on AWS forum)


View Profile
« Reply #3 on: September 05, 2007, 07:40:04 PM »

You're refering to S3SYNC_NATIVE_CHARSET? Thx for the hint, I missed that variable.

What are legal values? Is utf-8 legal? Does casing matter (ie. Windows-1252 == WINDOWS-1252)?

However, would it be possible to enhance s3sync so, that it automatically sets the charset, according to the locale?

Thx,
Alexander

Don't actually know what's legal.. I pass this to the conversion class Iconv from the ruby library.  Guess you could look there for more info.  See the file S3encoder.rb for the entirety of how I do escaping.

Also I don't know how to detect the right answer.. As I mentioned in another thread, I'm open to suggestions.
Logged
A.Skwar
Newbie
*
Posts: 8


View Profile
« Reply #4 on: September 06, 2007, 04:04:39 AM »

Don't actually know what's legal.. I pass this to the conversion class Iconv from the ruby library.

Well, "UTF-8" works and that's all I care for Smiley

Also I don't know how to detect the right answer..

You mean regarding setting S3SYNC_NATIVE_CHARSET automatically to the "correct" value?

If you have a look at how the locale is set up (ie. to what values the environment variables LANG and LC_* are set), you could induce what the "correct" value would be. If it's set to POSIX or C, it would be US-ASCII; if it's set to something like de_DE.utf-8 (or de_DE.utf8), it would be UTF-8. Other possible values would be something like en_US.ISO8859-1 which would be for ISO-8859-1. Etc. pp.. Basically, the locale setting would also tell, what character encoding is used.

HOWEVER: It would still be very useful, if there were a way to override the automatically detected value.

Best regards,
Alexander
Logged
lowflyinghawk
Jr. Member
**
Posts: 52


View Profile
« Reply #5 on: September 14, 2007, 07:35:31 PM »

I don't follow you here...a filename in common linux filesystems is just a sequence of bytes not including '\0' or '/'.  these bytes are displayed by various tools according to the current locale, but that has no effect on the underlying name.

s3 isn't the problem, according to AWS:

"We support from U+0001 to U+10FFFF (null character is encodedlikethis: %C0%80). This range is supported for GETs and PUTs, but someXMLparsers might choke on LIST if there are any unpritable charactersin the keys. (We return entity references like  forthesecharacters, but some parsers choke on these). "

there is a good reason why "some parsers choke", because most chars below #x20 are not legal in xml 1.0, which is what a list bucket request returns.  however, it turns out the ruby REXML parser will correctly parse all those illegal chars even though it isn't supposed, so you can list, get and put these chars as well as the legal ones.

I have testing get/put files with the illegal chars in them as well as files with umlauts with no problem.

shorter answer:  either s3sync is doing something bad (I doubt it if it is using REXML) or...
Logged
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2013, Simple Machines Valid XHTML 1.0! Valid CSS!