S3Sync.net
February 02, 2014, 01:34:33 PM *
Welcome, Guest. Please login or register.

Login with username, password and session length
 
   Home   Help Search Login Register  
Pages: [1]
  Print  
Author Topic: Amazon object limit of 5 GB causing problem for rsync (s3sync)  (Read 13929 times)
cm_gui
Newbie
*
Posts: 13


View Profile
« on: February 20, 2007, 01:16:29 PM »

Hi All

Right now, we are using rsync to backup our data to our own backup servers. 
We are thinking of using Amazon S3 with the s3sync.rb tool.

But the Amazon's limit of 5 GB per object is causing problem for us because
many of the folders we are rsync-ing to our backup servers now are very much
bigger than 5 GB.    And we cannot break up the folders into smaller chunkers
as it would be very messy and would in fact negate the benefits of rsync.

Does anybody else have this problem?
And how do you workaround it?

Right now, we are thinking of using s3cmb's put command to put modified files
individually to Amazon but we don't think this is as good as rsync.

Thank you.

Gui



Explanation why we cannot break up our data into smaller chunks.
For example, right now, we can run
rsync --delete -aruvze ssh /usr/ftp 192.168.1.2:/back/
and the entire ftp folder gets backed up onto our backup server 192.168.1.2
If we delete some subfolders in /usr/ftp, the same in 192.168.1.2:/back will get deleted.

If we were to s3sync.rb each sub-sub-subfolder in /usr/ftp individually onto Amazon S3 server,
we would have write scripts to delete those sub-sub-subfolders on Amazon S3 server which
have been deleted on our server.     Have to do the sub-sub-subfolder level
because the 1st level subfolders in /usr/ftp are also very much larger than 5 GB.

Logged
ferrix
Sr. Member
****
Posts: 363


(I am greg13070 on AWS forum)


View Profile
« Reply #1 on: February 20, 2007, 04:06:13 PM »

The size of the folder is irrelevant, only the size of each node.  s3sync maps one file per node.  So if you have a file > 5G then you can't use s3sync.  Otherwise it should be OK.

Note however I think there may still be an S3 bug about not being able to send a file that is >2GB because of some .. hardware issues.  But I'm too lazy to look up the details right now.  AWS forum should be swarming with stuff about it.
Logged
lowflyinghawk
Jr. Member
**
Posts: 52


View Profile
« Reply #2 on: February 20, 2007, 06:37:03 PM »

AWS has never announced a fix for this, nor a schedule. "we're working on it" is all you get.  lots of people have complained about it though, and you should post another to keep the topic warm.
Logged
ferrix
Sr. Member
****
Posts: 363


(I am greg13070 on AWS forum)


View Profile
« Reply #3 on: February 20, 2007, 08:30:09 PM »

Meh.. I don't have any files >2gb myself.. or else I long since would have been complaining and/or figuring out some way around it.

As you all may notice, the best way to motivate me is .. when I have the problem myself  Grin
Logged
cm_gui
Newbie
*
Posts: 13


View Profile
« Reply #4 on: March 01, 2007, 01:08:07 PM »

Hi Ferrix

Thank you for the reply.

Are you saying that I can run
./s3sync.rb -r --ssl /local/test1 bucket1:prefix1/data1
even if the test1 folder is greater than 5GB ?

Wouldn't the local/test1 become a single object on Amazon?

Gui


The size of the folder is irrelevant, only the size of each node.  s3sync maps one file per node.  So if you have a file > 5G then you can't use s3sync.  Otherwise it should be OK.

Note however I think there may still be an S3 bug about not being able to send a file that is >2GB because of some .. hardware issues.  But I'm too lazy to look up the details right now.  AWS forum should be swarming with stuff about it.
Logged
lowflyinghawk
Jr. Member
**
Posts: 52


View Profile
« Reply #5 on: March 01, 2007, 07:12:38 PM »

no, it's one key per file, so a folder containing 10 2G files maps to 10 separate keys each of which is PUT separately.  the only limit is on the individual keys, i.e. foo:bar/baz can't be over 2G, but the total of foo:/* doesn't matter.  remember, S3 is not a file system on a disk, it is a name/value database, so only the individual keys matter.  /foo/bar and /foo/bar/baz are not contents of the folder "/foo" in the way you may be used to thinking of it, each one is just a key.
Logged
cm_gui
Newbie
*
Posts: 13


View Profile
« Reply #6 on: March 02, 2007, 02:18:12 AM »

hi lowflyinghawk

So are you saying that if you have a folder of 20GB (containg 10 2GB files),
you can do this
./s3sync.rb -r --ssl /local/20GBFolder  bucket1:prefix1/data1

The entire 20GB /local/20GBFolder can be uploaded to Amazon's bucket1:prefix1/data1  at one go?
As long as none of the files inside the 20GB folder are bigger than 2GB, the above will work?

The Amazon 5GB limit is actually a limit on the file size and not object size?
Or one single file is an object?

I thought if I do this -- /s3sync.rb -r --ssl /local/20GBFolder  bucket1:prefix1/data1,
then /local/20GBFolder is a single object on Amazon?


Thank you.



no, it's one key per file, so a folder containing 10 2G files maps to 10 separate keys each of which is PUT separately.  the only limit is on the individual keys, i.e. foo:bar/baz can't be over 2G, but the total of foo:/* doesn't matter.  remember, S3 is not a file system on a disk, it is a name/value database, so only the individual keys matter.  /foo/bar and /foo/bar/baz are not contents of the folder "/foo" in the way you may be used to thinking of it, each one is just a key.
« Last Edit: March 02, 2007, 02:22:51 AM by cm_gui » Logged
lowflyinghawk
Jr. Member
**
Posts: 52


View Profile
« Reply #7 on: March 02, 2007, 06:41:52 AM »

yes, that's what I'm saying.  the limit applies only to the data stored under one key, so you can store 100 2G files all under your directory foo/bar.  these keys are *not* directories, i.e. /foo/bar and /foo/baz are completely separate from one another.  better than me saying it is the s3sync developer:

"The size of the folder is irrelevant, only the size of each node.  s3sync maps one file per node.  So if you have a file > 5G then you can't use s3sync.  Otherwise it should be OK."

always with the caveat that amazon hasn't fixed the >2G issue AFAIK.
Logged
cm_gui
Newbie
*
Posts: 13


View Profile
« Reply #8 on: March 02, 2007, 02:23:23 PM »

Hi lowflyinghawk

Thank you for the clarification.

When I upload a folder, say /data/folder1, to Amazon using s3sync.rb,
I cannot use s3cmd's get or Firefox S3 organizer to download subfolders or individual files inside /data/folder1 from Amazon.
In fact, Firefox S3 organizer sees the /data/folder1 as a single file on Amazon and I cannot get anything from it using Firefox S3.
I know the author of s3sync.rb had said that s3sync.rb may not be compatible with other tools,
but s3cmd's get command also cannot download subfolders and individual files inside /data/folder1.
This led me into thinking that /data/folder1 is uploaded as a single object.
However, s3sync.rb is able to download subfolders inside /data/folder1.

When I upload individuals files one by one to Amazon using s3cmd's put command -- and duplicating
the same directory structure /data/folder1 on Amazon,  I cannot use s3sync to download anything
from the /data/folder1 at all.
So it seems that s3cmd and s3sync are not compatible?

Logged
lowflyinghawk
Jr. Member
**
Posts: 52


View Profile
« Reply #9 on: March 02, 2007, 04:40:13 PM »

greg:  you're going to have to answer this one.  I don't know a thing about firefox s3, and I assume the usual "other tools may not be compatible" applies to it, but what about s3sync and s3cmd?
Logged
ferrix
Sr. Member
****
Posts: 363


(I am greg13070 on AWS forum)


View Profile
« Reply #10 on: March 05, 2007, 07:39:18 AM »

They're not especially "compatible", as you can tell from the following caveats.  s3cmd can't create the "folder nodes" that s3sync expects, and there's no facility for automatically adding s3sync-like meta-data for permissions and ownership.  It's not a defect, per se.. s3cmd is not intended as a "light weight s3sync" or anything of the kind.  It is strictly meant to cover the low level s3 operations like other s3 "shells".  Things such as listing and creating buckets, poking at individual files, etc.

Let me say that again and clarify: s3cmd is LOW LEVEL.. direct access to the keys and objects on s3.  You can't be thinking about them like directories and files, because they're not, and they don't behave as such.  You can't use s3cmd to "get a sub directory" because there's no such thing.

Having said that, s3cmd should be able to list the keys in your bucket (added by s3sync or anything else) and then you can do a "get" for whatever key you want. 

I would say in general you shouldn't be trying to poke at s3sync'd stuff with s3cmd.. it's just not intended.  It would be more likely, if anything, for me to enhance s3sync to handle single files some day.  The code for that is mostly there already, but the initial conditions setup is brittle, and I am disinclined to kick it around any more than necessary.
« Last Edit: March 05, 2007, 07:43:27 AM by ferrix » Logged
cm_gui
Newbie
*
Posts: 13


View Profile
« Reply #11 on: March 05, 2007, 05:08:10 PM »

Thank you once again Greg.

Do you think you can come up with a command in s3cmd to show the Upload Time of the uploaded data on Amazon?

Or the unix timestamp (modified time) of the files if they are preserved on the Amazon server after being uploaded.

If I can use s3cmd to see the upload time of the files on Amazon, I can then use s3cmd put command to upload files which are newer, and won't need to use s3sync.

s3cmd may be useful in instances where users want to be able to access the Amazon data using other tools like Firefox S3 Organizer.

Or maybe you can drop some hints on how to modify your scripts to do this?     If it isn't too difficult, I might want to give it a shot myself -- although I'm a newbie.


« Last Edit: March 07, 2007, 12:55:18 PM by cm_gui » Logged
ferrix
Sr. Member
****
Posts: 363


(I am greg13070 on AWS forum)


View Profile
« Reply #12 on: March 06, 2007, 04:49:03 AM »

Hints:
- Learn how S3 REST interface works by reading the amazon documentation on it.
- Get a working knowledge of Ruby with http://www.rubycentral.com/book/index.html
- Look at s3sync.rb and s3cmd.rb code and comments
Logged
cm_gui
Newbie
*
Posts: 13


View Profile
« Reply #13 on: March 23, 2007, 05:21:20 PM »

Thanks ferrix

I managed to get s3cmd.rb to print the file's last_modified (upload time) by adding the
following line after line no. 118 in s3cmd.rb.
puts Time.parse( item.last_modified ).to_i

Now, I want to create an empty folder in Amazon S3, but that is another post.
http://s3sync.net/forum/index.php?topic=39.0










Hints:
- Learn how S3 REST interface works by reading the amazon documentation on it.
- Get a working knowledge of Ruby with http://www.rubycentral.com/book/index.html
- Look at s3sync.rb and s3cmd.rb code and comments
« Last Edit: March 23, 2007, 06:04:17 PM by cm_gui » Logged
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2013, Simple Machines Valid XHTML 1.0! Valid CSS!