Title: s3sync.rb deleting sibling directories on when syncing to S3 Post by: RobS on April 01, 2008, 01:31:10 PM I just started using s3sync.rb to backup the files on my web server to S3. I really like the program and the features of it were exactly what I was looking for.
I wrote a little wrapper script to run the program to selectively back up directories from my web server. When I ran this, I found an interesting problem with the program that may be me doing something wrong or may be a bug. I'm not sure. The problem occurs when s3sync.rb is run in delete mode syncing to a target directory in S3 that has sibling directories that start with the same characters as the target directory. When this occurs, s3sync.rb removes the other directories and I don't think that it should. (However, I'm new at this and I want to see what you think.) So, if I have a directory on S3 with the following directories in it: test and test2, when I try to sync test, it will remove test2. What appears to be happening is that s3sync.rb is matching the first part of the directory name and not taking into account that it is not at a directory boundary (/). This is kind of hard to explain, so let me give you an example that shows what is happening pretty clearly. I have a local directory with the following files: $ ls -al The whole directory tree looks like the following: $ find . My S3 bucket is currently empty: $ s3cmd.rb list vpc_test I want to sync the test and test2 directories to S3 using s3sync.rb but not the test3 directory. To do this I run s3sync.rb twice. Once to backup the test directory and once to backup the test2 directory. (This example is a bit contrived but it should get the point across. For a case this simple you could probably use exclude to do the same thing in one run and not hit this problem.) The first time I do this, things work fine: $ s3sync.rb -s -r -v --delete ./test/ vpc_test:backup_a/test And all my files are on S3: $ s3cmd.rb list vpc_test However, when I run this again (with no changes to the local source directories) the first command deletes the directory that the second command put on S3: $ s3sync.rb -s -r -v --delete ./test/ vpc_test:backup_a/test And we see that the files have been removed from S3: $ s3cmd.rb list vpc_test The second command restores the files from it's directory: $ s3sync.rb -s -r -v --delete ./test2/ vpc_test:backup_a/test2 And we see all the files back in S3: $ s3cmd.rb list vpc_test Note that exactly the same behavior is exhibited if I use the following format of the s3sync.rb commmands: s3sync.rb -s -r -v --delete ./test vpc_test:backup_a To me, this isn't the way that it should work. Am I missing something or is this a subtle bug? Thanks for any insight. Rob Title: Re: s3sync.rb deleting sibling directories on when syncing to S3 Post by: ferrix on April 02, 2008, 03:29:27 AM Sure seems like a bug.. Instead of test and test2 what if you use "test" and "apple" or something that doesn't begin with the other one. It shouldn't matter but I wonder if the S3 listing logic is returning like all of "test*" when we expect "test/*"
Let me know if that makes a diff. As it may be apparent, I haven't had any time for this project lately ...... Title: Re: s3sync.rb deleting sibling directories on when syncing to S3 Post by: RobS on April 02, 2008, 04:59:08 PM Thanks for your quick response. I thought that the behavior seemed weird.
As you requested, I repeated the steps in my previous post with the two directories being 'test' and 'apple' instead of 'test' and 'test2' (I just renamed 'test2' to 'apple' for this test). As you can see below, the behavior of the program does change when the directories are named this way. The program now works as I would expect: Upon the second running of the two s3sync.rb commands, no files are deleted or created. $ s3cmd.rb list vpc_test Title: Re: s3sync.rb deleting sibling directories on when syncing to S3 Post by: ferrix on April 02, 2008, 11:08:26 PM I still call it a bug, but this helps explain why it hasn't been caught for so long.
Title: Re: s3sync.rb deleting sibling directories on when syncing to S3 Post by: RobS on April 03, 2008, 02:56:52 PM I dug into this a little more and have a theory as to what is going on. This is from a pretty cursory look at the source code, so I may be way off. Corrections are appreciated.
It looks like what is happening is that when the 'list_bucket vpc_test max-keys 200 prefix backup_a/test delimiter / with 100 retries left' command is called, S3 is responding with both "backup_a/test/" and "backup_a/test2/" as matching the common prefix. This is technically correct if S3 is just looking at the key as a string, doing a simple string compare with the specified prefix, and not paying any attention to slashes (/) being a directory delimiter. However, it appears that s3sync.rb expects s3 to do the prefix match paying attention to the slash delimiters. (My quick reading of the s3 docs would seem to indicate that s3 is behaving as documented. However, it was just a quick look and I've never programmed with s3 before so I could have easily read it wrong.) That appears to be what is throwing s3sync.rb off in this case. Thoughts? Rob Title: Re: s3sync.rb deleting sibling directories on when syncing to S3 Post by: RobS on April 03, 2008, 04:27:39 PM Based upon the theory I presented in my previous post, I looked at the source a little more and I have an idea as to how to fix this. Note that this is based upon a very cursory glance at the code and my theory of the problem as stated in my previous post. This could easily be wrong or break other things. Please advise if this is an ill-conceived fix!
It looks like this can be fixed by checking the items returned by the list_bucket command (both the entries and common_prefix_entries) to see if the part of the name beyond the prefix starts with a slash (/). If it does, then we're on a directory boundary and things will work fine. If it doesn't, then this is one of those cases that was causing the problem, so we should just skip that item. There is already code to extract the part of the name beyond the prefix to check for excludes. So, I just used this excludePath variable to do the check for the initial slash. Then I added some if logic around the code that calls S3Node.new or recurses into s3TreeRecurse again (depending on the item type) to skip these calls when the prefix doesn't align with a directory boundary. There's probably a more elegant way to do what I did, but for a quick test this seems to work. The following patch shows my implementation: --- s3sync.rb 2008-01-06 10:25:55.000000000 -0500 This appears to fix the problem that I'm seeing. Someone who knows the code should check this carefully to make sure that it doesn't mess up anything else. Thoughts? Rob |