S3Sync.net

General Category => Report Bugs => Topic started by: RobS on April 01, 2008, 01:31:10 PM



Title: s3sync.rb deleting sibling directories on when syncing to S3
Post by: RobS on April 01, 2008, 01:31:10 PM
I just started using s3sync.rb to backup the files on my web server to S3.  I really like the program and the features of it were exactly what I was looking for.

I wrote a little wrapper script to run the program to selectively back up directories from my web server. When I ran this, I found an interesting problem with the program that may be me doing something wrong or may be a bug.  I'm not sure.

The problem occurs when s3sync.rb is run in delete mode syncing to a target directory in S3 that has sibling directories that start with the same characters as the target directory.  When this occurs, s3sync.rb removes the other directories and I don't think that it should.  (However, I'm new at this and I want to see what you think.)  So, if I have a directory on S3 with the following directories in it: test and test2, when I try to sync test, it will remove test2.

What appears to be happening is that s3sync.rb is matching the first part of the directory name and not taking into account that it is not at a directory boundary (/).

This is kind of hard to explain, so let me give you an example that shows what is happening pretty clearly.

I have a local directory with the following files:
$ ls -al
total 10
drwxrwxr-x   5 vpcweb vpcweb 2048 Mar 31 17:46 .
drwxr-x---  20 vpcweb nobody 2048 Mar 31 17:45 ..
drwxrwxr-x   4 vpcweb vpcweb 2048 Mar 31 16:31 test
drwxrwxr-x   4 vpcweb vpcweb 2048 Mar 19 18:19 test2
drwxrwxr-x   2 vpcweb vpcweb 2048 Mar 31 17:46 test3
$

The whole directory tree looks like the following:
$ find .
.
./test
./test/test_file2
./test/test_file
./test/dir_abc
./test/dir_abc/test_file_ab
./test/dir_abc/test_file_a
./test/dir_a
./test/dir_a/file2
./test/dir_a/dir2
./test/dir_a/dir2/file1
./test/dir_a/dir2/dir3
./test/dir_a/dir2/dir3/file
./test3
./test2
./test2/dir
./test2/dir/file_c
./test2/test_dir
./test2/test_dir/test_file2
$

My S3 bucket is currently empty:
$ s3cmd.rb list vpc_test
--------------------
$

I want to sync the test and test2 directories to S3 using s3sync.rb but not the test3 directory.  To do this I run s3sync.rb twice.  Once to backup the test directory and once to backup the test2 directory.  (This example is a bit contrived but it should get the point across.  For a case this simple you could probably use exclude to do the same thing in one run and not hit this problem.)

The first time I do this, things work fine:
$ s3sync.rb -s -r -v --delete ./test/ vpc_test:backup_a/test
Create node dir_a
Create node dir_a/dir2
Create node dir_a/dir2/dir3
Create node dir_a/dir2/dir3/file
Create node dir_a/dir2/file1
Create node dir_a/file2
Create node dir_abc
Create node dir_abc/test_file_a
Create node dir_abc/test_file_ab
Create node test_file
Create node test_file2
$

$ s3sync.rb -s -r -v --delete ./test2/ vpc_test:backup_a/test2
Create node dir
Create node dir/file_c
Create node test_dir
Create node test_dir/test_file2
$

And all my files are on S3:
$ s3cmd.rb list vpc_test
--------------------
backup_a/test/dir_a
backup_a/test/dir_a/dir2
backup_a/test/dir_a/dir2/dir3
backup_a/test/dir_a/dir2/dir3/file
backup_a/test/dir_a/dir2/file1
backup_a/test/dir_a/file2
backup_a/test/dir_abc
backup_a/test/dir_abc/test_file_a
backup_a/test/dir_abc/test_file_ab
backup_a/test/test_file
backup_a/test/test_file2
backup_a/test2/dir
backup_a/test2/dir/file_c
backup_a/test2/test_dir
backup_a/test2/test_dir/test_file2
$

However, when I run this again (with no changes to the local source directories) the first command deletes the directory that the second command put on S3:
$ s3sync.rb -s -r -v --delete ./test/ vpc_test:backup_a/test
Remove node 2/dir/file_c
Remove node 2/test_dir/test_file2
Remove node 2/test_dir
Remove node 2/dir
$

And we see that the files have been removed from S3:
$ s3cmd.rb list vpc_test
--------------------
backup_a/test/dir_a
backup_a/test/dir_a/dir2
backup_a/test/dir_a/dir2/dir3
backup_a/test/dir_a/dir2/dir3/file
backup_a/test/dir_a/dir2/file1
backup_a/test/dir_a/file2
backup_a/test/dir_abc
backup_a/test/dir_abc/test_file_a
backup_a/test/dir_abc/test_file_ab
backup_a/test/test_file
backup_a/test/test_file2
$

The second command restores the files from it's directory:
$ s3sync.rb -s -r -v --delete ./test2/ vpc_test:backup_a/test2
Create node dir
Create node dir/file_c
Create node test_dir
Create node test_dir/test_file2
$

And we see all the files back in S3:
$ s3cmd.rb list vpc_test        
--------------------
backup_a/test/dir_a
backup_a/test/dir_a/dir2
backup_a/test/dir_a/dir2/dir3
backup_a/test/dir_a/dir2/dir3/file
backup_a/test/dir_a/dir2/file1
backup_a/test/dir_a/file2
backup_a/test/dir_abc
backup_a/test/dir_abc/test_file_a
backup_a/test/dir_abc/test_file_ab
backup_a/test/test_file
backup_a/test/test_file2
backup_a/test2/dir
backup_a/test2/dir/file_c
backup_a/test2/test_dir
backup_a/test2/test_dir/test_file2
$

Note that exactly the same behavior is exhibited if I use the following format of the s3sync.rb commmands:
   s3sync.rb -s -r -v --delete ./test vpc_test:backup_a
   s3sync.rb -s -r -v --delete ./test2 vpc_test:backup_a

To me, this isn't the way that it should work.  Am I missing something or is this a subtle bug?

Thanks for any insight.

Rob




Title: Re: s3sync.rb deleting sibling directories on when syncing to S3
Post by: ferrix on April 02, 2008, 03:29:27 AM
Sure seems like a bug.. Instead of test and test2 what if you use "test" and "apple" or something that doesn't begin with the other one.  It shouldn't matter but I wonder if the S3 listing logic is returning like all of "test*" when we expect "test/*"

Let me know if that makes a diff.  As it may be apparent, I haven't had any time for this project lately ......


Title: Re: s3sync.rb deleting sibling directories on when syncing to S3
Post by: RobS on April 02, 2008, 04:59:08 PM
Thanks for your quick response.  I thought that the behavior seemed weird.

As you requested, I repeated the steps in my previous post with the two directories being 'test' and 'apple' instead of 'test' and 'test2' (I just renamed 'test2' to 'apple' for this test).  As you can see below, the behavior of the program does change when the directories are named this way. The program now works as I would expect:  Upon the second running of the two s3sync.rb commands, no files are deleted or created.

$ s3cmd.rb list vpc_test
--------------------

$ s3sync.rb -s -r -v --delete ./test/ vpc_test:backup_a/test
Create node dir_a
Create node dir_a/dir2
Create node dir_a/dir2/dir3
Create node dir_a/dir2/dir3/file
Create node dir_a/dir2/file1
Create node dir_a/file2
Create node dir_abc
Create node dir_abc/test_file_a
Create node dir_abc/test_file_ab
Create node test_file
Create node test_file2

$ s3sync.rb -s -r -v --delete ./apple/ vpc_test:backup_a/apple
Create node dir
Create node dir/file_c
Create node test_dir
Create node test_dir/test_file2

$ s3cmd.rb list vpc_test
--------------------
backup_a/apple/dir
backup_a/apple/dir/file_c
backup_a/apple/test_dir
backup_a/apple/test_dir/test_file2
backup_a/test/dir_a
backup_a/test/dir_a/dir2
backup_a/test/dir_a/dir2/dir3
backup_a/test/dir_a/dir2/dir3/file
backup_a/test/dir_a/dir2/file1
backup_a/test/dir_a/file2
backup_a/test/dir_abc
backup_a/test/dir_abc/test_file_a
backup_a/test/dir_abc/test_file_ab
backup_a/test/test_file
backup_a/test/test_file2

$ s3sync.rb -s -r -v --delete ./test/ vpc_test:backup_a/test

$ s3sync.rb -s -r -v --delete ./apple/ vpc_test:backup_a/apple

$ s3cmd.rb list vpc_test
--------------------
backup_a/apple/dir
backup_a/apple/dir/file_c
backup_a/apple/test_dir
backup_a/apple/test_dir/test_file2
backup_a/test/dir_a
backup_a/test/dir_a/dir2
backup_a/test/dir_a/dir2/dir3
backup_a/test/dir_a/dir2/dir3/file
backup_a/test/dir_a/dir2/file1
backup_a/test/dir_a/file2
backup_a/test/dir_abc
backup_a/test/dir_abc/test_file_a
backup_a/test/dir_abc/test_file_ab
backup_a/test/test_file
backup_a/test/test_file2



Title: Re: s3sync.rb deleting sibling directories on when syncing to S3
Post by: ferrix on April 02, 2008, 11:08:26 PM
I still call it a bug, but this helps explain why it hasn't been caught for so long.


Title: Re: s3sync.rb deleting sibling directories on when syncing to S3
Post by: RobS on April 03, 2008, 02:56:52 PM
I dug into this a little more and have a theory as to what is going on.  This is from a pretty cursory look at the source code, so I may be way off.  Corrections are appreciated.

It looks like what is happening is that when the 'list_bucket vpc_test max-keys 200 prefix backup_a/test delimiter / with 100 retries left' command is called, S3 is responding with both "backup_a/test/" and "backup_a/test2/" as matching the common prefix.  This is technically correct if S3 is just looking at the key as a string, doing a simple string compare with the specified prefix, and not paying any attention to slashes (/) being a directory delimiter.  However, it appears that s3sync.rb expects s3 to do the prefix match paying attention to the slash delimiters.  (My quick reading of the s3 docs would seem to indicate that s3 is behaving as documented.  However, it was just a quick look and I've never programmed with s3 before so I could have easily read it wrong.)  That appears to be what is throwing s3sync.rb off in this case.

Thoughts?

Rob


Title: Re: s3sync.rb deleting sibling directories on when syncing to S3
Post by: RobS on April 03, 2008, 04:27:39 PM
Based upon the theory I presented in my previous post, I looked at the source a little more and I have an idea as to how to fix this.  Note that this is based upon a very cursory glance at the code and my theory of the problem as stated in my previous post.  This could easily be wrong or break other things.  Please advise if this is an ill-conceived fix! 

It looks like this can be fixed by checking the items returned by the list_bucket command (both the entries and common_prefix_entries) to see if the part of the name beyond the prefix starts with a slash (/).  If it does, then we're on a directory boundary and things will work fine.  If it doesn't, then this is one of those cases that was causing the problem, so we should just skip that item.

There is already code to extract the part of the name beyond the prefix to check for excludes.  So, I just used this excludePath variable to do the check for the initial slash.  Then I added some if logic around the code that calls S3Node.new or recurses into s3TreeRecurse again (depending on the item type) to skip these calls when the prefix doesn't align with a directory boundary.  There's probably a more elegant way to do what I did, but for a quick test this seems to work.

The following patch shows my implementation:
--- s3sync.rb   2008-01-06 10:25:55.000000000 -0500
+++ s3sync.rb.mod   2008-04-03 17:08:51.000000000 -0400
@@ -314,22 +314,30 @@
                   if not (item.kind_of? String)
                      # this is an item
                      excludePath = item.name.slice($S3SyncOriginalS3Prefix.length...item.name.length)
-                     if $S3SyncExclude and $S3SyncExclude.match(excludePath)
-                        debug("skipping S3 item #{excludePath} due to --exclude")
+                     if !excludePath.empty? && excludePath[0,1] != '/'
+                        debug("file not on directory boundary. skipped")
                      else
-                        debug("S3 item #{item.name}")
-                        g.yield(S3Node.new(bucket, prefix, item))
+                        if $S3SyncExclude and $S3SyncExclude.match(excludePath)
+                           debug("skipping S3 item #{excludePath} due to --exclude")
+                        else
+                           debug("S3 item #{item.name}")
+                           g.yield(S3Node.new(bucket, prefix, item))
+                        end
                      end
                   else
                      # it's a prefix (i.e. there are sub keys)
                      partialPath = item.slice(prefix.length..item.length) # will have trailing slash
                      excludePath = item.slice($S3SyncOriginalS3Prefix.length...item.length)
-                     # recurse
-                     if $S3SyncExclude and $S3SyncExclude.match(excludePath)
-                        debug("skipping prefix #{excludePath} due to --exclude")
+                     if !excludePath.empty? && excludePath[0,1] != '/'
+                        debug("file not on directory boundary. skipped")
                      else
-                        debug("prefix found: #{partialPath}")
-                        s3TreeRecurse(g, bucket, prefix, partialPath) if $S3syncOptions['--recursive']
+                        # recurse
+                        if $S3SyncExclude and $S3SyncExclude.match(excludePath)
+                           debug("skipping prefix #{excludePath} due to --exclude")
+                        else
+                           debug("prefix found: #{partialPath}")
+                           s3TreeRecurse(g, bucket, prefix, partialPath) if $S3syncOptions['--recursive']
+                        end
                      end
                   end
                end

This appears to fix the problem that I'm seeing.  Someone who knows the code should check this carefully to make sure that it doesn't mess up anything else.

Thoughts?

Rob