General Category => Feature Requests => Topic started by: ferrix on February 20, 2007, 06:29:38 AM

Title: Option to encrypt stored data using gpg
Post by: ferrix on February 20, 2007, 06:29:38 AM
This also implies some more work to cache the etag value of encrypted contents locally.. or else using the modify date only in the comparator. 

Could also store the unencrypted md5 sum to S3 as meta data, though that doesn't come down in the bucket list, so it's not clear that it would be too useful.

I think the "check date first, then issue HEAD to check md5" approach would probably be fast enough.

Edit 6/23/2008: The "--no-md5" option has since been added, which uses only the modify date and file size, as mentioned above.  For encryption it may be sufficient to force this option, rather than doing some kind of wacky md5/etag cache.

Title: Re: Option to encrypt stored data using gpg
Post by: lowflyinghawk on February 20, 2007, 10:22:19 AM
my 2 cents: I don't think encrypting/decrypting the stored data is the business of an rsync lookalike.  you would be asking for trouble from users who forgot their keys, changed or buggy encryption code in a library, etc etc etc.  you could provide a utility that does the encryption in a separate process (still asking for trouble) or just point the users to things like GPG and its many wrappers.  any time your code is transforming the user's data, especially when it is not reversible, is asking for a big helping of blame, hold the mustard.

of course encrypting the stored data is a separate question from transferring the files using SSL, which I *do* think s3sync should (and already does) support.

Title: Re: Option to encrypt stored data using gpg
Post by: xpg on April 11, 2007, 02:53:34 AM
I see your point lowflyinghawk, but I still think that having encryption of the data stored to S3 is a rather good idea. It is true, that using encryption requires a bit more thought with regards to key and algorithm handling. My guess is that there are quite a lot paranoid people out there that would enjoy such a feature. (I for one would like the added security). But you might have a very relevant point in whether this is something an rsync-like program should do, it would certainly make s3sync.rb more a dedicated backup-program.

With regards to the storage of the unencrypted md5 sum, ferrix, I have noticed that s3sync.rb creates nodes for each directory. This might be an appropriate place for storing the unencrypted md5sum, rather than having to issue separate HEAD requests for each file. But then again, I might have missed the point entirely :-)

Title: Re: Option to encrypt stored data using gpg
Post by: fredo on December 07, 2007, 04:21:22 PM
Here's my shot at implementing encryption.

It's probably not very well coded and I'm not completly satisfied by the way it's implemented (see todo below) but it works. If somebody has time (or is brave enough) to give it a try, I'd be glad to hear his ideas to improve it.

I have changed :
  • s3try.rb (to catch decryption errors),
  • HTTPStreaming.rb (to add a CryptedStream class on the same model as the ProgressStream class),
  • s3Sync.rb (renamed here s3syncC.rb).

All other files are unchanged. It also requires the openssl and digest/sha2 ruby libraries but they 're usually bundled in the ruby package

  • Encryption is only used on the file contents; The file names, directories and symlinks are not encrypted.
  • Encryption is used when uploading files to S3 if there is a $ENCRYPTION_ALGO constant set in config.yml file (pointing to the desired openssl encryption algorithm, for example "aes-256-cbc"). Additionnaly you may set a $ENCRYPTION_KEY constant for your password, though may also type in the password at runtime if you're not comfortable in storing you password in clear in the confg file.
  • Decryption is used when downloading from S3 if the "encrypted" flag is set in the metadata (this flag is set automatically when uploaded). No other metadata is created, not even the unencrypted file md5. For this to work, s3sync.rb calculates both crypted and uncrypted md5 before assessing if a given file needs a refresh (not good - see todo). If the password is incorrect, an error is thrown and the local file (if there is one) is not overwritten.

Note that unencrypted files will be handled well too, but once you start to use encryption you cannot revert to the official s3sync.rb as it will not recognise that a file on S3 is encrypted and will update local files with encrypted data. You have been warned !

Todo :

- either store uncrypted file md5 as metadata or optimize the comparison process (with this version each file is encrypted once for md5 comparison, and possibly a second time if the file needs to be uploaded - it could cache the encrypted file -). Though usually the speed of the syncing process is bandwidth-bound so encrypting twice is not slowing things too much.
- have a command line option to force encryption or no encryption
- other ?

Title: Re: Option to encrypt stored data using gpg
Post by: fredo on December 13, 2007, 05:54:43 AM
My second try, this time by storing the unencrypted MD5 as metadata.

Files changed from official version : s3try.rb (same as try #1 above), HTTPStreaming.rb (some changes to CryptedStream class), and s3sync.rb.

On the plus side : 1) less load on the CPU as the local files do not have to be encrypted to compare MD5 with S3, this makes a difference with the prior approach when comparing identical files between S3 and local as the process could be CPU-bound instead of bandwidth bound, 2) much less changes to the official version (about 10 lines of code added to s3sync.rb).

Downside : a get headers command is necessary for each file, this slows down the process noticeably (x2 when files are both present on S3 and locally, no changes otherwise)

Configuration variables are unchanged.

Title: Re: Option to encrypt stored data using gpg
Post by: edalquist on December 13, 2007, 05:32:48 PM
That is a great patch fredo and exactly what I was looking for. I want to copy a backup to s3 and having the backup tool understand how to do the encryption makes the process much more efficient. It would be great if this feature could get into the core version.

Title: Re: Option to encrypt stored data using gpg
Post by: ferrix on December 14, 2007, 07:02:25 PM
Patience :)

Title: Re: Option to encrypt stored data using gpg
Post by: jh on January 28, 2008, 09:08:24 AM
I think the right way to do this is to allow an arbitrary filter before s3 writes and after s3 reads. 

Instead of the current:

Read File --> Put File on S3;  and Get File from S3 --> Write File

we'd have:

Read File --> Arbitrary Filter --> Put File; and Get File --> Arbitrary Filer --> Write file

Reasonable choices for the filters are "gzip -f" and "gunzip" (to cut down on bandwith and other costs), and some variation of gpg that doesn't require user input.  (Is there such a thing?)