hard links

lowflyinghawk

Jr. Member

Posts: 52

hard links

« on: February 19, 2007, 09:39:28 PM »

as I recall, s3sync doesn't attempt to deal with hardlinks, i.e. it just copies the data multiple times. note hardlinks exist in linux, mac and (yes) ntfs filesystems. the code below will detect them on linux, but I don't know about mac and not likely on windows. note that just like rsync it doesn't try to find them outside the directory tree being examined because it would have to look at the entire filesystem to be sure.

once detected I imagine some bookeeping could be invented to save multiple copies by keeping notes in metadata. if hardlinks worked it would be a lot easier to come up with a snapshot style backup utility built on s3sync the same as you would with rsync.

note this is *not* a feature request...I'm not using s3sync myself ;-).

--- cut ---
#!/usr/bin/ruby -w

require 'find'
require 'yaml'

def find_hard_links(path)
links = Hash.new { |hash,key| hash[key] = [] }
Find.find(path) do |f|
if File.file?(f)
s = File.stat(f)
if s.nlink > 1
ino = File.stat(f).ino
links[s.ino].push(f)
end
end
end
links
end

path = ARGV[0] if ARGV[0]
path = "." unless path

# links is a hash with key = inode number and value = array of paths
links = find_hard_links(path)
YAML::dump(links,$stdout)
--- cut ---


	Logged

ferrix

Sr. Member

Posts: 363

(I am greg13070 on AWS forum)

Re: hard links

« Reply #1 on: February 20, 2007, 06:21:17 AM »

There's so many obstacles to hard link style backups with S3, this is just one of em. Undecided

Before S3, I backed everything up incrementally into hard link directory structures using rsync and loved it. Meaning, every day it would create a new directory structure full of hard links to the previous day, and then only "copy" files that had changed. That way I had like 20 inter-linked hot backups, and geometrically backed them off so I had data going back for a whole year! That's the real goal for me with rsync, s3, and hard links. I don't care about a couple of files that are hardlinked together, I want to leverage the concept of links to make the backups themselves more powerful. Like see http://www.mikerubel.org/computers/rsync_snapshots/

I think that once you surpass all the problems to do this on S3 (no renaming, no modification of meta data without a full re-upload, and a couple others) you will have a structure that basically treats S3 like a block device... although you might not have thought of it that way, that's what the solution would entail.

So the reason I am not pursuing these ends is, I think that eventually one or more of the "FUSE" type projects for S3 will mature to the point where it can be used efficiently as an rsync backup target (at least, with some modification). I think trying to coerce s3sync into being "smart" enough to solve these issues itself is ultimately folly and would warp the design/implementation so far that it would be trying to solve two rather incompatible problems with the same code base.

Would be thrilled to continue the discussion on this.

P.S. You're not using s3sync? Wow.. you've got to be one of the leading contributors to the community (here, and back on the aws thread). I'm honored, but confused... why the interest if you're not using the tool? Did you find something better? Maybe I'll start using it Grin

P.P.S.: The memory usage of your innocent looking ruby snippet means it can't scale to the level of what I was talking about with intra-backup hardlink structures-- not that you intended it to necessarily.


	Logged

lowflyinghawk

Jr. Member

Posts: 52

Re: hard links

« Reply #2 on: February 20, 2007, 08:15:30 AM »

yes, I agree with you about the obstacles...I've been thinking about it for a while but no finished idea has bloomed. it's possible a filesystem could make it all simple, but I'm not holding my breath...s3 definitely has some issues when you want it to pretend to be a harddrive ;-).

the "find the hardlinks" code might be useful to warn users in a log message, e.g. "files 'x' 'y.blah'" are hardlinks, data will be duplicated".

memory usage: yes, if you had a full filesystem backed up using hardlinks this would be problematical, however there is no way to find all the links without looking at all the files. I suppose one might iterate over all the files one inode at a time, spit out the result, then go again, but frankly I'm too lazy to go far with that...I have around 50,000 files in my HOME directory, and if all of them were paired hardlinks the memory usage wouldn't be enough to get excited about. if you had a million files it would be an issue, but if you have filesystems like that you have bigger issues to think about.

me: I've been fooling around with s3 for two reasons, 1) because I need to back up my pics, and 2) because I wanted to learn ruby (or python, but ruby won out). I started out with a little app, no classes, no rubyisms, etc, and then it grew like topsy. now I have a big app with one big class that does most of the work, a bunch of little helper classes, and 4,000 commandline switches, but I finally didn't like that much so I refactored the whole thing into a bunch of obvious classes (bucket, service, ...) and smaller focused apps, e.g. s3mkbucket, s3rm, etc. now the code uses 'yield' and blocks idiomatically and as a bonus doesn't fill up huge arrays with interim results and the utilities are much more typical unix-like (small, focused scripts). I learned quite a bit about ruby vs c++ in the process, and I ended up with some useful gadgets. fooling around with it also ended up making me relearn some css and html so I could use s3 as a webserver for pics and whatnot.

I like participating in the s3sync discussions because I've learned a lot by so doing, even if I did occasionally broadcast my ignorance (e.g. SSL x.509 certs).

why not s3sync? as I said, one goal was to learn ruby, and you can't really do that by just looking at code. what I ended up with is not really rsync-like, although it performs many of the same functions. for example none of my gadgets generate their own list of keys to archive, retrieve, etc. on the other hand my s3archive *does* look before leaping, i.e. it doesn't just blindly copy bits without checking what is there first, and my s3get is the same way, it looks first and only retrieves if necessary.

my stuff does do some things I doubt s3sync does, for example I can look at ACLs either as xml or in summary format, I can use canned-acls to set permissions or selectively modify the xml to change permissions for a single grantee (REXML::document) on one or more keys or buckets, in other words I wrapped the ACL-related code in S3.rb and turned it into some utilities.


	Logged

ferrix

Sr. Member

Posts: 363

(I am greg13070 on AWS forum)

Re: hard links

« Reply #3 on: February 20, 2007, 09:18:14 AM »

In your endeavors have you discovered a way to modify a node's meta data without re-PUT-ing the node? That ability would be incredibly huge for me if it existed. I just couldn't find anything on doing that in the s3 API/docs and assume it is impossible.

Isn't .yield fun? Before using generators I was literally running into memory limitations building up paths in memory (I run on some low-powered virtual machines).


	Logged

lowflyinghawk

Jr. Member

Posts: 52

Re: hard links

« Reply #4 on: February 20, 2007, 03:46:23 PM »

AFAIK the metadata can't be modified. this seems like an artificial limitation given that ACLs, which are really just metadata, can. something else that would be useful for the commercial guys is being able to set permissions on prefixes, i.e. make them more like directories, then you effectively could have any number of buckets.

yes, yield, blocks, et al are fun. ruby is an interesting jar to your mindset if you are used to something like c++. mixins, include, extension of existing classes, etc, are quite different. the libraries are pretty good. I'm sure I still don't understand the scope rules though ;-). on the other hand, compared to c++, it is weird getting used to how many mistakes you can make without seeing a complaint (until your boss is looking and that odd branch in the code runs for the first time). the compiler catches a lot of errors for you in c++, and it takes getting used to when you don't have it. one exception is c++ templates where member functions can have all sorts of errors that don't show up until they are instantiated, i.e. the compiler really doesn't do much beyond tokenization unless the function is actually called.

I programmed a lot in perl at one time, then came back to it after 5 years or so of heavy c++, and it just doesn't seem to scale (other than CPAN, which is incredible)...that grafted on OO biz just doesn't make it for me. in all, ruby is a nice language. funnily enough, what made me start looking at it is something I don't even use, Rails...now there is some power on display!


	Logged

Pages: [1]

« previous next »