Option to use size/modified time in comparisons

ferrix

Sr. Member

Posts: 363

(I am greg13070 on AWS forum)

Option to use size/modified time in comparisons

« on: February 20, 2007, 05:47:45 AM »

This is non-trivial in some ways to do correctly, but in certain situations it could be much faster than reading every byte of every local file to calculate its etag.


	Logged

lowflyinghawk

Jr. Member

Posts: 52

Re: Option to use size/modified time in comparisons

« Reply #1 on: February 20, 2007, 10:13:18 AM »

s3 it isn't like rsync where just a few bytes can be sent or retrieved to sync the files, so if the size is different you are stuck getting the whole thing, right? a size check seems like a cheap way to avoid calculating the etag before a retrieve. of course for puts there is no way around it, and eventually you will do the etag check after a retrieve if you want to be maximum safe (but that's on the retrieved bits, not on the ones you already have).

timestamps might be something to restore, but if the bits are the same the timestamp is a cheap fix without getting anything but the metadata.


	Logged

ferrix

Sr. Member

Posts: 363

(I am greg13070 on AWS forum)

Re: Option to use size/modified time in comparisons

« Reply #2 on: February 20, 2007, 05:14:09 PM »

I already check the size before the etag and skip etag if the size is different. But the problem is that 99% of the time the size is the same, so the etag check has to go ahead. Of course no one would claim that "same size" is good enough to know that the files are the same.

The point of my comment on modified time is not to restore timestamps... but rather to use them as a method of determining whether to sync *in lieu* of etag checking. This way we would always be able to use meta data locally to check, rather than md5'ing the file.

I wouldn't make this the default, or take the old behavior away. Just some low-hanging fruit to speed things up especially on my slow ass windows XP machine.


	Logged

lowflyinghawk

Jr. Member

Posts: 52

Re: Option to use size/modified time in comparisons

« Reply #3 on: February 20, 2007, 06:47:28 PM »

timestamps would only be adequate under special circumstances, i.e. you'd have to have special knowledge about the files in question. rsync takes the opposite tack though, believing the files are the same if the timestamp and size match (of course it has options to control the behavior).

here is another thing I didn't know, from the rsync man page:

When comparing two timestamps, rsync treats the timestamps as
being equal if they differ by no more than the modify-window
value. This is normally 0 (for an exact match), but you may
find it useful to set this to a larger value in some situations.
In particular, when transferring to or from an MS Windows FAT
filesystem (which represents times with a 2-second resolution),
--modify-window=1 is useful (allowing times to differ by up to 1
second).

ps: unless your bandwidth is extreme and your drives awfully slow I don't know if it is worth worrying about the time to compute the checksum. I tested by filling a 1G file with bytes from /dev/urandom and then calculating the md5 on the file. interestingly, doing it in ruby is just as fast as doing it with md5sum (a compiled utility on linux), about 15 seconds total. if you had 1MB/s up bandwidth it would still take you 17 minutes to upload the bytes, so I think the md5 calculation time is in the noise.


« Last Edit: February 20, 2007, 07:10:40 PM by lowflyinghawk »	Logged

Pages: [1]

« previous next »