PPRuNe Forums - View Single Post - On the topic of duplicate files.....
View Single Post
Old 14th June 2013 | 09:42
  #1 (permalink)  
mixture
 
Joined: Aug 2002
Posts: 3,663
Likes: 0
From: Earth
On the topic of duplicate files.....

Mr Mac sent me a PM chasing me for a follow-up.

Evidently he is still stewing in his own juices about my comments on a historical thread (here).

Background

Mr OSFO said he was looking for software that would search for duplicate images on his computer and highlight these for deletion.

Scope

Mr OSFO specifically used the word duplicate.

I'm rather partial any of the definitions provided by the Oxford Dictionary of the word duplicate, for example make or be an exact copy of.

It is widespread practice in the IT industry to use cryptographic hashes to identify duplicate files. There is no debate that using cryptographic hashes is the fastest and most efficient way to do this. Therefore using "fuzzy logic" or whatever when the goal is to identify DUPLICATE files is a waste of time, takes longer to process and is generally a wheel that really does not need to be reinvented.

Mac then came back suggesting his tools were useful for detecting similar files.

I expressed some doubt in that statement. Given that organisations such as Google with deep pockets and large development teams are unable to build an algorithm that actually works at detecting similar files, there is good reason to doubt the ability of a one-man-band shareware/freeware peddler to achieve this.

Evaluation

However, Mac chased me.... so I picked four images, two pairs of visually similar images (see below), and downloaded Mac's toys.

I've done a fair amount of beta testing and product evaluation work on stuff more complex than this, so I know how to challenge a piece of software.

I deliberately picked challenging images. Both images were taken on a tripod at the exact same location and within a few minutes of each other. The only thing that changed slightly was zoom or exposure.

Obviously for my tests I used the real high-res images without the watermark etc.

(1) Dupeguru

In its default setting, Dupeguru picked up no similar files.

I turned down the detection settings to the loosest settings. Dupeguru correctly identified the forest scenes as being similar (26%), but failed to identify the coastal scene as being similar.

An unscientific test on a small number of samples, but I would suggest Dupeguru is probably not to be relied upon as a surefire way to identify similar files, however as a first stage tool to filter out some similar files before resorting to manual review, I can see it would have its uses.

I would give it 2.5/5.

(2) Visipics

No matter how loose I made the settings, Visipics came up with no matches for similar files.

0/5.



Happy now Mac ?




Last edited by mixture; 14th June 2013 at 09:45.
mixture is offline  
Reply