The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.
Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
finddupe.dynamically-linked.stripped | 2009-01-19 | 64.9 kB | |
finddupe.dynamically-linked.debuggable | 2009-01-19 | 117.6 kB | |
finddupe.c | 2009-01-19 | 98.4 kB | |
Catalog_File_Structure | 2009-01-19 | 1.8 kB | |
Makefile | 2009-01-19 | 883 Bytes | |
Theory_of_Operation | 2009-01-19 | 4.0 kB | |
README | 2009-01-19 | 5.1 kB | |
Totals: 7 Items | 292.8 kB | 0 |
README for finddupe.c copyright(c)2009 Scott Mitchell Jennings under the terms of the GNU General Public License Sun Jan 18 16:12:26 PST 2009 * Preamble / Justification The overriding principle in the development of finddupe is speed. There are many programs available for managing and cataloging various archives of data, and many for finding duplication within archives. Many are extremely featureful, and have amazing graphical interfaces. When I began trying them all, I found that none of there were effectively usable on the extremely large archives now easily possible with the availability of large and inexpensive disk drives. Most were so slow merely to open their catalogs, that I usually found it was faster to locate the file using /usr/bin/find, if the disk drive was online. Eventually I just used shell scripts which used "find" and "ls" to produce text files showing what files were on what disks. Then I'd find what I had by name or by size using grep. Somewhat tedious, not exactly fast, but still faster than anything else I'd found. No program to load, just a quick grep. The slow part was keeping the text files up to date, as I organized my data. Meanwhile, I had been using an old utility called "finddupe" basically forever. It was extremely fast at establishing there were no duplicate files in large repositories, but when there were duplicates, it slowed way down, as it did a byte wise compare between all potentialy duplicate files. I began to teach myself C by hacking on it. My first hack was to make it ignore my .xvpics directories. My next hack was to make it report duplicates in quotes, so I could feed it's output to shell scripts. Eventually, the added features made the new finddupe virtually unrecognizable as the old finddupe. * Invocation Finddupe accepts multiple command line switches in either long (two dashes) or short forms. (--hardlink-dups could be seen as dangerous, and has no short form) It then expects a list paths. Paths which are directories are searched deep for all regular files on the same filesystem. Here is the current list of accepted command line switches: -v or --verbose or --verbose=N Each -v increases "verbosity" by one level, or N is an explicit level. -c or --catalog This causes finddupe to check the files in the paths specified against all files in all catalogs, as well as against themselves. (by default command line paths are checked only against themselves) -a or --all This causes finddupe to check for duplicates everywhere in all files in all catalogs. -m --force-md5 This causes finddupe to ensure that MD5 data exists for all files reference on the command line. By default MD5 data is only generated when it is needed to identify duplication. This feature is extremely useful for removable media, ensuring that the MD5 data will be available in future, even if the media itself is not presently available. -p --paranoid When duplicates are identified (files are of identical size and have identical MD5sums) this switch causes finddupe to also do a bytewise compare of the two file's contents. (slow) -h --show-hard This causes finddupe to treat files that are hardlinked as if they are duplicates. By default hard linkes are not considered duplicates, but are *not* ignored (as in the original finddupe) and will always be reported in the duplicate list if they are the same content as other files in at least one other inode. -z N --ignore-less=N This causes finddupe to ignore files of less that N size, when looking for duplicates. -. --ignore-hidden This causes finddupe to ignore all files and directories who's names start with a '.' Entire branches (all subdirectories) are also ignored. -e --edit This causes finddupe to allow you to edit the comments for all volumes of media reference by the command line paths. If any of the command lines paths explicitly reference a single file, comments for those files will also be edited. Note that files whos comments start with a '.' are *always* ignored when searching for duplication. -n --no-catalogs This causes finddupe never to read in or write out any catalog files. This makes it's behavour much closer to the original finddupe. This also means that MD5 data will have to be generated from scratch, if needed. -rN --reports=N This causes finddupe to report only the first N groups of duplicate files. -fpathname --catalog-dir=pathname This causes finddupe to use "pathname" as the directory to store catalog files in. By default this is "~/.finddupe". -xfilename --exclude=filename This causes finddupe to ignore files and/or directories of exactly this filename. --hardlink-dups This causes finddupe to attempt to hard link all duplicate files together. There is no guarantee which file will be trashed, so one of them will inherit the stats of the other. -d --debug --debug=N Finddupe is still in beta testing, and contains many lines of code useful only for debugging. These switches act like the -v switch, increasing the verbosity of the debug messages. -smj