File duplicate finder

A file duplicate finder written in Java 8 with native MD5 check support.

License

License

Categories

Categories

Java Languages
GroupId

GroupId

com.github.cbismuth
ArtifactId

ArtifactId

fdupes-java
Last Version

Last Version

1.3.1
Release Date

Release Date

Type

Type

jar
Description

Description

File duplicate finder
A file duplicate finder written in Java 8 with native MD5 check support.
Project URL

Project URL

https://github.com/cbismuth/fdupes-java
Project Organization

Project Organization

Pivotal Software, Inc.
Source Code Management

Source Code Management

https://github.com/cbismuth/fdupes-java

Download fdupes-java

How to add to project

<!-- https://jarcasting.com/artifacts/com.github.cbismuth/fdupes-java/ -->
<dependency>
    <groupId>com.github.cbismuth</groupId>
    <artifactId>fdupes-java</artifactId>
    <version>1.3.1</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.cbismuth/fdupes-java/
implementation 'com.github.cbismuth:fdupes-java:1.3.1'
// https://jarcasting.com/artifacts/com.github.cbismuth/fdupes-java/
implementation ("com.github.cbismuth:fdupes-java:1.3.1")
'com.github.cbismuth:fdupes-java:jar:1.3.1'
<dependency org="com.github.cbismuth" name="fdupes-java" rev="1.3.1">
  <artifact name="fdupes-java" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.cbismuth', module='fdupes-java', version='1.3.1')
)
libraryDependencies += "com.github.cbismuth" % "fdupes-java" % "1.3.1"
[com.github.cbismuth/fdupes-java "1.3.1"]

Dependencies

compile (7)

Group / Artifact Type Version
org.springframework.boot : spring-boot-starter-web jar 1.4.1.RELEASE
org.springframework.boot : spring-boot-starter-logging jar 1.4.1.RELEASE
com.google.guava : guava jar 19.0
org.apache.spark : spark-network-common_2.11 jar 1.6.2
org.zeroturnaround : zt-exec jar 1.9
com.opencsv : opencsv jar 3.8
io.dropwizard.metrics : metrics-servlets jar 3.1.2

test (1)

Group / Artifact Type Version
org.springframework.boot : spring-boot-starter-test jar 1.4.1.RELEASE

Project Modules

There are no modules declared in this project.

fdupes-java

build coverage javadoc repository issues licence

Description

A command line duplicated files finder written in Java 8 which finds all duplicated files from input paths and their subdirectories.

Usage

Executable files are available on the release page, download the latest one and run the command line below.

java -jar fdupes-1.3.0.jar <PATH1> [<PATH2>]...

Output

Paths of duplicated files are reported in a duplicates.log file dumped in the current working directory.

Note: reported paths are double-quoted and whitespace-escaped to be *nix-compliant.

Options

Here are optional command line switches:

-Dlogging.level.fdupes=<LEVEL>    the logging level of fdupes-java        (default is INFO)
-Dlogging.level.root=<LEVEL>      the logging level of embedded libraries (default is WARN)

-Xmx<SIZE><UNIT>                  the max amount of memory to allocate (e.g. 512m)

-Dfdupes.parallelism=<NUMBER>     the numbers of threads to parallelize execution  (default is 1)
-Dfdupes.buffer.size=<SIZE><UNIT> the buffer size used for byte-by-byte comparison (default is 64k)

Note: logging levels must be one of: ALL, TRACE, DEBUG, INFO, WARN, ERROR, OFF.

Examples

Find duplicated files in a single directory and its subdirectories with default options:

java -jar fdupes-1.3.0.jar ~/pictures

Find duplicated files in a two directories plus one single file with custom options:

java -Xmx1g                       \
     -Dfdupes.parallelism=8       \
     -Dfdupes.buffer.size=3m      \
     -Dlogging.level.fdupes=DEBUG \
     -Dlogging.level.root=DEBUG   \
     -jar fdupes-1.3.0.jar        \
     ~/pictures                   \
     ~/downloads                  \
     ~/desktop/DSC00042.JPG

Note: <PATH1> [<PATH2>]... can be either regular files, directories or both.

Benchmark

Hardware
Processor Intel® Core™ i7-5500U CPU @ 2.40GHz × 4
Memory 15.4 Go
Disk SSD Samsung MZ7LN256 rev. 3L6Q
Software
OS Ubuntu 16.04 LTS 64-bit
Java JRE 1.8.0_92-b14 64-bit

Command line

java -Xmx8g                       \
     -Dfdupes.parallelism=8       \
     -Dfdupes.buffer.size=512k    \
     -Dlogging.level.fdupes=INFO  \
     -Dlogging.level.root=ERROR   \
     -jar fdupes-1.3.0.jar        \
     ~/Pictures/tmp
Results
Total files count 69406
Total files size 148 Go
Total duplicates count 8196
Total duplicates size 49,597.715 Mo
Execution time 3m1.164s

Requirements

Java 8 Runtime environment is the only requirement, it can be downloaded here.

Motivation

Original fdupes application has two major caveats fdupes-java works around.

When used together with options -s or --symlink, a user could accidentally preserve a symlink while deleting the file it points to.

Symlinks are ignored in fdupes-java.

Furthermore, when specifying a particular directory more than once, all files within that directory will be listed as their own duplicates, leading to data loss should a user preserve a file without its "duplicate" (the file itself!).

Duplicated input directories and files are filtered in fdupes-java.

Algorithms

  • Files are compared by file sizes, then by MD5 signatures, finally a buffered byte-by-byte comparison is done.
  • Original file is detected by comparing creation, last access and last modification time.

Issues

Here is how issues are triaged:

  • Bug: identifies an unexpected result or application behaviour.
  • Feature: adds an new end-user feature.
  • Enhancement: improves the way the application behaves but produces the same result.
  • Spike: improves implementation design but does not change application behaviour and produces the same result.

Credits

Written by Christophe Bismuth, licensed under the The MIT License (MIT).

This project is finely profiled with the awesome JProfiler from ej-technologies!

https://www.ej-technologies.com/products/jprofiler/overview.html

Versions

Version
1.3.1
1.3.0
1.2.0
1.2.0-RC4
1.2.0-RC3
1.2.0-RC2
1.2.0-RC1