deplicate.github.io by deplicate

Status
Description
Features
Installation
Usage
- Quick Examples
- Advanced Examples
API Reference

Status

Description

deplicate is an high-performance multi-filter duplicate file finder written in Pure Python with low memory impact and several advanced features.

Find out all the duplicate files in one or more directories, you can also scan directly a bunch of files. Latest releases let you to remove the spotted duplicates and/or apply a custom action over them.

From what we know, it’s the most complete and fastest duplicate finder tool for Python, nowadays.

Features

[x] Optimized for speed
[x] N-tree layout for low memory consumption
[x] Multi-threaded (partially)
[x] Raw drive data access to maximize I/O performances (Unix only)
[x] xxHash algorithm for fast file identification
[x] File size and signature checking for quick duplicate exclusion
[x] Extended file attributes scanning
[x] Multi-filtering
[x] Full error handling
[x] Unicode decoding
[x] Safe from directory recursion loop
[ ] SSD detection
[x] Dulicates purging
[x] Support for moving dulicates to trash/recycle bin
[x] Custom aation handling over deletion
[x] Command Line Interface (https://github.com/vuolter/deplicate-cli)
[x] Unified structured result
[x] Support posix_fadvise
[ ] Graphical User Interface
[ ] Incremental file chunk checking
[ ] Hard-link scanning
[ ] Duplicate directories recognition
[ ] Multi-processing
[ ] Fully documented
[ ] PyPy support
[ ] ~~Exif data scanning~~

Installation

Type in your command shell with administrator/root privileges:

pip install deplicate

In Unix-based systems, this is generally achieved by superseding the command sudo.

sudo pip install deplicate

If the above commands fail, consider installing it with the option --user:

pip install --user deplicate

Note: You can install it with its Command Line Interface, typing pip install deplicate[cli]

If the command pip is not found in your system, but you have the Python Interpreter and the package setuptools (>=20.8.1) installed, you can try to install it from the sources, in this way:

Get the latest tarball of the source code in format ZIP or TAR.
Extract the downloaded archive.
From the extracted path, launch the command python setup.py install.

Usage

Import in your script the module duplicate.

import duplicate

Call its function find if you want to know what are the duplicate files:

duplicate.find('/path')

Or purge if you want in addition to remove them:

duplicate.purge('/path')

In both cases, you’ll get a duplicate.ResultInfo object, with following properties:

dups – Tuples of paths of duplicate files.
deldups – Tuple of paths of purged duplicate files.
duperrors – Tuple of paths of files not filtered due errors.
scanerrors – Tuple of paths of files not scanned due errors.
delerrors – Tuple of paths of files not purged due errors.

Note: By default directory paths are scanned recursively.

Note: By default files smaller than 100 MiB or bigger than 100 GiB are not scanned.

Note: File paths are returned in canonical form.

Note: Tuples of duplicate files are sorted in descending order according input priority, file modification time and name length.

Quick Examples

Scan for duplicates a single directory:

import duplicate

duplicate.find('/path/to/dir')

Scan for duplicates two files (at least):

import duplicate

duplicate.find('/path/to/file1', '/path/to/file2')

Scan for duplicates a single directory and move them to the trash/recycle bin:

import duplicate

duplicate.purge('/path/to/dir')

Scan for duplicates a single directory and delete them:

import duplicate

duplicate.purge('/path/to/dir', trash=False)

Scan more directories together:

import duplicate

duplicate.find('/path/to/dir1', '/path/to/dir2', '/path/to/dir3')

Scan from iterable:

import duplicate

iterable = ['/path/to/dir1', '/path/to/dir2', '/path/to/dir3']

duplicate.find.from_iterable(iterable)

Scan ignoring the minimum file size threshold:

import duplicate

duplicate.find('/path/to/dir', minsize=0)

Advanced Examples

Scan without recursing directories:

import duplicate

duplicate.find('/path/to/file1', '/path/to/file2', '/path/to/dir1',
               recursive=False)

Note: In not-recursive mode, like the case above, directory paths are simply ignored.

Scan checking file names and hidden files:

import duplicate

duplicate.find.from_iterable('/path/to/file1', '/path/to/dir1',
                             comparename=True, scanhidden=True)

Scan excluding files with extension .doc:

import duplicate

duplicate.find('/path/to/dir', exclude="*.doc")

Scan including file links:

import duplicate

duplicate.find('/path/to/file1', '/path/to/file2', '/path/to/file3',
               scanlinks=True)

Scan for duplicates, handling errors with a custom action (printing):

import duplicate

def error_callback(exc, filename):
    print(filename)

duplicate.find('/path/to/dir', onerror=error_callback)

Scan for duplicates and apply a custom action (printing), instead of purging:

import duplicate

def purge_callback(filename):
    print(filename)
    raise duplicate.SkipException

duplicate.purge('/path/to/dir', ondel=purge_callback)

Scan for duplicates, apply a custom action (printing) and move them to the trash/recycle bin:

import duplicate

def purge_callback(filename):
    print(filename)

duplicate.purge('/path/to/dir', ondel=purge_callback)

Scan for duplicates, handling errors with a custom action (printing), and apply a custom action (moving to path), instead of purging:

import shutil
import duplicate

def error_callback(exc, filename):
    print(filename)

def purge_callback(filename):
    shutil.move(filename, '/path/to/custom-dir')
    raise duplicate.SkipException

duplicate.purge('/path/to/dir',
                ondel=purge_callback, onerror=error_callback)

API Reference

Exceptions

duplicate.SkipException(*args, **kwargs)
- Description: Raised to skip file scanning, filtering or purging.
- Return: Self instance.
- Parameters: Same as built-in Exception.
- Proprieties: Same as built-in Exception.
- Methods: Same as built-in Exception.

Classes

duplicate.Cache(maxlen=DEFAULT_MAXLEN)
- Description: Internal shared cache class.
- Return: Self instance.
- Parameters:
  - maxlen – Maximum number of entries stored.
- Proprieties:
  - DEFAULT_MAXLEN
    - Description: Default maximum number of entries stored.
    - Value: 128.
- Methods:
  - …
  - clear(self)
    - Description: Clear the cache if not acquired by any object.
    - Return: True if went cleared, otherwise False.
    - Parameters: None.
duplicate.Deplicate(paths, minsize=DEFAULT_MINSIZE, maxsize=DEFAULT_MAXSIZE, include=None, exclude=None, comparename=False, comparemtime=False, comparemode=False, recursive=True, followlinks=False, scanlinks=False, scanempties=False, scansystem=True, scanarchived=True, scanhidden=True)
- Description: Duplicate main class.
- Return: Self instance.
- Parameters:
  - paths – Iterable of directory and/or file paths.
  - minsize – (optional) Minimum size in bytes of files to include in scanning.
  - maxsize – (optional) Maximum size in bytes of files to include in scanning.
  - include – (optional) Wildcard pattern of files to include in scanning.
  - exclude – (optional) Wildcard pattern of files to exclude from scanning.
  - comparename – (optional) Check file name.
  - comparemtime – (optional) Check file modification time.
  - compareperms – (optional) Check file mode (permissions).
  - recursive – (optional) Scan directory recursively.
  - followlinks – (optional) Follow symbolic links pointing to directory.
  - scanlinks – (optional) Scan symbolic links pointing to file (hard-links included).
  - scanempties – (optional) Scan empty files.
  - scansystems – (optional) Scan OS files.
  - scanarchived – (optional) Scan archived files.
  - scanhidden – (optional) Scan hidden files.
- Proprieties:
  - DEFAULT_MINSIZE
    - Description: Minimum size of files to include in scanning (in bytes).
    - Value: 102400.
  - DEFAULT_MAXSIZE
    - Description: Maximum size of files to include in scanning (in bytes).
    - Value: 107374182400.
  - result
    - Description: Result of find or purge invocation (by default is None).
    - Value: duplicate.ResultInfo.
- Methods:
  - find(self, onerror=None, notify=None)
    - Description: Find duplicate files.
    - Return: None.
    - Parameters:
      - onerror – (optional) Callback function called with two arguments, exception and filename, when an error occurs during file scanning or filtering.
      - notify – (internal) Notifier callback.
  - purge(self, trash=True, ondel=None, onerror=None, notify=None)
    - Description: Find and purge duplicate files.
    - Return: None.
    - Parameters:
      - trash – (optional) Move duplicate files to trash/recycle bin, instead of deleting.
      - ondel – (optional) Callback function called with one arguments, filename, before purging a duplicate file.
      - onerror – (optional) Callback function called with two arguments, exception and filename, when an error occurs during file scanning, filtering or purging.
      - notify – (internal) Notifier callback.
duplicate.ResultInfo(dupinfo, delduplist, scnerrlist, delerrors)
- Description: Duplicate result class.
- Return: collections.namedtuple('ResultInfo', 'dups deldups duperrors scanerrors delerrors').
- Parameters:
  - dupinfo – (internal) Instance of duplicate.structs.DupInfo.
  - delduplist – (internal) Iterable of purged files (deleted or trashed).
  - scnerrlist – (internal) Iterable of files not scanned (due errors).
  - delerrors – (internal) Iterable of files not purged (due errors).
- Proprieties: Same as collections.namedtuple.
- Methods: Same as collections.namedtuple.

Functions

duplicate.find(*paths, minsize=duplicate.Deplicate.DEFAULT_MINSIZE, maxsize=duplicate.Deplicate.DEFAULT_MAXSIZE, include=None, exclude=None, comparename=False, comparemtime=False, comparemode=False, recursive=True, followlinks=False, scanlinks=False, scanempties=False, scansystem=True, scanarchived=True, scanhidden=True, onerror=None, notify=None)
- Description: Find duplicate files.
- Return: duplicate.ResultInfo.
- Parameters:
  - paths – Iterable of directory and/or file paths.
  - minsize – (optional) Minimum size in bytes of files to include in scanning.
  - maxsize – (optional) Maximum size in bytes of files to include in scanning.
  - include – (optional) Wildcard pattern of files to include in scanning.
  - exclude – (optional) Wildcard pattern of files to exclude from scanning.
  - comparename – (optional) Check file name.
  - comparemtime – (optional) Check file modification time.
  - compareperms – (optional) Check file mode (permissions).
  - recursive – (optional) Scan directory recursively.
  - followlinks – (optional) Follow symbolic links pointing to directory.
  - scanlinks – (optional) Scan symbolic links pointing to file (hard-links included).
  - scanempties – (optional) Scan empty files.
  - scansystems – (optional) Scan OS files.
  - scanarchived – (optional) Scan archived files.
  - scanhidden – (optional) Scan hidden files.
  - onerror – (optional) Callback function called with two arguments, exception and filename, when an error occurs during file scanning or filtering.
  - notify – (internal) (optional) Notifier callback.
duplicate.purge(*paths, minsize=duplicate.Deplicate.DEFAULT_MINSIZE, maxsize=duplicate.Deplicate.DEFAULT_MAXSIZE, include=None, exclude=None, comparename=False, comparemtime=False, comparemode=False, recursive=True, followlinks=False, scanlinks=False, scanempties=False, scansystem=True, scanarchived=True, scanhidden=True, trash=True, ondel=None, onerror=None, notify=None)
- Description: Find and purge duplicate files.
- Return: duplicate.ResultInfo.
- Parameters:
  - paths – Iterable of directory and/or file paths.
  - minsize – (optional) Minimum size in bytes of files to include in scanning.
  - maxsize – (optional) Maximum size in bytes of files to include in scanning.
  - include – (optional) Wildcard pattern of files to include in scanning.
  - exclude – (optional) Wildcard pattern of files to exclude from scanning.
  - comparename – (optional) Check file name.
  - comparemtime – (optional) Check file modification time.
  - compareperms – (optional) Check file mode (permissions).
  - recursive – (optional) Scan directory recursively.
  - followlinks – (optional) Follow symbolic links pointing to directory.
  - scanlinks – (optional) Scan symbolic links pointing to file (hard-links included).
  - scanempties – (optional) Scan empty files.
  - scansystems – (optional) Scan OS files.
  - scanarchived – (optional) Scan archived files.
  - scanhidden – (optional) Scan hidden files.
  - trash – (optional) Move duplicate files to trash/recycle bin, instead of deleting.
  - ondel – (optional) Callback function called with one arguments, filename, before purging a duplicate file.
  - onerror – (optional) Callback function called with two arguments, exception and filename, when an error occurs during file scanning, filtering or purging.
  - notify – (internal) (optional) Notifier callback.

deplicate.github.io

Table of contents