Skip to content
scrubbbbs edited this page Mar 11, 2025 · 26 revisions

Welcome to the cbird wiki! Just random stuff for now

Indexing

cbird indexing is based on a single top-level directory, and stores information using relative paths. The reason is to prevent use cases that can break the entire index, since it can take quite a while to compute:

  • It is not possible to break the index by renaming/copying/moving the top-level directory
  • The index can be shared on the network without causing any problems

However this design comes with a few issues:

  • Paths containing symlinks cannot always be simplified, as they could point outside of the index. This will result in unwanted duplicates
  • -use is always needed when working in a sub-directory (the default directory is CWD). This less annoying with -use @ which searches in the parent tree for the first index it finds.
  • Since windows doesn't have a single-root filesystem it can be a challenge. The mklink program can be used to link all content into some top-level directory of your choosing, then you can use -i.links true

Indexing operation

The indexing process consists of a few steps

  • Reading options: -use, -i.* etc
  • Verifying: Check state of the database (missing expected files, inconsistent algos, out-of-date items etc)
  • Scanning: Find all candidate files for addition. Apply filters, follow links etc
  • Deleting: Remove files in the database that were missing from the scan
  • Computing: algos compute hashes, features, etc and add them to the index

Index options

Options to the indexer are referred to as "index params", which are set with -i.<name> <value>. You can get a list of names with -h or -list-index-params which will also give the current value and a short description.

There are a couple of gotchas with index params to be aware of:

  • index params are not saved to disk and must be specified every time you run cbird. You can simplify this in a few ways:
    • make a command alias or script to invoke cbird so this remains a constant
    • create a global or local "args" file containing defaults (see: Arguments Files)
  • index options apply to the -update operation, and so they must appear before that
  • index options apply to the path scanned for changes this is either the path given to -use or the optional path given to -update

Using symlinks

The directory tree may contain links, and symlinks will be followed with -i.links true. This potentially causes unwanted duplicates, so we have the following options:

  • -i.dups ignores duplicate inodes, which are guaranteed to be duplicate files - but this cannot work across filesystems.
  • -i.resolve attempts to resolve links before adding files. However this can only work if the link resolution does not point outside of the top-level directory.

Limiting the scan

By default, -update scans all files and directories from the top-level directory (given by -use). All supported file types will be considered. This can be limited in several ways:

  • passing a directory after -update will only scan in that directory; index options will only apply to this directory and not to the index as a whole. Global queries like -similar may not work as expected unless the same settings are always given.
  • -i.algos chooses which algorithms to compute
  • -i.types chooses which file types (image/video) to consider
  • -i.fsize sets the minimum file size
  • -i.dirs enables recursive scanning of sub-directories

Filters can also be applied on file paths (full relative path from the top-level) using -i.include or -i.exclude, which can be used multiple times.

  • If both include/exclude are specified, the include is considered first, then the exclude.
  • This will not prevent scanning all files/directories, the filter is applied to file paths only

Examples:

  • -i.include "*.jpg" => only add files with .jpg suffix
  • -i.exclude "/some/dir/ => do not add any files with path matching "/some/dir/"
  • -i.include ":(jpg|png|gif)$" -exclude "*/originals/*" => add jpg,png,and gif files, except if they are located in "originals" folder

Choosing search algorithms (algos)

Normally, you don't care as they are all enabled by default (see space usage). If you know that particular algos are not going to be useful, you can speed up the scan and save a little space with -i.algos. The fdct, orb, and color algos are much slower to compute and perhaps not useful for very large data sets.

Previously-indexed algos cannot be removed by changing -i.algos, rather cbird will ensure previously used algos always remain in effect, this will prevent incomplete search results. See Removing algos if you want to save space or force items to be re-indexed.

The available algos are:

  • 0 means don't use any algos, cbird will only be able to find exact duplicates with -dups
  • dct finds rescaled and lightly altered images. You should always have this enabled as it is basically free to compute and store
  • fdct finds rescaled and cropped images
  • orb finds rescaled, cropped, and rotated images
  • color finds images with similar color palette; can find mirrored images quickly, useful for sorting/organizing

Examples:

  • -i.algos 0 => fastest option, only -dups will work
  • -i.algos dct => fast indexing but can't find cropped or rotated images
  • -i.algos dct+fdct => slow indexing, fdct can help find cropped images as well, much faster than orb
  • -i.algos dct+orb => slow indexing, orb can find challenging duplicates like rotations

Space usage

As of v0.8, the space requirement for all cbird algorithms (with cache files) is around 30Kb per image. As such, it will usually be of no concern. However, there are a few options available to manage this.:

  • -i.algos sets the algos for indexing. The algos in order of heaviest-to-lightest are orb,fdct,video,color,dct. Note that md5 checksums are always enabled.
  • -i.nfeat sets the number of transform-invariant features used per image. These are what allow cbird to find images that are cropped or rotated. This only affects orb and fdct algos. Fewer features will reduce search quality somewhat, but may still be usable.
  • -vaccuum can compact the database files if you have made a lot of deletions/removals.

Removing algos

To remove algos you are no longer using, you can remove the files from the database with -select-* -remove, and then re-index them with (-update with the changed algos. Since the heavier algos dominate indexing time, it might make sense just to delete the affected database files and cache directly as sqlite can be slow:

  • _index/cache/ contains files to speed up -similar, and can be rebuilt from database files
  • _index/media0.db contains file paths, checksums, and dct hash; do not delete this directly
  • _index/media<N>.db contains data for algo N. (1=fdct,2=orb,3=color,4=video). Deleting these drops that particular algo
  • _index/video contains video file indexes. Deleting this effectively drops the video algo

Hardware decoding

Hardware video decoding has been improved in v0.8 and supports common devices from AMD, Nvidia and Intel. Because FFmpeg does not have a uniform "hwaccel" interface for all platforms, and codecs/drivers can be buggy, there are a lot of options provided to help find something that works. If you need something performant that "just works" then Nvidia seems to be the only choice at the moment.

Specifying Hardware Decoders

The -i.hwdec <hwdec> option adds a hardware decoder to the list of available decoders. The format of <hwdec> is <libav-device>,[cbird-options].

  • <libav-device> is the same as passed through ffmpeg's -init_hw_device option, as documented here.
  • There is one exception, for Nvidia, the "cuda" device is not used, instead pass nvdec:<index>,(...).
  • Hardware decoding jobs consume one CPU thread each i.idxthr.
  • Multiple decoders are supported, cbird will occupy hw decoders before using software decoding.

[cbird-options] are key=value pairs appened to libav-device. If you want to see what values are available you can pass "help" as a value.

The following keys are defined:

  • family device family name for format support detection (always required)
  • vendor device vendor name for format support detection (required unless it is implied)
  • jobs number of parallel jobs, >1 maybe needed to saturate the hardware, but risks running out of GPU memory.
  • enable ";"-separated list of codecs to enable, provided they pass the vendor/family check
  • disable ";"-separated list of codecs to disable, even if they pass the format check

Examples

  • -i.hwdec nvdec,family=help => get available family values
  • -i.hwdec nvdec,family=ampere => use the first/default Nvidia GPU, detect format support based on ampere series capabilities.
  • -i.hwdec nvdec:1,family=ampere,enable=av1 => use the second Nvidia GPU, but only for av1 codec
  • -i.hwdec qsv,family=tigerlake => use Intel QuickSync on Windows (maybe Linux if driver is configured correctly)
  • -i.hwdec vaapi,vendor=intel,family=kabylake => use Intel QuickSync on Linux
  • -i.hwdec vaapi,vendor=amd,family=vcn1 => use AMD GPU on Linux (untested)
  • -i.hwdec d3d11va,vendor=amd,family=uvd6 => use AMD GPU on Windows (untested)
  • -i.hwdec vulkan,vendor=amd,family=vcn3 => use AMD GPU on Windows (untested)

Testing Hardware Decoding

You can test hardware decoding with ffmpeg, see the hwaccel intro page for info.

  • ffmpeg -init_hw_device qsv -hwaccel qsv -i file.mp4 -f null - -benchmark
  • use Task Manager or nvtop to see gpu activation and resource usage.

You can also test hardware decoding with cbird's -test-video-decoder command. It takes input from the current file selection and uses the arguments pass through index options.

  • cbird -select-files <files and dirs> -with suffix mp4 -i.hwdec nvdec,family=ampere -i.decthr 1 -test-video-decoder -maxframes 5000 -show -loop -no-fallback
  • -loop will try to decode each selected file. Make sure task manager/nvtop is running and look for resource leaks!
  • -maxframes limits number of frames decoded to check more files
  • -no-fallback prevents using the software decoder
  • -v shows verbose output from codec setup
  • see commands.cpp for other options

Working around Issues

To date, only a few devices are tested so problems are expected. Features have been added to work around potential issues.

  • Run hardware decoder in a separate process with -i.forkhw true so it cannot crash the main process or cause a resource leak that crashes the system or driver. This seems to be needed with the current Windows QuickSync driver which leaks memory badly.
  • Disable the problem codecs with the "disable" option, for example to disable vp9 and hevc, use -i.hwdec (...),disable=vp9;hevc
  • Use fewer jobs per decoder, too many jobs may exhaust gpu memory

Arguments Files

cbird processes arguments in the order given (positional arguments), except for a few arguments that have global effects (-v,-q, etc). This can get a bit annoying typing the same things over and over, so you can create lists of arguments and recall them by default or as needed.

Arguments files are text files with one command argument per line, for example to pass -use @ -i.decthr 8 the file would contain:

-use
@
# comment
-i.decthr
8

Arguments files may only contain positional arguments; you will get an error for any global arguments such as -headless or -verbose

The -args argument is replaced with saved arguments. It accepts the following parameters

  • -args global => loads args from ~/.config/cbird.args.txt
  • -args local => loads args from _index/args.txt
  • -args none => disables default args processing
  • -args <file.txt> => loads args from any text file

By default, -args global -args local is always tried, unless the global,local, or none option is supplied. To monitor what args are being loaded pass -v to see verbose logging. Note that -args local should always follow -use since this determines the file location.

Using selections and results

There are two types of lists in cbird, "selections" and "results". A selection (aka "group" or "MediaGroup") is basically a file list. Technically it is a list of "Media" objects so it doesn't necessarily have to be a file, it could just be raw image data and a description. For example, -select-grid can cut up an image into separate items for searching.

A result (aka "MediaGroupList") is a two-dimensional list where each item is a selection, and the first item is (by convention) the needle in the search query. Results usually come from search queries but can also be formed by other commands like -group-by.

In cbird, there is always a current selection and result.

  • The selection is built from -select- commands (mostly), which can be combined as needed to get the desired set of files.
  • The result comes from consuming a selection, and clears the current selection
  • Each -select- command appends to the current selection
  • -select-none clears the current selection
  • The selection is referenced in commands that take a dir/path/glob argument with the "@" sign, For example -similar-to @, -similar-in @" for subset queries.
  • The selection is cleared when it is consumed or invalidated (-similar-to,-similar-in,-group-by,-nuke,-remove), this prevents the case where there is both a selection and result at the same time, which would make some command that work with both ambiguous (-head,-sort).

Filtering

Filters can be used for a few things, but usually it is

  • limit scope of the search
  • remove or select certain results

Limiting The Scope

Say you know that the originals were taken before a certain date, so you only want to find duplicates of these. You can use -with to select the subset, then search within the subset using "@" to refer to it.

cbird -select-type i -with exif#Photo.DateTimeOriginal#todate '<2022-01-03' -similar-to @

You have a project folder that is known to have valid copies of your assets/originals folder, and you don't want to include it. You can use a regular expression to select the search set, then search within it.

cbird -select-all -without relPath ':^projects/.*' -similar-in @

Removing Unlikely Matches

You have a folder "incoming" with new content, if the file size is larger than the existing content, then you want to examine it since it's more likely to be a dupe you want to keep. Note: there is no guarantee that this assumption is correct, but it may suit your needs.

cbird -similar-to ./incoming/ -with fileSize '<=%needle' -show

You can now batch-delete the other ones by inverting the filter:

cbird -similar-to ./incoming/ -without fileSize '<=%needle' -first -select-result -nuke

Combining Filters

If you want to combine filters there are two boolean options. These are evaluated left-to right. The -or-* version (added in v0.7.1) must be preceded by a -with filter.

  • -with [this] -with [that] -- with this AND that
  • -with [this] -or-with [that] -- with this OR that
  • -with [this] -or-with [that] -with [theother] -- with (this OR that) AND theother

Using Property Expressions

A property can be followed by hash (#) and a series of transformations/functions

Lowercasing to Test Strings

  • -with name#lower '~robert paulson'

Date/time conversions

  • -with exif#Photo.DateTimeOriginal#month '<2020-01'

Using Filter Expressions

Boolean tests on properties (v0.7.1)

  • -with name#lower '~milk || ~cookies' == images of only milk, only cookies, or both
  • -with name#lower '~milk && ~cookies' == images with both milk and cookies

Type conversions for metadata properties to enable correct evaluation (v0.7.1)

  • -with exif#Photo.DateTimeOriginal#todate '>=2021-01-01 && <2022-02-01'
  • date is sometime in January 2021

Comparison with the needle property (v0.7.1)

  • -similar-to ./originals -with res '==%needle' -with suffix '==%needle' -with fileSize '<%needle'
  • assume dupes are of lower quality due to smaller size at the same resolution and file type