-
Notifications
You must be signed in to change notification settings - Fork 5
Home
Welcome to the cbird wiki! Just random stuff for now
cbird indexing is based on a single top-level directory, and stores information using relative paths. The reason is to prevent use cases that can break the entire index, since it can take quite a while to compute:
- It is not possible to break the index by renaming/copying/moving the top-level directory
- The index can be shared on the network without causing any problems
However this design comes with a few issues:
- Paths containing symlinks cannot always be simplified, as they could point outside of the index. This will result in unwanted duplicates
-
-use
is always needed when working in a sub-directory (the default directory is CWD). This less annoying with-use @
which searches in the parent tree for the first index it finds. - Since windows doesn't have a single-root filesystem it can be a challenge. The
mklink
program can be used to link all content into some top-level directory of your choosing, then you can use-i.links true
The indexing process consists of a few steps
- Reading options:
-use, -i.*
etc - Verifying: Check state of the database (missing expected files, inconsistent algos, out-of-date items etc)
- Scanning: Find all candidate files for addition. Apply filters, follow links etc
- Deleting: Remove files in the database that were missing from the scan
- Computing: algos compute hashes, features, etc and add them to the index
Options to the indexer are referred to as "index params", which are set with -i.<name> <value>
. You can get a list of names with -h
or -list-index-params
which will also give the current value and a short description.
There are a couple of gotchas with index params to be aware of:
- index params are not saved to disk and must be specified every time you run cbird. You can simplify this in a few ways:
- make a command alias or script to invoke cbird so this remains a constant
- create a global or local "args" file containing defaults (see: Arguments Files)
- index options apply to the
-update
operation, and so they must appear before that - index options apply to the path scanned for changes this is either the path given to
-use
or the optional path given to-update
The directory tree may contain links, and symlinks will be followed with -i.links true
. This potentially causes unwanted duplicates, so we have the following options:
-
-i.dups
ignores duplicate inodes, which are guaranteed to be duplicate files - but this cannot work across filesystems. -
-i.resolve
attempts to resolve links before adding files. However this can only work if the link resolution does not point outside of the top-level directory.
By default, -update
scans all files and directories from the top-level directory (given by -use
). All supported file types will be considered. This can be limited in several ways:
- passing a directory after
-update
will only scan in that directory; index options will only apply to this directory and not to the index as a whole. Global queries like-similar
may not work as expected unless the same settings are always given. -
-i.algos
chooses which algorithms to compute -
-i.types
chooses which file types (image/video) to consider -
-i.fsize
sets the minimum file size -
-i.dirs
enables recursive scanning of sub-directories
Filters can also be applied on file paths (full relative path from the top-level) using -i.include
or -i.exclude
, which can be used multiple times.
- If both include/exclude are specified, the include is considered first, then the exclude.
- This will not prevent scanning all files/directories, the filter is applied to file paths only
Examples:
-
-i.include "*.jpg"
=> only add files with .jpg suffix -
-i.exclude "/some/dir/
=> do not add any files with path matching "/some/dir/" -
-i.include ":(jpg|png|gif)$" -exclude "*/originals/*"
=> add jpg,png,and gif files, except if they are located in "originals" folder
Normally, you don't care as they are all enabled by default (see space usage). If you know that particular algos are not going to be useful, you can speed up the scan and save a little space with -i.algos
. The fdct
, orb
, and color
algos are much slower to compute and perhaps not useful for very large data sets.
Previously-indexed algos cannot be removed by changing -i.algos
, rather cbird will ensure previously used algos always remain in effect, this will prevent incomplete search results. See Removing algos if you want to save space or force items to be re-indexed.
The available algos are:
-
0
means don't use any algos, cbird will only be able to find exact duplicates with-dups
-
dct
finds rescaled and lightly altered images. You should always have this enabled as it is basically free to compute and store -
fdct
finds rescaled and cropped images -
orb
finds rescaled, cropped, and rotated images -
color
finds images with similar color palette; can find mirrored images quickly, useful for sorting/organizing
Examples:
-
-i.algos 0
=> fastest option, only-dups
will work -
-i.algos dct
=> fast indexing but can't find cropped or rotated images -
-i.algos dct+fdct
=> slow indexing,fdct
can help find cropped images as well, much faster thanorb
-
-i.algos dct+orb
=> slow indexing,orb
can find challenging duplicates like rotations
As of v0.8, the space requirement for all cbird algorithms (with cache files) is around 30Kb per image. As such, it will usually be of no concern. However, there are a few options available to manage this.:
-
-i.algos
sets the algos for indexing. The algos in order of heaviest-to-lightest areorb,fdct,video,color,dct
. Note that md5 checksums are always enabled. -
-i.nfeat
sets the number of transform-invariant features used per image. These are what allow cbird to find images that are cropped or rotated. This only affectsorb
andfdct
algos. Fewer features will reduce search quality somewhat, but may still be usable. -
-vaccuum
can compact the database files if you have made a lot of deletions/removals.
To remove algos you are no longer using, you can remove the files from the database with -select-* -remove
, and then re-index them with (-update
with the changed algos. Since the heavier algos dominate indexing time, it might make sense just to delete the affected database files and cache directly as sqlite can be slow:
-
_index/cache/
contains files to speed up-similar
, and can be rebuilt from database files -
_index/media0.db
contains file paths, checksums, and dct hash; do not delete this directly -
_index/media<N>.db
contains data for algo N. (1=fdct,2=orb,3=color,4=video). Deleting these drops that particular algo -
_index/video
contains video file indexes. Deleting this effectively drops the video algo
Hardware video decoding has been improved in v0.8 and supports common devices from AMD, Nvidia and Intel. Because FFmpeg does not have a uniform "hwaccel" interface for all platforms, and codecs/drivers can be buggy, there are a lot of options provided to help find something that works. If you need something performant that "just works" then Nvidia seems to be the only choice at the moment.
The -i.hwdec <hwdec>
option adds a hardware decoder to the list of available decoders. The format of <hwdec>
is <libav-device>,[cbird-options]
.
-
<libav-device>
is the same as passed through ffmpeg's-init_hw_device
option, as documented here. - There is one exception, for Nvidia, the "cuda" device is not used, instead pass
nvdec:<index>,(...)
. - Hardware decoding jobs consume one CPU thread each
i.idxthr
. - Multiple decoders are supported, cbird will occupy hw decoders before using software decoding.
[cbird-options]
are key=value pairs appened to libav-device. If you want to see what values are available you can pass "help" as a value.
The following keys are defined:
-
family
device family name for format support detection (always required) -
vendor
device vendor name for format support detection (required unless it is implied) -
jobs
number of parallel jobs, >1 maybe needed to saturate the hardware, but risks running out of GPU memory. -
enable
";"-separated list of codecs to enable, provided they pass the vendor/family check -
disable
";"-separated list of codecs to disable, even if they pass the format check
Examples
-
-i.hwdec nvdec,family=help
=> get available family values -
-i.hwdec nvdec,family=ampere
=> use the first/default Nvidia GPU, detect format support based on ampere series capabilities. -
-i.hwdec nvdec:1,family=ampere,enable=av1
=> use the second Nvidia GPU, but only for av1 codec -
-i.hwdec qsv,family=tigerlake
=> use Intel QuickSync on Windows (maybe Linux if driver is configured correctly) -
-i.hwdec vaapi,vendor=intel,family=kabylake
=> use Intel QuickSync on Linux -
-i.hwdec vaapi,vendor=amd,family=vcn1
=> use AMD GPU on Linux (untested) -
-i.hwdec d3d11va,vendor=amd,family=uvd6
=> use AMD GPU on Windows (untested) -
-i.hwdec vulkan,vendor=amd,family=vcn3
=> use AMD GPU on Windows (untested)
You can test hardware decoding with ffmpeg, see the hwaccel intro page for info.
ffmpeg -init_hw_device qsv -hwaccel qsv -i file.mp4 -f null - -benchmark
- use Task Manager or
nvtop
to see gpu activation and resource usage.
You can also test hardware decoding with cbird's -test-video-decoder
command. It takes input from the current file selection and uses the arguments pass through index options.
cbird -select-files <files and dirs> -with suffix mp4 -i.hwdec nvdec,family=ampere -i.decthr 1 -test-video-decoder -maxframes 5000 -show -loop -no-fallback
-
-loop
will try to decode each selected file. Make sure task manager/nvtop is running and look for resource leaks! -
-maxframes
limits number of frames decoded to check more files -
-no-fallback
prevents using the software decoder -
-v
shows verbose output from codec setup - see commands.cpp for other options
To date, only a few devices are tested so problems are expected. Features have been added to work around potential issues.
- Run hardware decoder in a separate process with
-i.forkhw true
so it cannot crash the main process or cause a resource leak that crashes the system or driver. This seems to be needed with the current Windows QuickSync driver which leaks memory badly. - Disable the problem codecs with the "disable" option, for example to disable vp9 and hevc, use
-i.hwdec (...),disable=vp9;hevc
- Use fewer jobs per decoder, too many jobs may exhaust gpu memory
cbird processes arguments in the order given (positional arguments), except for a few arguments that have global effects (-v
,-q
, etc). This can get a bit annoying typing the same things over and over, so you can create lists of arguments and recall them by default or as needed.
Arguments files are text files with one command argument per line, for example to pass -use @ -i.decthr 8
the file would contain:
-use
@
# comment
-i.decthr
8
Arguments files may only contain positional arguments; you will get an error for any global arguments such as -headless
or -verbose
The -args
argument is replaced with saved arguments. It accepts the following parameters
-
-args global
=> loads args from ~/.config/cbird.args.txt -
-args local
=> loads args from _index/args.txt -
-args none
=> disables default args processing -
-args <file.txt>
=> loads args from any text file
By default, -args global -args local
is always tried, unless the global
,local
, or none
option is supplied. To monitor what args are being loaded pass -v
to see verbose logging. Note that -args local
should always follow -use
since this determines the file location.
There are two types of lists in cbird, "selections" and "results". A selection (aka "group" or "MediaGroup") is basically a file list. Technically it is a list of "Media" objects so it doesn't necessarily have to be a file, it could just be raw image data and a description. For example, -select-grid
can cut up an image into separate items for searching.
A result (aka "MediaGroupList") is a two-dimensional list where each item is a selection, and the first item is (by convention) the needle in the search query. Results usually come from search queries but can also be formed by other commands like -group-by
.
In cbird, there is always a current selection and result.
- The selection is built from
-select-
commands (mostly), which can be combined as needed to get the desired set of files. - The result comes from consuming a selection, and clears the current selection
- Each
-select-
command appends to the current selection -
-select-none
clears the current selection - The selection is referenced in commands that take a dir/path/glob argument with the "@" sign, For example
-similar-to @, -similar-in @
" for subset queries. - The selection is cleared when it is consumed or invalidated (
-similar-to
,-similar-in
,-group-by
,-nuke
,-remove
), this prevents the case where there is both a selection and result at the same time, which would make some command that work with both ambiguous (-head
,-sort
).
Filters can be used for a few things, but usually it is
- limit scope of the search
- remove or select certain results
Say you know that the originals were taken before a certain date, so you only want to find duplicates of these.
You can use -with
to select the subset, then search within the subset using "@" to refer to it.
cbird -select-type i -with exif#Photo.DateTimeOriginal#todate '<2022-01-03' -similar-to @
You have a project folder that is known to have valid copies of your assets/originals folder, and you don't want to include it. You can use a regular expression to select the search set, then search within it.
cbird -select-all -without relPath ':^projects/.*' -similar-in @
You have a folder "incoming" with new content, if the file size is larger than the existing content, then you want to examine it since it's more likely to be a dupe you want to keep. Note: there is no guarantee that this assumption is correct, but it may suit your needs.
cbird -similar-to ./incoming/ -with fileSize '<=%needle' -show
You can now batch-delete the other ones by inverting the filter:
cbird -similar-to ./incoming/ -without fileSize '<=%needle' -first -select-result -nuke
If you want to combine filters there are two boolean options. These are evaluated left-to right. The -or-*
version (added in v0.7.1) must be preceded by a -with
filter.
-
-with [this] -with [that]
-- with this AND that -
-with [this] -or-with [that]
-- with this OR that -
-with [this] -or-with [that] -with [theother]
-- with (this OR that) AND theother
A property can be followed by hash (#) and a series of transformations/functions
Lowercasing to Test Strings
-with name#lower '~robert paulson'
Date/time conversions
-with exif#Photo.DateTimeOriginal#month '<2020-01'
Boolean tests on properties (v0.7.1)
-
-with name#lower '~milk || ~cookies'
== images of only milk, only cookies, or both -
-with name#lower '~milk && ~cookies'
== images with both milk and cookies
Type conversions for metadata properties to enable correct evaluation (v0.7.1)
-with exif#Photo.DateTimeOriginal#todate '>=2021-01-01 && <2022-02-01'
- date is sometime in January 2021
Comparison with the needle property (v0.7.1)
-similar-to ./originals -with res '==%needle' -with suffix '==%needle' -with fileSize '<%needle'
- assume dupes are of lower quality due to smaller size at the same resolution and file type