Skip to content

Commit

Permalink
Feature: ability to match ROMs via MD5 & SHA1 (#945)
Browse files Browse the repository at this point in the history
  • Loading branch information
emmercm authored Mar 21, 2024
1 parent f016c87 commit 307770c
Show file tree
Hide file tree
Showing 26 changed files with 313 additions and 161 deletions.
4 changes: 2 additions & 2 deletions docs/advanced/internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ Information about the inner workings of `igir`.
- Parent/clone information is inferred if the DAT has none (see [DATs docs](../dats/processing.md#parentclone-inference))
- Parent/clone ROMs sets are merged or split (`--merge-roms <type>`) (see [arcade docs](../usage/arcade.md))
- ROMs in the DAT are filtered to only those desired (`--filter-*` options) (see [filtering & preference docs](../roms/filtering-preferences.md))
- Input files are matched to ROMs in the DAT
- Patch files are matched to ROMs found
- Input files are matched to ROMs in the DAT (see [matching docs](../roms/matching.md))
- Patch files are matched to ROMs found (see [patching docs](../roms/patching.md))
- ROM preferences are applied (`--single`, see [filtering & preference docs](../roms/filtering-preferences.md#preferences-for-1g1r))
- ROMs are combined (`--zip-dat-name`)
- ROMs are written to the output directory, if specified (`copy`, `move`, `link`)
Expand Down
2 changes: 1 addition & 1 deletion docs/alternatives.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ There are a few different popular ROM managers that have similar features:
| DATs: combine multiple |||||
| Archives: extraction formats | ✅ many formats ([reading archives docs](input/reading-archives.md)) |`.zip`, `.7z`, `.rar` | ⚠️ `.zip`, `.7z` | ⚠️ `.zip`, `.7z` |
| Archives: creation formats |`.zip` only by design ([writing archives docs](output/writing-archives.md)) |`.zip`, `.7z`, `.rar` | ⚠️ `.zip` (TorrentZip), `.7z` | ⚠️ `.zip`, `.7z` |
| ROMs: DAT matching strategies | CRC32+size | ✅ CRC32+size, MD5, SHA1 | ✅ CRC32+size, MD5, SHA1 ||
| ROMs: DAT matching strategies | CRC32+size, MD5, SHA1 | ✅ CRC32+size, MD5, SHA1 | ✅ CRC32+size, MD5, SHA1 ||
| ROMs: CHD scanning || ⚠️ via chdman | ✅ v1-5 natively | ⚠️ v1-4 natively |
| ROMs: scan/checksum caching | ❌ by design ||||
| ROMs: header parsing |||| ⚠️ via plugins |
Expand Down
4 changes: 2 additions & 2 deletions docs/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,14 +63,14 @@ After performing one of the ROM writing commands, verify that the file was writt

### `clean`

Files in the output directory that do not match any ROM in any [DAT](dats/introduction.md) will be deleted.
Files in the output directory that do not [match any ROM](roms/matching.md) in any [DAT](dats/introduction.md) will be deleted.

See the [output cleaning page](output/cleaning.md) for more information.

## ROM reporting

### `report`

A report will be generated of what input files were matched by what DAT, and what games in what [DATs](dats/introduction.md) have missing ROMs.
A report will be generated of what [input files were matched](roms/matching.md) by what [DAT](dats/introduction.md), and what games in what DATs have missing ROMs.

See the [reporting page](output/reporting.md) for more information.
2 changes: 1 addition & 1 deletion docs/dats/processing.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ There have been a few DAT-like formats developed over the years. `igir` supports
```

- [CMPro](http://www.logiqx.com/DatFAQs/CMPro.php)
- [Hardware Target Game Database](https://github.com/frederic-mahe/Hardware-Target-Game-Database) SMDBs that contain file sizes
- [Hardware Target Game Database](https://github.com/frederic-mahe/Hardware-Target-Game-Database) SMDBs

!!! tip

Expand Down
4 changes: 2 additions & 2 deletions docs/input/reading-archives.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@

**You should prefer archive formats that have CRC32 checksum information for each file.**

`igir` uses CRC32 information to match ROMs to DAT entries. If an archive already contains CRC32 information for each file, then `igir` won't need to extract each file and compute its CRC32 itself. This can save a lot of time on large files, especially.
By default, `igir` uses CRC32 information to [match ROMs](../roms/matching.md) to DAT entries. If an archive already contains CRC32 information for each file, then `igir` doesn't need to extract each file and compute its CRC32. This can save a lot of time on large archives.

This is why you should use the [`igir zip` command](../output/writing-archives.md) when organizing your primary ROM collection. It is much faster to scan archives with CRC32 information, speeding up actions such as merging new ROMs into an existing collection.
This is why you should use the [`igir zip` command](../output/writing-archives.md) when organizing your primary ROM collection. It is much faster for `igir` to scan archives with CRC32 information, speeding up actions such as merging new ROMs into an existing collection.

**You should prefer archive formats that `igir` can extract natively.**

Expand Down
2 changes: 1 addition & 1 deletion docs/output/writing-archives.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

It is intentional that `igir` only supports `.zip` archives right now.

`.zip` archives store CRC32 information in their "file table" which helps drastically speed up `igir`'s file scanning, and they are easy to create without proprietary tools (e.g. Rar).
`.zip` archives store CRC32 information in their "central directory" which helps drastically speed up `igir`'s file scanning, and they are easy to create without proprietary tools (e.g. 7-Zip, Rar).

See the [reading archives](../input/reading-archives.md) page for more information on archive formats and their capabilities.

Expand Down
61 changes: 61 additions & 0 deletions docs/roms/matching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# ROM Matching

When `igir` [scans ROM files](../input/file-scanning.md) in the input directory, it calculates a number of checksums to uniquely identify each file. These checksums are then matched to ones found in [DATs](../dats/introduction.md).

By default, `igir` will use CRC32 + filesize to match input files to ROMs found in DATs. CRC32 checksums are fast to calculate, and many [archive formats](../input/reading-archives.md) include them in their directory of files, which greatly speeds up scanning.

!!! note

The main drawback of CRC32 checksums are their small keyspace of 4.29 billion unique values (see below). This might seem like a lot, but it's sufficiently small enough that it is very possible for two different files to have the same CRC32. Chances of these "collisions" can be reduced by also comparing the filesize of the two different files.

## Automatically using other checksum algorithms

Some DAT release groups do not include every checksum for every file. For example, MAME CHDs only include SHA1 checksums and nothing else, not even filesize information.

And some DAT release groups do not include filesize information for every file, preventing a safe use of CRC32. For example, not every [Hardware Target Game Database SMDB](https://github.com/frederic-mahe/Hardware-Target-Game-Database/tree/master/EverDrive%20Pack%20SMDBs) includes file sizes, but they typically include all the normal checksums.

!!! success

For situations like these, `igir` will automatically detect what combination of checksums it needs to calculate for input files to be able to match them to DATs. This has the chance of greatly slowing down file scanning, especially with archives.

For example, if you provide all of these DATs at once with the [`--dat <path>` option](../dats/processing.md):

- No-Intro's Nintendo Game Boy DAT (which includes filesize, CRC32, MD5, and SHA1 information)
- Hardware Target Game Database's Atari Lynx SMBD (which includes CRC32, MD5, SHA1, and SHA256 information but _not_ filesize)
- MAME ListXML (which only includes SHA1 information for CHD "disks")

...then `igir` will determine that SHA1 is necessary to calculate because not every ROM in every DAT includes CRC32 _and_ filesize information.

!!! note

When generating a [dir2dat](../dats/dir2dat.md) with the `igir dir2dat` command, `igir` will calculate CRC32, MD5, and SHA1 information for every file. This helps ensure that the generated DAT has the most complete information it can.

## Manually using other checksum algorithms

!!! danger

Most people do not need to calculate checksums above CRC32. CRC32 + filesize is sufficient to match ROMs and test written files in the gross majority of cases. The below information is for people that _truly_ know they need higher checksums.

You can specify higher checksum algorithms with the `--input-min-checksum <algorithm>` option like this:

```shell
igir [commands..] [options] --input-min-checksum MD5
```

```shell
igir [commands..] [options] --input-min-checksum SHA1
```

If not every ROM in every DAT provides the checksum you specify, `igir` may automatically calculate and match files based on other checksums (see above).

The reason you might want to do this is to have a higher confidence that found files _exactly_ match ROMs in DATs. Just keep in mind that enabling non-CRC32 checksums will _greatly_ slow down scanning of files within archives.

Here is a table that shows the keyspace for each checksum algorithm, where the higher number of bits reduces the chances of collisions:

| Algorithm | Digest size | Unique values | Example value |
|-----------|-------------|----------------------------|--------------------------------------------|
| CRC32 | 32 bits | 2^32 = 4.29 billion | `30a184a7` |
| MD5 | 128 bits | 2^128 = 340.28 undecillion | `52bb8f12b27cebd672b1fd8a06145b1c` |
| SHA1 | 160 bits | 2^160 = 1.46 quindecillion | `666d29a15d92f62750dd665a06ce01fbd09eb98a` |

When files are [tested](../commands.md#test) after being written, `igir` will use the highest checksum available from the scanned file to check the written file. This lets you have equal confidence that a file was written correctly as well as matched correctly.
6 changes: 2 additions & 4 deletions docs/usage/personal.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,8 @@ The `igir_library_sync.sh` script helps me keep this collection organized and me
# @param {...string} $@ Input directories to merge into this collection
set -euo pipefail

here="$(pwd)"
# shellcheck disable=SC2064
trap "cd \"${here}\"" EXIT
trap "cd \"${PWD}\"" EXIT
cd "$(dirname "$0")"


Expand Down Expand Up @@ -103,9 +102,8 @@ I have this script `igir_pocket_sync.sh` at the root of my Analogue Pocket's SD
#!/usr/bin/env bash
set -euo pipefail

here="$(pwd)"
# shellcheck disable=SC2064
trap "cd \"${here}\"" EXIT
trap "cd \"${PWD}\"" EXIT
cd "$(dirname "$0")"


Expand Down
1 change: 1 addition & 0 deletions index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ gracefulFs.gracefulify(realFs);
logger.notice(`Exiting ${Constants.COMMAND_NAME} early`);
await ProgressBarCLI.stop();
process.exit(0);
// TODO(cemmer): does exit here cause cleanup not to happen?
});

// Parse CLI arguments
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ nav:
- input/file-scanning.md
- input/reading-archives.md
- ROM Processing:
- roms/matching.md
- roms/filtering-preferences.md
- roms/headers.md
- roms/patching.md
Expand Down
2 changes: 1 addition & 1 deletion src/console/logger.ts
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ export default class Logger {
.replace(new RegExp(`(${Constants.COMMAND_NAME}) (( ?[a-z0-9])+)`, 'g'), `$1 ${chalk.magenta('$2')}`)

.replace(/(\[options\.*\])/g, chalk.cyan('$1'))
.replace(/([^a-zA-Z0-9-])(-[a-zA-Z0-9]+)/g, `$1${chalk.cyanBright('$2')}`)
.replace(/([^a-zA-Z0-9-])(-[a-zA-Z0-9]([a-zA-Z0-9]|\n[ \t]*)*)/g, `$1${chalk.cyanBright('$2')}`)
.replace(/(--[a-zA-Z0-9][a-zA-Z0-9-]+(\n[ \t]+)?[a-zA-Z0-9-]+) ((?:[^ -])[^"][^ \n]*|"(?:[^"\\]|\\.)*")/g, `$1 ${chalk.underline('$3')}`)
.replace(/(--[a-zA-Z0-9][a-zA-Z0-9-]+(\n[ \t]+)?[a-zA-Z0-9-]+)/g, chalk.cyan('$1'))
.replace(/(<[a-zA-Z]+>)/g, chalk.blue('$1'))
Expand Down
40 changes: 37 additions & 3 deletions src/igir.ts
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ import DAT from './types/dats/dat.js';
import Parent from './types/dats/parent.js';
import DATStatus from './types/datStatus.js';
import File from './types/files/file.js';
import { ChecksumBitmask } from './types/files/fileChecksums.js';
import IndexedFiles from './types/indexedFiles.js';
import Options from './types/options.js';
import OutputFactory from './types/outputFactory.js';
Expand Down Expand Up @@ -74,7 +75,7 @@ export default class Igir {

// Scan and process input files
let dats = await this.processDATScanner();
const indexedRoms = await this.processROMScanner();
const indexedRoms = await this.processROMScanner(this.determineScanningBitmask(dats));
const roms = indexedRoms.getFiles();
const patches = await this.processPatchScanner();

Expand Down Expand Up @@ -220,11 +221,44 @@ export default class Igir {
return dats;
}

private async processROMScanner(): Promise<IndexedFiles> {
private determineScanningBitmask(dats: DAT[]): number {
let matchChecksum = this.options.getInputMinChecksum() ?? ChecksumBitmask.CRC32;

if (this.options.shouldDir2Dat()) {
Object.keys(ChecksumBitmask)
.filter((bitmask): bitmask is keyof typeof ChecksumBitmask => Number.isNaN(Number(bitmask)))
// Has not been enabled yet
.filter((bitmask) => ChecksumBitmask[bitmask] > 0)
.filter((bitmask) => !(matchChecksum & ChecksumBitmask[bitmask]))
.forEach((bitmask) => {
matchChecksum |= ChecksumBitmask[bitmask];
this.logger.trace(`generating a dir2dat, enabling ${bitmask} file checksums`);
});
}

dats.forEach((dat) => {
const datMinimumBitmask = dat.getRequiredChecksumBitmask();
Object.keys(ChecksumBitmask)
.filter((bitmask): bitmask is keyof typeof ChecksumBitmask => Number.isNaN(Number(bitmask)))
// Has not been enabled yet
.filter((bitmask) => ChecksumBitmask[bitmask] > 0)
.filter((bitmask) => !(matchChecksum & ChecksumBitmask[bitmask]))
// Should be enabled for this DAT
.filter((bitmask) => datMinimumBitmask & ChecksumBitmask[bitmask])
.forEach((bitmask) => {
matchChecksum |= ChecksumBitmask[bitmask];
this.logger.trace(`${dat.getNameShort()}: needs ${bitmask} file checksums, enabling`);
});
});

return matchChecksum;
}

private async processROMScanner(checksumBitmask: number): Promise<IndexedFiles> {
const romScannerProgressBarName = 'Scanning for ROMs';
const romProgressBar = await this.logger.addProgressBar(romScannerProgressBarName);

const rawRomFiles = await new ROMScanner(this.options, romProgressBar).scan();
const rawRomFiles = await new ROMScanner(this.options, romProgressBar).scan(checksumBitmask);

await romProgressBar.setName('Detecting ROM headers');
const romFilesWithHeaders = await new ROMHeaderProcessor(this.options, romProgressBar)
Expand Down
48 changes: 31 additions & 17 deletions src/modules/argumentsParser.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import Logger from '../console/logger.js';
import Constants from '../constants.js';
import ArrayPoly from '../polyfill/arrayPoly.js';
import ConsolePoly from '../polyfill/consolePoly.js';
import { ChecksumBitmask } from '../types/files/fileChecksums.js';
import ROMHeader from '../types/files/romHeader.js';
import Internationalization from '../types/internationalization.js';
import Options, { GameSubdirMode, MergeMode } from '../types/options.js';
Expand Down Expand Up @@ -64,8 +65,9 @@ export default class ArgumentsParser {
parse(argv: string[]): Options {
this.logger.trace(`Parsing CLI arguments: ${argv}`);

const groupInput = 'Input options (supports globbing):';
const groupRomInput = 'ROM input options:';
const groupDatInput = 'DAT input options:';
const groupPatchInput = 'Patch input options:';
const groupRomOutput = 'ROM output options (processed in order):';
const groupRomZip = 'ROM zip command options:';
const groupRomLink = 'ROM link command options:';
Expand Down Expand Up @@ -176,33 +178,30 @@ export default class ArgumentsParser {

yargsParser
.option('input', {
group: groupInput,
group: groupRomInput,
alias: 'i',
description: 'Path(s) to ROM files or archives',
description: 'Path(s) to ROM files or archives (supports globbing)',
demandOption: true,
type: 'array',
requiresArg: true,
})
.option('input-exclude', {
group: groupInput,
group: groupRomInput,
alias: 'I',
description: 'Path(s) to ROM files or archives to exclude from processing',
description: 'Path(s) to ROM files or archives to exclude from processing (supports globbing)',
type: 'array',
requiresArg: true,
})
.option('patch', {
group: groupInput,
alias: 'p',
description: `Path(s) to ROM patch files or archives (supported: ${PatchFactory.getSupportedExtensions().join(', ')})`,
type: 'array',
requiresArg: true,
})
.option('patch-exclude', {
group: groupInput,
alias: 'P',
description: 'Path(s) to ROM patch files or archives to exclude from processing',
type: 'array',
.option('input-min-checksum', {
group: groupRomInput,
description: 'The minimum checksum level to calculate and use for matching',
choices: Object.keys(ChecksumBitmask)
.filter((bitmask) => Number.isNaN(Number(bitmask)))
.filter((bitmask) => ChecksumBitmask[bitmask as keyof typeof ChecksumBitmask] > 0)
.map((bitmask) => bitmask.toUpperCase()),
coerce: ArgumentsParser.getLastValue, // don't allow string[] values
requiresArg: true,
default: ChecksumBitmask[ChecksumBitmask.CRC32].toUpperCase(),
})

.option('dat', {
Expand Down Expand Up @@ -286,6 +285,21 @@ export default class ArgumentsParser {
return true;
})

.option('patch', {
group: groupPatchInput,
alias: 'p',
description: `Path(s) to ROM patch files or archives (supports globbing) (supported: ${PatchFactory.getSupportedExtensions().join(', ')})`,
type: 'array',
requiresArg: true,
})
.option('patch-exclude', {
group: groupPatchInput,
alias: 'P',
description: 'Path(s) to ROM patch files or archives to exclude from processing (supports globbing)',
type: 'array',
requiresArg: true,
})

.option('fixdat', {
type: 'boolean',
coerce: (val: boolean) => {
Expand Down
Loading

0 comments on commit 307770c

Please sign in to comment.