Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up MetadataLoader.ts #277

Open
bpatrik opened this issue Apr 26, 2021 · 13 comments
Open

Clean up MetadataLoader.ts #277

bpatrik opened this issue Apr 26, 2021 · 13 comments

Comments

@bpatrik
Copy link
Owner

bpatrik commented Apr 26, 2021

Pigallery2 is using multiple exif, iptc parers (as I could not find one that can do all at once) to get all the metadata for a photo:

  1. fs.statSync for silesize, creation date (if i do not find any better from exif)

    const stat = fs.statSync(fullPath);

  2. ts-exif-parser for most of the exif data:

    const exif = ExifParserFactory.create(data).parse();

  3. image-size to get the photo dimension if 2) ts-exif-parser fails:

    const info = imageSize(fullPath);

    (The image dimension is a must for this app to work properly )

  4. ts-node-iptc for GPS/location, keywords, caption related info:

    const iptcData = IptcParser.parse(data);

  5. exifreader for faces and now for orientation as this lib parses that properly

    const exif = ExifReader.load(data);

This class needs a major refactor. Probably. exifreader would be able to handle all cases now. Reducing to only one lib, photo indexing would be probably way faster (instead of 4-5 file open, 1 would do the trick)

@bpatrik
Copy link
Owner Author

bpatrik commented Apr 26, 2021

cbhushan@ in #212 (comment) suggested https://mutiny.cz/exifr/ but that is missing Face region support: MikeKovarik/exifr#59

@mattiasw
Copy link

Hi! If I can do anything to help ease a transition, please let me know. /Maintainer of ExifReader

@bpatrik
Copy link
Owner Author

bpatrik commented Apr 30, 2021

Thank you, @mattiasw for your library. I think that is the best candidate for pigallry2 to have "one exif reader to rule them all".

So far I'm not planning to put effort in it on my side.
I opened this bug if anyone feels the power to implement it :)
We already have some unit tests in place for the Metadata loader , so a refactor should be relatively safe.

@bpatrik
Copy link
Owner Author

bpatrik commented May 10, 2021

I "hacked up" a quick benchmark between 3 exif readers.
(#294 (comment) suggests exiftool is not as fast as we hoped)

Disclaimer: I did not double check if all three can proved all the needed metadata. But I thinks so, or at least their developers are quite responsive.

Libs to test:

  • "exiftool-vendored": "14.3.0"
  • "exifr": "7.0.0"
  • "exifreader": "3.15.0"

Test was done on Wind 10, with SSHD (it is possoble that the HDD cached the photos after a while), number of photos: 698, dir size: 3,28GB. Each test was run 10 times and an average was calculated.

Result:

Exifreader: 4116.9349999999995 ms
exiftool: 34532.78095 ms
exifr: 990.25858 ms

Reproduce the test:

MetadaLoader.ts: loadPhotoMetadata(fullPath: string): got modified to:

 /****exiftool*******/
public static loadPhotoMetadata(fullPath: string): Promise<PhotoMetadata> {
 return new Promise<PhotoMetadata>(async (resolve, reject) => {
      const exif = await exiftool.read(fullPath, ['--ifd1:all']);
      resolve({
        size: {width: 1, height: 1},
        orientation: OrientationTypes.TOP_LEFT,
        creationDate: 0,
        fileSize: 0
      });
    });
  }
 /****exifr*******/
public static loadPhotoMetadata(fullPath: string): Promise<PhotoMetadata> {
 return new Promise<PhotoMetadata>(async (resolve, reject) => {
      const exif = await exifr.parse(fullPath);
      resolve({
        size: {width: 1, height: 1},
        orientation: OrientationTypes.TOP_LEFT,
        creationDate: 0,
        fileSize: 0
      });
    });
  }
 /****exifreader test*******/
public static loadPhotoMetadata(fullPath: string): Promise<PhotoMetadata> {
return new Promise<PhotoMetadata>(async (resolve, reject) => {
      const fd = fs.openSync(fullPath, 'r');
      const data = Buffer.allocUnsafe(Config.Server.photoMetadataSize);
      fs.read(fd, data, 0, Config.Server.photoMetadataSize, 0, async (err) => {
        fs.closeSync(fd);
        if (err) {
          return reject({file: fullPath, error: err});
        }
        const exif = ExifReader.load(data); // this is a synchronous call, no await it required
        resolve({
          size: {width: 1, height: 1},
          orientation: OrientationTypes.TOP_LEFT,
          creationDate: 0,
          fileSize: 0
        });
      });
    });
  }

NOTE: for exifreader, not the whole file is read, but only part of it. The exif data is only at the beginning of the file, no need to read the actual photo.
NOTE2: I could not find an api for exiftool to read only the beginning of the file
NOTE3: It seems to be possbile to read only the beginning of the file with exifr, although the test were done by passing the file path.

Test sandbox:
pigallrery2/test.ts

import {DiskMangerWorker} from './src/backend/model/threading/DiskMangerWorker';
import {Config} from './src/common/config/private/Config';
import {ProjectPath} from './src/backend/ProjectPath';

const run = async () => {
  Config.Server.Media.folder =<path to a folder with 698 photos>
  ProjectPath.reset();
  let time = 0;
  const TIMES = 10;
  for (let i = 0; i < TIMES; ++i) {
    console.log('round', i);
    const start = process.hrtime();
    await DiskMangerWorker.scanDirectory('/');
    const end = process.hrtime(start);
    time += (end[0] * 1000 + end[1] / 1000000);
    console.log('round time:', (end[0] * 1000 + end[1] / 1000000));
  }
  console.log(time / TIMES);
};
run().catch(console.error);

@mcdamo
Copy link
Contributor

mcdamo commented May 12, 2021

It looks like exiftool is out of the running.

I also think that benchmark might not be fair for exifr vs exifreader: it appears exifr is very fast by default because it only imports IFD0, EXIF, GPS tags, so no IPTC or XMP(?)

I tried a benchmark with loading all tags exifr.parse(file, true) and it was of similar performance to exifreader.

bpatrik added a commit that referenced this issue May 13, 2021
@bpatrik
Copy link
Owner Author

bpatrik commented May 13, 2021

I made an experimental branch that build bpatrik/pigallery2:experimental-debian-buster

I made multiple experiments:

public static async loadPhotoMetadata(fullPath: string): Promise<PhotoMetadata> {
if (ActiveExperiments[Experiments.loadPhotoMetadata.name] === Experiments.loadPhotoMetadata.groups.exifr) {
return await this.loadWithExifr(fullPath);
}
if (ActiveExperiments[Experiments.loadPhotoMetadata.name] === Experiments.loadPhotoMetadata.groups.exifrAll) {
return await this.loadWithExifr(fullPath);
}
if (ActiveExperiments[Experiments.loadPhotoMetadata.name] === Experiments.loadPhotoMetadata.groups.exifrSelected) {
return await this.loadWithExifr(fullPath);
}
if (ActiveExperiments[Experiments.loadPhotoMetadata.name] === Experiments.loadPhotoMetadata.groups.exifreader) {
return await this.loadWithExifReader(fullPath);
}
if (ActiveExperiments[Experiments.loadPhotoMetadata.name] === Experiments.loadPhotoMetadata.groups.exiftool) {
return await this.loadWithExiftool(fullPath);
}
return this.currentMethod(fullPath);
}

And run my benchmark.

Result:

PiGallery2 v1.8.7 (experimental), 13.05.2021

Version: v1.8.7, built at: Thu May 13 2021 10:31:47 GMT+0000 (Coordinated Universal Time), git commit:36d5ccbbd4c077d25c99f47c215740a46122cad4
System: Raspberry Pi 4 4G Model B, SandisK Mobile Ultra 32Gb CLass10, UHS-I, HDD: Western Digital Elements 1TB (WDBUZG0010BBK)

Gallery: directories: 31, photos: 2036, videos: 35, diskUsage : 22.08GB, persons : 1241, unique persons (faces): 14

Action Average Duration Result
[baseline]Scanning directory (current production method) 10 383.2 ms media: 698, directories: 0
[loadPhotoMetadata=exifr]Scanning directory (does not parse the lib output) 2 502.6 ms media: 698, directories: 0
[loadPhotoMetadata=exifrAll]Scanning directory (does not parse the lib output) 2 454.3 ms media: 698, directories: 0
[loadPhotoMetadata=exifrSelected]Scanning directory (does not parse the lib output) 2 437.0 ms media: 698, directories: 0
[loadPhotoMetadata=exifreader]Scanning directory (does not parse the lib output) 9 595.9 ms media: 698, directories: 0
[loadPhotoMetadata=exiftool]Scanning directory (copied from PR #294) 76 023.4 ms media: 698, directories: 0

*Measurements run 20 times and an average was calculated.

run for : 2316553.0ms


I do not see much difference between the ˛exifr settings. Also for exifreader, I have to read a file, which I just do by feeling. Most likely I read more bytes than absolutely necessary, that is why it is slower (I guess). I could not find an exifreader API that could read the file itself.

Based on this I find exifr the most appealing for this project.

@bpatrik
Copy link
Owner Author

bpatrik commented Jul 7, 2021

exifr now support sidecars: MikeKovarik/exifr#62

@grasdk
Copy link
Contributor

grasdk commented Feb 9, 2024

3. `image-size` to get the photo dimension       

A comment from a veteran user of exiftool on the exiftool forums, indicate that metadata independent determination of image size is a good idea, even if you find data in the exif:
https://exiftool.org/forum/index.php?topic=13310.msg71936#msg71936

So perhaps it would be fine to keep image-size around after the cleanup.

@bpatrik
Copy link
Owner Author

bpatrik commented Feb 20, 2024

3. `image-size` to get the photo dimension       

A comment from a veteran user of exiftool on the exiftool forums, indicate that metadata independent determination of image size is a good idea, even if you find data in the exif: https://exiftool.org/forum/index.php?topic=13310.msg71936#msg71936

So perhaps it would be fine to keep image-size around after the cleanup.

keeping image-size sounds good to me.

@mceachen
Copy link

mceachen commented Mar 25, 2024

Howdy! I'm the author of exiftool-vendored.

Windows has especially poor child process spawn performance: I've seen Windows take upwards of 10 seconds on a lightly loaded server to just allow the process to spin. I had to increase timeouts to 30 seconds on GHA! For reference, a slow Linux server can spin up a subprocess in 3-100ms.

As exiftool is running in a separate process, it suffers from this OS performance hit, but exiftool allows you to amortize that cost over many thousands of file reads if you use it's "-stay_open" mode (which my library leverages).

I'd update your benchmark to try reading from multiple files concurrently, rather than waiting for every read -- exiftool-vendored will spawn a pool of exiftool child processes (and manage their lifecycle, shutting them down when idle and spinning them up as your demand ebbs and flows).

One possibly surprising thing with my library: in an effort to avoid bug reports saying that exiftool fork-bombed or IO-starved their box, I default the maxProcs setting to be 1/4 the number of CPUs on the current box. In other words, it's not set up to win any benchmark with default settings. Changing maxProcs to be require("node:os").cpus().length should help: in my tests, read performance on Windows is ~<200ms/file, but of course this depends on CPU speed, filesystem performance, and arguments provided to the read command--it turns out that well-written perl is quite fast: on my AMD 5950x, Ubuntu 22, and a gen-3 SSD, read() throughput is 500+ files/second (which backs into ~60ms/read())

Edit: and just to clarify, I'm not asking you to switch EXIF libraries--I'm just wanting to highlight the issues with using exiftool on fork-hostile OSes. If exifr behaves as expected for you, then that may indeed be your best option. 🍻

@enrique-lozano
Copy link

Hi @mceachen,

By "reading from multiple files concurrently, rather than waiting for every read" you mean using a Promise.all() or something similar or what?? I tried that and the results are better (120-140ms per file instead of 200) but really poor compared to others.

Of course this can be improved with web workers but the other libraries will have a really big improvement in performance with web workers too.

Maybe I'm missing something or maybe is the OS (I'm in windows). But I use a lot exiftool to process images in bulk and is way faster than this :(

@bpatrik
Copy link
Owner Author

bpatrik commented Jun 28, 2024

Thank you for the pointers @mceachen. Really appreciate them.

Generally I would be happy to switch to exiftools given its the "standard" tool that everyone just accepts. If it's performance can kinda match with what we today have, it would be a nice switch.

On the other hand, we just had a major MetadataLoader refactor recently. I don't think I would want to invest into an other one. But If someone wants to give it a try, I happy to support.

@grasdk
Copy link
Contributor

grasdk commented Oct 31, 2024

I think this is done, but #915 might be reopened as a suggestion?
@bpatrik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants