Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary of a number of differences in mime type reporting before and after Tika #48

Open
malclocke opened this issue May 12, 2021 · 4 comments

Comments

@malclocke
Copy link

Hello 👋

In light of the differences that are showing up in mime type reporting pre and post Tika I thought it might be nice to try and get ahead of the bug reports by trying to get a big set of example files and run the mime type detection on them before and after the change to Tika.

I found a source of about 500 files here https://gitlab.freedesktop.org/xdg/shared-mime-info/-/tree/master/tests/mime-detection. Unfortunately Tika doesn't seem to have a similar set of test files in the source afaict.

I then ran the following test script against this set of files:

require "marcel"

ARGV.each do |filename|
  basename = File.basename(filename)

  File.open(filename) do |file|
    puts "%s %s" % [basename, Marcel::MimeType.for(file, name: basename)]
  end
end

I ran this script using 2 versions of Marcel - v0.3.3 and the current at time of writing HEAD - a525d5b

The attached CSV shows all the instances where a different MIME type was reported between the two versions. There are a total of 286. Most of the MIME types I would say are fairly niche and could no doubt be ignored without ever causing anyone a problem. But there are some common ones in there. And conversely the set of files is not a complete list of all MIME types known to humanity, so there will no doubt still be others that show up.

Anyway, I figured this list may be useful. Feel free to close this issue if it's not. 🥰

mimetype_for_diff-v0.3.3-a525d5b3.csv

@gmcgibbon
Copy link
Member

gmcgibbon commented May 12, 2021

I don't think we can legally use those files as our fixtures, but this is good to know nonetheless. Thanks for the info!

Here's a table version of the CSV detailing all the affected types (minus types I've already fixed in PRs), we can track fixes for these in this issue:

PR open? file v0.3.3 type a525d5b type
[ ] 32x-rom.32x application/x-genesis-32x-rom application/octet-stream
[ ] 3ds-tloz-mm.3ds image/x-3ds application/octet-stream
[ ] 4jsno.669 audio/x-mod application/octet-stream
[ ] adf-test.adf application/x-amiga-disk-format application/octet-stream
[ ] aero_alt.cur image/x-win-bitmap application/octet-stream
[ ] all_w.m3u8 audio/x-mpegurl application/vnd.apple.mpegurl
[ ] Anaphraseus-1.21-beta.oxt application/vnd.openofficeorg.extension application/zip
[ ] androide.k7 application/x-thomson-cassette audio/mpeg
[ ] aportis.pdb application/x-aportisdoc chemical/x-pdb
[ ] archive.lrz application/x-lrzip application/octet-stream
[ ] ascii.stl model/stl application/vnd.ms-pki.stl
[ ] atari-2600-test.A26 application/x-atari-2600-rom application/octet-stream
[ ] atari-7800-test.A78 application/x-atari-7800-rom application/octet-stream
[ ] atari-lynx-chips-challenge.lnx application/x-atari-lynx-rom application/octet-stream
[ ] bathead.sk image/x-skencil application/octet-stream
[ ] bibtex.bib text/x-bibtex text/x-matlab
[ ] binary.stl model/stl application/vnd.ms-pki.stl
[ ] blitz.m7 application/x-thomson-cartridge-memo7 application/octet-stream
[ ] break.mtm audio/x-mod application/octet-stream
[ ] bug106330.iso application/x-cd-image application/x-iso9660-image
[ ] bug-30656-xchat.conf application/octet-stream text/x-config
[ ] ccfilm.axv video/ogg application/ogg
[ ] classiq1.hfe application/x-hfe-floppy-image application/octet-stream
[ ] combined.karbon application/x-karbon application/zip
[ ] comics.cb7 application/x-cb7 application/x-7z-compressed
[ ] comics.cbt application/x-cbt application/x-gtar
[x] ct_faac-adts.aac audio/aac audio/x-aac
[ ] cyborg.med audio/x-mod application/octet-stream
[ ] dbus-comment.service text/x-dbus-service application/octet-stream
[ ] dbus.service text/x-dbus-service application/octet-stream
[ ] debian-goodies_0.63_all.deb application/vnd.debian.binary-package application/x-debian-package
[ ] dia.shape application/x-dia-shape image/svg+xml
[ ] disk.img application/x-raw-disk-image application/octet-stream
[ ] disk.raw-disk-image application/x-raw-disk-image application/octet-stream
[ ] disk.vhd text/x-vhdl application/x-vhd
[ ] dreamcast-us-samba-de-amigo-track-1.bin application/x-sega-cd-rom application/octet-stream
[ ] Empty.chrt application/x-kchart application/zip
[ ] en_US.zip.meta4 application/metalink4+xml application/xml
[ ] esm.mjs application/javascript application/octet-stream
[ ] eu_en_Sword_of_Vermilion.bin application/x-sega-cd-rom audio/mpeg
[ ] example_42_all.snap application/vnd.snap application/octet-stream
[ ] feed2 application/rss+xml application/xml
[ ] feed.atom application/atom+xml application/xml
[ ] feed.rss application/rss+xml application/xml
[ ] feeds.opml text/x-opml+xml application/xml
[ ] fuji.themepack application/x-windows-themepack application/vnd.ms-cab-compressed
[ ] game-boy-color-test.gbc application/x-gameboy-color-rom application/octet-stream
[ ] game-boy-test.gb application/x-gameboy-color-rom application/octet-stream
[ ] game-gear-test.gg application/x-gamegear-rom application/octet-stream
[ ] GammaChart.exr image/x-exr image/aces
[ ] gedit.flatpakref application/vnd.flatpak.ref application/octet-stream
[ ] genesis1.bin application/x-genesis-rom application/octet-stream
[ ] genesis2.bin application/x-genesis-rom application/octet-stream
[ ] gnome.flatpakrepo application/vnd.flatpak.repo application/octet-stream
[ ] gtk-builder.ui application/x-designer application/xml
[ ] hbo-playlist.qtl application/x-quicktime-media-link application/octet-stream
[ ] hello.flatpak application/vnd.flatpak application/octet-stream
[ ] helloworld.groovy text/x-modelica text/x-groovy
[ ] helloworld.java text/x-java text/x-java-source
[ ] helloworld.xpi application/x-xpinstall application/zip
[ ] hello.xdgapp application/vnd.flatpak application/octet-stream
[ ] hereyes_remake.mo3 audio/x-mo3 application/octet-stream
[ ] image.sqsh application/vnd.squashfs application/octet-stream
[ ] ISOcyr1.ent application/xml-external-parsed-entity text/plain
[ ] iso-file.iso application/x-cd-image application/x-iso9660-image
[ ] IWAD.WAD application/x-doom-wad application/x-doom
[ ] javascript-without-extension application/javascript application/x-sh
[ ] jc-win.ani application/x-navi-animation application/octet-stream
[ ] json-ld-full-iri.jsonld application/ld+json application/octet-stream
[ ] layersupdatesignals.flw application/x-kivio application/zip
[ ] Leafpad-0.8.17-x86_64.AppImage application/x-iso9660-appimage application/x-elf
[ ] Leafpad-0.8.18.1.glibc2.4-x86_64.AppImage application/x-iso9660-appimage application/x-elf
[ ] linguist.ts text/vnd.qt.linguist application/xml
[ ] live-streaming.m3u audio/x-mpegurl application/vnd.apple.mpegurl
[ ] ls application/x-sharedlib application/x-elf
[ ] m64p_test_rom.n64 application/x-n64-rom application/octet-stream
[ ] m64p_test_rom.v64 application/x-n64-rom application/octet-stream
[ ] m64p_test_rom.z64 application/x-n64-rom application/octet-stream
[ ] markdown.md text/markdown text/x-web-markdown
[ ] mega-drive-rom.gen application/x-genesis-rom application/octet-stream
[ ] Metroid_japan.fds application/x-fds-disk application/octet-stream
[ ] msg0001.gsm audio/x-gsm application/octet-stream
[ ] msx2-metal-gear.msx application/x-msx-rom application/octet-stream
[ ] msx-penguin-adventure.msx application/x-msx-rom application/octet-stream
[ ] my-data.json-patch application/json-patch+json application/octet-stream
[ ] mypaint.ora image/openraster application/zip
[ ] neo-geo-pocket-color-test.ngc application/x-neo-geo-pocket-color-rom application/octet-stream
[ ] neo-geo-pocket-test.ngp application/x-neo-geo-pocket-rom application/octet-stream
[ ] nrl.trig application/trig application/octet-stream
[ ] ooo.stw application/vnd.sun.xml.writer.template application/vnd.sun.xml.writer
[ ] ooo-test.fodg application/vnd.oasis.opendocument.graphics-flat-xml application/xml
[ ] ooo-test.fodp application/vnd.oasis.opendocument.presentation-flat-xml application/xml
[ ] ooo-test.fods application/vnd.oasis.opendocument.spreadsheet-flat-xml application/xml
[ ] ooo-test.fodt application/vnd.oasis.opendocument.text-flat-xml application/xml
[ ] ooo.vor application/vnd.stardivision.writer application/x-staroffice-template
[ ] Oriental_tattoo_by_daftpunk22.eps image/x-eps application/postscript
[ ] panasonic_lumix_dmc_fz38_05.rw2 image/x-panasonic-rw2 image/x-raw-panasonic
[ ] petite-ouverture-a-danser.ly text/x-lilypond application/octet-stream
[ ] pico-rom.bin application/x-sega-pico-rom application/octet-stream
[ ] playlist.asx audio/x-ms-asx application/x-ms-asx
[ ] playlist.mrl text/x-mrml application/octet-stream
[ ] playlist.wpl application/vnd.ms-wpl text/html
[ ] plugins.qmltypes text/x-qml application/octet-stream
[ ] pocket-word.psw application/x-pocket-word application/octet-stream
[ ] Presentation.kpt application/x-kpresenter application/gzip
[ ] project.glade application/x-glade application/xml
[ ] PWAD.WAD application/x-doom-wad application/x-doom
[ ] pyside.py text/x-python3 text/x-python
[ ] raw-mjpeg.mjpeg video/x-mjpeg image/jpeg
[ ] README-pandoc-flavored-markdown.md text/markdown text/x-matlab
[ ] rectangle.qml text/x-qml application/octet-stream
[ ] registry-nt.reg text/x-ms-regedit application/octet-stream
[ ] registry.reg text/x-ms-regedit text/plain
[ ] reStructuredText.rst application/octet-stream text/x-rst
[ ] rgb-reference.ktx image/ktx application/octet-stream
[ ] ringtone.ime text/x-imelody application/octet-stream
[ ] ringtone.mmf application/x-smaf application/vnd.smaf
[ ] ripoux.sap application/x-thomson-sap-image application/octet-stream
[ ] sample1.nzb application/x-nzb application/xml
[ ] sample.vsdx application/vnd.ms-visio.drawing.main+xml application/x-tika-ooxml
[ ] saturn-test.bin application/x-saturn-rom application/octet-stream
[ ] sega-cd-test.iso application/x-sega-cd-rom application/x-iso9660-image
[ ] serafettin.rar application/pdf application/x-rar-compressed;version=4
[ ] settopbox.ts video/mp2t application/octet-stream
[ ] sg1000-test.sg application/x-sg1000-rom application/octet-stream
[ ] shebang.qml text/x-qml application/x-sh
[ ] shell-calls-awk application/x-perl application/x-sh
[ ] simon.669 audio/x-mod application/octet-stream
[ ] sms-test.sms application/x-sms-rom application/octet-stream
[ ] sqlite2.kexi application/x-kexiproject-sqlite2 application/octet-stream
[ ] sqlite3.kexi application/vnd.sqlite3 application/x-sqlite3
[ ] stream.nsc application/x-netshow-channel application/octet-stream
[ ] subtitle-microdvd.sub text/x-microdvd application/octet-stream
[ ] subtitle-mpsub.sub text/x-microdvd application/octet-stream
[ ] subtitle.srt application/x-subrip application/octet-stream
[ ] subtitle.ssa text/x-ssa application/octet-stream
[ ] subtitle-subviewer.sub text/x-microdvd application/octet-stream
[ ] systemd.automount text/x-systemd-unit application/octet-stream
[ ] systemd.device text/x-systemd-unit application/octet-stream
[ ] systemd.mount text/x-systemd-unit application/octet-stream
[ ] systemd.path text/x-systemd-unit application/octet-stream
[ ] systemd.scope text/x-systemd-unit application/octet-stream
[ ] systemd.service text/x-dbus-service application/octet-stream
[ ] systemd.slice text/x-systemd-unit application/octet-stream
[ ] systemd.socket text/x-systemd-unit application/octet-stream
[ ] systemd.swap text/x-systemd-unit application/octet-stream
[ ] systemd.target text/x-systemd-unit application/octet-stream
[ ] systemd.timer text/x-systemd-unit application/octet-stream
[ ] test10.gpx application/gpx+xml application/xml
[ ] test3.py text/x-python3 text/x-python
[ ] test.aa audio/x-pn-audibleaudio application/octet-stream
[ ] test.aax audio/x-pn-audibleaudio video/quicktime
[ ] test.alz application/x-alz application/octet-stream
[ ] test.bflng text/html application/xml
[ ] test.bsdiff application/x-bsdiff application/octet-stream
[ ] testcases.ksp application/x-kspread application/gzip
[ ] test.ccmx application/x-ccmx application/octet-stream
[ ] test-cdda.toc application/x-cdrdao-toc application/octet-stream
[ ] test-cdrom.toc application/x-cdrdao-toc application/octet-stream
[ ] test.class application/x-java application/java-vm
[ ] test.cl text/x-opencl-src text/x-common-lisp
[ ] test.cmake text/x-cmake application/octet-stream
[ ] test.coffee application/vnd.coffeescript text/x-coffeescript
[ ] test.csvs text/csv-schema application/octet-stream
[ ] test.dot text/vnd.graphviz application/msword
[ ] test.d text/x-dsrc text/x-d
[ ] test-en.mo application/x-gettext-translation application/octet-stream
[ ] test-en.po text/x-gettext-translation application/octet-stream
[ ] test.eps image/x-eps application/postscript
[ ] test.feature text/x-gherkin application/octet-stream
[ ] test.fit image/fits application/fits
[ ] test.fl application/x-fluid application/octet-stream
[ ] test.fli video/x-flic video/x-fli
[ ] test.g3 image/fax-g3 image/g3fax
[ ] test.gbr image/x-gimp-gbr application/octet-stream
[ ] test.gcode text/x.gcode application/octet-stream
[ ] test.geojson application/geo+json application/json
[ ] test.geo.json application/json application/octet-stream
[ ] test.gih image/x-gimp-gih application/octet-stream
[ ] test.gnd application/gnunet-directory application/octet-stream
[ ] test.gpx application/gpx+xml application/xml
[ ] test.gs text/x-genie application/octet-stream
[ ] test.html text/html application/xml
[ ] test-html-with-svg.html text/html image/svg+xml
[ ] test.ilbm image/x-ilbm audio/x-aiff
[ ] test.im1 image/x-sun-raster application/octet-stream
[ ] test.iptables text/x-iptables application/octet-stream
[ ] test.ipynb application/x-ipynb+json application/octet-stream
[ ] test_issue127.py text/x-python3 text/x-python
[ ] test.it87 application/x-it87 application/octet-stream
[ ] test.jar application/x-java-archive application/java-archive
[ ] test.jceks application/x-java-jce-keystore application/octet-stream
[ ] test.jks application/x-java-keystore application/octet-stream
[ ] test.jnlp application/x-java-jnlp-file application/xml
[ ] test.kdc image/x-kodak-kdc image/tiff
[ ] test-kounavail2.kwd application/x-kword application/gzip
[ ] test.lzo application/x-lzop application/octet-stream
[ ] test.manifest text/cache-manifest text/plain
[ ] test.metalink application/metalink+xml application/xml
[ ] test.mml application/mathml+xml application/octet-stream
[ ] test.mobi application/x-mobipocket-ebook application/octet-stream
[ ] test.mof text/x-mof application/x-mobipocket-ebook
[ ] test.mo text/x-modelica text/plain
[ ] test.mpc audio/x-musepack application/vnd.mophun.certificate
[ ] test.msi application/x-msi application/x-ms-installer
[ ] test.ogg audio/ogg audio/vorbis
[ ] test.ooc text/x-ooc application/octet-stream
[ ] test.opus audio/ogg audio/opus
[ ] test.owx application/owl+xml application/xml
[ ] test.oxps application/oxps application/zip
[ ] test.p12 application/pkcs12 application/x-pkcs12
[ ] test.pat image/x-gimp-pat application/octet-stream
[ ] test.pgn application/vnd.chess-pgn application/x-chess-pgn
[ ] test.php application/x-php text/html
[ ] test.pl application/x-perl text/x-perl
[ ] test.pm application/x-perl application/x-tika-msoffice
[ ] test.pmd application/x-pagemaker text/x-perl
[ ] test.por application/x-spss-por application/octet-stream
[ ] test.pot text/x-gettext-translation-template application/vnd.ms-powerpoint
[ ] test.py3 text/x-python3 application/x-sh
[ ] test.py text/x-python3 text/x-python
[ ] test.pyx text/x-python application/octet-stream
[ ] test.qp application/x-qpress application/octet-stream
[ ] test.qti application/x-qtiplot application/octet-stream
[ ] test.raml application/raml+yaml application/octet-stream
[ ] test-reordered.ipynb application/x-ipynb+json application/octet-stream
[ ] test.rs text/rust application/rls-services+xml
[x] test.sass text/x-sass application/octet-stream
[ ] test.sav application/x-spss-sav application/octet-stream
[x] test.scss text/x-scss application/octet-stream
[ ] test-secret.key application/pgp-keys application/vnd.apple.keynote
[ ] test-secret-key.skr application/pgp-keys application/octet-stream
[ ] test.sgi image/x-sgi image/x-rgb
[ ] test.sqlite2 application/x-sqlite2 application/octet-stream
[ ] test.sqlite3 application/vnd.sqlite3 application/x-sqlite3
[ ] test.ss text/x-scheme text/plain
[ ] test.svh text/x-svhdr application/octet-stream
[ ] test.sv text/x-svsrc application/octet-stream
[ ] test.t application/x-perl application/x-lz4
[ ] test.tar.lz4 application/x-lz4 application/x-lzip
[ ] test.tar.lz application/x-lzip application/zstd
[ ] test.tar.zst application/octet-stream application/msword
[ ] test-template.dot application/msword-template application/x-tex
[ ] test.tex text/x-tex image/x-tga
[ ] test.tga image/x-tga image/tiff
[ ] test.tif image/tiff application/octet-stream
[ ] test.ts video/mp2t text/troff
[ ] test.ttl text/turtle application/octet-stream
[ ] test.ttx application/x-font-ttx application/xml
[ ] test.twig text/x-twig application/octet-stream
[ ] test.url application/x-mswinurl application/octet-stream
[ ] test.uue text/x-uuencode application/octet-stream
[ ] test.vala text/x-vala application/octet-stream
[ ] test-vpn.pcf application/x-cisco-vpn-settings application/x-font-pcf
[ ] test.wim application/x-ms-wim application/octet-stream
[ ] test.xar application/x-xar application/vnd.xara
[ ] test.xht application/xhtml+xml application/xml
[ ] test.xhtml application/xhtml+xml application/xml
[ ] test.xlr application/vnd.ms-works application/x-tika-msworks-spreadsheet
[ ] test.xml.in application/xml text/plain
[ ] test.xpm image/x-xpixmap image/x-xbitmap
[ ] test.xps application/oxps application/zip
[ ] test.xsl application/xslt+xml application/xml
[ ] test.yaml application/x-yaml text/x-yaml
[ ] test.zst application/octet-stream application/zstd
[ ] text.qmlproject text/x-qml application/octet-stream
[ ] text.wwf application/x-wwf application/pdf
[ ] TS010082249.pub application/vnd.ms-publisher application/x-mspublisher
[ ] Utils.jsm application/javascript text/plain
[ ] virtual-boy-wario-land.vb application/x-virtual-boy-rom text/x-vbdotnet
[ ] webfinger.jrd application/jrd+json application/octet-stream
[ ] white_640x480.kra application/x-krita application/zip
[ ] wii.wad application/x-wii-wad application/x-doom
[ ] wonderswan-color-chocobo.wsc application/x-wonderswan-color-rom application/octet-stream
[ ] wonderswan-rockman-forte.ws application/x-wonderswan-rom application/octet-stream
[ ] x_speex_ogg.spx video/ogg audio/speex
[ ] zeb.3ds image/x-3ds application/octet-stream

@pixeltrix
Copy link
Contributor

I can see that getting legitimate test files for some of these could be fraught with issues - Genesis ROMs for example 😬

@pixeltrix
Copy link
Contributor

Also not all of the differences are bugs - for example serafettin.rar returns application/pdf in 0.3.3 which is obviously wrong so we shouldn't change that back (though there's still a question whether the new value is correct).

@malclocke
Copy link
Author

@pixeltrix I agree that these are not all bugs. It might be nice to ask for some kind of reference (e.g. an RFC or similar) when a PR is submitted to fix a 'regression' as to why the old value is more correct than the new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants