Skip to content

Latest commit

 

History

History
267 lines (163 loc) · 7.92 KB

CHANGELOG.md

File metadata and controls

267 lines (163 loc) · 7.92 KB

Changelog

All notable changes to this project are documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning (as of version 1.4.0).

[Unreleased]

[2.2.0] - 2024-01-10

Changed

  • Upgrade dependencies: zimscraperlib 5.0.0, warcio 1.7.5, cdxj_index 1.4.6 and others
  • Use all rewriting stuff from zimscraperlib
  • Remove most HTML / CSS / JS rewriting logic which is now part of zimscraperlib 5
  • Fix wombat setup settings (especially isSW) (#293)

Fixed

  • Stop checking main entry processability when it is already found (#424)

[2.1.3] - 2024-11-01

Changed

  • Upgrade to wombat 3.8.3 (#414)

[2.1.2] - 2024-10-08

Added

  • Enrich test website with img srcset situations (in preparation for #403)

Changed

  • Upgrade dependencies, including wombat 3.8.2 (#407)

Fixed

  • HTML document can be retrieved as fetch resource type (#405)

[2.1.1] - 2024-09-05

Changed

  • Upgrade dependencies, including wombat 3.8.0 (#386)

[2.1.0] - 2024-08-09

Added

  • New fuzzy-rule for cheatography.com (#342), der-postillon.com (#330), iranwire.com (#363)
  • Properly rewrite redirect target url when present in HTML tag (#237)
  • New --encoding-aliases argument to pass encoding/charset aliases (#331)
  • Add support for SVG favicon (#148)
  • Automatically index PDF content and use PDF title (#289 and #290)

Changed

  • Upgrade to python-scraperlib 4.0.0
  • Generate fuzzy rules tests in Python and Javascript (#284)
  • Refactor HTML rewriter class to make it more open to change and expressive (#305)
  • Detect charset in document header only for HTML documents (#331)
  • Use software property from warcinfo record to set ZIM Scraper metadata (#357)
  • Store ContentDate as metadata, based on WARC-Date (#358)
  • Remove domain specific rules (#328)
  • Revisit retrieve_illustration logic to prefer best favicons (#352 and #369)
  • Upgrade dependencies (zimscraperlib 4.0.0, wombat.js 3.7.12 and others) (#376)

### Fixed

  • Handle case where the redirect target is bad / unsupported (#332 and #356)
  • Fixed WARC files handling order to follow creation order (#366)
  • Remove subsequent slashes in URLs, both in Python and JS (#365)
  • Ignore non HTTP(S) WARC records (#351)
  • Fix vimeo_cdn_fix fuzzy rule for proper operation in Javascript (#348)
  • Performance issue linked to new "extensible" HTML rewriting rules (#370)

[2.0.3] - 2024-07-24

Changed

  • Moved rules definition from JSON to YAML and documented update process (#216)
  • Upgrade to wombat.js 3.7.11

### Added

  • Exit with cleaner message when no entries are expected in the ZIM (#336) and when main entry is not processable (#337)
  • Add debug log for items whose content is empty (#344)

Fixed

  • Some resources rewrite mode are still not correctly identified (#326)

[2.0.2] - 2024-06-18

Added

  • Add --ignore-content-header-charsets option to disable automatic retrieval of content charsets from content first bytes (#318)
  • Add --content-header-bytes-length option to specify how many first bytes to consider when searching for content charsets in header (#320)
  • Add --ignore-http-header-charsets option to disable automatic retrieval of content charsets from content HTTP Content-Type headers (#318)

Changed

  • Simplify logic deciding content charset, stop guessing with chardet (#312)

Fixed

  • Rewrite only content with mimetype text-html when WARC-Resource-Type is html (#313)

[2.0.1] - 2024-06-13

Added

  • Add support for multiple languages in --lang CLI argument (#300)

Changed

  • Use the new WARC-Resource-Type header to decide rewrite mode (when present in WARC) (#296)
  • Upgrade Python dependencies + wombat.js 3.7.5

Fixed

  • Drop integrity attribute in HTML <script> and <link> tags (#298)
  • Use automatic detection of content encoding also for JS, JSON and CSS files (#301)
  • Set correct charset in HTML documents (#253)

[2.0.0] - 2024-06-04

Added

  • Allow to specify a scraper suffix for the ZIM scraper metadata at the CLI (#168)
  • New test website to test many known situations supposed to be handled (#166)

Changed

  • Replace Service Worker approach by scraper-side rewriting of static content (kiwix/overview#95)
  • Adopted Python bootstrap conventions (#152)
  • Upgrade dependencies, especially move to Python 3.12 (only) and zimscraperlib 3.3.2
  • Change wording in logs about the return code 100 (which is not an error code)
  • Added checks in converter.py to verify output directory existence, logging appropriate error messages and cleanly exit if checks fail. (#106)
  • Added check for invalid zim file names (#232)
  • Changed default publisher metadata from 'Kiwix' to 'openZIM' (#150)

[1.5.5] - 2024-01-18

Changed

  • Code restructuration in preparation for 2.x

[1.5.4] - 2023-09-18

Changed

  • Using wabac.js 2.16.11
  • Using cover resize method for favicon to prevent issues with too-small ones
  • Fixed direct link hack when inside an outer frame (kiwix-serve 3.5+) #119

[1.5.3] - 2023-08-23

Changed

  • Using wabac.js 2.16.9

[1.5.2] - 2023-08-02

Changed

  • Using scraperlib 3.1.1, openZIM metatadata now always set, using default if missing
  • Using wabac.js 2.16.6

[1.5.1] - 2023-02-06

Changed

  • Using wabac.js 2.15.2

[1.5.0] - 2023-02-02

Added

  • Don't crash on failure to convert illustration (skip illus instead)

Changed

  • Fixed 404 page (#96)
  • Dont't crash on missing Location headers on potential redirect
  • Fixed incorrect ISO-639-3 --lang not replaced with eng
  • Don't fallback to eng if the host doesnt have the matching locale
  • Using wabac.js 2.15.0 with fix for scope conflict in SW/DB
  • Payload entries now uses original ~text/html mimetype instead of text/html;raw=true
  • dont't crash on icon link with no href

[1.4.3] - 2022-06-21

Changed

  • Using wabac.js 2.12.0
  • Prevent duplicate entries from failing (including illustrations)
  • Fixed crash on HTTP 300 records (#94)

[1.4.0] – 2022-06-14

Added

  • Additional fuzzy matching rules for youtube and vimeo, and additional test cases
  • Support for youtube videos, which require POST request handling to work.
  • Support for canonicalizing POST request data into URL for fuzzy matching (using cdxj-indexer)
  • Support loading custom sw.js from a local file path

Changed

  • Updated zimscraperlib to 1.6 using libzim7.2
  • Updated warcio to 1.7.4
  • Added support for {period} replacement in --zim-file
  • Using fixed MarkupSafe version (Jinja2 dependency)

[1.3.6]

  • updated zimscraperlib (for libzim fix)

[1.3.5]

  • don't crash on records without WARC-Target-URI
  • fixed failure if url contains a fragment
  • updated wabac.js to 2.7.3

[1.3.4]

  • Added --custom-css option

[1.3.3]

  • Added --progress-file option

[1.3.2]

  • Update to wabac.js 2.1.6

[1.3.1]

  • Favicon loading fixes: In topFrame.html, load favicon URL directly from ZIM A/ record, bypassing service worker H/ lookup.

[1.3.0]

  • Supports 'fuzzy matching' with additional redirects add from normalized URL to exact URL
  • Add fuzzy matching rules for youtube and '?timestamp' URLs
  • Fix canonicaliziation where URLs that contain http/https were being incorrectly stripped (openzim/zimit#37)

[1.2.0]

  • Accepts directory inputs as well as individual files. If directory given, which will process all .warc and .warc.gz files recursively in the directory.
  • If trailing slash is missing on main URL, --url https://example.com?test=value, slash added and URL treated as --url https://example.com/?test=value

[1.1.0]

  • Now defaults to including all URLs unless --include-domains is specifief (removed -a)
  • Arguments are now checked before starting. Also returns 100 on valid arguments but no WARC provided.

[1.0.1]

  • Now skipping WARC records that redirect to self (http -> https mostly)

[1.0.0]

  • Initial release