Over-zealous extension splitting #146

SethMMorton · 2022-01-30T01:38:26Z

Closes #145.

The old algorithm for suffix splitting when applying PATH sorting was over-zealous, and it appears to give bad results if filename collections a) have the form "word.then.dot.then.word.then.dot" and b) one or more (but not all) of the filenames looks like "word.then.dot.5.then.dot".

The new algorithm will not attempt to split ALL suffixes, but only the first two it encounters, and then only if the length of the suffix (plus the leading ".") is not more than 5 characters. The rule that the suffix cannot start with the regex /.\d/ still applies.

It's hard to say that this was a "bug" because the algorithm was chosen as it was deliberately, but it certainly was not the desired behavior.

I'm not sure sure it is *actually* a bug, but the PATH algorithm's way of splitting extensions was over-zealous and in practice will split off more extensions that is probably desired. To fix this, we will need to add a heuristic, but this commit adds tests to demonstrate the problem.

The prior algorithm went as follows: Obtain ALL suffixes from the base component of the filename. Then, starting from the back, keep the suffixes split until a suffix is encountered that begins with the regular expression /.\d/. It was assumed that this was intended to be a floating point number, and not an extension, and thus the splitting would stop at that point. Some input has been seen where the filenames are composed nearly entirely of Word.then.dot.and.then.dot. One entry amongst them contained Word.then.dot.5.then.dot. This caused this one entry to be treated differently from the rest of the entries due to the ".5", and the sorting order was not as expected. The new algorithm is as follows: Obtain a maxium of two suffixes. Keep these suffixes until one of them has a length greater than 4 or starts with the regular expression /.\d/. This heuristic of course is not bullet-proof, but it will do a better job on most real-world filenames than the previous algorithm.

codecov · 2022-01-30T01:39:56Z

Codecov Report

Merging #146 (4832c15) into master (4f0b3a8) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #146   +/-   ##
=======================================
  Coverage   96.62%   96.62%           
=======================================
  Files          10       10           
  Lines         622      622           
=======================================
  Hits          601      601           
  Misses         21       21

Impacted Files	Coverage Δ
natsort/utils.py	`98.38% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f0b3a8...4832c15. Read the comment docs.

SethMMorton added 2 commits January 29, 2022 17:31

SethMMorton added the bug label Jan 30, 2022

SethMMorton mentioned this pull request Jan 30, 2022

Wrong os_sorted sorting with special character in filename #145

Closed

Update changelog

4832c15

SethMMorton merged commit 24d7a4c into master Jan 30, 2022

SethMMorton deleted the over-zealous-extension-splitting branch January 30, 2022 22:12

SethMMorton changed the title ~~Over zealous extension splitting~~ Over-zealous extension splitting Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Over-zealous extension splitting #146

Over-zealous extension splitting #146

SethMMorton commented Jan 30, 2022

codecov bot commented Jan 30, 2022 •

edited

Loading

Over-zealous extension splitting #146

Over-zealous extension splitting #146

Conversation

SethMMorton commented Jan 30, 2022

codecov bot commented Jan 30, 2022 • edited Loading

Codecov Report

codecov bot commented Jan 30, 2022 •

edited

Loading