Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Over-zealous extension splitting #146

Merged
merged 3 commits into from
Jan 30, 2022

Conversation

SethMMorton
Copy link
Owner

Closes #145.

The old algorithm for suffix splitting when applying PATH sorting was over-zealous, and it appears to give bad results if filename collections a) have the form "word.then.dot.then.word.then.dot" and b) one or more (but not all) of the filenames looks like "word.then.dot.5.then.dot".

The new algorithm will not attempt to split ALL suffixes, but only the first two it encounters, and then only if the length of the suffix (plus the leading ".") is not more than 5 characters. The rule that the suffix cannot start with the regex /.\d/ still applies.


It's hard to say that this was a "bug" because the algorithm was chosen as it was deliberately, but it certainly was not the desired behavior.

I'm not sure sure it is *actually* a bug, but the PATH algorithm's way
of splitting extensions was over-zealous and in practice will split off
more extensions that is probably desired. To fix this, we will need to
add a heuristic, but this commit adds tests to demonstrate the problem.
The prior algorithm went as follows: Obtain ALL suffixes from the base
component of the filename. Then, starting from the back, keep the
suffixes split until a suffix is encountered that begins with the
regular expression /.\d/. It was assumed that this was intended to be a
floating point number, and not an extension, and thus the splitting
would stop at that point.

Some input has been seen where the filenames are composed nearly entirely
of Word.then.dot.and.then.dot. One entry amongst them contained
Word.then.dot.5.then.dot. This caused this one entry to be treated
differently from the rest of the entries due to the ".5", and the
sorting order was not as expected.

The new algorithm is as follows: Obtain a maxium of two suffixes. Keep
these suffixes until one of them has a length greater than 4 or starts
with the regular expression /.\d/.

This heuristic of course is not bullet-proof, but it will do a better
job on most real-world filenames than the previous algorithm.
@codecov
Copy link

codecov bot commented Jan 30, 2022

Codecov Report

Merging #146 (4832c15) into master (4f0b3a8) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #146   +/-   ##
=======================================
  Coverage   96.62%   96.62%           
=======================================
  Files          10       10           
  Lines         622      622           
=======================================
  Hits          601      601           
  Misses         21       21           
Impacted Files Coverage Δ
natsort/utils.py 98.38% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f0b3a8...4832c15. Read the comment docs.

@SethMMorton SethMMorton merged commit 24d7a4c into master Jan 30, 2022
@SethMMorton SethMMorton deleted the over-zealous-extension-splitting branch January 30, 2022 22:12
@SethMMorton SethMMorton changed the title Over zealous extension splitting Over-zealous extension splitting Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong os_sorted sorting with special character in filename
1 participant