-
Notifications
You must be signed in to change notification settings - Fork 407
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: check file name uniqueness wiht Unicode canonical case fold nor…
…malization This commit changes the OCF container file name uniqueness check to perform the Unicode canonical case fold normalization step defined in https://www.w3.org/TR/charmod-norm/#CanonicalFoldNormalizationStep That is, we normalize the file name to NFD then apply full case folding before checking for uniqueness. Previously the behaviors was: - we checked for uniqueness of the lower case form (String.toLowerCase) - we checked for uniqueness of the NFC-normalized lower case This was flawed, since String.toLowerCase is not equivalent to Unicode full case folding. Also, previously, only a warning (OPF-061) was reported when names were not unique after NFC normalization. This is now an error, using the same code as the other uniqueness failures (OPF-060). This commit removes OPF-061, which is no longer used. Fixes #1246
- Loading branch information
Showing
10 changed files
with
69 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
39 changes: 39 additions & 0 deletions
39
src/main/java/org/w3c/epubcheck/util/text/UnicodeUtils.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
package org.w3c.epubcheck.util.text; | ||
|
||
import com.google.common.base.Preconditions; | ||
import com.ibm.icu.text.CaseMap; | ||
import com.ibm.icu.text.Normalizer2; | ||
|
||
public final class UnicodeUtils | ||
{ | ||
|
||
private static final Normalizer2 NFD_NORMALIZER = Normalizer2.getNFCInstance(); | ||
private static final CaseMap.Fold CASE_FOLDER = CaseMap.fold(); | ||
|
||
private UnicodeUtils() | ||
{ | ||
// static utility class | ||
} | ||
|
||
/** | ||
* Applies Unicode Canonical Case Fold Normalization as defined in | ||
* https://www.w3.org/TR/charmod-norm/#CanonicalFoldNormalizationStep | ||
* | ||
* This applies, in sequence: - canonical decomposition (NFD) - case folding | ||
* | ||
* Note that the result is **not** recomposed (NFC), i.e. the optional | ||
* post-folding NFC normalization is not applied. | ||
* | ||
* In other words, the result is suitable for string comparison for | ||
* case-insensitive string comparison, but not for display. | ||
* | ||
* @param string | ||
* the string to normalize | ||
* @return the string normalized by applying NFD then case folding | ||
*/ | ||
public static String canonicalCaseFold(String string) | ||
{ | ||
Preconditions.checkArgument(string != null); | ||
return CASE_FOLDER.apply(NFD_NORMALIZER.normalize(string)); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
Binary file added
BIN
+1.78 KB
...es/epub3/04-ocf/files/ocf-filename-duplicate-after-compatibility-normalization-valid.epub
Binary file not shown.
Binary file added
BIN
+1.81 KB
...st/resources/epub3/04-ocf/files/ocf-filename-duplicate-after-full-case-folding-error.epub
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters