Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscellaneous fixes to IUP #681

Merged
merged 21 commits into from
Feb 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
import com.ibm.icu.text.Normalizer2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for having special values for code point and name was to reduce the size of the data in memory. ( I think the .bin files are also zipped so at least the name part probably isn't so bad on disk).

That probably isn't a big issue having (eg) 80000 extra name strings at this point, so I'm ok with these changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the size of the data in memory remains the same; getUnicodeMap only transiently creates the full map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It is a different issue (performance) if the rebuilding is repeated a lot during normal usage.

import com.ibm.icu.text.Transform;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.UTF16;
import com.ibm.icu.text.UnicodeSet;
import com.ibm.icu.util.ICUException;
import com.ibm.icu.util.VersionInfo;
Expand Down Expand Up @@ -535,28 +534,12 @@ private UnicodeMap<String> getCachedMap(UcdProperty prop2, String sourceFileName

public static String getResolvedValue(
IndexUnicodeProperties props, UcdProperty prop, String codepoint, String value) {
if (value == null && props != null) {
if (getResolvedDefaultValueType(prop) == DefaultValueType.CODE_POINT) {
return codepoint;
}
}
if (prop == UcdProperty.Name && value != null && value.endsWith("-#")) {
return value.substring(0, value.length() - 1) + Utility.hex(codepoint);
}
return value;
return props.getProperty(prop).getValue(codepoint.codePointAt(0));
Copy link
Member

@macchiati macchiati Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this method is to counteract the 'compression' by using # in names and <code point>. If that isn't done any more, then these methods could be refactored away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, the compression is still done at rest, but let us decompress in one place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks good

}

public static String getResolvedValue(
IndexUnicodeProperties props, UcdProperty prop, int codepoint, String value) {
if (value == null && props != null) {
if (getResolvedDefaultValueType(prop) == DefaultValueType.CODE_POINT) {
return UTF16.valueOf(codepoint);
}
}
if (prop == UcdProperty.Name && value != null && value.endsWith("-#")) {
return value.substring(0, value.length() - 1) + Utility.hex(codepoint);
}
return value;
return props.getProperty(prop).getValue(codepoint);
}

UnicodeMap<String> nameMap;
Expand Down Expand Up @@ -693,17 +676,66 @@ class IndexUnicodeProperty extends UnicodeProperty.BaseProperty {
}
}

@Override
public boolean isTrivial() {
return _getRawUnicodeMap().isEmpty()
|| _getRawUnicodeMap().keySet("").equals(UnicodeSet.ALL_CODE_POINTS);
}

@Override
protected UnicodeMap<String> _getUnicodeMap() {
var raw = _getRawUnicodeMap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looks like you are leaving them in the internal map, but then refashioning the map. Would be better to do this once in the internal map instead, probably.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we are back to high memory usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For additional context: I started looking at that while trying to make the versioned JSPs use less memory by storing only differences for earlier versions; right now the staging JSPs have a container with more memory than they used to so they can hold the whole UCD history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And in fact that container eventually runs out of memory while loading Unicode Version 5.0. The fact that it takes it about twenty hours to reach Unicode 5.0 when on any of my machines it loads all history in 15ish minutes raises questions as to the cultivar of potato we run the JSPs on, but there is room for performance improvement here both in precomputation and memory usage.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take this back; that can be done in a broader rework of UnicodeProperty

if (prop == UcdProperty.Name
|| raw.containsValue("<code point>")
|| raw.containsValue("<codepoint>")) {
final long start = System.currentTimeMillis();
UnicodeMap<String> newMap = new UnicodeMap<>();
for (UnicodeMap.EntryRange<String> range : raw.entryRanges()) {
if (range.codepoint == -1) {
newMap.put(range.string, range.value);
} else if (DefaultValueType.forString(range.value)
== DefaultValueType.CODE_POINT
|| (prop == UcdProperty.Name && range.value.endsWith("#"))) {
for (int c = range.codepoint; c <= range.codepointEnd; ++c) {
newMap.put(c, resolveValue(range.value, c));
}
} else {
newMap.putAll(range.codepoint, range.codepointEnd, range.value);
}
}
final long stop = System.currentTimeMillis();
final long Δt_in_ms = stop - start;
// We do not want to construct these UnicodeMaps that map most of the code space to
// itself, not so much because building them is costly, but because whatever we do
// on them is almost certainly a bad idea (for instance calling `values()` will be
// extremely slow). Log a trace so we can figure out where we are using this.
System.out.println(
"Built " + prop + " " + ucdVersion + " map in " + Δt_in_ms + " ms");
new Throwable().printStackTrace(System.out);

return newMap;
} else {
return raw;
}
}

protected UnicodeMap<String> _getRawUnicodeMap() {
return load(prop);
}

@Override
protected String _getValue(int codepoint) {
final String result = _getUnicodeMap().get(codepoint);
if (DefaultValueType.forString(result) == DefaultValueType.CODE_POINT) {
final String result = _getRawUnicodeMap().get(codepoint);
return resolveValue(result, codepoint);
}

private String resolveValue(String rawValue, int codepoint) {
if (DefaultValueType.forString(rawValue) == DefaultValueType.CODE_POINT) {
return Character.toString(codepoint);
} else if (prop == UcdProperty.Name && rawValue != null && rawValue.endsWith("#")) {
return rawValue.substring(0, rawValue.length() - 1) + Utility.hex(codepoint);
} else {
return result;
return rawValue;
}
}

Expand Down Expand Up @@ -731,6 +763,7 @@ protected List _getValueAliases(String valueAlias, List result) {
}
}
if (!result.contains(valueAlias)) {
// TODO(egg): We should not be constructing this map for this.
if (_getUnicodeMap().containsValue(valueAlias)) {
result.add(valueAlias);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ public ShimUnicodePropertyFactory(IndexUnicodeProperties factory) {
UnicodeProperty prop = factory.getProperty(propName);
switch (propName) {
case "Bidi_Mirroring_Glyph":
// The default is <none> in BidiMirroring.txt, but TUP incorrectly has it as
// <code point>.
prop =
replaceCpValues(
prop,
Expand All @@ -68,22 +70,21 @@ public ShimUnicodePropertyFactory(IndexUnicodeProperties factory) {
prop = replaceValues(prop, oldValue -> oldValue == null ? "\u0000" : oldValue);
break;
case "FC_NFKC_Closure":
// The default is <code point> in PropertyValueAliases.txt, but TUP incorrectly
// has it as <none>.
prop =
replaceCpValues(
prop, (cp, oldValue) -> fixFC_NFKC_Closure(cp, oldValue));

break;
case "Joining_Type":
prop = replaceCpValues(prop, (cp, oldValue) -> fixJoining_Type(cp, oldValue));
break;
case "Joining_Group":
prop = modifyJoining_Group(prop);
break;
case "Jamo_Short_Name":
prop = modifyJamo_Short_Name(prop);
break;
case "Name":
// prop = modifyName(prop);
// TUP reports the special label <control-XXXX> as the value of the Name
// property. This is incorrect, the actual value of the Name property is "" for
// those, see https://www.unicode.org/versions/latest/ch04.pdf#G135245 and
// https://www.unicode.org/reports/tr44/#Name.
prop = replaceCpValues(prop, (cp, x) -> fixName(cp, x));
break;
case "Numeric_Value":
Expand Down Expand Up @@ -301,40 +302,24 @@ private String fixNumericValue(String value) {
private String fixName(int cp, String value) {
if (control.contains(cp)) {
return "<control-" + Utility.hex(cp, 4) + ">";
} else if (value != null && value.contains("#")) {
return value.replace("#", Utility.hex(cp, 4));
} else {
return value;
}
}

private String fixFC_NFKC_Closure(int cp, String oldValue) {
if (oldValue.equals("<code point>") || equalsString(cp, oldValue)) {
if (equalsString(cp, oldValue)) {
return null;
} else {
return oldValue;
}
}

// Joining_Type needs fix in IUP
private String fixJoining_Type(int cp, String oldValue) {
if (defaultTransparent.contains(cp) && "Non_Joining".equals(oldValue)) {
return "Transparent";
} else {
return oldValue;
}
}

// Jamo_Short_Name needs fix in IUP
private UnicodeProperty modifyJamo_Short_Name(UnicodeProperty prop) {
return copyPropReplacingMap(prop, prop.getUnicodeMap().put('ᄋ', ""));
}

// Joining_Group needs fix in IUP (really, in UCD data)
private UnicodeProperty modifyJoining_Group(UnicodeProperty prop) {
return copyPropReplacingMap(prop, prop.getUnicodeMap().put('ۃ', "Teh_Marbuta_Goal"));
}

/** Very useful. May already be in ICU, but not sure. */
public boolean equalsString(int codepoint, String value) {
return codepoint == value.codePointAt(0)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -550,6 +550,12 @@ public UnicodeMap<String> getUnicodeMap() {
return getUnicodeMap(false);
}

public boolean isTrivial() {
final var map = getUnicodeMap();
return map.isEmpty()
|| map.keySet("").equals(UnicodeSet.ALL_CODE_POINTS) && map.stringKeys().isEmpty();
}

/**
* @return the unicode map
*/
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
package org.unicode.text.UCD;

import com.ibm.icu.dev.util.UnicodeMap;
import com.ibm.icu.text.SymbolTable;
import com.ibm.icu.text.UnicodeSet;
import java.text.ParsePosition;
Expand Down Expand Up @@ -63,13 +62,6 @@ public static VersionedProperty forJSPs() {
private static final Set<String> TOOL_ONLY_PROPERTIES =
Set.of("toNFC", "toNFD", "toNFKC", "toNFKD");

private static boolean isTrivial(UnicodeMap<String> map) {
return map.isEmpty()
|| (map.values().size() == 1
&& map.getSet(map.values().iterator().next())
.equals(UnicodeSet.ALL_CODE_POINTS));
}

public UnicodeProperty getProperty() {
return property;
}
Expand Down Expand Up @@ -112,11 +104,11 @@ public VersionedProperty set(String xPropertyName) {
propSource = getIndexedProperties(version);
property = propSource.getProperty(xPropertyName);
if ((property == null && TOOL_ONLY_PROPERTIES.contains(xPropertyName))
|| (property != null && isTrivial(property.getUnicodeMap()) && allowRetroactive)) {
|| (property != null && property.isTrivial() && allowRetroactive)) {
propSource = ToolUnicodePropertySource.make(version);
property = propSource.getProperty(xPropertyName);
}
if (property == null || isTrivial(property.getUnicodeMap())) {
if (property == null || property.isTrivial()) {
if (!throwOnUnknownProperty) {
return null;
}
Expand Down
Loading