-
-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miscellaneous fixes to IUP #681
Changes from all commits
f986f63
98c2213
69c40c0
037bf98
739a622
4b4b036
f820fc4
0c615a8
e836f3c
8db9d63
3ae2ed2
0982bc2
e61cde1
e173336
6d64e71
4d7a2bf
b05ca4a
ee95c8c
9fab320
d2cde59
e3c13d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,7 +10,6 @@ | |
import com.ibm.icu.text.Normalizer2; | ||
import com.ibm.icu.text.Transform; | ||
import com.ibm.icu.text.Transliterator; | ||
import com.ibm.icu.text.UTF16; | ||
import com.ibm.icu.text.UnicodeSet; | ||
import com.ibm.icu.util.ICUException; | ||
import com.ibm.icu.util.VersionInfo; | ||
|
@@ -535,28 +534,12 @@ private UnicodeMap<String> getCachedMap(UcdProperty prop2, String sourceFileName | |
|
||
public static String getResolvedValue( | ||
IndexUnicodeProperties props, UcdProperty prop, String codepoint, String value) { | ||
if (value == null && props != null) { | ||
if (getResolvedDefaultValueType(prop) == DefaultValueType.CODE_POINT) { | ||
return codepoint; | ||
} | ||
} | ||
if (prop == UcdProperty.Name && value != null && value.endsWith("-#")) { | ||
return value.substring(0, value.length() - 1) + Utility.hex(codepoint); | ||
} | ||
return value; | ||
return props.getProperty(prop).getValue(codepoint.codePointAt(0)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The purpose of this method is to counteract the 'compression' by using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above, the compression is still done at rest, but let us decompress in one place. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup, looks good |
||
} | ||
|
||
public static String getResolvedValue( | ||
IndexUnicodeProperties props, UcdProperty prop, int codepoint, String value) { | ||
if (value == null && props != null) { | ||
if (getResolvedDefaultValueType(prop) == DefaultValueType.CODE_POINT) { | ||
return UTF16.valueOf(codepoint); | ||
} | ||
} | ||
if (prop == UcdProperty.Name && value != null && value.endsWith("-#")) { | ||
return value.substring(0, value.length() - 1) + Utility.hex(codepoint); | ||
} | ||
return value; | ||
return props.getProperty(prop).getValue(codepoint); | ||
} | ||
|
||
UnicodeMap<String> nameMap; | ||
|
@@ -693,17 +676,66 @@ class IndexUnicodeProperty extends UnicodeProperty.BaseProperty { | |
} | ||
} | ||
|
||
@Override | ||
public boolean isTrivial() { | ||
return _getRawUnicodeMap().isEmpty() | ||
|| _getRawUnicodeMap().keySet("").equals(UnicodeSet.ALL_CODE_POINTS); | ||
} | ||
|
||
@Override | ||
protected UnicodeMap<String> _getUnicodeMap() { | ||
var raw = _getRawUnicodeMap(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, looks like you are leaving them in the internal map, but then refashioning the map. Would be better to do this once in the internal map instead, probably. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But then we are back to high memory usage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For additional context: I started looking at that while trying to make the versioned JSPs use less memory by storing only differences for earlier versions; right now the staging JSPs have a container with more memory than they used to so they can hold the whole UCD history. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (And in fact that container eventually runs out of memory while loading Unicode Version 5.0. The fact that it takes it about twenty hours to reach Unicode 5.0 when on any of my machines it loads all history in 15ish minutes raises questions as to the cultivar of potato we run the JSPs on, but there is room for performance improvement here both in precomputation and memory usage.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I take this back; that can be done in a broader rework of UnicodeProperty |
||
if (prop == UcdProperty.Name | ||
|| raw.containsValue("<code point>") | ||
|| raw.containsValue("<codepoint>")) { | ||
final long start = System.currentTimeMillis(); | ||
UnicodeMap<String> newMap = new UnicodeMap<>(); | ||
for (UnicodeMap.EntryRange<String> range : raw.entryRanges()) { | ||
if (range.codepoint == -1) { | ||
newMap.put(range.string, range.value); | ||
} else if (DefaultValueType.forString(range.value) | ||
== DefaultValueType.CODE_POINT | ||
|| (prop == UcdProperty.Name && range.value.endsWith("#"))) { | ||
for (int c = range.codepoint; c <= range.codepointEnd; ++c) { | ||
newMap.put(c, resolveValue(range.value, c)); | ||
} | ||
} else { | ||
newMap.putAll(range.codepoint, range.codepointEnd, range.value); | ||
} | ||
} | ||
final long stop = System.currentTimeMillis(); | ||
final long Δt_in_ms = stop - start; | ||
// We do not want to construct these UnicodeMaps that map most of the code space to | ||
// itself, not so much because building them is costly, but because whatever we do | ||
// on them is almost certainly a bad idea (for instance calling `values()` will be | ||
// extremely slow). Log a trace so we can figure out where we are using this. | ||
System.out.println( | ||
"Built " + prop + " " + ucdVersion + " map in " + Δt_in_ms + " ms"); | ||
new Throwable().printStackTrace(System.out); | ||
|
||
return newMap; | ||
} else { | ||
return raw; | ||
} | ||
} | ||
|
||
protected UnicodeMap<String> _getRawUnicodeMap() { | ||
return load(prop); | ||
} | ||
|
||
@Override | ||
protected String _getValue(int codepoint) { | ||
final String result = _getUnicodeMap().get(codepoint); | ||
if (DefaultValueType.forString(result) == DefaultValueType.CODE_POINT) { | ||
final String result = _getRawUnicodeMap().get(codepoint); | ||
return resolveValue(result, codepoint); | ||
} | ||
|
||
private String resolveValue(String rawValue, int codepoint) { | ||
if (DefaultValueType.forString(rawValue) == DefaultValueType.CODE_POINT) { | ||
return Character.toString(codepoint); | ||
} else if (prop == UcdProperty.Name && rawValue != null && rawValue.endsWith("#")) { | ||
return rawValue.substring(0, rawValue.length() - 1) + Utility.hex(codepoint); | ||
} else { | ||
return result; | ||
return rawValue; | ||
} | ||
} | ||
|
||
|
@@ -731,6 +763,7 @@ protected List _getValueAliases(String valueAlias, List result) { | |
} | ||
} | ||
if (!result.contains(valueAlias)) { | ||
// TODO(egg): We should not be constructing this map for this. | ||
if (_getUnicodeMap().containsValue(valueAlias)) { | ||
result.add(valueAlias); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason for having special values for code point and name was to reduce the size of the data in memory. ( I think the .bin files are also zipped so at least the name part probably isn't so bad on disk).
That probably isn't a big issue having (eg) 80000 extra name strings at this point, so I'm ok with these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the size of the data in memory remains the same; getUnicodeMap only transiently creates the full map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It is a different issue (performance) if the rebuilding is repeated a lot during normal usage.