Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[benchmark] SR-10855: Added non-ASCII characters to String benchmarking #25309

Merged

Conversation

Armenm
Copy link
Contributor

@Armenm Armenm commented Jun 7, 2019

  • Added Cyrillic and Armenian names.
  • Added test for pathological case of Eszett (ß -> SS).
  • Added variants that work on longer string (tests for _SmallString optimization have .Small suffix).

Resolves SR-10855
Subtask of SR-8905: Gaps in String benchmarking

@Armenm Armenm changed the title [benchmark] SR-10855: Added non-ASCII characters [benchmark] SR-10855: Added non-ASCII characters to String benchmarking Jun 7, 2019
@jrose-apple
Copy link
Contributor

@swift-ci Please benchmark

@jrose-apple
Copy link
Contributor

@swift-ci Please smoke test

@swift-ci

This comment has been minimized.

@Armenm
Copy link
Contributor Author

Armenm commented Jun 8, 2019

Hey @gottesmm I have decreased the workload, should be enough to pass the check. Could you please re-run the benchmark test?

@beccadax
Copy link
Contributor

beccadax commented Jun 8, 2019

@swift-ci Please benchmark

@beccadax
Copy link
Contributor

beccadax commented Jun 8, 2019

@swift-ci Please smoke test

@swift-ci

This comment has been minimized.

@garricn
Copy link
Contributor

garricn commented Jun 8, 2019

🥳🇦🇲👏🏽🤓💪🏽💻📲

@beccadax
Copy link
Contributor

beccadax commented Jun 8, 2019

@Armenm Looks like you still need to reduce the workload a little more. Maybe mixed strings with emoji should be measured in separate benchmarks?

Copy link
Contributor

@palimondo palimondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to Swift project, Armen! Looks like we need to coordinate two PRs modifying the same benchmark… See PR #25305. You have filed SR-10855 first, so we'll start here.

Since the addition of other bicameral scripts is turning a single benchmark into a family of benchmarks, we should take this opportunity to properly design it to allow for their relative comparison, while we're at it. We'll leave the original AngryPhonebook benchmark without changes.

cc @mdab121

benchmark/single-source/AngryPhonebook.swift Show resolved Hide resolved
benchmark/single-source/AngryPhonebook.swift Show resolved Hide resolved
@@ -39,3 +49,39 @@ public func run_AngryPhonebook(_ N: Int) {
}
}
}

Copy link
Contributor

@palimondo palimondo Jun 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the workload baseline by the slowest benchmark to be under 1000 μs, looking at the last measured performance of Armenian (27^2 = 729 name pairs), I'm guessing we could standardize on workload size of 20 names per script. We should also try to keep the overall size roughly the same. Document this:

// Workloads for various scripts. Always 20 names for 400 pairings.
// To keep the performance of various scripts roughly comparable, aim for
// a total length of approximately 120 characters.
// E.g.: `latin.joined(separator: "").count == 124`

let latin = Array(words.prefix(20))

let cyrillic = [
  "Александр", "Аркадий", "Аня", "Даниил", "Дмитрий", "Эдуард", "Юрій", "Давид",
  "Анна", "Дмитрий", "Евгений", "Борис", "Владимир", "Артур", "Антон",
  "Антон", "Надія", "Алёна", "Алиса"]

let armenian = [
  "Արմեն", "Աննա", "Հարություն", "Միքայել", "Մարիա", "Դավիթ", "Վարդան", "Նարինե",
  "Հռիփսիմե", "Տիգրան", "Տաթև", "Ադամ", "Ազատ", "Ազնաւուր", "Գրիգոր", "Անի",
  "Լիլիթ", "Հայկ", "Անդրանիկ", "Գառնիկ"]

I think cyrillic needs a bit tuning, the first 20 names I took are only 108 characters… could you please bring it up to ~120? Armenian at 114 are fine, I think…

Copy link
Member

@milseman milseman Jun 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Armenian and Cyrillic are both 2-byte encoded languages. Unless there are salient differences in how capitalization works (I don't know either enough to know), normalization, or use of combining characters, these would be the same workload from a technical perspective.

As @palimondo pointed out, what's also interesting would be the following:

let smallTwoByte = [
  // something up to 15 UTF-8 code units in length, i.e. 7 "characters" in Cyrillic or Armenian ...
]
let largeTwoByte = [
  // something > 16 UTF-8 code units in length, i.e. at least 8 "characters" in Cyrillic or Armenian...  
]

And the same for 3-byte for some latter-BMP language with capitalization (not sure off the top of my head), and without (e.g. Han ideographs), and similarly for 4-byte, and a large 1-byte (i.e. ASCII) workload. This can be done as a separate PR, though.

benchmark/single-source/AngryPhonebook.swift Show resolved Hide resolved
@Armenm
Copy link
Contributor Author

Armenm commented Jun 10, 2019

@palimondo Thanks for the great feedback and support. Yes, I will review and make all necessary changes.
Regarding the Benchmark_Driver.BenchmarkDoctor. Could you please share instructions on how to run it on a local machine. It would be easier for me to contribute to other benchmarks.
cc @gottesmm

Thanks in advance.

@palimondo
Copy link
Contributor

palimondo commented Jun 10, 2019

Sure, BenchmarkDriver has 3 subcommands:

$ ./Benchmark_Driver --help
usage: Benchmark_Driver [-h] COMMAND ...

optional arguments:
  -h, --help  show this help message and exit

Swift benchmark driver commands:
  COMMAND     See COMMAND -h for additional arguments
    run       Run benchmarks and output results to stdout
    check
    compare   Compare benchmark results

Example: ./Benchmark_Driver run -i 5 -f Prefix -f .*Suffix.*

You want to run the check to validate that your newly added benchmarks pass our sanity checks:

$ ./Benchmark_Driver check --help
usage: Benchmark_Driver check [-h] [-f PATTERN] [-t TESTS] [-o OPT] [-v | -md]
                              [BENCHMARK [BENCHMARK ...]]

positional arguments:
  BENCHMARK             benchmark to run (default: all)

optional arguments:
  -h, --help            show this help message and exit
  -f PATTERN, --filter PATTERN
                        run all tests whose name match regular expression
                        PATTERN, multiple filters are supported
  -t TESTS, --tests TESTS
                        directory containing Benchmark_O{,none,size} (default:
                        DRIVER_DIR)
  -o OPT, --optimization OPT
                        optimization level to use: {O,Onone,Osize}, (default:
                        O)
  -v, --verbose         show more details during benchmark analysis
  -md, --markdown       format report as Markdown table

So for the testing of new AngryPhonebook variants you'd run them with

$ ./Benchmark_Driver run -i 3 --output-dir log -f Angry

The --output-dir parameter triggers (as a side effect) a nicer result formatting on console (while writing the CSV formatted log into the log directory). The -i 3 will measure 3 independent runs of each benchmark. The -f Pattern will run all benchmarks that match a regex.

To run the benchmark validation locally you would do:

$ ./Benchmark_Driver check -f Angry

This takes a while, because it measures each benchmark for a second, ten times. You'll get color coded warnings and errors if we find something suspicious. To see all the nitty gritty while it does its thing do:

$ ./Benchmark_Driver check -f Angry --verbose

@palimondo
Copy link
Contributor

palimondo commented Jun 10, 2019

cc @mdab121, you might find these tips for local benchmark validation @Armenm requested also useful… 😉 I guess I should find time to update the docs someday… 🤨

@palimondo
Copy link
Contributor

palimondo commented Jun 10, 2019

@Armenm I've noticed one more thing, that we could take care of here if we want to be super fair when comparing across different scripts… The String has an optimization for small strings that can be packed as a value into the object body to avoid reference counting overhead. You can test it in the REPL of your locally built swift in bin directory (which can access also internal methods and properties):

$ ./swift
***  You are running Swift's integrated REPL,  ***
***  intended for compiler and stdlib          ***
***  development and testing purposes only.    ***
***  The full REPL is built as part of LLDB.   ***
***  Type ':help' for assistance.              ***
(swift) "smol"._guts.isSmall
// r0 : Bool = true
(swift) "Long String > 16B"._guts.isSmall
// r1 : Bool = false

So to be totally fair, we should probably take the extra step to ensure we'll be comparing apples to apples, because in latin, all names end up being small strings:

(swift) latin.allSatisfy { $0._guts.isSmall }
// r7 : Bool = true

but

(swift) cyrillic.filter { !$0._guts.isSmall }
// r5 : [String] = ["Александр", "Владимир"]
(swift) armenian.filter { !$0._guts.isSmall }
// r7 : [String] = ["Հարություն", "Հռիփսիմե", "Ազնաւուր", "Անդրանիկ"]

Could you try to pick the names with lengths so that they would fit into small strings (16 bytes when UTF-8 encoded) and still get the overall length?

@milseman Do you want to have benchmarks per script (this PR adds Cyrillic and Armenian, PR #25305 is poised to add some Extended Latin)? We could probably add Greek to cover all major bicameral scripts… or do you want just one non-ASCII addition?

@Armenm
Copy link
Contributor Author

Armenm commented Jun 10, 2019

Hey @palimondo This is the first time I am touching this code and had looked only for 4-5 hours during try! Swift San Jose. Forgive me if it takes too much time to understand what's going on here :)
Really appreciate your help.

@palimondo
Copy link
Contributor

@Armenm Don't worry, you're doing great and I'm overthinking things....

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thanks for working on this! Could you incorporate @palimondo 's code review feedback (naming convention, extracting out the common code), and we can merge this?

@@ -39,3 +49,39 @@ public func run_AngryPhonebook(_ N: Int) {
}
}
}

Copy link
Member

@milseman milseman Jun 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Armenian and Cyrillic are both 2-byte encoded languages. Unless there are salient differences in how capitalization works (I don't know either enough to know), normalization, or use of combining characters, these would be the same workload from a technical perspective.

As @palimondo pointed out, what's also interesting would be the following:

let smallTwoByte = [
  // something up to 15 UTF-8 code units in length, i.e. 7 "characters" in Cyrillic or Armenian ...
]
let largeTwoByte = [
  // something > 16 UTF-8 code units in length, i.e. at least 8 "characters" in Cyrillic or Armenian...  
]

And the same for 3-byte for some latter-BMP language with capitalization (not sure off the top of my head), and without (e.g. Han ideographs), and similarly for 4-byte, and a large 1-byte (i.e. ASCII) workload. This can be done as a separate PR, though.

@Armenm
Copy link
Contributor Author

Armenm commented Jun 12, 2019

Sure @milseman Will look into it. Thanks!

@Armenm Armenm force-pushed the SR-10855-non-ascii-angryphonebook branch from 1bf0ba1 to 4483e84 Compare June 26, 2019 04:34
@Armenm
Copy link
Contributor Author

Armenm commented Jun 26, 2019

Hey @palimondo. I have refactored the code based on your comments. There are 20 names in each language and ~120 characters.
latin.joined(separator: "").count = 118
cyrillic.joined(separator: "").count = 119
armenian.joined(separator: "").count = 116

Also removed all names that were not fitting in small strings.

@Armenm
Copy link
Contributor Author

Armenm commented Jun 26, 2019

To run the benchmark validation locally you would do:

$ ./Benchmark_Driver check -f Angry

This takes a while, because it measures each benchmark for a second, ten times. You'll get color coded warnings and errors if we find something suspicious. To see all the nitty gritty while it does its thing do:

$ ./Benchmark_Driver check -f Angry --verbose

I was not able to run Benchmark_Driver locally. Could you please check the error below and give advice?

./Benchmark_Driver check -f Angry --verbose
Traceback (most recent call last):
  File "./Benchmark_Driver", line 774, in <module>
    exit(main())
  File "./Benchmark_Driver", line 770, in main
    return args.func(args)
  File "./Benchmark_Driver", line 576, in run_check
    doctor = BenchmarkDoctor(args)
  File "./Benchmark_Driver", line 343, in __init__
    self.driver = driver or BenchmarkDriver(args)
  File "./Benchmark_Driver", line 59, in __init__
    self.tests = tests or self._get_tests()
  File "./Benchmark_Driver", line 112, in _get_tests
    self._invoke(self._cmd_list_benchmarks).split('\n')[1:-1]
  File "./Benchmark_Driver", line 68, in _invoke
    cmd, stderr=self._subprocess.STDOUT)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 216, in check_output
    process = Popen(stdout=PIPE, *popenargs, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

@palimondo
Copy link
Contributor

I was not able to run Benchmark_Driver locally. Could you please check the error below and give advice?

It's unable to find the benchmark binary. You need to be running it from the bin directory that also contains Benchmark_O.

@palimondo
Copy link
Contributor

@swift-ci please benchmark

@palimondo
Copy link
Contributor

@swift-ci please benchmark

@palimondo
Copy link
Contributor

@swift-ci please smoke test

@palimondo
Copy link
Contributor

@milseman since I was fixing the error and had these benchmarks running locally, I've added a .Large variant that tests the case conversion on a large string (~6000 characters ±150), since the originals operate on _SmallStrings. Could you please review?

@palimondo palimondo requested review from milseman and removed request for gottesmm July 3, 2019 16:54
@swift-ci

This comment has been minimized.

@palimondo
Copy link
Contributor

palimondo commented Jul 3, 2019

@Armenm, for future work in this area, you might be interested in what @milseman said in @mdab121's PR #25305 (comment)


@milseman Do you have any explanation for the performance of Latin.Large?

Per iteration, the small variants are doing 800 calls to ICU on 2-byte variants (Armenian, Cyrillic), vs. 2 call in the .Large, processing roughly the same amount of bytes (large has additional spaces and commas), so it makes sense the .Large would be faster compared to small.

The AngryPhonebook.Latin is doing the case conversion on the optimized small ASCII strings, so this is clearly the best case scenario, but why would the AngryPhonebook.Latin.Large be ~3x slower that the other .Large variants? Profiling in Instruments I'm seeing an overhead from swift_bridgeObjectRetain and no calls to libicucore… is that to be expected or should I file a bug?

@milseman
Copy link
Member

milseman commented Jul 3, 2019

@palimondo The naming here is preventing me from being able to discuss this intelligently. Are we using "Latin" (which is an irrelevant designation for benchmark purposes) to mean that it consists entirely of ASCII? If so, we should use "ASCII" for the name. Similarly, are we using "Large" to mean that it will not fit in a small-string representation? Otherwise, we're just rearranging the deck chairs: more quicker iterations vs fewer slower iterations.

@palimondo
Copy link
Contributor

Correct, Latin is ASCII. Named this way in expectation of adding ExtendedLatin in #25305. I’ve assumed using names of Unicode blocks would not be confusing…

The Large variant suffix is added to distinguish it from original AngryPhonebook which uses small strings only. We could flip it to use the base name for large variant and add .Small suffix for those that use small strings if you prefer?

@palimondo
Copy link
Contributor

palimondo commented Jul 3, 2019

@milseman

Otherwise, we're just rearranging the deck chairs: more quicker iterations vs fewer slower iterations.

I don’t understand your point here. I was saying that the Latin.Large variant of this benchmark has uncovered that a normal string that isASCII performs about 2.5x slower than comparably sized non-ASCII String, because it apparently does the conversion in ObjC per character(?) and not via the faster ICU. It look like there could be easy performance win just by using ICU in that case. Does that make sense?

@palimondo
Copy link
Contributor

palimondo commented Jul 3, 2019

@milseman

Are we using "Latin" (which is an irrelevant designation for benchmark purposes) to mean that it consists entirely of ASCII?

It seems we have different assumption about the implementation of optimizations for case conversion here. If I understand your perspective from an earlier discussion above it makes sense to distinguish only between cases based on UTF-8 encoding length:

  • 1-byte, ASCII
  • 2-byte, Extended Latin, Armenian, Cyrillic — all the same?

As I’ve explained in #25305 (comment) it seem like it could be possible to create optimizations for case conversions specialized for different Unicode code blocks. Is that a wrong way to think about this?

Edit: Looking at Rust's implementation, I see that the case conversions are implemented using a BoolTrie tables. Is this an approach you are planning use?

 Pathological case, uppercase: ß -> SS
@palimondo
Copy link
Contributor

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

swift-ci commented Jul 3, 2019

Performance: -O

Regression OLD NEW DELTA RATIO
Dictionary4 155 196 +26.5% 0.79x
ObjectiveCBridgeStubDateAccess 130 152 +16.9% 0.86x
Dictionary4OfObjects 197 224 +13.7% 0.88x (?)
 
Improvement OLD NEW DELTA RATIO
FlattenListFlatMap 4763 3701 -22.3% 1.29x (?)
FlattenListLoop 2705 2167 -19.9% 1.25x (?)
MapReduceAnyCollection 217 194 -10.6% 1.12x
 
Added MIN MAX MEAN MAX_RSS
AngryPhonebook.Armenian 586 587 587
AngryPhonebook.Armenian.Large 135 140 137
AngryPhonebook.Cyrillic 594 667 622
AngryPhonebook.Cyrillic.Large 146 169 154
AngryPhonebook.Latin 80 80 80
AngryPhonebook.Latin.Large 400 400 400
AngryPhonebook.Strasse 709 722 714
AngryPhonebook.Strasse.Large 114 114 114

Code size: -O

Regression OLD NEW DELTA RATIO
AngryPhonebook.o 1035 10241 +889.5% 0.10x

Performance: -Osize

Improvement OLD NEW DELTA RATIO
StringBuilderLong 910 820 -9.9% 1.11x (?)
 
Added MIN MAX MEAN MAX_RSS
AngryPhonebook.Armenian 598 598 598
AngryPhonebook.Armenian.Large 139 149 142
AngryPhonebook.Cyrillic 600 639 613
AngryPhonebook.Cyrillic.Large 146 151 148
AngryPhonebook.Latin 80 80 80
AngryPhonebook.Latin.Large 400 400 400
AngryPhonebook.Strasse 703 703 703
AngryPhonebook.Strasse.Large 114 116 115

Code size: -Osize

Regression OLD NEW DELTA RATIO
AngryPhonebook.o 1144 9891 +764.6% 0.12x

Performance: -Onone

Improvement OLD NEW DELTA RATIO
ArrayAppendOptionals 1550 610 -60.6% 2.54x (?)
DataAppendArray 3200 2900 -9.4% 1.10x (?)
ObjectiveCBridgeStubFromNSStringRef 110 101 -8.2% 1.09x (?)
 
Added MIN MAX MEAN MAX_RSS
AngryPhonebook.Armenian 668 669 669
AngryPhonebook.Armenian.Large 136 141 138
AngryPhonebook.Cyrillic 672 690 678
AngryPhonebook.Cyrillic.Large 143 146 144
AngryPhonebook.Latin 165 180 171
AngryPhonebook.Latin.Large 392 392 392
AngryPhonebook.Strasse 761 788 770
AngryPhonebook.Strasse.Large 111 123 115

Code size: -swiftlibs

Benchmark Check Report
⚠️Ⓜ️ AngryPhonebook.Strasse has very wide range of memory used between independent, repeated measurements.
AngryPhonebook.Strasse mem_pages [i1, i2]: min=[14, 14] 𝚫=0 R=[6, 18]
⚠️Ⓜ️ AngryPhonebook.Strasse.Large has very wide range of memory used between independent, repeated measurements.
AngryPhonebook.Strasse.Large mem_pages [i1, i2]: min=[28, 28] 𝚫=0 R=[0, 21]
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

@palimondo
Copy link
Contributor

palimondo commented Jul 3, 2019

@milseman I've added a benchmark for the pathological case of uppercased ß -> SS:

"Straße".uppercased() == "STRASSE"

You can see that even that one (Strasse.Large) is almost 4 times faster variant than the Latin.Large.

Heaviest Stack Trace from Instruments

Latin.Large
   4 Benchmark_O 3976.0  largeAngryPhonebook(_:_:)
   3 libswiftCore.dylib 2002.0  specialized String.uppercased()
   2 libswiftCore.dylib 1483.0  _StringGuts.append(_:)
   1 libswiftCore.dylib 450.0  swift_bridgeObjectRetain
   0 libswiftCore.dylib 266.0  _swift_retain_(swift::HeapObject*)

If this conversion is done per character and the API is returning String, because of the possibility of 1:2 conversion in case of ß -> SS… why aren't these small strings that would not require retain/release traffic? Am I misunderstanding what is going on here?

Strasse.Large
   6 Benchmark_O 1412.0  largeAngryPhonebook(_:_:)
   5 libswiftCore.dylib 827.0  specialized String.uppercased()
   4 libicucore.A.dylib 393.0  u_strToUpper
   3 libicucore.A.dylib 393.0  0x7fff74cbdd6f
   2 libicucore.A.dylib 392.0  0x7fff74cbd796
   1 libicucore.A.dylib 40.0  u_memcpy
   0 libsystem_platform.dylib 21.0  _platform_memmove$VARIANT$Base

@milseman
Copy link
Member

milseman commented Jul 3, 2019

Correct, Latin is ASCII. Named this way in expectation of adding ExtendedLatin in #25305. I’ve assumed using names of Unicode blocks would not be confusing…

In general, Unicode code blocks is an irrelevant categorization for benchmarking, and “Latin” isn’t even the formal name of the code block (nor would the formal name be as recognizable). ASCII vs non-ASCII is a very relevant categorization, which just so happens to align with a Unicode code block because Unicode explicitly chose to be a superset of ASCII.

For example, a workload that consists entirely of Han Ideographs would be interesting, not because they all come from a given block, but because of the attributes they have in comparison to other workloads, such as case invariance (and to a lesser extent, encoding width for the BMP ones).

Similarly, improvements and additional fast-paths would apply based on these kinds of attributes, such as over case-invariant ranges of scalar values. The boundaries of such ranges are not strictly aligned with code block boundaries.

The Large variant suffix is added to distinguish it from original AngryPhonebook which uses small strings only. We could flip it to use the base name for large variant and add .Small suffix for those that use small strings if you prefer?

I was confused by your workload in https://github.com/apple/swift/pull/25309/files#r291816905, which had “small” workloads that don’t fit in the small-string representation (15 bytes of UTF-8). If the PR is consistent in small vs large aligning with small vs large string representations, then this is fine. Otherwise I think that “long” vs “short” is better.

I don’t understand your point here. I was saying that the Latin.Large variant of this benchmark has uncovered that a normal string that isASCII performs about 2.5x slower than comparably sized non-ASCII String, because it apparently does the conversion in ObjC per character(?) and not via the faster ICU. It look like there could be easy performance win just by using ICU in that case. Does that make sense?

This seems unlikely, and perhaps we need to dig into the details if this is happening. Interfacing with ICU requires that we:

  1. Allocate a buffer (e.g. Array) of UInt16s and transcode our contents into it.
  2. Allocate another buffer for ICU to write the result to.
  3. Call ICU
  4. See if our prior buffer was big enough (and redo with a larger one if not)
  5. Create a new String (heap allocation if large) while transcoding the UTF-16 back to UTF-8.

In contrast, if we know a String is all-ASCII, we do the following:

  1. Allocate a String storage of sufficient capacity
  2. Copy the lowercased bytes in

Point #2 could probably be improved with wider/bulk operations.

2-byte, Extended Latin, Armenian, Cyrillic — all the same?

If Extended Latin, Armenian, and Cyrillic have similar/identical case-conversion properties (e.g. invariance), and ICU’s implementation is otherwise not distinguishing between them in its own fast paths, then yes.

As I’ve explained in #25305 (comment) it seem like it could be possible to create optimizations for case conversions specialized for different Unicode code blocks. Is that a wrong way to think about this?

Again, code block boundaries is irrelevant. You’re likely going to optimize for ranges of code points that may span or lie within code block boundaries.

Edit: Looking at Rust's implementation, I see that the case conversions are implemented using a BoolTrie tables. Is this an approach you are planning use?

A BoolTrie or something similar would be very useful for representing Unicode Scalar Properties in the stdlib without asking ICU on a scalar-by-scalar basis. That includes case conversion properties, though we might want to combine it with some kind of sparse interval set for case invariance.

Am I misunderstanding what is going on here?

That profile is demonstrating something highly suspect about the implementation.

  • _StringGuts.append(_:) this might be performing highly-redundant introspection on every byte.

This should switch to https://forums.swift.org/t/pitch-add-a-string-initializer-with-access-to-uninitialized-storage/22611 when it happens. @Catfish-Man, here is another example of something that would benefit from this. We should propose it immediately after 5.1 wraps up.

  • swift_bridgeObjectRetain should early exit in the runtime, since small strings are immortal.
  • _swift_retain_(swift::HeapObject*) should not be happening at all inside here.

Something is going terribly wrong in the implementation (or benchmark construction, but I suspect the implementation) that’s sending us down a very slow path here.

  • specialized String.uppercased() Why is this “specialized”?

The difference between these two traces for uppercased() is ~1200, which should be easily recoverable or surpassed. Why is the difference between largeAngryPhonebook and uppercased() ~2000 in one run and ~600 in the other? That seems suspect as well to me.

Clarified the naming of tests that exercise `_SmallString`s with `.Small` suffix and used plane name for regular String workloads.
Reordered declarations. Documented workload sizes.
@palimondo
Copy link
Contributor

@swift-ci please benchmark

@palimondo
Copy link
Contributor

@swift-ci please smoke test

@palimondo
Copy link
Contributor

@milseman

The difference between these two traces for uppercased() is ~1200, which should be easily recoverable or surpassed. Why is the difference between largeAngryPhonebook and uppercased() ~2000 in one run and ~600 in the other? That seems suspect as well to me.

In addition to uppercased() the largeAngryPhonebook also contains call to lowercased() which takes 1966 in the suspicious ASCII case. In the case of Strasse, uppercased()takes 827 and lowercased() takes 582, which makes sense because the upper case conversion needs to grow the string.

I have filed a bug for this performance anomaly: SR-11069.


I have renamed the tests that exercise the _SmallString optimization to have .Small suffix. The test for regular String have just the plain name denoting the Code Block. To ease the relative comparison, the workload sizes are also documented in the source code:

SMALL UTF-8 UTF-16 REGULAR UTF-8 UTF-16
ascii 124 B 248 B longASCII 6158 B 12316 B
strasse 140 B 240 B longStrasse 6798 B 11996 B
armenian 232 B 232 B longArmenian 10478 B 11676 B
cyrillic 238 B 238 B longCyrillic 10718 B 11916 B

@swift-ci
Copy link
Contributor

swift-ci commented Jul 4, 2019

Performance: -O

Regression OLD NEW DELTA RATIO
Dictionary4 155 196 +26.5% 0.79x
ObjectiveCBridgeStubDateAccess 130 152 +16.9% 0.86x (?)
Dictionary4OfObjects 197 224 +13.7% 0.88x (?)
 
Improvement OLD NEW DELTA RATIO
MapReduceAnyCollection 218 194 -11.0% 1.12x (?)
 
Added MIN MAX MEAN MAX_RSS
AngryPhonebook.ASCII 392 411 398
AngryPhonebook.ASCII.Small 78 81 80
AngryPhonebook.Armenian 135 165 147
AngryPhonebook.Armenian.Small 608 669 630
AngryPhonebook.Cyrillic 150 159 153
AngryPhonebook.Cyrillic.Small 620 643 628
AngryPhonebook.Strasse 115 118 117
AngryPhonebook.Strasse.Small 717 732 722

Code size: -O

Regression OLD NEW DELTA RATIO
AngryPhonebook.o 1035 10235 +888.9% 0.10x

Performance: -Osize

Regression OLD NEW DELTA RATIO
SortLettersInPlace 378 411 +8.7% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
FlattenListLoop 2705 2125 -21.4% 1.27x (?)
 
Added MIN MAX MEAN MAX_RSS
AngryPhonebook.ASCII 392 392 392
AngryPhonebook.ASCII.Small 78 78 78
AngryPhonebook.Armenian 136 149 141
AngryPhonebook.Armenian.Small 590 591 591
AngryPhonebook.Cyrillic 145 150 148
AngryPhonebook.Cyrillic.Small 604 605 605
AngryPhonebook.Strasse 117 117 117
AngryPhonebook.Strasse.Small 717 718 717

Code size: -Osize

Regression OLD NEW DELTA RATIO
AngryPhonebook.o 1144 9901 +765.5% 0.12x

Performance: -Onone

Regression OLD NEW DELTA RATIO
ArrayAppendOptionals 600 1530 +155.0% 0.39x (?)
ObjectiveCBridgeStubToNSDateRef 2580 3040 +17.8% 0.85x (?)
 
Added MIN MAX MEAN MAX_RSS
AngryPhonebook.ASCII 392 411 404
AngryPhonebook.ASCII.Small 169 189 177
AngryPhonebook.Armenian 136 137 136
AngryPhonebook.Armenian.Small 707 717 710
AngryPhonebook.Cyrillic 143 146 144
AngryPhonebook.Cyrillic.Small 694 695 694
AngryPhonebook.Strasse 111 112 112
AngryPhonebook.Strasse.Small 769 769 769

Code size: -swiftlibs

Benchmark Check Report
⚠️Ⓜ️ AngryPhonebook.ASCII has very wide range of memory used between independent, repeated measurements.
AngryPhonebook.ASCII mem_pages [i1, i2]: min=[11, 11] 𝚫=0 R=[8, 16]
⚠️Ⓜ️ AngryPhonebook.Strasse has very wide range of memory used between independent, repeated measurements.
AngryPhonebook.Strasse mem_pages [i1, i2]: min=[28, 28] 𝚫=0 R=[0, 42]
⚠️Ⓜ️ AngryPhonebook.Cyrillic has very wide range of memory used between independent, repeated measurements.
AngryPhonebook.Cyrillic mem_pages [i1, i2]: min=[32, 32] 𝚫=0 R=[18, 20]
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

@palimondo
Copy link
Contributor

@Armenm, I'll merge this as soon as all the tests pass. I hope you were not discouraged by the meandering progress… Thank you for your contribution!

@palimondo palimondo merged commit f890cfb into swiftlang:master Jul 4, 2019
@milseman
Copy link
Member

milseman commented Jul 8, 2019

@palimondo it seems removing all that senseless overhead gives us around a 30x speedup for the large case, and nice wins even for the smaller one.

#26007 (comment)

@Armenm
Copy link
Contributor Author

Armenm commented Jul 10, 2019

@Armenm, I'll merge this as soon as all the tests pass. I hope you were not discouraged by the meandering progress… Thank you for your contribution!

Not discouraged at all @palimondo. I would say even the opposite. It was really cool to dive deep into the details. And I am happy to see that my humble help brought to some other implementation and optimization on String.
Thanks everyone for the support and making Swift a better language. I will definitely contribute in the future and encourage everyone to do it too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants