[benchmark] SR-10855: Added non-ASCII characters to String benchmarking #25309

Armenm · 2019-06-07T23:39:18Z

Added Cyrillic and Armenian names.
Added test for pathological case of Eszett (ß -> SS).
Added variants that work on longer string (tests for _SmallString optimization have .Small suffix).

Resolves SR-10855
Subtask of SR-8905: Gaps in String benchmarking

jrose-apple · 2019-06-07T23:43:17Z

@swift-ci Please benchmark

jrose-apple · 2019-06-07T23:43:23Z

@swift-ci Please smoke test

Armenm · 2019-06-08T06:40:44Z

Hey @gottesmm I have decreased the workload, should be enough to pass the check. Could you please re-run the benchmark test?

beccadax · 2019-06-08T07:11:47Z

@swift-ci Please benchmark

beccadax · 2019-06-08T07:12:06Z

@swift-ci Please smoke test

garricn · 2019-06-08T16:17:07Z

🥳🇦🇲👏🏽🤓💪🏽💻📲

beccadax · 2019-06-08T18:37:20Z

@Armenm Looks like you still need to reduce the workload a little more. Maybe mixed strings with emoji should be measured in separate benchmarks?

palimondo

Welcome to Swift project, Armen! Looks like we need to coordinate two PRs modifying the same benchmark… See PR #25305. You have filed SR-10855 first, so we'll start here.

Since the addition of other bicameral scripts is turning a single benchmark into a family of benchmarks, we should take this opportunity to properly design it to allow for their relative comparison, while we're at it. We'll leave the original AngryPhonebook benchmark without changes.

cc @mdab121

benchmark/single-source/AngryPhonebook.swift

palimondo · 2019-06-08T19:19:08Z

benchmark/single-source/AngryPhonebook.swift

@@ -39,3 +49,39 @@ public func run_AngryPhonebook(_ N: Int) {
    }
  }
 }
+


Set the workload baseline by the slowest benchmark to be under 1000 μs, looking at the last measured performance of Armenian (27^2 = 729 name pairs), I'm guessing we could standardize on workload size of 20 names per script. We should also try to keep the overall size roughly the same. Document this:

// Workloads for various scripts. Always 20 names for 400 pairings. // To keep the performance of various scripts roughly comparable, aim for // a total length of approximately 120 characters. // E.g.: `latin.joined(separator: "").count == 124` let latin = Array(words.prefix(20)) let cyrillic = [ "Александр", "Аркадий", "Аня", "Даниил", "Дмитрий", "Эдуард", "Юрій", "Давид", "Анна", "Дмитрий", "Евгений", "Борис", "Владимир", "Артур", "Антон", "Антон", "Надія", "Алёна", "Алиса"] let armenian = [ "Արմեն", "Աննա", "Հարություն", "Միքայել", "Մարիա", "Դավիթ", "Վարդան", "Նարինե", "Հռիփսիմե", "Տիգրան", "Տաթև", "Ադամ", "Ազատ", "Ազնաւուր", "Գրիգոր", "Անի", "Լիլիթ", "Հայկ", "Անդրանիկ", "Գառնիկ"]

I think cyrillic needs a bit tuning, the first 20 names I took are only 108 characters… could you please bring it up to ~120? Armenian at 114 are fine, I think…

Armenian and Cyrillic are both 2-byte encoded languages. Unless there are salient differences in how capitalization works (I don't know either enough to know), normalization, or use of combining characters, these would be the same workload from a technical perspective.

As @palimondo pointed out, what's also interesting would be the following:

let smallTwoByte = [ // something up to 15 UTF-8 code units in length, i.e. 7 "characters" in Cyrillic or Armenian ... ] let largeTwoByte = [ // something > 16 UTF-8 code units in length, i.e. at least 8 "characters" in Cyrillic or Armenian... ]

And the same for 3-byte for some latter-BMP language with capitalization (not sure off the top of my head), and without (e.g. Han ideographs), and similarly for 4-byte, and a large 1-byte (i.e. ASCII) workload. This can be done as a separate PR, though.

benchmark/single-source/AngryPhonebook.swift

Armenm · 2019-06-10T21:44:10Z

@palimondo Thanks for the great feedback and support. Yes, I will review and make all necessary changes.
Regarding the Benchmark_Driver.BenchmarkDoctor. Could you please share instructions on how to run it on a local machine. It would be easier for me to contribute to other benchmarks.
cc @gottesmm

Thanks in advance.

palimondo · 2019-06-10T22:15:18Z

Sure, BenchmarkDriver has 3 subcommands:

$ ./Benchmark_Driver --help
usage: Benchmark_Driver [-h] COMMAND ...

optional arguments:
  -h, --help  show this help message and exit

Swift benchmark driver commands:
  COMMAND     See COMMAND -h for additional arguments
    run       Run benchmarks and output results to stdout
    check
    compare   Compare benchmark results

Example: ./Benchmark_Driver run -i 5 -f Prefix -f .*Suffix.*

You want to run the check to validate that your newly added benchmarks pass our sanity checks:

$ ./Benchmark_Driver check --help
usage: Benchmark_Driver check [-h] [-f PATTERN] [-t TESTS] [-o OPT] [-v | -md]
                              [BENCHMARK [BENCHMARK ...]]

positional arguments:
  BENCHMARK             benchmark to run (default: all)

optional arguments:
  -h, --help            show this help message and exit
  -f PATTERN, --filter PATTERN
                        run all tests whose name match regular expression
                        PATTERN, multiple filters are supported
  -t TESTS, --tests TESTS
                        directory containing Benchmark_O{,none,size} (default:
                        DRIVER_DIR)
  -o OPT, --optimization OPT
                        optimization level to use: {O,Onone,Osize}, (default:
                        O)
  -v, --verbose         show more details during benchmark analysis
  -md, --markdown       format report as Markdown table

So for the testing of new AngryPhonebook variants you'd run them with

$ ./Benchmark_Driver run -i 3 --output-dir log -f Angry

The --output-dir parameter triggers (as a side effect) a nicer result formatting on console (while writing the CSV formatted log into the log directory). The -i 3 will measure 3 independent runs of each benchmark. The -f Pattern will run all benchmarks that match a regex.

To run the benchmark validation locally you would do:

$ ./Benchmark_Driver check -f Angry

This takes a while, because it measures each benchmark for a second, ten times. You'll get color coded warnings and errors if we find something suspicious. To see all the nitty gritty while it does its thing do:

$ ./Benchmark_Driver check -f Angry --verbose

palimondo · 2019-06-10T22:24:03Z

cc @mdab121, you might find these tips for local benchmark validation @Armenm requested also useful… 😉 I guess I should find time to update the docs someday… 🤨

palimondo · 2019-06-10T22:59:50Z

@Armenm I've noticed one more thing, that we could take care of here if we want to be super fair when comparing across different scripts… The String has an optimization for small strings that can be packed as a value into the object body to avoid reference counting overhead. You can test it in the REPL of your locally built swift in bin directory (which can access also internal methods and properties):

$ ./swift
***  You are running Swift's integrated REPL,  ***
***  intended for compiler and stdlib          ***
***  development and testing purposes only.    ***
***  The full REPL is built as part of LLDB.   ***
***  Type ':help' for assistance.              ***
(swift) "smol"._guts.isSmall
// r0 : Bool = true
(swift) "Long String > 16B"._guts.isSmall
// r1 : Bool = false

So to be totally fair, we should probably take the extra step to ensure we'll be comparing apples to apples, because in latin, all names end up being small strings:

(swift) latin.allSatisfy { $0._guts.isSmall }
// r7 : Bool = true

but

(swift) cyrillic.filter { !$0._guts.isSmall }
// r5 : [String] = ["Александр", "Владимир"]
(swift) armenian.filter { !$0._guts.isSmall }
// r7 : [String] = ["Հարություն", "Հռիփսիմե", "Ազնաւուր", "Անդրանիկ"]

Could you try to pick the names with lengths so that they would fit into small strings (16 bytes when UTF-8 encoded) and still get the overall length?

@milseman Do you want to have benchmarks per script (this PR adds Cyrillic and Armenian, PR #25305 is poised to add some Extended Latin)? We could probably add Greek to cover all major bicameral scripts… or do you want just one non-ASCII addition?

Armenm · 2019-06-10T23:51:59Z

Hey @palimondo This is the first time I am touching this code and had looked only for 4-5 hours during try! Swift San Jose. Forgive me if it takes too much time to understand what's going on here :)
Really appreciate your help.

palimondo · 2019-06-11T02:34:37Z

@Armenm Don't worry, you're doing great and I'm overthinking things....

milseman

This is looking great, thanks for working on this! Could you incorporate @palimondo 's code review feedback (naming convention, extracting out the common code), and we can merge this?

milseman · 2019-06-11T22:03:41Z

benchmark/single-source/AngryPhonebook.swift

@@ -39,3 +49,39 @@ public func run_AngryPhonebook(_ N: Int) {
    }
  }
 }
+


Armenian and Cyrillic are both 2-byte encoded languages. Unless there are salient differences in how capitalization works (I don't know either enough to know), normalization, or use of combining characters, these would be the same workload from a technical perspective.

As @palimondo pointed out, what's also interesting would be the following:

let smallTwoByte = [ // something up to 15 UTF-8 code units in length, i.e. 7 "characters" in Cyrillic or Armenian ... ] let largeTwoByte = [ // something > 16 UTF-8 code units in length, i.e. at least 8 "characters" in Cyrillic or Armenian... ]

And the same for 3-byte for some latter-BMP language with capitalization (not sure off the top of my head), and without (e.g. Han ideographs), and similarly for 4-byte, and a large 1-byte (i.e. ASCII) workload. This can be done as a separate PR, though.

Armenm · 2019-06-12T14:31:42Z

Sure @milseman Will look into it. Thanks!

Added Cyrillic and Armenian names with emojis

Armenm · 2019-06-26T04:39:42Z

Hey @palimondo. I have refactored the code based on your comments. There are 20 names in each language and ~120 characters.
latin.joined(separator: "").count = 118
cyrillic.joined(separator: "").count = 119
armenian.joined(separator: "").count = 116

Also removed all names that were not fitting in small strings.

Armenm · 2019-06-26T04:49:48Z

To run the benchmark validation locally you would do:
$ ./Benchmark_Driver check -f Angry
This takes a while, because it measures each benchmark for a second, ten times. You'll get color coded warnings and errors if we find something suspicious. To see all the nitty gritty while it does its thing do:
$ ./Benchmark_Driver check -f Angry --verbose

I was not able to run Benchmark_Driver locally. Could you please check the error below and give advice?

./Benchmark_Driver check -f Angry --verbose
Traceback (most recent call last):
  File "./Benchmark_Driver", line 774, in <module>
    exit(main())
  File "./Benchmark_Driver", line 770, in main
    return args.func(args)
  File "./Benchmark_Driver", line 576, in run_check
    doctor = BenchmarkDoctor(args)
  File "./Benchmark_Driver", line 343, in __init__
    self.driver = driver or BenchmarkDriver(args)
  File "./Benchmark_Driver", line 59, in __init__
    self.tests = tests or self._get_tests()
  File "./Benchmark_Driver", line 112, in _get_tests
    self._invoke(self._cmd_list_benchmarks).split('\n')[1:-1]
  File "./Benchmark_Driver", line 68, in _invoke
    cmd, stderr=self._subprocess.STDOUT)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 216, in check_output
    process = Popen(stdout=PIPE, *popenargs, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

palimondo · 2019-06-27T13:51:31Z

I was not able to run Benchmark_Driver locally. Could you please check the error below and give advice?

It's unable to find the benchmark binary. You need to be running it from the bin directory that also contains Benchmark_O.

palimondo · 2019-06-27T13:51:42Z

@swift-ci please benchmark

palimondo · 2019-07-03T16:51:10Z

@swift-ci please benchmark

palimondo · 2019-07-03T16:52:00Z

@swift-ci please smoke test

palimondo · 2019-07-03T16:54:52Z

@milseman since I was fixing the error and had these benchmarks running locally, I've added a .Large variant that tests the case conversion on a large string (~6000 characters ±150), since the originals operate on _SmallStrings. Could you please review?

palimondo · 2019-07-03T17:44:04Z

@Armenm, for future work in this area, you might be interested in what @milseman said in @mdab121's PR #25305 (comment)

@milseman Do you have any explanation for the performance of Latin.Large?

Per iteration, the small variants are doing 800 calls to ICU on 2-byte variants (Armenian, Cyrillic), vs. 2 call in the .Large, processing roughly the same amount of bytes (large has additional spaces and commas), so it makes sense the .Large would be faster compared to small.

The AngryPhonebook.Latin is doing the case conversion on the optimized small ASCII strings, so this is clearly the best case scenario, but why would the AngryPhonebook.Latin.Large be ~3x slower that the other .Large variants? Profiling in Instruments I'm seeing an overhead from swift_bridgeObjectRetain and no calls to libicucore… is that to be expected or should I file a bug?

milseman · 2019-07-03T18:21:29Z

@palimondo The naming here is preventing me from being able to discuss this intelligently. Are we using "Latin" (which is an irrelevant designation for benchmark purposes) to mean that it consists entirely of ASCII? If so, we should use "ASCII" for the name. Similarly, are we using "Large" to mean that it will not fit in a small-string representation? Otherwise, we're just rearranging the deck chairs: more quicker iterations vs fewer slower iterations.

palimondo · 2019-07-03T18:41:49Z

Correct, Latin is ASCII. Named this way in expectation of adding ExtendedLatin in #25305. I’ve assumed using names of Unicode blocks would not be confusing…

The Large variant suffix is added to distinguish it from original AngryPhonebook which uses small strings only. We could flip it to use the base name for large variant and add .Small suffix for those that use small strings if you prefer?

palimondo · 2019-07-03T19:20:46Z

@milseman

Otherwise, we're just rearranging the deck chairs: more quicker iterations vs fewer slower iterations.

I don’t understand your point here. I was saying that the Latin.Large variant of this benchmark has uncovered that a normal string that isASCII performs about 2.5x slower than comparably sized non-ASCII String, because it apparently does the conversion ~~in ObjC~~ per character(?) and not via the faster ICU. It look like there could be easy performance win just by using ICU in that case. Does that make sense?

palimondo · 2019-07-03T19:35:38Z

@milseman

Are we using "Latin" (which is an irrelevant designation for benchmark purposes) to mean that it consists entirely of ASCII?

It seems we have different assumption about the implementation of optimizations for case conversion here. If I understand your perspective from an earlier discussion above it makes sense to distinguish only between cases based on UTF-8 encoding length:

1-byte, ASCII
2-byte, Extended Latin, Armenian, Cyrillic — all the same?

As I’ve explained in #25305 (comment) it seem like it could be possible to create optimizations for case conversions specialized for different Unicode code blocks. Is that a wrong way to think about this?

Edit: Looking at Rust's implementation, I see that the case conversions are implemented using a BoolTrie tables. Is this an approach you are planning use?

Pathological case, uppercase: ß -> SS

palimondo · 2019-07-03T20:01:41Z

@swift-ci please benchmark

swift-ci · 2019-07-03T20:22:02Z

Performance: -O

Regression	OLD	NEW	DELTA	RATIO
Dictionary4	155	196	+26.5%	0.79x
ObjectiveCBridgeStubDateAccess	130	152	+16.9%	0.86x
Dictionary4OfObjects	197	224	+13.7%	0.88x (?)

Improvement	OLD	NEW	DELTA	RATIO
FlattenListFlatMap	4763	3701	-22.3%	1.29x (?)
FlattenListLoop	2705	2167	-19.9%	1.25x (?)
MapReduceAnyCollection	217	194	-10.6%	1.12x

Added	MIN	MAX	MEAN	MAX_RSS
AngryPhonebook.Armenian	586	587	587	—
AngryPhonebook.Armenian.Large	135	140	137	—
AngryPhonebook.Cyrillic	594	667	622	—
AngryPhonebook.Cyrillic.Large	146	169	154	—
AngryPhonebook.Latin	80	80	80	—
AngryPhonebook.Latin.Large	400	400	400	—
AngryPhonebook.Strasse	709	722	714	—
AngryPhonebook.Strasse.Large	114	114	114	—

Code size: -O

Regression	OLD	NEW	DELTA	RATIO
AngryPhonebook.o	1035	10241	+889.5%	0.10x

Performance: -Osize

Improvement	OLD	NEW	DELTA	RATIO
StringBuilderLong	910	820	-9.9%	1.11x (?)

Added	MIN	MAX	MEAN	MAX_RSS
AngryPhonebook.Armenian	598	598	598	—
AngryPhonebook.Armenian.Large	139	149	142	—
AngryPhonebook.Cyrillic	600	639	613	—
AngryPhonebook.Cyrillic.Large	146	151	148	—
AngryPhonebook.Latin	80	80	80	—
AngryPhonebook.Latin.Large	400	400	400	—
AngryPhonebook.Strasse	703	703	703	—
AngryPhonebook.Strasse.Large	114	116	115	—

Code size: -Osize

Regression	OLD	NEW	DELTA	RATIO
AngryPhonebook.o	1144	9891	+764.6%	0.12x

Performance: -Onone

Improvement	OLD	NEW	DELTA	RATIO
ArrayAppendOptionals	1550	610	-60.6%	2.54x (?)
DataAppendArray	3200	2900	-9.4%	1.10x (?)
ObjectiveCBridgeStubFromNSStringRef	110	101	-8.2%	1.09x (?)

Added	MIN	MAX	MEAN	MAX_RSS
AngryPhonebook.Armenian	668	669	669	—
AngryPhonebook.Armenian.Large	136	141	138	—
AngryPhonebook.Cyrillic	672	690	678	—
AngryPhonebook.Cyrillic.Large	143	146	144	—
AngryPhonebook.Latin	165	180	171	—
AngryPhonebook.Latin.Large	392	392	392	—
AngryPhonebook.Strasse	761	788	770	—
AngryPhonebook.Strasse.Large	111	123	115	—

Code size: -swiftlibs

✅	Benchmark Check Report
⚠️Ⓜ️	`AngryPhonebook.Strasse` has very wide range of memory used between independent, repeated measurements. _{AngryPhonebook.Strasse mem_pages [i1, i2]: min=[14, 14] 𝚫=0 R=[6, 18]}
⚠️Ⓜ️	`AngryPhonebook.Strasse.Large` has very wide range of memory used between independent, repeated measurements. _{AngryPhonebook.Strasse.Large mem_pages [i1, i2]: min=[28, 28] 𝚫=0 R=[0, 21]}

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

palimondo · 2019-07-03T20:39:43Z

@milseman I've added a benchmark for the pathological case of uppercased ß -> SS:

"Straße".uppercased() == "STRASSE"

You can see that even that one (Strasse.Large) is almost 4 times faster variant than the Latin.Large.

Heaviest Stack Trace from Instruments

`Latin.Large`

   4 Benchmark_O 3976.0  largeAngryPhonebook(_:_:)
   3 libswiftCore.dylib 2002.0  specialized String.uppercased()
   2 libswiftCore.dylib 1483.0  _StringGuts.append(_:)
   1 libswiftCore.dylib 450.0  swift_bridgeObjectRetain
   0 libswiftCore.dylib 266.0  _swift_retain_(swift::HeapObject*)

If this conversion is done per character and the API is returning String, because of the possibility of 1:2 conversion in case of ß -> SS… why aren't these small strings that would not require retain/release traffic? Am I misunderstanding what is going on here?

`Strasse.Large`

   6 Benchmark_O 1412.0  largeAngryPhonebook(_:_:)
   5 libswiftCore.dylib 827.0  specialized String.uppercased()
   4 libicucore.A.dylib 393.0  u_strToUpper
   3 libicucore.A.dylib 393.0  0x7fff74cbdd6f
   2 libicucore.A.dylib 392.0  0x7fff74cbd796
   1 libicucore.A.dylib 40.0  u_memcpy
   0 libsystem_platform.dylib 21.0  _platform_memmove$VARIANT$Base

milseman · 2019-07-03T22:19:08Z

Correct, Latin is ASCII. Named this way in expectation of adding ExtendedLatin in #25305. I’ve assumed using names of Unicode blocks would not be confusing…

In general, Unicode code blocks is an irrelevant categorization for benchmarking, and “Latin” isn’t even the formal name of the code block (nor would the formal name be as recognizable). ASCII vs non-ASCII is a very relevant categorization, which just so happens to align with a Unicode code block because Unicode explicitly chose to be a superset of ASCII.

For example, a workload that consists entirely of Han Ideographs would be interesting, not because they all come from a given block, but because of the attributes they have in comparison to other workloads, such as case invariance (and to a lesser extent, encoding width for the BMP ones).

Similarly, improvements and additional fast-paths would apply based on these kinds of attributes, such as over case-invariant ranges of scalar values. The boundaries of such ranges are not strictly aligned with code block boundaries.

The Large variant suffix is added to distinguish it from original AngryPhonebook which uses small strings only. We could flip it to use the base name for large variant and add .Small suffix for those that use small strings if you prefer?

I was confused by your workload in https://github.com/apple/swift/pull/25309/files#r291816905, which had “small” workloads that don’t fit in the small-string representation (15 bytes of UTF-8). If the PR is consistent in small vs large aligning with small vs large string representations, then this is fine. Otherwise I think that “long” vs “short” is better.

I don’t understand your point here. I was saying that the Latin.Large variant of this benchmark has uncovered that a normal string that isASCII performs about 2.5x slower than comparably sized non-ASCII String, because it apparently does the conversion in ObjC per character(?) and not via the faster ICU. It look like there could be easy performance win just by using ICU in that case. Does that make sense?

This seems unlikely, and perhaps we need to dig into the details if this is happening. Interfacing with ICU requires that we:

Allocate a buffer (e.g. Array) of UInt16s and transcode our contents into it.
Allocate another buffer for ICU to write the result to.
Call ICU
See if our prior buffer was big enough (and redo with a larger one if not)
Create a new String (heap allocation if large) while transcoding the UTF-16 back to UTF-8.

In contrast, if we know a String is all-ASCII, we do the following:

Allocate a String storage of sufficient capacity
Copy the lowercased bytes in

Point #2 could probably be improved with wider/bulk operations.

2-byte, Extended Latin, Armenian, Cyrillic — all the same?

If Extended Latin, Armenian, and Cyrillic have similar/identical case-conversion properties (e.g. invariance), and ICU’s implementation is otherwise not distinguishing between them in its own fast paths, then yes.

As I’ve explained in #25305 (comment) it seem like it could be possible to create optimizations for case conversions specialized for different Unicode code blocks. Is that a wrong way to think about this?

Again, code block boundaries is irrelevant. You’re likely going to optimize for ranges of code points that may span or lie within code block boundaries.

Edit: Looking at Rust's implementation, I see that the case conversions are implemented using a BoolTrie tables. Is this an approach you are planning use?

A BoolTrie or something similar would be very useful for representing Unicode Scalar Properties in the stdlib without asking ICU on a scalar-by-scalar basis. That includes case conversion properties, though we might want to combine it with some kind of sparse interval set for case invariance.

Am I misunderstanding what is going on here?

That profile is demonstrating something highly suspect about the implementation.

_StringGuts.append(_:) this might be performing highly-redundant introspection on every byte.

This should switch to https://forums.swift.org/t/pitch-add-a-string-initializer-with-access-to-uninitialized-storage/22611 when it happens. @Catfish-Man, here is another example of something that would benefit from this. We should propose it immediately after 5.1 wraps up.

swift_bridgeObjectRetain should early exit in the runtime, since small strings are immortal.
_swift_retain_(swift::HeapObject*) should not be happening at all inside here.

Something is going terribly wrong in the implementation (or benchmark construction, but I suspect the implementation) that’s sending us down a very slow path here.

specialized String.uppercased() Why is this “specialized”?

The difference between these two traces for uppercased() is ~1200, which should be easily recoverable or surpassed. Why is the difference between largeAngryPhonebook and uppercased() ~2000 in one run and ~600 in the other? That seems suspect as well to me.

Clarified the naming of tests that exercise `_SmallString`s with `.Small` suffix and used plane name for regular String workloads. Reordered declarations. Documented workload sizes.

palimondo · 2019-07-04T14:51:32Z

@swift-ci please benchmark

palimondo · 2019-07-04T14:52:08Z

@swift-ci please smoke test

palimondo · 2019-07-04T15:28:28Z

@milseman

The difference between these two traces for uppercased() is ~1200, which should be easily recoverable or surpassed. Why is the difference between largeAngryPhonebook and uppercased() ~2000 in one run and ~600 in the other? That seems suspect as well to me.

In addition to uppercased() the largeAngryPhonebook also contains call to lowercased() which takes 1966 in the suspicious ASCII case. In the case of Strasse, uppercased()takes 827 and lowercased() takes 582, which makes sense because the upper case conversion needs to grow the string.

I have filed a bug for this performance anomaly: SR-11069.

I have renamed the tests that exercise the _SmallString optimization to have .Small suffix. The test for regular String have just the plain name denoting the Code Block. To ease the relative comparison, the workload sizes are also documented in the source code:

SMALL	UTF-8	UTF-16	REGULAR	UTF-8	UTF-16
ascii	124 B	248 B	longASCII	6158 B	12316 B
strasse	140 B	240 B	longStrasse	6798 B	11996 B
armenian	232 B	232 B	longArmenian	10478 B	11676 B
cyrillic	238 B	238 B	longCyrillic	10718 B	11916 B

swift-ci · 2019-07-04T15:33:36Z

Performance: -O

Regression	OLD	NEW	DELTA	RATIO
Dictionary4	155	196	+26.5%	0.79x
ObjectiveCBridgeStubDateAccess	130	152	+16.9%	0.86x (?)
Dictionary4OfObjects	197	224	+13.7%	0.88x (?)

Improvement	OLD	NEW	DELTA	RATIO
MapReduceAnyCollection	218	194	-11.0%	1.12x (?)

Added	MIN	MAX	MEAN	MAX_RSS
AngryPhonebook.ASCII	392	411	398	—
AngryPhonebook.ASCII.Small	78	81	80	—
AngryPhonebook.Armenian	135	165	147	—
AngryPhonebook.Armenian.Small	608	669	630	—
AngryPhonebook.Cyrillic	150	159	153	—
AngryPhonebook.Cyrillic.Small	620	643	628	—
AngryPhonebook.Strasse	115	118	117	—
AngryPhonebook.Strasse.Small	717	732	722	—

Code size: -O

Regression	OLD	NEW	DELTA	RATIO
AngryPhonebook.o	1035	10235	+888.9%	0.10x

Performance: -Osize

Regression	OLD	NEW	DELTA	RATIO
SortLettersInPlace	378	411	+8.7%	0.92x (?)

Improvement	OLD	NEW	DELTA	RATIO
FlattenListLoop	2705	2125	-21.4%	1.27x (?)

Added	MIN	MAX	MEAN	MAX_RSS
AngryPhonebook.ASCII	392	392	392	—
AngryPhonebook.ASCII.Small	78	78	78	—
AngryPhonebook.Armenian	136	149	141	—
AngryPhonebook.Armenian.Small	590	591	591	—
AngryPhonebook.Cyrillic	145	150	148	—
AngryPhonebook.Cyrillic.Small	604	605	605	—
AngryPhonebook.Strasse	117	117	117	—
AngryPhonebook.Strasse.Small	717	718	717	—

Code size: -Osize

Regression	OLD	NEW	DELTA	RATIO
AngryPhonebook.o	1144	9901	+765.5%	0.12x

Performance: -Onone

Regression	OLD	NEW	DELTA	RATIO
ArrayAppendOptionals	600	1530	+155.0%	0.39x (?)
ObjectiveCBridgeStubToNSDateRef	2580	3040	+17.8%	0.85x (?)

Added	MIN	MAX	MEAN	MAX_RSS
AngryPhonebook.ASCII	392	411	404	—
AngryPhonebook.ASCII.Small	169	189	177	—
AngryPhonebook.Armenian	136	137	136	—
AngryPhonebook.Armenian.Small	707	717	710	—
AngryPhonebook.Cyrillic	143	146	144	—
AngryPhonebook.Cyrillic.Small	694	695	694	—
AngryPhonebook.Strasse	111	112	112	—
AngryPhonebook.Strasse.Small	769	769	769	—

Code size: -swiftlibs

✅	Benchmark Check Report
⚠️Ⓜ️	`AngryPhonebook.ASCII` has very wide range of memory used between independent, repeated measurements. _{AngryPhonebook.ASCII mem_pages [i1, i2]: min=[11, 11] 𝚫=0 R=[8, 16]}
⚠️Ⓜ️	`AngryPhonebook.Strasse` has very wide range of memory used between independent, repeated measurements. _{AngryPhonebook.Strasse mem_pages [i1, i2]: min=[28, 28] 𝚫=0 R=[0, 42]}
⚠️Ⓜ️	`AngryPhonebook.Cyrillic` has very wide range of memory used between independent, repeated measurements. _{AngryPhonebook.Cyrillic mem_pages [i1, i2]: min=[32, 32] 𝚫=0 R=[18, 20]}

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

palimondo · 2019-07-04T15:34:24Z

@Armenm, I'll merge this as soon as all the tests pass. I hope you were not discouraged by the meandering progress… Thank you for your contribution!

milseman · 2019-07-08T23:16:08Z

@palimondo it seems removing all that senseless overhead gives us around a 30x speedup for the large case, and nice wins even for the smaller one.

#26007 (comment)

Armenm · 2019-07-10T03:15:02Z

@Armenm, I'll merge this as soon as all the tests pass. I hope you were not discouraged by the meandering progress… Thank you for your contribution!

Not discouraged at all @palimondo. I would say even the opposite. It was really cool to dive deep into the details. And I am happy to see that my humble help brought to some other implementation and optimization on String.
Thanks everyone for the support and making Swift a better language. I will definitely contribute in the future and encourage everyone to do it too.

Armenm changed the title ~~[benchmark] SR-10855: Added non-ASCII characters~~ [benchmark] SR-10855: Added non-ASCII characters to String benchmarking Jun 7, 2019

jrose-apple requested review from palimondo, milseman and gottesmm June 7, 2019 23:42

This comment has been minimized.

Sign in to view

palimondo suggested changes Jun 8, 2019

View reviewed changes

milseman reviewed Jun 11, 2019

View reviewed changes

Armenm added 3 commits June 26, 2019 00:05

SR-10855: Added non-ASCII characters

d2255bd

Added Cyrillic and Armenian names with emojis

Decreased the workload of benchmarks

76bf39c

SR-10855: Consistent test data and code refactor

4483e84

Armenm force-pushed the SR-10855-non-ascii-angryphonebook branch from 1bf0ba1 to 4483e84 Compare June 26, 2019 04:34

palimondo requested review from milseman and removed request for gottesmm July 3, 2019 16:54

This comment has been minimized.

Sign in to view

[benchmark] Add AngryPhonebook.Strasse variant

1d37b00

Pathological case, uppercase: ß -> SS

[benchmark] Cleaned up new AngryPhonebook tests

6c97101

Clarified the naming of tests that exercise `_SmallString`s with `.Small` suffix and used plane name for regular String workloads. Reordered declarations. Documented workload sizes.

palimondo merged commit f890cfb into swiftlang:master Jul 4, 2019

palimondo mentioned this pull request Jul 4, 2019

Add benchmarks for string operations. #25305

Closed

BasThomas mentioned this pull request Jul 12, 2019

[139] Issue #139 - July 25, 2019 SwiftWeekly/swiftweekly.github.io#468

Closed

palimondo mentioned this pull request Jul 24, 2019

[benchmark] Driver Improvements #26303

Merged

This was referenced Jun 11, 2019

[SR-8905] Gaps in String benchmarking #51411

Open

[SR-10855] Benchmark non-ASCII with AngryPhonebook #53245

Closed

@@ @@ -39,3 +49,39 @@ public func run_AngryPhonebook(_ N: Int) { @@
                   }
                 }
               }

[benchmark] SR-10855: Added non-ASCII characters to String benchmarking #25309

[benchmark] SR-10855: Added non-ASCII characters to String benchmarking #25309

Conversation

Armenm commented Jun 7, 2019 • edited by palimondo Loading

jrose-apple commented Jun 7, 2019

jrose-apple commented Jun 7, 2019

This comment has been minimized.

Armenm commented Jun 8, 2019

beccadax commented Jun 8, 2019

beccadax commented Jun 8, 2019

This comment has been minimized.

garricn commented Jun 8, 2019

beccadax commented Jun 8, 2019

palimondo left a comment • edited Loading

Choose a reason for hiding this comment

palimondo Jun 8, 2019 • edited Loading

Choose a reason for hiding this comment

milseman Jun 11, 2019 • edited Loading

Choose a reason for hiding this comment

Armenm commented Jun 10, 2019

palimondo commented Jun 10, 2019 • edited Loading

palimondo commented Jun 10, 2019 • edited Loading

palimondo commented Jun 10, 2019 • edited Loading

Armenm commented Jun 10, 2019

palimondo commented Jun 11, 2019

milseman left a comment

Choose a reason for hiding this comment

milseman Jun 11, 2019 • edited Loading

Choose a reason for hiding this comment

Armenm commented Jun 12, 2019

Armenm commented Jun 26, 2019 • edited Loading

Armenm commented Jun 26, 2019

palimondo commented Jun 27, 2019

palimondo commented Jun 27, 2019

palimondo commented Jul 3, 2019

palimondo commented Jul 3, 2019

palimondo commented Jul 3, 2019

This comment has been minimized.

palimondo commented Jul 3, 2019 • edited Loading

milseman commented Jul 3, 2019

palimondo commented Jul 3, 2019

palimondo commented Jul 3, 2019 • edited Loading

palimondo commented Jul 3, 2019 • edited Loading

palimondo commented Jul 3, 2019

swift-ci commented Jul 3, 2019

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

palimondo commented Jul 3, 2019 • edited Loading

Heaviest Stack Trace from Instruments

Latin.Large

Strasse.Large

milseman commented Jul 3, 2019

palimondo commented Jul 4, 2019

palimondo commented Jul 4, 2019

palimondo commented Jul 4, 2019

swift-ci commented Jul 4, 2019

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

palimondo commented Jul 4, 2019

milseman commented Jul 8, 2019

Armenm commented Jul 10, 2019

Armenm commented Jun 7, 2019 •

edited by palimondo

Loading

palimondo left a comment •

edited

Loading

palimondo Jun 8, 2019 •

edited

Loading

milseman Jun 11, 2019 •

edited

Loading

palimondo commented Jun 10, 2019 •

edited

Loading

palimondo commented Jun 10, 2019 •

edited

Loading

palimondo commented Jun 10, 2019 •

edited

Loading

milseman Jun 11, 2019 •

edited

Loading

Armenm commented Jun 26, 2019 •

edited

Loading

palimondo commented Jul 3, 2019 •

edited

Loading

palimondo commented Jul 3, 2019 •

edited

Loading

palimondo commented Jul 3, 2019 •

edited

Loading

palimondo commented Jul 3, 2019 •

edited

Loading

`Latin.Large`

`Strasse.Large`