Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimization 😅 #1

Open
wants to merge 16 commits into
base: 12.0
Choose a base branch
from
Open

Conversation

wHo-EM-i
Copy link

No description provided.

thestinger and others added 10 commits December 12, 2021 10:00
Pretend that there is never room to grow the heap in order to prevent
usage of these unsafe legacy functions. There are likely no users of
these in practice as it is inherently broken to use them outside of
malloc.

Signed-off-by: anupritaisno1 <www.anuprita804@gmail.com>
In some cases, it can be useful to load libraries from zip files that
are only available by fd reference. For example, file descriptors of
APKs containing native libraries may be sent via Binder IPC for clients
to use.

Unfortunately, while this linker does support loading libraries from
file descriptors using android_dlopen_ext, using that API is not an
option because our dlopen calls originate from JNI loadLibrary requests
in ART.

This is necessary for compatibility with Google Play Services' dynamic
module system (Dynamite) without weakening the SELinux sandbox to allow
other apps to open module APKs from
/data/user_de/0/com.google.android.gms/app_chimera/m.

Change-Id: If44d5c3faf4f50e4704688b520b197ff151ae05a
This reduces entropy of the canary from 64-bit to 56-bit in exchange for
mitigating non-terminated C string overflows.

Signed-off-by: anupritaisno1 <www.anuprita804@gmail.com>
The hosts file is normally searched linearly.  This is very slow when
the file is large.  To mitigate this, read the hosts file and sort the
entries in an in-memory cache.  When an address is requested via
gethostbyname or getaddrinfo, binary search the cache.

In case where the cache is not available, return a suitable error code
and fall back to the existing lookup code.

This has been written to behave as much like the existing lookup code as
possible.  But note bionic and glibc differ in behavior for some corner
cases.  Choose the most standard compliant behavior for these where
possible.  Otherwise choose the behavior that seems most reasonable.

RM-290

Change-Id: I3b322883cbc48b0d76a0ce9d149b59faaac1dc58
(cherry picked from commit ed4c3a6bd449a4ed70645071a440ae146f194116)
If an exact name is not found in the hosts file and the host name
contains at least one dot, search for entries of the form "*.domain",
where domain is the portion of the host name after the first dot.  If
that is not found, repeat using the domain.

Example: a.b.c.example.com would search for the following in turn:
	a.b.c.example.com
	*.b.c.example.com
	*.c.example.com
	*.example.com
	*.com

Change-Id: I4b0bb81699151d5b371850daebf785e35ec9b180
…A57/Denver/Krait

Benchmarked on Nextbit Robin (MSM8992)

Before:
                                       iterations      ns/op
BM_string_strlen/8                            50M         75    0.106 GiB/s
BM_string_strlen/64                           10M        159    0.400 GiB/s
BM_string_strlen/512                           2M        819    0.625 GiB/s
BM_string_strlen/1024                       1000k       1547    0.662 GiB/s
BM_string_strlen/8Ki                         200k      12327    0.665 GiB/s
BM_string_strlen/16Ki                        100k      24579    0.667 GiB/s
BM_string_strlen/32Ki                         50k      48950    0.669 GiB/s
BM_string_strlen/64Ki                         20k      97195    0.674 GiB/s

After:
                                       iterations      ns/op
BM_string_strlen/8                            50M         13    0.574 GiB/s
BM_string_strlen/64                         1000k         23    2.703 GiB/s
BM_string_strlen/512                          20M        115    4.414 GiB/s
BM_string_strlen/1024                         10M        206    4.954 GiB/s
BM_string_strlen/8Ki                        1000k       1528    5.359 GiB/s
BM_string_strlen/16Ki                       1000k       2946    5.561 GiB/s
BM_string_strlen/32Ki                        500k       5910    5.544 GiB/s
BM_string_strlen/64Ki                        200k      11842    5.534 GiB/s

Signed-off-by: Bernhard Rosenkraenzer <Bernhard.Rosenkranzer@linaro.org>
Signed-off-by: Jake Weinstein <xboxlover360@gmail.com>
Signed-off-by: Vishalcj17 <vishalcj@aospa.co>
Change-Id: I1e74557046c901afd1356e8ebf3a6c39b0850b87
Change-Id: Ia82860326de114d48dea0e8cd93b836f1e826e15
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Overall, jemalloc performs significantly better than Scudo in Bionic's
real-world memory_replay traces (all times in milliseconds):

+----------------+-------+----------+
|     Trace      | Scudo | jemalloc |
+----------------+-------+----------+
|      SQL       |   27  |    21    |
| Angry Birds 2  |  2236 |   1501   |
|     Camera     |  4251 |   979    |
|  Candy Crush   |  2197 |   1325   |
|     Gmail      |  594  |   463    |
|      Maps      |  434  |   344    |
|     Photos     |  1330 |   477    |
|      PUBG      |  666  |   416    |
| surfaceflinger |  221  |   192    |
| system_server  |  1921 |   1416   |
|    SystemUI    |  102  |    79    |
|    YouTube     |  363  |   294    |
+----------------+-------+----------+

jemalloc also tends to use slightly less memory than Scudo for most
traces. These tests were conducted on desktop x86 Linux with glibc and
the latest stable version of each allocator, but they should still be
relevant. RSS values in KiB:

+----------------+--------+----------+
|     Trace      | Scudo  | jemalloc |
+----------------+--------+----------+
| Angry Birds 2  | 793948 |  746992  |
|     Camera     | 219372 |  251888  |
|  Candy Crush   | 548288 |  550148  |
|     Gmail      | 195236 |  193048  |
|      Maps      | 159860 |  159816  |
|     Photos     | 175436 |  171872  |
|      PUBG      | 233752 |  223572  |
| surfaceflinger | 94736  |  107068  |
| system_server  | 471048 |  484392  |
|    SystemUI    | 54432  |  60740   |
|    YouTube     | 139376 |  142252  |
+----------------+--------+----------+

While not representative of real-world usage, jemalloc also performs
fairly well in synthetic benchmarks (all times in nanoseconds):

+-----------------+---------+----------+
|    Benchmark    |  Scudo  | jemalloc |
+-----------------+---------+----------+
|     alloc 8     |   87.9  |   60.1   |
|     alloc 16    |   87.9  |    60    |
|     alloc 32    |   88.6  |   60.7   |
|     alloc 64    |   88.6  |   59.7   |
|    alloc 512    |   89.2  |    60    |
|    alloc 1024   |   89.4  |   59.8   |
|    alloc 8192   |   89.8  |   65.2   |
|   alloc 16384   |   92.7  |   69.1   |
|   alloc 32768   |   97.2  |    74    |
|   alloc 65536   |   109   |   83.8   |
|   alloc 131072  |  41536  |  42720   |
|    alloc40x 8   |   2156  |   2556   |
|   alloc40x 16   |   2155  |   2244   |
|   alloc40x 32   |   2234  |   2312   |
|   alloc40x 64   |   2234  |   2289   |
|   alloc40x 512  |   2274  |   8171   |
|  alloc40x 1024  |   2397  |   2162   |
|  alloc40x 8192  |   3550  |  78880   |
|  alloc40x 16384 |   3732  |  124454  |
|  alloc40x 32768 |   3849  |  275460  |
|  alloc40x 65536 |   4987  |  727598  |
| alloc40x 131072 | 2745207 | 3067980  |
|   alloc8192 1x  |   464   |   454    |
|   alloc8192 2x  |   510   |   488    |
|   alloc8192 3x  |   587   |   523    |
|   alloc8192 4x  |   665   |   557    |
|   alloc8192 5x  |   742   |   598    |
|   alloc8192 6x  |   818   |   633    |
|   alloc8192 7x  |   884   |   669    |
|   alloc8192 8x  |   960   |   699    |
|   alloc8192 9x  |   1045  |   734    |
|  alloc8192 10x  |   1131  |   770    |
|  alloc8192 11x  |   1207  |   806    |
|  alloc8192 12x  |   1282  |   841    |
|  alloc8192 13x  |   1363  |   877    |
|  alloc8192 14x  |   1442  |   912    |
|  alloc8192 15x  |   1512  |   944    |
|  alloc8192 16x  |   1587  |   978    |
|  alloc8192 24x  |   2256  |  21195   |
|  alloc8192 32x  |   2867  |  45446   |
|  alloc8192 40x  |   3522  |  71618   |
|  alloc8192 48x  |   4126  |  89740   |
|  alloc8192 56x  |   4786  |  114990  |
|  alloc8192 64x  |   5412  |  141082  |
|  alloc8192 72x  |   6049  |  170742  |
|  alloc8192 80x  |   6712  |  198480  |
|  alloc8192 88x  |   7331  |  221557  |
|  alloc8192 96x  |   7976  |  251462  |
|  alloc8192 104x |   8581  |  281626  |
|  alloc8192 112x |   9245  |  313164  |
|  alloc8192 120x |   9914  |  353147  |
|  alloc8192 128x |  10514  |  376625  |
|  alloc8192 136x |  11187  |  408194  |
|  alloc8192 144x |  11802  |  445694  |
|  alloc8192 160x |  13083  |  514547  |
|  alloc8192 176x |  14414  |  582501  |
|  alloc8192 192x |  15746  |  654346  |
|  alloc8192 208x |  17044  |  712620  |
|  alloc8192 224x |  18405  |  769963  |
|  alloc8192 240x |  19744  |  843969  |
|  alloc8192 256x |  21160  |  917803  |
+-----------------+---------+----------+

Scudo performs fairly well for a hardened memory allocator, but we're
optimizing for performance.

Full benchmark data with graphs: https://docs.google.com/spreadsheets/d/1LG_kxaK5cI14gGtnyM-nNNmfpMdV9Vh-LtYoq7H5J4s/edit

Change-Id: Ia4901eedfaa2c9779678c5b6532979de4919ee01
enh-google and others added 6 commits January 9, 2022 20:28
A decent chunk of the logcat profile is spent formatting the timestamps
for each line, and most of that time was going to snprintf(3). We should
find all the places that could benefit from a lighter-weight "format an
integer" and share something between them, but this is easy for now.

Before:

-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_time_strftime        781 ns          775 ns       893102

After:

-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_time_strftime        170 ns          168 ns      4139487

Much of the remaining time is in tzset() which seems unfortunate.

Test: treehugger
Change-Id: Ie0f7ee462ff1b1abea6f87d4a9a996d768e51056
clang was getting in the way of a strftime(3) optimization, and smaller
hammers weren't working, and this seems like the right choice for libc
anyway? If we have code that can usefully be optimized, we should do it
in the source. In general, though, no libc/libm author should be
ignorant of memset(3) or memcpy(3), and would have used it themselves if
it made sense. (And the compiler isn't using profiling data or anything;
it's just always assuming it should use the functions, and doesn't
consider whether the cost of the calls can be amortized or not.)

Test: treehugger
Change-Id: Ia7e22623e47bfbfcfe46c1af0d95ef7e8669c0f6
From a logcat profile:
```
     |--95.06%-- convertPrintable(char*, char const*, unsigned long)
     |    |--13.95%-- [hit in function]
     |    |
     |    |--35.96%-- mbrtoc32
     |    |    |--82.72%-- [hit in function]
     |    |    |
     |    |    |--11.07%-- mbsinit
     |    |    |
     |    |    |--5.96%-- @plt
```
I think we'd assumed that mbsinit() would be inlined, but since these
functions aren't all in wchar.cpp it wasn't being. This change moves the
implementation into a (more clearly named) inline function so we can
trivially reclaim that 11%+6%.

Benchmarks before:
```
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_stdlib_mbrtowc_1            8.03 ns         7.95 ns     87144997
BM_stdlib_mbrtowc_2            22.0 ns         21.8 ns     32002437
BM_stdlib_mbrtowc_3            30.0 ns         29.7 ns     23517699
BM_stdlib_mbrtowc_4            37.4 ns         37.1 ns     18895204
BM_stdlib_mbstowcs_ascii     792373 ns       782484 ns          890 bytes_per_second=609.389M/s
BM_stdlib_mbstowcs_wide    15836785 ns     15678316 ns           44 bytes_per_second=30.4138M/s
```

Benchmarks after:
```
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_stdlib_mbrtowc_1            5.76 ns         5.72 ns    121863813
BM_stdlib_mbrtowc_2            17.1 ns         16.9 ns     41487260
BM_stdlib_mbrtowc_3            24.2 ns         24.0 ns     29141629
BM_stdlib_mbrtowc_4            30.3 ns         30.1 ns     23229291
BM_stdlib_mbstowcs_ascii     783506 ns       775389 ns          903 bytes_per_second=614.965M/s
BM_stdlib_mbstowcs_wide    12787003 ns     12672642 ns           55 bytes_per_second=37.6273M/s
```

Bug: http://b/206523398
Test: treehugger
Change-Id: If8c6c39880096ddd2cbd323c68dca82e9849ace6
Signed-off-by: Vishalcj17 <vishalcj@aospa.co>
This saves a couple of syscalls in the common case, and also lets static
binaries run in a chroot without /dev/null as long as
stdin/stdout/stderr are actually connected to something (which the
toybox maintainer tried to do).

Test: manual with strace
Change-Id: Ic9a28896a07304a3bd428acfd9ddca9d22015f6e
Signed-off-by: Vishalcj17 <vishalcj@aospa.co>
This way, critical string functions are always at the start of a
cacheline.

Change-Id: I049e88d88a043911093641f44d9846fa5f6f3982
Suggested-By: Wilco Dijkstra <wilco.dijkstra@arm.com>
Test: TBD
Signed-off-by: Vishalcj17 <vishalcj@aospa.co>
This optimization is extracted from cortex-strings and bionic-ized,
and applied to arm-v7a cpus.

Stringbench results
https://android.git.linaro.org/gitweb/platform/external/stringbench.git

-----------------------------------------------------------------------
Nexus 6P (MSM8994):

Before:
15000 chars: 154.493394 seconds
 5000 chars:  51.545608 seconds

After:
15000 chars: 18.374188 seconds (88% improvement)
 5000 chars:  7.494449 seconds (85% improvement)
------------------------------------------------------------------------

OnePlus 3 (MSM8996):

Before:
15000 chars:166.577121 seconds
 5000 chars: 60.121731 seconds

After:
15000 chars: 13.684960 seconds (91% improvement)
 5000 chars:  5.413961 seconds (90% improvemend)
------------------------------------------------------------------------

Razer Phone (MSM8998)

Before:
15000 chars: 215.932986 seconds
 5000 chars:  72.147830 seconds

After:
15000 chars: 17.342402 seconds (92% improvement)
 5000 chars:  4.397512 seconds (94% improvement)
-------------------------------------------------------------------------

Change-Id: I1c3fb0c89ce2b3ee7e44f492367b6caf6db58ccf
Signed-off-by: Yingshiuan Pan <yingshiuan.pan@linaro.org>
Signed-off-by: Vishalcj17 <vishalcj@aospa.co>
wHo-EM-i pushed a commit to wHo-EM-i/bionic that referenced this pull request Aug 18, 2022
The first time should_trace() returns true, bionic_trace_begin() calls
open() on trace_marker.

The problem is that open() can call bionic_trace_begin(). We've observed
this happening, for example when:

* fdtrack is enabled. dlopen("libfdtrack.so") can be used to enable
  fdtrack on a process.
* ThreadA is busy unwinding inside fdtrack and is holding an fdtrack
  internal mutex.
* ThreadB calls bionic_trace_begin() for the first time since the
  property "debug.atrace.tags.enableflags" contains ATRACE_TAG_BIONIC.
* ThreadB calls open("/sys/kernel/tracing/trace_marker"). Since fdtrack
  is enabled, ThreadB tries to do unwinding as well.
* ThreadB, inside fdtrack's unwinding tries to grab the same mutex that
  ThreadA is holding.
* Mutex contention is reported using bionic_systrace, therefore
  bionic_trace_begin() is called again on ThreadB.
* ThreadB tries to grab g_lock in bionin_systrace.cpp, but that's
  already held by ThreadB itself, earlier on the stack. Therefore
  ThreadB is stuck.

I managed to reproduce the above scenario by manually pausing ThreadA
inside unwinding with a debugger and letting ThreadB hitting
bionic_trace_begin() for the first time.

We could avoid using g_lock while calling open() (either by releasing
g_lock and reacquiring it later, or by using atomics), but
bionic_trace_begin() would try to call open() again. In my tests, open()
does not call bionic_trace_begin() a third time, because fdtrack has
reentrancy protection, but there might be another code path inside open
that calls bionic_trace_begin again (it could be racy or only happen in
certain configurations).

This commit fixes the problem by implementing reentrancy protection in
bionic_systrace.

Sample callstack from ThreadA deadlocked before the fix:
```
  * frame #0: 0x0000007436db077c libc.so`syscall at syscall.S:41
    frame CAF-Extended#1: 0x0000007436db0ba0 libc.so`bionic_trace_begin(char const*) [inlined] __futex(ftx=0x000000743737a548, op=<unavailable>, value=2, timeout=0x0000000000000000, bitset=-1) at bionic_futex.h:45:16
    frame #2: 0x0000007436db0b8c libc.so`bionic_trace_begin(char const*) [inlined] __futex_wait_ex(ftx=0x000000743737a548, value=2) at bionic_futex.h:66:10
    frame #3: 0x0000007436db0b78 libc.so`bionic_trace_begin(char const*) [inlined] Lock::lock(this=0x000000743737a548) at bionic_lock.h:67:7
    frame #4: 0x0000007436db0b74 libc.so`bionic_trace_begin(char const*) [inlined] should_trace() at bionic_systrace.cpp:38:10
    frame #5: 0x0000007436db0b74 libc.so`bionic_trace_begin(message="Contending for pthread mutex") at bionic_systrace.cpp:59:8
    frame #6: 0x0000007436e193e4 libc.so`NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*) [inlined] NonPI::NormalMutexLock(mutex=0x0000007296cae9f0, shared=0, use_realtime_clock=false, abs_timeout_or_null=0x0000000000000000) at pthread_mutex.cpp:592:17
    frame #7: 0x0000007436e193c8 libc.so`NonPI::MutexLockWithTimeout(mutex=0x0000007296cae9f0, use_realtime_clock=false, abs_timeout_or_null=0x0000000000000000) at pthread_mutex.cpp:719:16
    frame #8: 0x0000007436e1912c libc.so`::pthread_mutex_lock(mutex_interface=<unavailable>) at pthread_mutex.cpp:839:12 [artificial]
    frame #9: 0x00000071a4e5b290 libfdtrack.so`std::__1::mutex::lock() [inlined] std::__1::__libcpp_mutex_lock(__m=<unavailable>) at __threading_support:256:10
    frame #10: 0x00000071a4e5b28c libfdtrack.so`std::__1::mutex::lock(this=<unavailable>) at mutex.cpp:31:14
    frame #11: 0x00000071a4e32634 libfdtrack.so`unwindstack::Elf::Step(unsigned long, unwindstack::Regs*, unwindstack::Memory*, bool*, bool*) [inlined] std::__1::lock_guard<std::__1::mutex>::lock_guard(__m=0x0000007296cae9f0) at __mutex_base:104:27
    frame #12: 0x00000071a4e32618 libfdtrack.so`unwindstack::Elf::Step(this=0x0000007296cae9c0, rel_pc=66116, regs=0x0000007266ca0470, process_memory=0x0000007246caa130, finished=0x0000007ff910efb4, is_signal_frame=0x0000007ff910efb0) at Elf.cpp:206:31
    frame #13: 0x00000071a4e2b3b0 libfdtrack.so`unwindstack::LocalUnwinder::Unwind(this=0x00000071a4ea1528, frame_info=<unavailable>, max_frames=34) at LocalUnwinder.cpp:102:22
    frame #14: 0x00000071a4e2a3ec libfdtrack.so`fd_hook(event=<unavailable>) at fdtrack.cpp:119:18
    frame #15: 0x0000007436dbf684 libc.so`::__open_2(pathname=<unavailable>, flags=<unavailable>) at open.cpp:72:10
    frame #16: 0x0000007436db0a04 libc.so`bionic_trace_begin(char const*) [inlined] open(pathname=<unavailable>, flags=524289) at fcntl.h:63:12
    frame #17: 0x0000007436db09f0 libc.so`bionic_trace_begin(char const*) [inlined] get_trace_marker_fd() at bionic_systrace.cpp:49:25
    frame #18: 0x0000007436db09c0 libc.so`bionic_trace_begin(message="pthread_create") at bionic_systrace.cpp:63:25
```

Bug: 213642769
Change-Id: I10d331859045cb4a8609b007f5c6cf2577ff44df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants