GitHub - jimenezrick/patch-AuthenticAMD: Utility to patch binaries generated by the Intel C++ Compiler to get the maximum performance on AMD CPUs

jimenezrick / patch-AuthenticAMD Public

Notifications You must be signed in to change notification settings
Fork 13
Star 182

Utility to patch binaries generated by the Intel C++ Compiler to get the maximum performance on AMD CPUs

r.untroubled.be/

GPL-3.0 license

182 stars 13 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
doc		doc
AUTHOR		AUTHOR
COPYING		COPYING
Makefile		Makefile
README		README
benchmark-partial-sums.c		benchmark-partial-sums.c
patch-AuthenticAMD.c		patch-AuthenticAMD.c

Repository files navigation

patch-AuthenticAMD
====================

Utility to patch binaries generated by the Intel C++ Compiler to get the maximum performance on AMD
CPUs.

The Intel C++ Compiler adds to generated binaries a CPUID test that looks if they are executed on a
Intel CPU, so the binaries don't run with full optimizations on non-Intel CPUs. This utility patches
such CPUID tests, so the binaries can run on an AMD CPU as if they were on a Intel CPU.

**Tested on Linux with Intel C++ Compiler 10.x/11.x (it might work with future releases of ICC).
Maybe it also works with Fortran compiler if it has the same CPUID test, but this is not
confirmed.**

*It seems that ICC 11.x doesn't impose anymore a penalty on the performance when running the
compiled binaries on AMD. But the CPUID tests are still present on those binaries and this program
can remove them.*

*There are some GNU libraries that also have CPUID tests, so in case you generate a static binary
with that code included, they could be affected, but in the performed tests the comparisons used a
different instruction so they were left intact. Anyway, those tests are not evil like the Intel
ones.*

How to compile
----------------

You must have the libelf library. In Ubuntu 8.04 just install the package libelfg0-dev. With a
version around 0.8.6 it should work well. Now you can compile with the command:

make

Benchmark
-----------

In the source code tarball there is a file called benchmark-partial-sums.c (taken from
*The Computer Language Shootout* http://shootout.alioth.debian.org). This code can be optimized
with SSE2 by the Intel compiler.

Compile this code with:

icc -O3 -xW -o benchmark-partial-sums benchmark-partial-sums.c

To run the benchmark use:

time ./benchmark-partial-sums 100000000

These were the average results on my AMD64 CPU:

- GCC compiled executable --> 45.5s (compiled with -O3 -msse2)
- ICC original executable --> 31.5s (probably not taking the SSE2 optimized path in the binary)
- ICC patched executable --> 25.5s

How to patch a binary generated by Intel C++ Compiler
-------------------------------------------------------

Just run:

patch-AuthenticAMD <executable_name>

How to patch the Intel C++ Compiler
-------------------------------------

In the /path/to/icc/lib there are the shared libraries used by the compiler. It seems that
patching all of them, the binaries generated by ICC won't have the CPUID test. So they run perfectly
in AMD. Probably only one of the shared libraries is the responsible of adding such test. Anyway, I
can't confirm this because I didn't try it.

**But you are warned that modifying, disassembling or reverse engineering the Intel C++ Compiler goes
against the Intel EULA (End User License Agreement). So do at your own risk.**

If you want to try, run this command in /path/to/icc/lib:

for i in *; do patch-AuthenticAMD -ev $i; done

Report results
----------------

Please, this tool seems to work well, but it is not very tested. Send me an email with your
results. You can also send me questions, suggestions, or anything. Feel free to send me questions
about the code:

jimenezrick@gmail.com

The content of the doc directory
------------------------------------

- libelf by Example.mht: http://people.freebsd.org/~jkoshy/download/libelf/article.html
a tutorial for libelf in FreeBSD. Almost everything it says is valid for Linux.
- naughty-intel.html: the person who wrote this article explains everything one need to know about
the subject.

How it works
--------------

Here it is a binary compiled by ICC 10.1 disassembled:

0000000000402c5c <__intel_cpu_indicator_init>:
...
# Get CPU vendor string (EAX = 0)
402c84: 48 33 c0 xor %rax,%rax
402c87: 0f a2 cpuid
402c89: 89 45 f8 mov %eax,-0x8(%rbp)
402c8c: 89 5d fc mov %ebx,-0x4(%rbp)
402c8f: 89 4d ec mov %ecx,-0x14(%rbp)
402c92: 89 55 f4 mov %edx,-0xc(%rbp)
402c95: 48 c7 c0 01 00 00 00 mov $0x1,%rax
# Get CPU capabilities (EAX = 1)
402c9c: 0f a2 cpuid
402c9e: 89 45 f0 mov %eax,-0x10(%rbp)
402ca1: 89 5d e0 mov %ebx,-0x20(%rbp)
402ca4: 89 4d e8 mov %ecx,-0x18(%rbp)
402ca7: 89 55 e4 mov %edx,-0x1c(%rbp)
...
402cca: 8b 45 fc mov -0x4(%rbp),%eax
# Compare the first four bytes of your vendor string with "Genu"
402ccd: 3d 47 65 6e 75 cmp $0x756e6547,%eax
402cd2: bb 01 00 00 00 mov $0x1,%ebx
402cd7: 75 1b jne 402cf4 <__intel_cpu_indicator_init+0x98>
402cd9: 8b 45 f4 mov -0xc(%rbp),%eax
# Compare the first four bytes of your vendor string with "ineI"
402cdc: 3d 69 6e 65 49 cmp $0x49656e69,%eax
402ce1: 75 11 jne 402cf4 <__intel_cpu_indicator_init+0x98>
402ce3: 8b 45 ec mov -0x14(%rbp),%eax
# Compare the first four bytes of your vendor string with "ntel"
402ce6: 3d 6e 74 65 6c cmp $0x6c65746e,%eax
402ceb: 75 07 jne 402cf4 <__intel_cpu_indicator_init+0x98>
402ced: ba 01 00 00 00 mov $0x1,%edx
402cf2: eb 02 jmp 402cf6 <__intel_cpu_indicator_init+0x9a>
402cf4: 33 d2 xor %edx,%edx
# If you has "GenuineIntel" everything goes OK. Later are more test
# to see the capabilities of your CPU and they are taken in account.
...
# Here it loads in RAX the address of a global variable (_DYNAMIC+0x1d8)
# where a value representing the the capabilities of your CPU is stored.
# This value also says if your CPU is non-INTEL which means that the
# true capabilities of your CPU are not full used (i.e. SSE).
402d7e: 48 8b 05 a3 56 20 00 mov 0x2056a3(%rip),%rax # 608428 <_DYNAMIC+0x1d8>
# In EBX the value of this global variable is ready to be copied to
# memory. An INTEL CPU with SSE and SSE2 has EBX = 0x800. An AMD CPU
# with SSE and SSE2 has EBX = 0x1 which means that the SSE and SSE2
# capabilities are not recognized.
402d85: 89 18 mov %ebx,(%rax)
...

The patch-AuthenticAMD utility remplaces those three CMP instructions by other three CMPs that look
for the vendor string AuthenticAMD. The libelf library is used to analyze the structure of the
ELF binary to be patched so we can find the executable sections and do the replacements only in that
sections, so we can garantee that what we remplaces is a machine instruction and no another thing.
Also it is possible to by pass libelf and make replacements in all the binary.

The binaries generated with the Intel C++ Compiler usually have several execution branches, some of
them are for maximum compatibily with x86 processors and others are for maximun speed with SSE
optimizations. With this utility, the executable will get the fastest path your CPU supports.