10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 #654

SebastianApel · 2023-03-31T16:21:54Z

UPDATE:

First: Let me say "sorry" for the confusion: I was not clear in my original post as to what the baseline was. This was mainly, because #642 was merged AFTER I had branched to run my tests of which I have reported the results.

When I published this pull request, I had not realized that #642 existed, and also not that it had been merged. My results were therefore based on the pre-#642 code version.

I have re-run the benchmarks now with three versions: the "pre-PR642" AVX2 code, the AVX2 code from #642 and the AVX2 code from this PR (#654).

At least on my machine I get (for 100 iterations of the benchmark from #653) a 14% performance increase, which is similar to what @jart reported in #654 (comment).

I have therefore changed the title of the PR. Again, sorry for the confusion.

Please let me know your thoughts.

Results Summary (single thread performance)

Codebase	FLOPS (single thread	% compared to pre 642	% compared to with #642
pre #642	14769 FLOPS	100%	76%
with #642	19443 FLOPS	132%	100%
with #654	22217 FLOPS	150%	114 %

Please find the raw data of the benchmark runs here: benchmark-data.csv

Boxplot of the results:

Technical details

Machine: Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz, 16 GB RAM
Compiler & switches:

I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Iteration	NThreads	SizeX	SizeY	SizeZ	Required FLOPS	Elapsed uSeconds	FLOPS per uSecond
0	1	11008	4096	128	11542724608	783404	14734.06
1	1	11008	4096	128	11542724608	781391	14772.02
2	1	11008	4096	128	11542724608	783799	14726.64
3	1	11008	4096	128	11542724608	777324	14849.31
4	1	11008	4096	128	11542724608	781041	14778.64
5	1	11008	4096	128	11542724608	778367	14829.41
6	1	11008	4096	128	11542724608	777931	14837.72
7	1	11008	4096	128	11542724608	780476	14789.34
8	1	11008	4096	128	11542724608	778450	14827.83
9	1	11008	4096	128	11542724608	778300	14830.69

Benchmark data for code in this pull request:

Iteration	NThreads	SizeX	SizeY	SizeZ	Required FLOPS	Elapsed uSeconds	FLOPS per uSecond
0	1	11008	4096	128	11542724608	552244	20901.49
1	1	11008	4096	128	11542724608	540733	21346.44
2	1	11008	4096	128	11542724608	555696	20771.65
3	1	11008	4096	128	11542724608	553204	20865.22
4	1	11008	4096	128	11542724608	546703	21113.34
5	1	11008	4096	128	11542724608	505609	22829.35
6	1	11008	4096	128	11542724608	505045	22854.84
7	1	11008	4096	128	11542724608	525141	21980.24
8	1	11008	4096	128	11542724608	506639	22782.94
9	1	11008	4096	128	11542724608	505468	22835.72

jart · 2023-03-31T17:42:37Z

This PR would change the behavior of evaluation. It makes the output of LLaMA extremely different from what Apple M1 produces on the same inputs. Please compare your change with 02c5b27 which was fine.

What's interesting is that this PR causes behavior to regress in the exact same way #642 did, which was merged a moment ago.. I raised the same concern there too #642 (comment)

I've confirmed that this PR would cause an 11% performance boost on inference, which I measured on an Intel(R) Core(TM) i9-9900 CPU @ 3.10GHz.. Would it be possible to get that performance boost without changing the behavior of LLaMA? If so, why is it not possible to preserve consistency of behavior?

rabidcopy · 2023-03-31T18:06:49Z

This PR would change the behavior of evaluation. It makes the output of LLaMA extremely different from what Apple M1 produces on the same inputs. Please compare your change with 02c5b27 which was fine.

What's interesting is that this PR causes behavior to regress in the exact same way #642 did, which was merged a moment ago.. I raised the same concern there too #642 (comment)

I've confirmed that this PR would cause an 11% performance boost on inference, which I measured on an Intel(R) Core(TM) i9-9900 CPU @ 3.10GHz.. Would it be possible to get that performance boost without changing the behavior of LLaMA? If so, why is it not possible to preserve consistency of behavior?

Edit: A lot of what I said doesn't seem particularly relevant. Here's a comparison between #642 without #617 and this PR.
Left is #642 minus #617, right is this PR. This PR squeezes out a little bit more speed, but may be mild margin of error.

Then this is the same seed/prompt on latest master that has both #642 and #617 merged together. Which strangely introduces different output and a regression in performance.

x02Sylvie · 2023-03-31T18:26:39Z

There are issues with compiling on msvc with this pull

error C2065: '__m128i_u': undeclared identifier

It's probably resolvable as mentioned in here: https://stackoverflow.com/questions/68939552/can-i-assign-the-result-of-intrinsic-that-returns-m128i-to-variable-of-the-typ

SebastianApel · 2023-03-31T19:10:44Z

@rabidcopy @jart

Could you elaborate a bit more on what you mean with "changes the behavior of the evaluation" and how you produced the outcome?

I tried to replicate your findings on my machine, but I was not successful.

I've tried to reproduce this with three execution runs, both with the same seed:

Execution 1: with Add AVX acceleration #617 and Optimize AVX2 ggml_vec_dot_q4_0 #642 merged (commit 02c5b27)
Execution 2: the version of the PR
Execution 3: the version of the PR compiled with LLAMA_OPENBLAS=ON make

The text output was identical, the only difference were the improved timings ( 751.68 ms per run -> 516.78 ms per run), see below.

$ cat /proc/cpuinfo | grep name | uniq
model name	: Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz

$ cat /sys/devices/cpu/caps/pmu_name
haswell

$ gcc -v | tail -1
gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04) 

$ ./main -m /tmp/ggml-alpaca-7b-q4-ggjt.bin -p "Building a website can be done in 10 simple steps:" -n 100 -t 2 --seed 1

 Building a website can be done in 10 simple steps:
Describe your goal.
Brainstorm keywords/phrases.
Choose a domain name.
Set up hosting.
Plan out content structure.
Create Content for your site.
Test and debug your site.
Launch your site!
Market and Promote your site.
1. Describe Your Goal
The first step to creating a website is to clearly describe what you want your website to accomplish. This will help you figure out which features to include,
llama_print_timings:        load time =  6459.39 ms
llama_print_timings:      sample time =   108.39 ms /   100 runs   (    1.08 ms per run)
llama_print_timings: prompt eval time =  7418.28 ms /    14 tokens (  529.88 ms per token)
llama_print_timings:        eval time = 74415.93 ms /    99 runs   (  751.68 ms per run)
llama_print_timings:       total time = 83421.13 ms

$ diff ExecutionRun1.txt ExecutionRun2.txt
13,18c13,17
< llama_print_timings:        load time =  6459.39 ms
< llama_print_timings:      sample time =   108.39 ms /   100 runs   (    1.08 ms per run)
< llama_print_timings: prompt eval time =  7418.28 ms /    14 tokens (  529.88 ms per token)
< llama_print_timings:        eval time = 74415.93 ms /    99 runs   (  751.68 ms per run)
< llama_print_timings:       total time = 83421.13 ms
< 
---
> llama_print_timings:        load time =  6778.95 ms
> llama_print_timings:      sample time =    83.37 ms /   100 runs   (    0.83 ms per run)
> llama_print_timings: prompt eval time =  8431.62 ms /    14 tokens (  602.26 ms per token)
> llama_print_timings:        eval time = 51160.91 ms /    99 runs   (  516.78 ms per run)
> llama_print_timings:       total time = 60553.50 ms

$ LLAMA_OPENBLAS=ON make

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  -lopenblas
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0



$ diff ExecutionRun1.txt ExecutionRun3.txt
13,17c13,17
< llama_print_timings:        load time =  6459.39 ms
< llama_print_timings:      sample time =   108.39 ms /   100 runs   (    1.08 ms per run)
< llama_print_timings: prompt eval time =  7418.28 ms /    14 tokens (  529.88 ms per token)
< llama_print_timings:        eval time = 74415.93 ms /    99 runs   (  751.68 ms per run)
< llama_print_timings:       total time = 83421.13 ms
---
> llama_print_timings:        load time =  3867.08 ms
> llama_print_timings:      sample time =    80.33 ms /   100 runs   (    0.80 ms per run)
> llama_print_timings: prompt eval time =  5218.40 ms /    14 tokens (  372.74 ms per token)
> llama_print_timings:        eval time = 45214.81 ms /    99 runs   (  456.72 ms per run)
> llama_print_timings:       total time = 51430.97 ms

rabidcopy · 2023-03-31T19:24:53Z

For some reason current master cancels out a fair portion of #642's original performance gains and gives different outputs in my case. Either way, I get a slightly higher uplift with this PR compared to #642 with and without #617. System is Linux, compiling with LLAMA_OPENBLAS=ON make in all my findings. Might be worth compiling without BLAS.. Edit: Same findings without BLAS outside of uplift from BLAS.

sw · 2023-03-31T19:39:53Z

Joining the discussion here, coming from #642...

I'm seeing a tiny speed improvement of ~2% here, certainly not 1.5x. This is on an i3-8100.

No change in output compared to master after merge of #642, which is in line with @jart's observation:

What's interesting is that this PR causes behavior to regress in the exact same way #642 did

SebastianApel · 2023-03-31T19:40:02Z

That's very strange. For some reason current master cancels out a fair portion of #642's original performance gains and gives different outputs in my case. Either way, I get a slightly higher uplift with this PR compared to #642 with and without #617. System is Linux, compiling with LLAMA_OPENBLAS=ON make in all my findings. Might be worth compiling without BLAS..

I've also tried to replicate with LLAMA_OPENBLAS=ON make - output is still identical, performance even better. See my comment above

SebastianApel · 2023-03-31T19:45:10Z

Joining the discussion here, coming from #642...

I'm seeing a tiny speed improvement of ~2% here, certainly not 1.5x. This is on an i3-8100.

No change in output with master after merge of #642, which is in line with @jart's observation:

What's interesting is that this PR causes behavior to regress in the exact same way #642 did

@sw That's very interesting, apparently the performance is highly dependent on the the specific constellation (CPU, Compiler, Operating System, ...).

Would it be possible that you try to compile with the compiler switch -march=native appended to line 34 and 35 of the Makefile?

sw · 2023-03-31T20:05:52Z

-march=native is very slightly faster (again ~2%), but I'm not seeing a significant difference between master and this PR. This seems more and more like a wild goose chase...

rabidcopy · 2023-03-31T20:32:53Z

Please disregard most if not everything regarding #617 and #642. I had folders mixed up. There's no mystery performance regression. Performance is still a tiny bit better here but only by 10-15ms for me.

SebastianApel · 2023-03-31T20:36:06Z

@sw @rabidcopy

The problem (at least on my machine) is: Even with the SAME executable, I get a lot of variance in the "ms per run" eval timings. Between the minimum (448 ms per run) and the maximum (598 ms per run) was a delta of ~150 ms, that's 33% of the total time per run (see below).

My hypothesis is: It highly depends what else is happening on the system.

This means: It's going to be very hard to derive any conclusion from a single run of ./main .

I think we need a consistent "test harness", with identical settings, compiler switches that executes several runs to get to a statistical view. If started to work towards something like this in PR#653

llama_print_timings:        eval time = 56604.60 ms /    99 runs   (  571.76 ms per run)
llama_print_timings:        eval time = 46594.74 ms /    99 runs   (  470.65 ms per run)
llama_print_timings:        eval time = 46408.88 ms /    99 runs   (  468.78 ms per run)
llama_print_timings:        eval time = 59202.12 ms /    99 runs   (  598.00 ms per run)   <- MAX
llama_print_timings:        eval time = 48923.75 ms /    99 runs   (  494.18 ms per run)
llama_print_timings:        eval time = 46817.46 ms /    99 runs   (  472.90 ms per run)
llama_print_timings:        eval time = 44402.43 ms /    99 runs   (  448.51 ms per run)    <- MIN
llama_print_timings:        eval time = 47603.48 ms /    99 runs   (  480.84 ms per run)
llama_print_timings:        eval time = 47156.63 ms /    99 runs   (  476.33 ms per run)
llama_print_timings:        eval time = 47849.84 ms /    99 runs   (  483.33 ms per run)

rabidcopy · 2023-03-31T21:17:20Z

Just a small comparison on a 13B model. Left is PR, right is master. Compiled with BLAS and -march=native.

howard0su · 2023-04-01T15:12:40Z

Help run a stats calculation based on FLOPS per us:

ministat.exe a.txt b.txt
x a.txt
+ b.txt
+--------------------------------------------------------------------------+
|xx                                                                       +|
|xx                                                     +                 +|
|xx                                                    ++ + +     +      ++|
||A                                                     |________AM______| |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10      14726.64      14849.31      14827.83     14797.566     44.044047
+  10      20771.65      22854.84      21980.24     21828.123     922.04901
Difference at 95.0% confidence
        7030.56 +/- 613.303
        47.5116% +/- 4.14462%
        (Student's t, pooled s = 652.731)

SebastianApel · 2023-04-01T18:37:47Z

Help run a stats calculation based on FLOPS per us:

@howard0su Nice! What do "a" and "b" stand for in your comment?

Ameobea · 2023-04-02T00:03:47Z

Impressive work! I honestly don't fully understand how this is providing performance improvements over the already optimized stuff in #642, but I can replicate the improvement locally on my machine as well.

It seems that after these changes, the AVX512 version can't keep pace and it's actually significantly slower to use that version now, even on AVX512 capable hardware. I've made some attempts to port your changes over to the AVX512 code as well, but I'm unable to see any success.

My best guess for the reason behind this is that the extremely dense 4-bit data has more overhead than benefit for AVX512 at this point. For example, there is no single instruction to broadcast a 32-bit memory address to a zmm/512-bit register; _mm512_set1_ps seems to produce an expensive combination of a broadcast into a ymm followed by a vshuff32x4.

Additionally, I've noticed that while shorter in instruction count, the AVX512 code actually has a longer latency due to data dependencies. The CPU actually seems to be able to parallelize work across three execution ports in your improved AVX2 version (AVX2 execution trace) vs. only two in the AVX512 version (AVX512 execution trace)

I've created a goldbolt comparison between your AVX2 version and my attempt at porting it for AVX512. You can switch between them by toggling the define on line 20.

Anyway, I figure there's a good chance you don't have access to an AVX512-enabled CPU yourself, so if you're not able to replicate or continue on with the AVX512 porting of your improvements, I think the best thing to do for now would be to remove or disable the AVX512 codepath and default back to AVX2 for all CPUs that support it.

EDIT: I've done some more investigation, and it seems that clang-13/clang++-13 is producing significantly better code than gcc for both your current AVX2 and my adapted AVX512 version. AVX512 is still significantly slower, but yeah I think that's something worth investigating if we're really chasing down perf.

Updated godbolt comparison using clang 13: https://c.godbolt.org/z/Yv5szs6Kh

clang-13 AVX2 execution trace: https://uica.uops.info/tmp/63000579f2b24ee182d450170a4f5dd8_trace.html

clang-13 AVX512 execution trace: https://uica.uops.info/tmp/d65cfb55599a41ffbb0cbcbe22c8dcbe_trace.html

With clang-13 on my 7950x CPU, I'm seeing:

231.11 ms per run with AVX2
298.52 ms per run with my ported AVX512 version.

With gcc-12, I'm seeing:

247.57 ms per run per run with AVX2
329.70 ms per run with AVX512

Ameobea · 2023-04-02T04:10:33Z

I've wasted a couple more hours on this. I've realized that once I get ~6 threads going, there are no detectable differences between the AVX2 and AVX512 versions. The whole thing becomes completely memory bound and no changes to compiler or AVX2 vs. AVX512 move more than a percent or two away from 100ms/token.

I'm not going to spend any more time trying to optimize this at this myself point. My opinion is that we should just drop the AVX512 implementation. It provides no detectable benefit for me at this point in the best case, and makes things slower in the worst case.

ggerganov

I agree with @Ameobea that we should drop AVX512 for now until we have an implementation that is actually benefiting from these instructions. Moreover, now that we use -march=native by default

Regarding reproducibility of the results:
It is totally expected the produced results to vary for different SIMD implementations since we are dealing with floating point round off errors and the model is auto-regressive, meaning that small variability at the start can cause totally different results at the end

The thing that we are interested in remaining approximately the same is perplexity.
This is the quality metric that we currently have and as long as it is not degraded, we can accept changes in the SIMD implementation that improve the performance

I haven't tested this PR on x86 since it is more tedious for me, but if you observe good perplexity values and performance improvement, please go ahead and merge these changes.

Important
When you evaluate the perplexity, remember that for batch size of >= 32, currently ggml will switch to using OpenBLAS sgemm if you are linking to it and this piece of code will not be evaluated at all. So make sure to either use a smaller batch, or best - disable OpenBLAS when evaluating the perplexity.

ggml.c

SebastianApel · 2023-04-02T18:40:42Z

This is the quality metric that we currently have and as long as it is not degraded, we can accept changes in the SIMD implementation that improve the performance.

I would be very happy to verify this. But: I started to the perplexity binary and it tells me: 223.28 seconds per pass - ETA 40.63 hours. This means it would be nearly 4 days until I have a result (2 runs, one for #642, one for #654).

Is there someone with a beefier machine who could help out?

SebastianApel · 2023-04-02T18:44:36Z

we should drop AVX512 for now until we have an implementation that is actually benefiting from these instructions.

@ggerganov I would suggest to move that to a separate PR and keep the removal in a single commit. This would help to keep things clean and it would be easier to revert the commit if the AVX512 could would still be wanted by someone.

SebastianApel · 2023-04-02T19:06:23Z

@Ameobea

Additionally, I've noticed that while shorter in instruction count, the AVX512 code actually has a longer latency due to data dependencies. The CPU actually seems to be able to parallelize work across three execution ports in your improved AVX2 version (AVX2 execution trace) vs. only two in the AVX512 version (AVX512 execution trace)

Awesome, thank you for the links - did not know that existed, it's super helpful.

Anyway, I figure there's a good chance you don't have access to an AVX512-enabled CPU yourself,

Yes, I currently do not have access to a AVX512 CPU. This why I suggested to move that into a separate issue here: #654 (comment)

I've done some more investigation, and it seems that clang-13/clang++-13 is producing significantly better code than gcc for both your current AVX2

I was also under the impression that CLANG seems to produce slightly better assembly. But I am not sure if I compare it to gcc without -march=native.

ggml.c

rabidcopy · 2023-04-02T21:10:12Z

Doing some short but repetitive testing of the latest commits, about the same performance as the initial commit of this PR. Compared to current master, it's on average 190ms per run versus 165ms per run. This PR being faster. So 10-15% performance increase is still pretty accurate. Output remains the same.

sw

Given @ggerganov's approval on a previous revision, and the detailed statistics provided by @SebastianApel (thanks!), I'm merging this even though I see no improvement on my machine.

rabidcopy · 2023-04-03T13:29:37Z

Glad to see this merged. Little by little inference is getting faster.

sw · 2023-04-04T21:11:34Z

This brought in a bunch of trailing whitespace, sorry for not catching that. @SebastianApel : please configure your editor properly.

rabidcopy mentioned this pull request Mar 31, 2023

Optimize AVX2 ggml_vec_dot_q4_0 #642

Merged

SebastianApel changed the title ~~~1.5x performance improvement of ggml_vec_dot_q4_0 on AVX2~~ 10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 Apr 1, 2023

ggerganov reviewed Apr 2, 2023

View reviewed changes

ggerganov mentioned this pull request Apr 2, 2023

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

Closed

sw reviewed Apr 2, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

SebastianApel force-pushed the performance_improvement_q4_0_AVX2 branch from cb5cc71 to d8acf29 Compare April 2, 2023 20:03

SebastianApel requested a review from sw April 2, 2023 20:04

SebastianApel force-pushed the performance_improvement_q4_0_AVX2 branch from d8acf29 to e621f62 Compare April 2, 2023 20:08

Performance improvement of AVX2 code

69ef03d

SebastianApel force-pushed the performance_improvement_q4_0_AVX2 branch from 9e62f03 to 69ef03d Compare April 2, 2023 20:13

sw reviewed Apr 2, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

SebastianApel added 2 commits April 3, 2023 08:33

Fixed problem with MSVC compiler

b589e34

Reviewer comments: removed double semicolon, deleted empty line 1962

1ed8878

SebastianApel requested a review from sw April 3, 2023 07:37

sw approved these changes Apr 3, 2023

View reviewed changes

sw merged commit 437e778 into ggerganov:master Apr 3, 2023

SebastianApel mentioned this pull request Apr 3, 2023

DRAFT: AVX2: Potential additional performance increases #749

Closed

dfyz mentioned this pull request Apr 13, 2023

≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 #654

10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 #654

SebastianApel commented Mar 31, 2023 •

edited

Loading

jart commented Mar 31, 2023

rabidcopy commented Mar 31, 2023 •

edited

Loading

x02Sylvie commented Mar 31, 2023

SebastianApel commented Mar 31, 2023 •

edited

Loading

rabidcopy commented Mar 31, 2023 •

edited

Loading

sw commented Mar 31, 2023 •

edited

Loading

SebastianApel commented Mar 31, 2023

SebastianApel commented Mar 31, 2023 •

edited

Loading

sw commented Mar 31, 2023 •

edited

Loading

rabidcopy commented Mar 31, 2023

SebastianApel commented Mar 31, 2023

rabidcopy commented Mar 31, 2023

howard0su commented Apr 1, 2023

SebastianApel commented Apr 1, 2023 •

edited

Loading

Ameobea commented Apr 2, 2023 •

edited

Loading

Ameobea commented Apr 2, 2023

ggerganov left a comment

SebastianApel commented Apr 2, 2023

SebastianApel commented Apr 2, 2023

SebastianApel commented Apr 2, 2023

rabidcopy commented Apr 2, 2023

sw left a comment

rabidcopy commented Apr 3, 2023

sw commented Apr 4, 2023

10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 #654

10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 #654

Conversation

SebastianApel commented Mar 31, 2023 • edited Loading

Results Summary (single thread performance)

Boxplot of the results:

Technical details

PREVIOUS POST:

jart commented Mar 31, 2023

rabidcopy commented Mar 31, 2023 • edited Loading

x02Sylvie commented Mar 31, 2023

SebastianApel commented Mar 31, 2023 • edited Loading

rabidcopy commented Mar 31, 2023 • edited Loading

sw commented Mar 31, 2023 • edited Loading

SebastianApel commented Mar 31, 2023

SebastianApel commented Mar 31, 2023 • edited Loading

sw commented Mar 31, 2023 • edited Loading

rabidcopy commented Mar 31, 2023

SebastianApel commented Mar 31, 2023

rabidcopy commented Mar 31, 2023

howard0su commented Apr 1, 2023

SebastianApel commented Apr 1, 2023 • edited Loading

Ameobea commented Apr 2, 2023 • edited Loading

Ameobea commented Apr 2, 2023

ggerganov left a comment

Choose a reason for hiding this comment

SebastianApel commented Apr 2, 2023

SebastianApel commented Apr 2, 2023

SebastianApel commented Apr 2, 2023

rabidcopy commented Apr 2, 2023

sw left a comment

Choose a reason for hiding this comment

rabidcopy commented Apr 3, 2023

sw commented Apr 4, 2023

SebastianApel commented Mar 31, 2023 •

edited

Loading

rabidcopy commented Mar 31, 2023 •

edited

Loading

SebastianApel commented Mar 31, 2023 •

edited

Loading

rabidcopy commented Mar 31, 2023 •

edited

Loading

sw commented Mar 31, 2023 •

edited

Loading

SebastianApel commented Mar 31, 2023 •

edited

Loading

sw commented Mar 31, 2023 •

edited

Loading

SebastianApel commented Apr 1, 2023 •

edited

Loading

Ameobea commented Apr 2, 2023 •

edited

Loading