break when there's nothing to read #2582

sywangyi · 2024-09-28T01:39:18Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sywangyi · 2024-09-28T01:43:38Z

see huge perf drop when use cpu to inference if bind cpu.
command like
docker run --rm --shm-size 1g --network host -e http_proxy=${http_proxy} -e https_proxy=${https_proxy} --cap-add=sys_nice --ipc=host --cpuset-cpus=0-55 -v ~/data:/data --device /dev/dri --privileged tgi_cpu2 --model-id=meta-llama/Meta-Llama-3.1-8B-Instruct
time_per_token="3.251889479s",
find blocked in buf read.
after the fix.
time_per_token= "99.367121ms"

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi · 2024-09-28T22:27:35Z

@Narsil @ErikKaum @danieldk please help review it

Narsil · 2024-09-30T14:48:34Z

This would break the debugger interactivity.

What I understand you're doing here is simply cut the reading from stdin (whenever there's nothing to read which is most of the time).

What I imagine is the issue is that on CPU, there's a lot of context interrupts because it tries to read from stdin.
Doesn't make sense to me as to why it should be so, but I can definitely see that being an issue.

We could definitely protect is with : if LevelFilter::current() >= tracing::Level::DEBUG (like we also have for the logging utiliy).
We then can bypass the stdin interactivity entirely. It's a bummer since debugger interactivity is really nice, but at least we can keep it around whenever we have LOG_LEVEL=debug in the environment (having interactive python debugger is huge bonus in debugging speeds)

Narsil · 2024-10-01T15:03:42Z

Here is the approach I'm talking about: #2596

sywangyi · 2024-10-04T09:25:50Z

Hi, @Narsil PR 2596 still has issue by my side, since it delete the break in func log_lines which is reading stdout (https://github.com/huggingface/text-generation-inference/pull/2582/files#diff-deb98d21e32cf9a211b60ba46044f0ce4b6f75a1ec3adebc69185b7a845ecb18R1068-R1069). by my side. add break in log_lines does not impact the log output. If there's data in stdout, the loop to read output is still triggered, and I think it should be similar for the stdin. could you share me how to use the stdin, so I could verify by my side.

Narsil · 2024-10-09T16:20:46Z

@sywangyi ok we are adding CI for intel (CPU for now) #2630

At least we detect stuff like broken imports.
For this particular issue I'm going to spend some time reproducing and finding a cleaner fix.

sywangyi · 2024-10-10T01:13:48Z

@Narsil glad to know that ci for intel cpu is added, yes, at least we could detect stuff like broken imports and broken api. cc @yao-matrix

Narsil · 2024-10-11T16:35:31Z

Okay I've setup a machine on AWS Intel(R) Xeon(R) Platinum 8488C and I still see 100ms/token.

On main.

docker run -v /mnt/:/data/  --shm-size 1g -e HF_TOKEN=$HF_TOKEN -p 8080:80 tgi --model-id meta-llama/Llama-3.2-1B-Instruct --cuda-graphs 0

I don't think the other arguments you're using are necessary, are they ?
cpuset is only used to limit the amount of CPU right ?

I'm mostly trying to reproduce so we can make sure we can keep the benefit of this stdin manipulation in debug while removing the bug for CPU.

sywangyi · 2024-10-11T20:37:02Z

Okay I've setup a machine on AWS Intel(R) Xeon(R) Platinum 8488C and I still see 100ms/token.

On main.
docker run -v /mnt/:/data/  --shm-size 1g -e HF_TOKEN=$HF_TOKEN -p 8080:80 tgi --model-id meta-llama/Llama-3.2-1B-Instruct --cuda-graphs 0
I don't think the other arguments you're using are necessary, are they ? cpuset is only used to limit the amount of CPU right ?

I'm mostly trying to reproduce so we can make sure we can keep the benefit of this stdin manipulation in debug while removing the bug for CPU.

cpuset is needed to reproduce it, our validation teams intended to launcher two text generation servers, one per sockets by cpu binding, and then find this issue

Narsil · 2024-10-15T09:33:25Z

Can we recheck on master ? I merged the protection with log_level which should limit the issue.

I am definitely not able to reproduce on my end unfortunately.

sywangyi · 2024-10-15T13:57:54Z

I still see the issue in master, and I must apply break in log_lines<R: Sized + Read>(mut bufread: BufReader) to avoid it. I just update this PR to fix it.
this is my command
docker run --rm --shm-size 1g --network host -e http_proxy=${http_proxy} -e https_proxy=${https_proxy} --cap-add=sys_nice --ipc=host --cpuset-cpus=0-55 -v /workspace/ywan171/data:/data --device /dev/dri --privileged tgi_cpu2 --model-id=meta-llama/Meta-Llama-3.1-8B-Instruct

I have two cpu sockets in the node, each socket has 56 physical cores. and just bind to socket0 to run the inference.

Narsil · 2024-10-18T14:25:03Z

I'm really struggling to reproduce anything.

I reproduce your command line with every argument (even though I don't understand why --privileged --net host --ipc host are actually required) and I still get 100ms/token without any issue.

The "fix" cannot be a proper fix because if it works that means it's deactivating the log facilities which is not something we desire to ever happen (having good logs is important)
https://doc.rust-lang.org/beta/std/io/trait.Read.html#tymethod.read
Specifies that the read might be not blocking depending on the target. Could that be the issue where you're running somehow on a host where the read method is unblocking and therefore this method starts spinning like crazy (and therefore escaping the logs "fixes" the spining loop) ?

Could it be a kernel flag, or specific shell you're using ?

I tried to see if it's know that some targets are known to be nonblocking but it seems to be not the case.
rust-lang/rust#39810 (comment)

If it's indeed nonblocking I/O that's causing the issue (which is the only thing that makes sense to me as to why the current fix would fix anything) then I'd favor fixes that actually force blocking I/O.
As a last resort, some kind of sleeping in the loop (with a lot of comments explaining the reason for it).

Removing the logs is not ok.

sywangyi · 2024-10-18T14:36:03Z

are you using ping cpu cores like " --cpuset-cpus=0-55"? or could you adjust the cpu cores. maybe could I try some sleep instead of break: but even I break. I still see the log output in the terminal.

Narsil · 2024-10-18T15:50:32Z

are you using ping cpu cores like " --cpuset-cpus=0-55"? or could you adjust the cpu cores.

Indeed I adapted since I don't have the same number of cores. I tried to numactl the docker command too to pin a specific socket.
I'm not sure if dockerland works the same way though.

maybe could I try some sleep instead of break: but even I break. I still see the log output in the terminal.

This doesn't make sense. If you break, you should exit the loop therefore exiting our printing loop.
I can't trigger it either, but that's because the reads are blocking (therefore they are never null).
Can you log something before breaking to see if you're actually managing to trigger it

Narsil · 2024-10-18T16:02:45Z

As a side note, when I'm using ALL cores over both sockets, I get ~110ms/token. Not sure if the value is normal but I figure the socket crossing has overhead compensating the speedup maybe ?

Could be a docker version issue on my end too ? (27.1.1 with buildx.) Or something in the cgroups that defeats --cpuset-cpus ? (I see a difference based on how many cpus I allow though)
Maybe something linked to AWS virtualization kicking in for me ?
I'm also not sure how to handle this issue, I've tried hard reproducing it on several machines so far.

sywangyi · 2024-10-21T12:31:50Z

Hi, @Narsil I debug in my environment. find it's blocked in this thread https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs#L1266-L1268, I print "n" in log_lines and it's continuously "0". while the thread https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs#L946-L948 behaves correctly.

diff --git a/launcher/src/main.rs b/launcher/src/main.rs
index d9f569f..56fdf20 100644
--- a/launcher/src/main.rs
+++ b/launcher/src/main.rs
@@ -1264,7 +1264,10 @@ fn download_convert_model(
     let download_stdout = BufReader::new(download_process.stdout.take().unwrap());

     thread::spawn(move || {
-        log_lines(download_stdout);
+        //log_lines(download_stdout);
+        for line in download_stdout.lines().map_while(Result::ok) {
+            tracing::info!("{line}");
+        }
     });
this fix also works by my side.

Narsil · 2024-10-21T12:42:10Z

Hi, @Narsil I debug in my environment. find it's blocked in this thread https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs#L1266-L1268, I print "n" in log_lines and it's continuously "0". while the thread https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs#L946-L948 behaves correctly.
diff --git a/launcher/src/main.rs b/launcher/src/main.rs
index d9f569f..56fdf20 100644
--- a/launcher/src/main.rs
+++ b/launcher/src/main.rs
@@ -1264,7 +1264,10 @@ fn download_convert_model(
     let download_stdout = BufReader::new(download_process.stdout.take().unwrap());

     thread::spawn(move || {
-        log_lines(download_stdout);
+        //log_lines(download_stdout);
+        for line in download_stdout.lines().map_while(Result::ok) {
+            tracing::info!("{line}");
+        }
     });
this fix also works by my side.

💡💡💡💡💡

We're just never closing that thread, which we indeed should (regardless of the output actually).
I'm not sure why I'm not able to trigger it on my end, but it doesn't matter, closing it is correct I think we can do it more cleanly than breaking.

Narsil · 2024-10-21T12:46:35Z

Does this fix it : #2674 too ?

sywangyi · 2024-10-21T12:55:58Z

hang in the t.join(). since there's no break in the loop in log_line

Narsil · 2024-10-21T13:22:34Z

Oh right, let's go with your fix then.

I'm still not sure this is technically correct given the docs: https://doc.rust-lang.org/std/io/trait.Read.html#tymethod.read
But Doing a "cleaner" fix invovles sending a channel over to the thread, which might be too much for such a simple thing.

I'm really curious on how to reproduce that n=0 instead of Err(_) when reading from stdout of a closed process.
What kind of distro are you running ?

sywangyi · 2024-10-21T13:54:00Z

docker --version
Docker version 26.1.3, build b72abbb
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0-223
Thread(s) per core: 2
Core(s) per socket: 56
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Platinum 8480L
Stepping: 7
CPU MHz: 2000.000
CPU max MHz: 3800.0000
CPU min MHz: 800.0000
BogoMIPS: 4000.00
L1d cache: 48K
L1i cache: 32K
L2 cache: 2048K
L3 cache: 107520K
NUMA node0 CPU(s): 0-55,112-167
NUMA node1 CPU(s): 56-111,168-223

sywangyi · 2024-10-21T13:54:59Z

I am also curious about which the thread is kept looping while the process is close.

Narsil · 2024-10-23T06:34:49Z

I think it's OS dependant more than CPU dependant no ? I'm wondering how docker handles this and if docker version could affect this too.

The doc definitely says that behavior varies based on target.

sywangyi · 2024-10-23T06:43:35Z

I think we use the same OS, since it's ubuntu22.04, see https://github.com/huggingface/text-generation-inference/blob/main/Dockerfile_intel#L127

sywangyi marked this pull request as draft September 28, 2024 02:13

break when there's nothing to read

6d95155

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi force-pushed the cpu_perf branch from e1fd951 to 6d95155 Compare September 28, 2024 02:40

sywangyi marked this pull request as ready for review September 28, 2024 02:56

Narsil mentioned this pull request Oct 2, 2024

Cpu perf #2596

Merged

5 tasks

Merge branch 'main' into cpu_perf

fb4d208

Narsil mentioned this pull request Oct 21, 2024

Fixing performance degradation on Intel. #2674

Closed

5 tasks

Narsil merged commit 058d306 into huggingface:main Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

break when there's nothing to read #2582

break when there's nothing to read #2582

sywangyi commented Sep 28, 2024

sywangyi commented Sep 28, 2024

sywangyi commented Sep 28, 2024

Narsil commented Sep 30, 2024

Narsil commented Oct 1, 2024

sywangyi commented Oct 4, 2024 •

edited

Loading

Narsil commented Oct 9, 2024

sywangyi commented Oct 10, 2024

Narsil commented Oct 11, 2024

sywangyi commented Oct 11, 2024

Narsil commented Oct 15, 2024

sywangyi commented Oct 15, 2024 •

edited

Loading

Narsil commented Oct 18, 2024

sywangyi commented Oct 18, 2024 •

edited

Loading

Narsil commented Oct 18, 2024

Narsil commented Oct 18, 2024

sywangyi commented Oct 21, 2024 •

edited

Loading

Narsil commented Oct 21, 2024

Narsil commented Oct 21, 2024

sywangyi commented Oct 21, 2024

Narsil commented Oct 21, 2024

sywangyi commented Oct 21, 2024

sywangyi commented Oct 21, 2024

Narsil commented Oct 23, 2024 •

edited

Loading

sywangyi commented Oct 23, 2024

break when there's nothing to read #2582

break when there's nothing to read #2582

Conversation

sywangyi commented Sep 28, 2024

What does this PR do?

Before submitting

Who can review?

sywangyi commented Sep 28, 2024

sywangyi commented Sep 28, 2024

Narsil commented Sep 30, 2024

Narsil commented Oct 1, 2024

sywangyi commented Oct 4, 2024 • edited Loading

Narsil commented Oct 9, 2024

sywangyi commented Oct 10, 2024

Narsil commented Oct 11, 2024

sywangyi commented Oct 11, 2024

Narsil commented Oct 15, 2024

sywangyi commented Oct 15, 2024 • edited Loading

Narsil commented Oct 18, 2024

sywangyi commented Oct 18, 2024 • edited Loading

Narsil commented Oct 18, 2024

Narsil commented Oct 18, 2024

sywangyi commented Oct 21, 2024 • edited Loading

Narsil commented Oct 21, 2024

Narsil commented Oct 21, 2024

sywangyi commented Oct 21, 2024

Narsil commented Oct 21, 2024

sywangyi commented Oct 21, 2024

sywangyi commented Oct 21, 2024

Narsil commented Oct 23, 2024 • edited Loading

sywangyi commented Oct 23, 2024

sywangyi commented Oct 4, 2024 •

edited

Loading

sywangyi commented Oct 15, 2024 •

edited

Loading

sywangyi commented Oct 18, 2024 •

edited

Loading

sywangyi commented Oct 21, 2024 •

edited

Loading

Narsil commented Oct 23, 2024 •

edited

Loading