Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

break when there's nothing to read #2582

Merged
merged 2 commits into from
Oct 21, 2024
Merged

Conversation

sywangyi
Copy link
Contributor

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sywangyi
Copy link
Contributor Author

see huge perf drop when use cpu to inference if bind cpu.
command like
docker run --rm --shm-size 1g --network host -e http_proxy=${http_proxy} -e https_proxy=${https_proxy} --cap-add=sys_nice --ipc=host --cpuset-cpus=0-55 -v ~/data:/data --device /dev/dri --privileged tgi_cpu2 --model-id=meta-llama/Meta-Llama-3.1-8B-Instruct
time_per_token="3.251889479s",
find blocked in buf read.
after the fix.
time_per_token= "99.367121ms"

@sywangyi sywangyi marked this pull request as draft September 28, 2024 02:13
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@sywangyi sywangyi marked this pull request as ready for review September 28, 2024 02:56
@sywangyi
Copy link
Contributor Author

@Narsil @ErikKaum @danieldk please help review it

@Narsil
Copy link
Collaborator

Narsil commented Sep 30, 2024

This would break the debugger interactivity.

What I understand you're doing here is simply cut the reading from stdin (whenever there's nothing to read which is most of the time).

What I imagine is the issue is that on CPU, there's a lot of context interrupts because it tries to read from stdin.
Doesn't make sense to me as to why it should be so, but I can definitely see that being an issue.

We could definitely protect is with : if LevelFilter::current() >= tracing::Level::DEBUG (like we also have for the logging utiliy).
We then can bypass the stdin interactivity entirely. It's a bummer since debugger interactivity is really nice, but at least we can keep it around whenever we have LOG_LEVEL=debug in the environment (having interactive python debugger is huge bonus in debugging speeds)

@Narsil
Copy link
Collaborator

Narsil commented Oct 1, 2024

Here is the approach I'm talking about: #2596

@Narsil Narsil mentioned this pull request Oct 2, 2024
5 tasks
@sywangyi
Copy link
Contributor Author

sywangyi commented Oct 4, 2024

Hi, @Narsil PR 2596 still has issue by my side, since it delete the break in func log_lines which is reading stdout (https://github.com/huggingface/text-generation-inference/pull/2582/files#diff-deb98d21e32cf9a211b60ba46044f0ce4b6f75a1ec3adebc69185b7a845ecb18R1068-R1069). by my side. add break in log_lines does not impact the log output. If there's data in stdout, the loop to read output is still triggered, and I think it should be similar for the stdin. could you share me how to use the stdin, so I could verify by my side.

@Narsil
Copy link
Collaborator

Narsil commented Oct 9, 2024

@sywangyi ok we are adding CI for intel (CPU for now) #2630

At least we detect stuff like broken imports.
For this particular issue I'm going to spend some time reproducing and finding a cleaner fix.

@sywangyi
Copy link
Contributor Author

@Narsil glad to know that ci for intel cpu is added, yes, at least we could detect stuff like broken imports and broken api. cc @yao-matrix

@Narsil
Copy link
Collaborator

Narsil commented Oct 11, 2024

Okay I've setup a machine on AWS Intel(R) Xeon(R) Platinum 8488C and I still see 100ms/token.

On main.

docker run -v /mnt/:/data/  --shm-size 1g -e HF_TOKEN=$HF_TOKEN -p 8080:80 tgi --model-id meta-llama/Llama-3.2-1B-Instruct --cuda-graphs 0

I don't think the other arguments you're using are necessary, are they ?
cpuset is only used to limit the amount of CPU right ?

I'm mostly trying to reproduce so we can make sure we can keep the benefit of this stdin manipulation in debug while removing the bug for CPU.

@sywangyi
Copy link
Contributor Author

Okay I've setup a machine on AWS Intel(R) Xeon(R) Platinum 8488C and I still see 100ms/token.

On main.

docker run -v /mnt/:/data/  --shm-size 1g -e HF_TOKEN=$HF_TOKEN -p 8080:80 tgi --model-id meta-llama/Llama-3.2-1B-Instruct --cuda-graphs 0

I don't think the other arguments you're using are necessary, are they ? cpuset is only used to limit the amount of CPU right ?

I'm mostly trying to reproduce so we can make sure we can keep the benefit of this stdin manipulation in debug while removing the bug for CPU.

cpuset is needed to reproduce it, our validation teams intended to launcher two text generation servers, one per sockets by cpu binding, and then find this issue

@Narsil
Copy link
Collaborator

Narsil commented Oct 15, 2024

Can we recheck on master ? I merged the protection with log_level which should limit the issue.

I am definitely not able to reproduce on my end unfortunately.

@sywangyi
Copy link
Contributor Author

sywangyi commented Oct 15, 2024

I still see the issue in master, and I must apply break in log_lines<R: Sized + Read>(mut bufread: BufReader) to avoid it. I just update this PR to fix it.
this is my command
docker run --rm --shm-size 1g --network host -e http_proxy=${http_proxy} -e https_proxy=${https_proxy} --cap-add=sys_nice --ipc=host --cpuset-cpus=0-55 -v /workspace/ywan171/data:/data --device /dev/dri --privileged tgi_cpu2 --model-id=meta-llama/Meta-Llama-3.1-8B-Instruct

I have two cpu sockets in the node, each socket has 56 physical cores. and just bind to socket0 to run the inference.

@Narsil
Copy link
Collaborator

Narsil commented Oct 18, 2024

I'm really struggling to reproduce anything.

I reproduce your command line with every argument (even though I don't understand why --privileged --net host --ipc host are actually required) and I still get 100ms/token without any issue.

The "fix" cannot be a proper fix because if it works that means it's deactivating the log facilities which is not something we desire to ever happen (having good logs is important)
https://doc.rust-lang.org/beta/std/io/trait.Read.html#tymethod.read
Specifies that the read might be not blocking depending on the target. Could that be the issue where you're running somehow on a host where the read method is unblocking and therefore this method starts spinning like crazy (and therefore escaping the logs "fixes" the spining loop) ?

Could it be a kernel flag, or specific shell you're using ?

I tried to see if it's know that some targets are known to be nonblocking but it seems to be not the case.
rust-lang/rust#39810 (comment)

If it's indeed nonblocking I/O that's causing the issue (which is the only thing that makes sense to me as to why the current fix would fix anything) then I'd favor fixes that actually force blocking I/O.
As a last resort, some kind of sleeping in the loop (with a lot of comments explaining the reason for it).

Removing the logs is not ok.

@sywangyi
Copy link
Contributor Author

sywangyi commented Oct 18, 2024

are you using ping cpu cores like " --cpuset-cpus=0-55"? or could you adjust the cpu cores. maybe could I try some sleep instead of break: but even I break. I still see the log output in the terminal.

@Narsil
Copy link
Collaborator

Narsil commented Oct 18, 2024

are you using ping cpu cores like " --cpuset-cpus=0-55"? or could you adjust the cpu cores.

Indeed I adapted since I don't have the same number of cores. I tried to numactl the docker command too to pin a specific socket.
I'm not sure if dockerland works the same way though.

maybe could I try some sleep instead of break: but even I break. I still see the log output in the terminal.

This doesn't make sense. If you break, you should exit the loop therefore exiting our printing loop.
I can't trigger it either, but that's because the reads are blocking (therefore they are never null).
Can you log something before breaking to see if you're actually managing to trigger it

@Narsil
Copy link
Collaborator

Narsil commented Oct 18, 2024

As a side note, when I'm using ALL cores over both sockets, I get ~110ms/token. Not sure if the value is normal but I figure the socket crossing has overhead compensating the speedup maybe ?

Could be a docker version issue on my end too ? (27.1.1 with buildx.) Or something in the cgroups that defeats --cpuset-cpus ? (I see a difference based on how many cpus I allow though)
Maybe something linked to AWS virtualization kicking in for me ?
I'm also not sure how to handle this issue, I've tried hard reproducing it on several machines so far.

@sywangyi
Copy link
Contributor Author

sywangyi commented Oct 21, 2024

Hi, @Narsil I debug in my environment. find it's blocked in this thread https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs#L1266-L1268, I print "n" in log_lines and it's continuously "0". while the thread https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs#L946-L948 behaves correctly.

diff --git a/launcher/src/main.rs b/launcher/src/main.rs
index d9f569f..56fdf20 100644
--- a/launcher/src/main.rs
+++ b/launcher/src/main.rs
@@ -1264,7 +1264,10 @@ fn download_convert_model(
     let download_stdout = BufReader::new(download_process.stdout.take().unwrap());

     thread::spawn(move || {
-        log_lines(download_stdout);
+        //log_lines(download_stdout);
+        for line in download_stdout.lines().map_while(Result::ok) {
+            tracing::info!("{line}");
+        }
     });
this fix also works by my side.

@Narsil
Copy link
Collaborator

Narsil commented Oct 21, 2024

Hi, @Narsil I debug in my environment. find it's blocked in this thread https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs#L1266-L1268, I print "n" in log_lines and it's continuously "0". while the thread https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs#L946-L948 behaves correctly.

diff --git a/launcher/src/main.rs b/launcher/src/main.rs
index d9f569f..56fdf20 100644
--- a/launcher/src/main.rs
+++ b/launcher/src/main.rs
@@ -1264,7 +1264,10 @@ fn download_convert_model(
     let download_stdout = BufReader::new(download_process.stdout.take().unwrap());

     thread::spawn(move || {
-        log_lines(download_stdout);
+        //log_lines(download_stdout);
+        for line in download_stdout.lines().map_while(Result::ok) {
+            tracing::info!("{line}");
+        }
     });
this fix also works by my side.

💡💡💡💡💡

We're just never closing that thread, which we indeed should (regardless of the output actually).
I'm not sure why I'm not able to trigger it on my end, but it doesn't matter, closing it is correct I think we can do it more cleanly than breaking.

@Narsil
Copy link
Collaborator

Narsil commented Oct 21, 2024

Does this fix it : #2674 too ?

@sywangyi
Copy link
Contributor Author

hang in the t.join(). since there's no break in the loop in log_line

@Narsil
Copy link
Collaborator

Narsil commented Oct 21, 2024

Oh right, let's go with your fix then.

I'm still not sure this is technically correct given the docs: https://doc.rust-lang.org/std/io/trait.Read.html#tymethod.read
But Doing a "cleaner" fix invovles sending a channel over to the thread, which might be too much for such a simple thing.

I'm really curious on how to reproduce that n=0 instead of Err(_) when reading from stdout of a closed process.
What kind of distro are you running ?

@Narsil Narsil merged commit 058d306 into huggingface:main Oct 21, 2024
@sywangyi
Copy link
Contributor Author

docker --version
Docker version 26.1.3, build b72abbb
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0-223
Thread(s) per core: 2
Core(s) per socket: 56
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Platinum 8480L
Stepping: 7
CPU MHz: 2000.000
CPU max MHz: 3800.0000
CPU min MHz: 800.0000
BogoMIPS: 4000.00
L1d cache: 48K
L1i cache: 32K
L2 cache: 2048K
L3 cache: 107520K
NUMA node0 CPU(s): 0-55,112-167
NUMA node1 CPU(s): 56-111,168-223

@sywangyi
Copy link
Contributor Author

I am also curious about which the thread is kept looping while the process is close.

@Narsil
Copy link
Collaborator

Narsil commented Oct 23, 2024

I think it's OS dependant more than CPU dependant no ? I'm wondering how docker handles this and if docker version could affect this too.

The doc definitely says that behavior varies based on target.

@sywangyi
Copy link
Contributor Author

I think we use the same OS, since it's ubuntu22.04, see https://github.com/huggingface/text-generation-inference/blob/main/Dockerfile_intel#L127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants