-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of concept TCP server mode #278
Conversation
Shall we close #267 now that we have this? |
8f1eeca
to
4a79a33
Compare
Rebased |
* fix coloring of last `n_batch` of prompt, and refactor line input * forgot the newline that needs to be sent to the model * (per #283) try to force flush of color reset in SIGINT handler
If anyone is seeking for working client/server implementation, |
1075f48
to
5e65c52
Compare
@avilum One problem with your implementation that the service spawns a new llama.cpp instance for every http request, so it can take a long time to respond (model has to be loaded every time). I suggest you try using this branch with a different approach: Start llama.cpp once, and on every http request you create a TCP connection to the singleton instance. This will get you a clean environment for every request without the overhead of reloading the model. |
For the sake of POC I ran the process every prompt. |
I disagree with this change. This is a large rearchitecting of the project that fundamentally changes its vision. No one will run your daemon. What value does having a TCP server mode offer, aside from fixing loading time? The issue of loading time is solved by #91 which we've implemented in the mmap branch: https://github.com/ggerganov/llama.cpp/tree/mmap It will be much easier to support win32 using mmap than it would be to support winsock. |
Why even the need of spawning multiple processes instead of multiple threads? Threads are already built-in, unlike sockets or mmap. The only truly portable thing is standard I/O, which can be redirected and also easily communicated with using simple file streams which are supported by everything. Instead of changing the main implementation much at all, you could just build any modules outside the main implementation as modules and communicate using these simple file streams. The main llama input would not need any sort of 'protocol' but just listen to '\n' or EOF like it currently does, and whatever modules could just follow that paradigm while communicating through the file streams. Am I missing something here? Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code. |
That is true. It is better to keep the scope focused and make sure llama.cpp is as stable as possible.
We can bundle llama with node.js in its current form. There is a library called caxa: https://github.com/leafac/caxa It bundles any nodejs app into a single executable. As a side effect of the way it is architected it unzips any executeables from the dist folder at runtime. So we can just place a compiled llama version in there and bundle it together. Then we can build a rest-api in node that can be easily called from nextjs. I also found a way that turns an openapi.yml file into the single source of truth for the nodejs routes. Here is the repo that combines caxa and this technique to make a single binary rpc daemon: https://github.com/spirobel/monerochan-merchant-rpc If there is an interest I can make something like this for llama.cpp. If we make a one click template for a provider like digital ocean people could spin up their own on demand instances that work just like the openAI api. |
@jart @spirobel there's no rearchitecting here, in fact my goal was to introduce a server/client model with minimal changes. If you don´t pass the The way this was implemented was basically by moving all the code (except for model loading code) from the
There are more unexplored applications of this TCP server mode. Here's a few ideas:
@ggerganov also thinks this is a good idea: #267 (comment) I don't think this PR is mutually exclusive with the work you are doing, it is still useful to load the model faster on new process startups. |
Responsiveness and proper concurrency is always good! 😀👍 Nothing more frustrating than lagging or hung up programs. That being said,
please take a look at the single binary rpc that I built! I am happy to answer any questions! I genuinely believe it is better to implement this in nodejs instead of doing it in cpp. Monero also has a cpp rest rpc daemon and it is not fun to work with. It always hangs up or becomes unresponsive for long times. Also the documentation is always out of date.
|
@anzz1 I don't see how it could work using threads. There's only one instance of the model in memory, AFAIK |
@spirobel If you want to implement a server mode in another program/language such as node.js and without changes to llama.cpp, there's two ways I see you can go about it:
I agree that higher level abstractions are better done in platforms like node.js or python, but in this case I don't think it would be possible to implement a server in purely node.js and have the same efficiency of a fork/connection server approach. Now, here is the last paragraph of the PR description
As you can see, my goal with this PR was to provide a base server/protocol that can be wrapped in a higher level API. In fact I implemented the parameter passing protocol in a way to allow reusing the existing Technically I could have implemented a higher level abstraction such as a simple HTTP endpoint that parses json messages, but it would require me to add at least a JSON parsing library, which goes against the goals of the project ("no dependencies"). I also think we would lose some flexibility by making assumptions about the format of the API, better to have a lower level implementation that can be tailored for various purposes. Since this server mode is meant to be wrapped in a higher level abstraction, it might be better to implement it using Unix sockets instead of TCP sockets, which I might do later after getting some feedback. This is still an experiment/POC. |
This TCP mode is not meant to be used directly, in my previous comments I've hinted that I created this as a lower level protocol meant to be wrapped in a higher level solution, possibly written in Node.js or Python. Right now a loopback address is hardcoded: so if you want to use this TCP mode over a network, it is necessary to wrap in a proxy such as |
It looks that this PR refactors current code base too much. That's not big problem if the changes are
So, I recommend collecting more feedbacks before merging this PR into mainline, and do some formal design.
If we could write these APIs (in C), it's possible to build chat servers in almost any popular programing language, with protocols like HTTP, GRPC, WebSocket. Before that, we could design and write C++ APIs on top of current code base. FYI, best regards :) |
If you consider replacing global references (stdin/stdout/stderr) with function parameters "too much refactoring", then yes. Really, review the commits individually, you will see the changes to existing code are easy and actually good even if the TCP server module is not merged . I had created a prior PR #267 with just these base changes, because I considered them worth in isolation.
No one is in a rush to merge this, I split the steps into separate commits and it is very easy for me to keep rebasing, which is what I will do.
I appreciate the suggestion but this is outside of the scope of what I'm willing to do. I wanted to introduce networking capabilities with minimal changes to existing code or architecture. If someone wants to do these more elaborate changes, they are free to do so in a separate PR, I will happily close this PR if there's a better implementation. |
Redid the commits on top of the latest C API changes. Now that the C API is implemented on llama.cpp, I've moved the program main loop to run.cpp. Seems like the resulting additions/removals is smaller now. |
f877c93
to
0635dc0
Compare
Signed-off-by: Thiago Padilha <thiago@padilha.cc>
Signed-off-by: Thiago Padilha <thiago@padilha.cc>
Signed-off-by: Thiago Padilha <thiago@padilha.cc>
The goal is to allow running "run" while connected to other streams, such as TCP sockets. Signed-off-by: Thiago Padilha <thiago@padilha.cc>
This new mode works by first loading the model then listening for TCP connections on a port. When a connection is received, arguments will be parsed using a simple protocol: - First the number of arguments will be read followed by a newline character. - Then each argument will be read, separated by the 0 byte. - With this we build an argument vector, similar to what is passed to the program entry point. We pass this to gpt_params_parse. Finally `run` will be executed with the input/output streams connected to the socket. Signed-off-by: Thiago Padilha <thiago@padilha.cc>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! Haven't had the chance to test it yet, but it seems well done.
I would like this to become a standalone example in the "./examples" folder.
The main.cpp
example has to remain the way it is on master
.
Even if you have to duplicate the code from main
in the tcp_server
example - that is OK.
@ggerganov I'm not sure if I understand. Do you want me to copy all the code in main.cpp to tcp_server.cpp and have it become a standalone program? |
I have some comments as a result of actually trying to wrap this with a node client last night:
I remediated 2, 5, and parts of 6 here. I did this by replacing the raw console output with a plaintext, line-based (irc-like/smtp-like) protocol. One message per line (with control characters escaped), with a message type keyword as the first word on each line. I implemented the keywords:
I didn't touch the input protocol used to start the model, or remediate all of the issues because frankly my C++ isn't good enough 😂. I don't know how to manipulate the input stream like that -- i think it might need to be handled in a separate thread and then sent to the thread executing lambda_main, but that's above the time I have to invest in a side project, so, I just did what i could. Sample Output
|
@ggerganov These are the changes I did to
Do you mean I should revert the changes listed above? These changes are mandatory for the implementation of the tcp server. |
Yes, the |
The run abstraction is necessary if we want to share the main loop with the tcp server, it is not practical for me to copy all the code in |
@ggerganov I feel that we are losing a lot here - I, for one, would love to be able to use @tarruda's fork with the API. Is there a way to add the API to this project? |
@tkafka I'm still maintaining these changes in my fork, and will keep rebasing for the foreseeable future (might even set up some script to do this semi-automatically in a daily basis). Here's the code: https://github.com/tarruda/llama.cpp/tree/tcp_server just rebased it. |
Looks like the source files will be re-structured sooner or later. #384 (comment) |
@mqy since I started this PR, the files have been restructured multiple times. I will just keep updating the main example to support tcp until there's a better native solution that doesn't rely on copy and pasting code. I would have been fine if tcp_server was a separate program from main, as long as the main loop was in a reusable module (which is what I've done in "run.cpp"). Until there's a better option, I will just keep rebasing the main example. |
Hi, i use your TCP fork and its working very well for my usecase This is a very important feature that should be merged imo |
Happy to see this as a separate llama.cpp-based project. It's nice to keep llama.cpp as a useful base to build on top of. See gpt4all for example. 🚀 |
This builds on my other PR to implement a very simple TCP mode.
The new mode first loads the model then listens for TCP connections on a port. When a connection is received, arguments will be parsed using a simple protocol:
gpt_params_parse
.Finally
llama_main
will be executed with the input/output streams connected to the socket.I've included two sample bash scripts which can be used to test the new mode. This is how it works:
./chat_tcp_server.sh
in a terminal../chat_tcp_client.sh
. This will connect to the server and start a sample chat session.One thing to note is that this mode is only implemented for Unixes. There's two reasons for that:
fork()
. The main advantage of using this mode is that it serves each connection in a separate process which inherits memory from the parent (so the model only has to be loaded once).While the protocol is a bit "low level", it should be easy to write a higher level API on top of this, such as a node.js web server or next.js app.