Proof of concept TCP server mode #278

tarruda · 2023-03-19T02:46:27Z

This builds on my other PR to implement a very simple TCP mode.

The new mode first loads the model then listens for TCP connections on a port. When a connection is received, arguments will be parsed using a simple protocol:

First the number of arguments will be read followed by a newline character.
Then each argument will be read, separated by the 0 byte.
With this we build an argument vector, similar to what is passed to the program entry point.
The resulting "argv" is passed gpt_params_parse.

Finally llama_main will be executed with the input/output streams connected to the socket.

I've included two sample bash scripts which can be used to test the new mode. This is how it works:

Run ./chat_tcp_server.sh in a terminal.
In a second terminal, run ./chat_tcp_client.sh. This will connect to the server and start a sample chat session.

One thing to note is that this mode is only implemented for Unixes. There's two reasons for that:

I never wrote win32 TCP code, so I'm not familiar with the API
There's really no advantage in implementing this for win32 because it doesn't support fork(). The main advantage of using this mode is that it serves each connection in a separate process which inherits memory from the parent (so the model only has to be loaded once).

While the protocol is a bit "low level", it should be easy to write a higher level API on top of this, such as a node.js web server or next.js app.

ggerganov · 2023-03-19T18:43:00Z

Shall we close #267 now that we have this?

tarruda · 2023-03-19T18:44:34Z

Shall we close #267 now that we have this?

We can, this already includes all changes in #267. Only reason I did in a separate PR was to simplify reviewing.

tarruda · 2023-03-19T19:08:32Z

Rebased

* fix coloring of last `n_batch` of prompt, and refactor line input * forgot the newline that needs to be sent to the model * (per #283) try to force flush of color reset in SIGINT handler

avilum · 2023-03-19T20:37:44Z

If anyone is seeking for working client/server implementation,
I wrote a minimal Go/Python server and client with live streaming, that is based on this awesome repo.
See https://github.com/avilum/llama-saas

CMakeLists.txt

tarruda · 2023-03-19T21:26:07Z

If anyone is seeking for working client/server implementation, I wrote a minimal Go/Python server and client with live streaming, that is based on this awesome repo. See https://github.com/avilum/llama-saas

@avilum One problem with your implementation that the service spawns a new llama.cpp instance for every http request, so it can take a long time to respond (model has to be loaded every time).

I suggest you try using this branch with a different approach: Start llama.cpp once, and on every http request you create a TCP connection to the singleton instance. This will get you a clean environment for every request without the overhead of reloading the model.

avideci · 2023-03-19T22:23:43Z

If anyone is seeking for working client/server implementation, I wrote a minimal Go/Python server and client with live streaming, that is based on this awesome repo. See https://github.com/avilum/llama-saas

@avilum One problem with your implementation that the service spawns a new llama.cpp instance for every http request, so it can take a long time to respond (model has to be loaded every time).

I suggest you try using this branch with a different approach: Start llama.cpp once, and on every http request you create a TCP connection to the singleton instance. This will get you a clean environment for every request without the overhead of reloading the model.

For the sake of POC I ran the process every prompt.
I most definitrly Agree - I am working at the moment on loaded the DLL only once (main exe) and once this is merged, I might evensend the data over this TCP socket.
I'll load the model and then feed it with inputs, from the Go code, over any kind of socket / IPC.

jart · 2023-03-20T03:36:00Z

I disagree with this change. This is a large rearchitecting of the project that fundamentally changes its vision. No one will run your daemon. What value does having a TCP server mode offer, aside from fixing loading time? The issue of loading time is solved by #91 which we've implemented in the mmap branch: https://github.com/ggerganov/llama.cpp/tree/mmap It will be much easier to support win32 using mmap than it would be to support winsock.

anzz1 · 2023-03-20T05:36:02Z

Why even the need of spawning multiple processes instead of multiple threads? Threads are already built-in, unlike sockets or mmap. The only truly portable thing is standard I/O, which can be redirected and also easily communicated with using simple file streams which are supported by everything. Instead of changing the main implementation much at all, you could just build any modules outside the main implementation as modules and communicate using these simple file streams. The main llama input would not need any sort of 'protocol' but just listen to '\n' or EOF like it currently does, and whatever modules could just follow that paradigm while communicating through the file streams. Am I missing something here?

Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

spirobel · 2023-03-20T09:43:51Z

@jart

I disagree with this change. This is a large rearchitecting of the project that fundamentally changes its vision.

That is true. It is better to keep the scope focused and make sure llama.cpp is as stable as possible.

@tarruda

While the protocol is a bit "low level", it should be easy to write a higher level API on top of this, such as a node.js web server or next.js app.

We can bundle llama with node.js in its current form. There is a library called caxa: https://github.com/leafac/caxa

It bundles any nodejs app into a single executable. As a side effect of the way it is architected it unzips any executeables from the dist folder at runtime.

So we can just place a compiled llama version in there and bundle it together. Then we can build a rest-api in node that can be easily called from nextjs.

I also found a way that turns an openapi.yml file into the single source of truth for the nodejs routes.
The result of this is,
that we get an interactive docs page for free that spits out curl commands to interact with the api.

Here is the repo that combines caxa and this technique to make a single binary rpc daemon: https://github.com/spirobel/monerochan-merchant-rpc

If there is an interest I can make something like this for llama.cpp.

If we make a one click template for a provider like digital ocean people could spin up their own on demand instances that work just like the openAI api.

tarruda · 2023-03-20T09:47:43Z

This is a large rearchitecting of the project that fundamentally changes its vision

@jart @spirobel there's no rearchitecting here, in fact my goal was to introduce a server/client model with minimal changes. If you don´t pass the -l option, the app works exactly as it is now. This PR might seem like it is changing a lot, but if you review the commits individually, you will see there's not much change to existing logic.

The way this was implemented was basically by moving all the code (except for model loading code) from the main function to another reusable function called llama_main, then replacing the direct std streams references by parameters passed to it. So the same code runs unmodified on stdin/stdout/stderr or on a TCP connection.

What value does having a TCP server mode offer, aside from fixing loading time?

There are more unexplored applications of this TCP server mode. Here's a few ideas:

Wrap into an http/json/rest/websocket server for using in a web application. Eventually I would like to write an API that is compatible with ChatGPT's API.
Do any kind of processing (such as preload the prompt) prior to accepting connections/requests. This would allow you to essentially respond to queries more quickly by preloading a chatbot prompt, then for each request just process user input.
Server/client usage, you can run llama.cpp in a more powerful computer and share with other devices in the same LAN.

@ggerganov also thinks this is a good idea: #267 (comment)

I don't think this PR is mutually exclusive with the work you are doing, it is still useful to load the model faster on new process startups.

spirobel · 2023-03-20T09:58:42Z

@tarruda

Do any kind of processing (such as preload the prompt) prior to accepting connections/requests. This would allow you to essentially respond to queries more quickly by preloading a chatbot prompt, then for each request just process user input.

Responsiveness and proper concurrency is always good! 😀👍 Nothing more frustrating than lagging or hung up programs.

That being said,

Wrap into an http/json/rest/websocket server for using in a web application. Eventually I would like to write an API that is compatible with ChatGPT's API.

please take a look at the single binary rpc that I built! I am happy to answer any questions! I genuinely believe it is better to implement this in nodejs instead of doing it in cpp. Monero also has a cpp rest rpc daemon and it is not fun to work with. It always hangs up or becomes unresponsive for long times. Also the documentation is always out of date.
Using openapi (not openai 😅) as a single source of truth for routing and documentation solves this issue permanently.

We can bundle llama with node.js in its current form. There is a library called caxa: https://github.com/leafac/caxa

It bundles any nodejs app into a single executable. As a side effect of the way it is architected it unzips any executeables from the dist folder at runtime.

So we can just place a compiled llama version in there and bundle it together. Then we can build a rest-api in node that can be easily called from nextjs.

I also found a way that turns an openapi.yml file into the single source of truth for the nodejs routes. The result of this is, that we get an interactive docs page for free that spits out curl commands to interact with the api.

Here is the repo that combines caxa and this technique to make a single binary rpc daemon: https://github.com/spirobel/monerochan-merchant-rpc

If there is an interest I can make something like this for llama.cpp.

If we make a one click template for a provider like digital ocean people could spin up their own on demand instances that work just like the openAI api.

tarruda · 2023-03-20T09:59:11Z

@anzz1 I don't see how it could work using threads. There's only one instance of the model in memory, AFAIK ggml API is not thread-safe or supports concurrent usage (@ggerganov please correct me if I'm wrong).

tarruda · 2023-03-20T11:52:04Z

implement this in nodejs instead of doing it in cpp

@spirobel If you want to implement a server mode in another program/language such as node.js and without changes to llama.cpp, there's two ways I see you can go about it:

Spawn a single instance of llama.cpp at server startup, and synchronize access to stdin/stdout between all requests. This might work well for a single user, but if llama.cpp the process crashes, you would have to restart it. Not to mention you would not be able to customize parameters per connection/request.
Spawn a new llama.cpp instance for each request (this is what @avilum did), which would be very slow with the way the model is loaded right now (also requires duplicating the model in memory). I'm not familiar with the mmap loading solution, but assuming it makes loading the model instant, would it be able to preload the prompt and/or customize the prompt/seed per request? Preloading the prompt is not implemented right now but it is something that can be done in the server mode I implemented here.

I agree that higher level abstractions are better done in platforms like node.js or python, but in this case I don't think it would be possible to implement a server in purely node.js and have the same efficiency of a fork/connection server approach.

Now, here is the last paragraph of the PR description

While the protocol is a bit "low level", it should be easy to write a higher level API on top of this, such as a node.js web server or next.js app."

As you can see, my goal with this PR was to provide a base server/protocol that can be wrapped in a higher level API. In fact I implemented the parameter passing protocol in a way to allow reusing the existing gpt_params_parse function, so any new parameters added are automatically supported. The server mode is basically a "zygote" for quickly spawning new instances of llama.cpp with custom parameters (except for n_ctx, which has to be passed to ggml_init).

Technically I could have implemented a higher level abstraction such as a simple HTTP endpoint that parses json messages, but it would require me to add at least a JSON parsing library, which goes against the goals of the project ("no dependencies"). I also think we would lose some flexibility by making assumptions about the format of the API, better to have a lower level implementation that can be tailored for various purposes.

Since this server mode is meant to be wrapped in a higher level abstraction, it might be better to implement it using Unix sockets instead of TCP sockets, which I might do later after getting some feedback. This is still an experiment/POC.

tarruda · 2023-03-21T12:00:49Z

I’m personally uncomfortable with this because I don’t believe new C code should be exposed directly to the internet

This TCP mode is not meant to be used directly, in my previous comments I've hinted that I created this as a lower level protocol meant to be wrapped in a higher level solution, possibly written in Node.js or Python.

Right now a loopback address is hardcoded:

so if you want to use this TCP mode over a network, it is necessary to wrap in a proxy such as socat.

mqy · 2023-03-21T17:52:27Z

It looks that this PR refactors current code base too much. That's not big problem if the changes are urgent, but this is arguable.

First of all, there are too much things to do in this hot repo, various people want to add/fix something. More than a thousand forks have been created. The llma-rs copied ggml as library.
If a PR changes code base a lot, that may create unnecessary conflicts with pending pull requests to merge. Perhaps it's better to break down a somewhat big PR to smaller ones, and avoid deleting/renaming existing core files.
Perhaps it's not easy to write a production level chat server in C++.

So, I recommend collecting more feedbacks before merging this PR into mainline, and do some formal design.
Please let me list several possible APIs for constructing a chat server:

API to init the inference engine, which returns engine object. Explicit server config required, that can be loaded from config file later.
APIs to start/close/resume a conversation session, one-shot or interactive. Conceptually, a engine object manages sessions.
API(s) for conversations. Conceptually, a session object manages talks.
Other possible APIs, for example: server metrics.

If we could write these APIs (in C), it's possible to build chat servers in almost any popular programing language, with protocols like HTTP, GRPC, WebSocket. Before that, we could design and write C++ APIs on top of current code base.

FYI, best regards :)

tarruda · 2023-03-21T19:26:50Z

It looks that this PR refactors current code base too much.

If you consider replacing global references (stdin/stdout/stderr) with function parameters "too much refactoring", then yes. Really, review the commits individually, you will see the changes to existing code are easy and actually good even if the TCP server module is not merged . I had created a prior PR #267 with just these base changes, because I considered them worth in isolation.

So, I recommend collecting more feedbacks before merging this PR into mainline, and do some formal design.

No one is in a rush to merge this, I split the steps into separate commits and it is very easy for me to keep rebasing, which is what I will do.

Please let me list several possible APIs for constructing a chat server:

I appreciate the suggestion but this is outside of the scope of what I'm willing to do. I wanted to introduce networking capabilities with minimal changes to existing code or architecture. If someone wants to do these more elaborate changes, they are free to do so in a separate PR, I will happily close this PR if there's a better implementation.

tarruda · 2023-03-22T13:49:22Z

Redid the commits on top of the latest C API changes. Now that the C API is implemented on llama.cpp, I've moved the program main loop to run.cpp.

Seems like the resulting additions/removals is smaller now.

Signed-off-by: Thiago Padilha <thiago@padilha.cc>

The goal is to allow running "run" while connected to other streams, such as TCP sockets. Signed-off-by: Thiago Padilha <thiago@padilha.cc>

This new mode works by first loading the model then listening for TCP connections on a port. When a connection is received, arguments will be parsed using a simple protocol: - First the number of arguments will be read followed by a newline character. - Then each argument will be read, separated by the 0 byte. - With this we build an argument vector, similar to what is passed to the program entry point. We pass this to gpt_params_parse. Finally `run` will be executed with the input/output streams connected to the socket. Signed-off-by: Thiago Padilha <thiago@padilha.cc>

ggerganov

Great job! Haven't had the chance to test it yet, but it seems well done.

I would like this to become a standalone example in the "./examples" folder.
The main.cpp example has to remain the way it is on master.
Even if you have to duplicate the code from main in the tcp_server example - that is OK.

tarruda · 2023-03-22T18:19:07Z

I would like this to become a standalone example in the "./examples" folder.
The main.cpp example has to remain the way it is on master.
Even if you have to duplicate the code from main in the tcp_server example - that is OK.

@ggerganov I'm not sure if I understand. Do you want me to copy all the code in main.cpp to tcp_server.cpp and have it become a standalone program?

C2D03041 · 2023-03-22T20:12:42Z

I have some comments as a result of actually trying to wrap this with a node client last night:

I think host bind address should be an option because
1. the service may be running on a private network on a distinct host from the http proxy
2. the service may be running on a container (without host networking) and therefore would have its own virtual network interface, which would make this service unreachable without unnecessarily colocating a proxy inside the container
3. I added a host option here
There is no clear signal to the client when the user input is pending. You could rely on the color ANSI, but only in color mode. And that seems a bit flaky.
There is no way for the client to reset the conversation, without causing unnecessary round trips (TCP handshake)
There is no method for the client to stop ongoing processing without disconnecting the socket and losing the conversation state.
There is no clear signal to the client of how many tokens are remaining.
There is no reliable indication to the client of which model file is loaded, or other model metadata -- possibly some can be gleaned from debugging log messages, but this is not reliable enough for an integration.

I remediated 2, 5, and parts of 6 here. I did this by replacing the raw console output with a plaintext, line-based (irc-like/smtp-like) protocol. One message per line (with control characters escaped), with a message type keyword as the first word on each line. I implemented the keywords:

HELO: Sent to the client upon initial connect.
FATAL (.+): For any unrecoverable error with a message.
DEBUG (.+): For human-readable debug messages.
OUTPUT (.+): For model outputs, sent as they become available.
PROMPT (.+): When the model begins responding to a prompt, the prompt received is echoed back to the client in this line.
KV ([\w\d_-]+)=(.+): For sending named key-value pairs to the client. For example, interactive_mode, remaining_tokens, seed, awaiting_prompt, etc.

I didn't touch the input protocol used to start the model, or remediate all of the issues because frankly my C++ isn't good enough 😂. I don't know how to manipulate the input stream like that -- i think it might need to be handled in a separate thread and then sent to the thread executing lambda_main, but that's above the time I have to invest in a side project, so, I just did what i could.

Sample Output

>> HELO
>> KV seed=1679516412
>> PROMPT  Transcript of a dialog, where the user interacts with an assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User\'s requests immediately and with precision.\n\nUser:Hello, Bob.\nBob:Hello. How may I help you today?\nUser:
>> KV prompt_tokens=68
>> DEBUG 1 -> ''
>> DEBUG 4103 -> ' Trans'
>> DEBUG 924 -> 'cript'
>> DEBUG 310 -> ' of'
>> DEBUG 263 -> ' a'
>> DEBUG 7928 -> ' dialog'
>> DEBUG 29892 -> ','
>> DEBUG 988 -> ' where'
>> DEBUG 278 -> ' the'
>> DEBUG 1404 -> ' user'
>> DEBUG 16254 -> ' interact'
>> DEBUG 29879 -> 's'
>> DEBUG 411 -> ' with'
>> DEBUG 385 -> ' an'
>> DEBUG 20255 -> ' assistant'
>> DEBUG 4257 -> ' named'
>> DEBUG 7991 -> ' Bob'
>> DEBUG 29889 -> '.'
>> DEBUG 7991 -> ' Bob'
>> DEBUG 338 -> ' is'
>> DEBUG 8444 -> ' helpful'
>> DEBUG 29892 -> ','
>> DEBUG 2924 -> ' kind'
>> DEBUG 29892 -> ','
>> DEBUG 15993 -> ' honest'
>> DEBUG 29892 -> ','
>> DEBUG 1781 -> ' good'
>> DEBUG 472 -> ' at'
>> DEBUG 5007 -> ' writing'
>> DEBUG 29892 -> ','
>> DEBUG 322 -> ' and'
>> DEBUG 2360 -> ' never'
>> DEBUG 8465 -> ' fails'
>> DEBUG 304 -> ' to'
>> DEBUG 1234 -> ' answer'
>> DEBUG 278 -> ' the'
>> DEBUG 4911 -> ' User'
>> DEBUG 29915 -> '\''
>> DEBUG 29879 -> 's'
>> DEBUG 7274 -> ' requests'
>> DEBUG 7389 -> ' immediately'
>> DEBUG 322 -> ' and'
>> DEBUG 411 -> ' with'
>> DEBUG 16716 -> ' precision'
>> DEBUG 29889 -> '.'
>> DEBUG 13 -> '\n'
>> DEBUG 13 -> '\n'
>> DEBUG 2659 -> 'User'
>> DEBUG 29901 -> ':'
>> DEBUG 10994 -> 'Hello'
>> DEBUG 29892 -> ','
>> DEBUG 7991 -> ' Bob'
>> DEBUG 29889 -> '.'
>> DEBUG 13 -> '\n'
>> DEBUG 29362 -> 'Bob'
>> DEBUG 29901 -> ':'
>> DEBUG 10994 -> 'Hello'
>> DEBUG 29889 -> '.'
>> DEBUG 1128 -> ' How'
>> DEBUG 1122 -> ' may'
>> DEBUG 306 -> ' I'
>> DEBUG 1371 -> ' help'
>> DEBUG 366 -> ' you'
>> DEBUG 9826 -> ' today'
>> DEBUG 29973 -> '?'
>> DEBUG 13 -> '\n'
>> DEBUG 2659 -> 'User'
>> DEBUG 29901 -> ':'
>> KV interactive_mode=true
>> KV reverse_prompt="User:"
>> KV temp=0.700000
>> KV top_k=40
>> KV top_p=0.500000
>> KV repeat_last_n=500
>> KV repeat_penalty=1.200000
>> OUTPUT  Transcript of a dialog, where
>> OUTPUT  the user interacts with an assistant named
>> OUTPUT  Bob. Bob is helpful, kind,
>> OUTPUT  honest, good at writing, and never
>> OUTPUT  fails to answer the User\'s requests
>> OUTPUT  immediately and with precision.\n\nUser
>> OUTPUT :Hello, Bob.\nBob:
>> OUTPUT Hello. How may I help you today
>> OUTPUT ?\nUser:
>> KV awaiting_prompt=true

tarruda · 2023-03-22T20:45:14Z

@ggerganov These are the changes I did to main:

Moved main I/O loop out of the main() function into a run() function which can be reused
In the run() function, changed fprintf(stderr,... to fprintf(errstream, ... where errstream is a FILE * parameter (main() passes stderr)
In the run() function, changed printf(... to fprintf(outstream, ... where outstream is a FILE * parameter (main() passes stdout)
In the run() function, changed std::getline(std::cin,... to std::getline(instream,... where instream is a std::istream parameter (main() passes std:cin)

The main.cpp example has to remain the way it is on master.

Do you mean I should revert the changes listed above? These changes are mandatory for the implementation of the tcp server.

ggerganov · 2023-03-22T20:57:32Z

@tarruda

@ggerganov I'm not sure if I understand. Do you want me to copy all the code in main.cpp to tcp_server.cpp and have it become a standalone program?

Yes, the main.cpp is the example that everybody will run first when they clone the project. It has to be straightforward, demonstrating basic usage of llama.cpp. The run() abstraction is not necessary here.

tarruda · 2023-03-22T21:03:19Z

The run() abstraction is not necessary here.

The run abstraction is necessary if we want to share the main loop with the tcp server, it is not practical for me to copy all the code in main and keep duplicating the changes back to a separate program whenever the main loop is updated. I will maintain this in my own private fork then.

tkafka · 2023-03-23T15:44:42Z

@ggerganov I feel that we are losing a lot here - I, for one, would love to be able to use @tarruda's fork with the API. Is there a way to add the API to this project?

tarruda · 2023-03-23T16:13:19Z

@tkafka I'm still maintaining these changes in my fork, and will keep rebasing for the foreseeable future (might even set up some script to do this semi-automatically in a daily basis). Here's the code: https://github.com/tarruda/llama.cpp/tree/tcp_server just rebased it.

mqy · 2023-03-23T16:26:00Z

Looks like the source files will be re-structured sooner or later. #384 (comment)

tarruda · 2023-03-23T16:32:45Z

@mqy since I started this PR, the files have been restructured multiple times. I will just keep updating the main example to support tcp until there's a better native solution that doesn't rely on copy and pasting code.

I would have been fine if tcp_server was a separate program from main, as long as the main loop was in a reusable module (which is what I've done in "run.cpp"). Until there's a better option, I will just keep rebasing the main example.

sowa705 · 2023-03-29T10:59:01Z

Hi, i use your TCP fork and its working very well for my usecase This is a very important feature that should be merged imo

jessejohnson · 2023-03-29T12:43:21Z

Happy to see this as a separate llama.cpp-based project. It's nice to keep llama.cpp as a useful base to build on top of. See gpt4all for example. 🚀

tarruda requested a review from ggerganov March 19, 2023 02:50

tarruda mentioned this pull request Mar 19, 2023

Refactor most code in main.cpp into a separate module (preparing to implement TCP mode) #267

Closed

tarruda force-pushed the tcp_server branch 2 times, most recently from 8f1eeca to 4a79a33 Compare March 19, 2023 19:08

tarruda force-pushed the tcp_server branch from 4a79a33 to 3848f83 Compare March 19, 2023 20:10

tarruda force-pushed the tcp_server branch from 3848f83 to c663305 Compare March 19, 2023 20:58

maxsu reviewed Mar 19, 2023

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

tarruda force-pushed the tcp_server branch 2 times, most recently from 1075f48 to 5e65c52 Compare March 19, 2023 21:19

tarruda force-pushed the tcp_server branch from 5e65c52 to 3ebf150 Compare March 19, 2023 23:51

tarruda mentioned this pull request Mar 20, 2023

Should use mmap for model loading #91

Closed

This was referenced Mar 20, 2023

High performance API #321

Closed

Ability for ./main to keep the model in memory and pass it more text #23

Closed

gjmulder added the enhancement New feature or request label Mar 20, 2023

tarruda force-pushed the tcp_server branch from 3ebf150 to b6fdbee Compare March 20, 2023 21:52

anzz1 mentioned this pull request Mar 20, 2023

MMAP for Windows (not working atm) #341

Closed

tarruda force-pushed the tcp_server branch from 74ce479 to 8abd3de Compare March 21, 2023 19:27

dzid26 mentioned this pull request Mar 22, 2023

How to integrate this into our own applications? antimatter15/alpaca.cpp#110

Open

tarruda force-pushed the tcp_server branch from 8abd3de to 94c729b Compare March 22, 2023 13:48

tarruda force-pushed the tcp_server branch 2 times, most recently from f877c93 to 0635dc0 Compare March 22, 2023 16:43

tarruda added 5 commits March 22, 2023 14:31

Move main.cpp to run.cpp

90175ee

Signed-off-by: Thiago Padilha <thiago@padilha.cc>

Add main.cpp back and invoke "run" from it

d7d53b8

Signed-off-by: Thiago Padilha <thiago@padilha.cc>

Move llama_context setup + perplexity back to main.cpp

b7f1fa6

Signed-off-by: Thiago Padilha <thiago@padilha.cc>

Remove direct access to std streams from "run"

bf44faa

The goal is to allow running "run" while connected to other streams, such as TCP sockets. Signed-off-by: Thiago Padilha <thiago@padilha.cc>

tarruda force-pushed the tcp_server branch from 0635dc0 to 3a0dcb3 Compare March 22, 2023 17:35

ggerganov requested changes Mar 22, 2023

View reviewed changes

tarruda closed this Mar 22, 2023

gjmulder mentioned this pull request Mar 28, 2023

[Feature Request] Simplified API for Inference and HTTP Server Integration #565

Closed

gjmulder added the high priority Very important issue label Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of concept TCP server mode #278

Proof of concept TCP server mode #278

tarruda commented Mar 19, 2023

ggerganov commented Mar 19, 2023

tarruda commented Mar 19, 2023

tarruda commented Mar 19, 2023

avilum commented Mar 19, 2023

tarruda commented Mar 19, 2023 •

edited

Loading

avideci commented Mar 19, 2023

jart commented Mar 20, 2023

anzz1 commented Mar 20, 2023 •

edited

Loading

spirobel commented Mar 20, 2023

tarruda commented Mar 20, 2023

spirobel commented Mar 20, 2023

tarruda commented Mar 20, 2023

tarruda commented Mar 20, 2023

tarruda commented Mar 21, 2023 •

edited

Loading

mqy commented Mar 21, 2023

tarruda commented Mar 21, 2023

tarruda commented Mar 22, 2023 •

edited

Loading

ggerganov left a comment

tarruda commented Mar 22, 2023

C2D03041 commented Mar 22, 2023 •

edited

Loading

tarruda commented Mar 22, 2023

ggerganov commented Mar 22, 2023

tarruda commented Mar 22, 2023

tkafka commented Mar 23, 2023

tarruda commented Mar 23, 2023

mqy commented Mar 23, 2023

tarruda commented Mar 23, 2023

sowa705 commented Mar 29, 2023

jessejohnson commented Mar 29, 2023

Proof of concept TCP server mode #278

Proof of concept TCP server mode #278

Conversation

tarruda commented Mar 19, 2023

ggerganov commented Mar 19, 2023

tarruda commented Mar 19, 2023

tarruda commented Mar 19, 2023

avilum commented Mar 19, 2023

tarruda commented Mar 19, 2023 • edited Loading

avideci commented Mar 19, 2023

jart commented Mar 20, 2023

anzz1 commented Mar 20, 2023 • edited Loading

spirobel commented Mar 20, 2023

tarruda commented Mar 20, 2023

spirobel commented Mar 20, 2023

tarruda commented Mar 20, 2023

tarruda commented Mar 20, 2023

tarruda commented Mar 21, 2023 • edited Loading

mqy commented Mar 21, 2023

tarruda commented Mar 21, 2023

tarruda commented Mar 22, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

tarruda commented Mar 22, 2023

C2D03041 commented Mar 22, 2023 • edited Loading

tarruda commented Mar 22, 2023

ggerganov commented Mar 22, 2023

tarruda commented Mar 22, 2023

tkafka commented Mar 23, 2023

tarruda commented Mar 23, 2023

mqy commented Mar 23, 2023

tarruda commented Mar 23, 2023

sowa705 commented Mar 29, 2023

jessejohnson commented Mar 29, 2023

tarruda commented Mar 19, 2023 •

edited

Loading

anzz1 commented Mar 20, 2023 •

edited

Loading

tarruda commented Mar 21, 2023 •

edited

Loading

tarruda commented Mar 22, 2023 •

edited

Loading

C2D03041 commented Mar 22, 2023 •

edited

Loading