-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial attempt at straightforward document processing script. #731
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I prefer the local_mode (ctx = sycamore.init(exec_mode=ExecMode.LOCAL)) approach for three reasons:
- If you need to scale up, it's easy to switch it over to ray mode
- We can in the future add multiprocessing support to get more speed
- It preserves all of the metadata so the reliability work will be able to happen
That said, there is clearly a need for some rayless thing as people are starting to use local mode before it's really ready, and you ended up writing this example.
|
||
############################################################################### | ||
|
||
def iterInputs(inputs: list[str], aws_sess = None) -> Iterator[BinaryIO]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can replace all of this with
docs = BinaryScan(paths=inputs).local_source()
once https://github.com/aryn-ai/sycamore/pull/712/files is in.
Do we need to get this group together to make a call on the approach?
From: Eric Anderson ***@***.***>
Date: Tuesday, August 27, 2024 at 12:00 PM
To: aryn-ai/sycamore ***@***.***>
Cc: Jonathan Fritz ***@***.***>, Review requested ***@***.***>
Subject: Re: [aryn-ai/sycamore] Initial attempt at straightforward document processing script. (PR #731)
@eric-anderson commented on this pull request.
In general, I prefer the local_mode (ctx = sycamore.init(exec_mode=ExecMode.LOCAL)) approach for three reasons:
1. If you need to scale up, it's easy to switch it over to ray mode
2. We can in the future add multiprocessing support to get more speed
3. It preserves all of the metadata so the reliability work will be able to happen
That said, there is clearly a need for some rayless thing as people are starting to use local mode before it's really ready, and you ended up writing this example.
________________________________
In examples/direct.py<#731 (comment)>:
+import boto3
+import pyarrow.fs
+
+import aryn_sdk.partition
+import sycamore
+from sycamore.transforms.embed import SentenceTransformerEmbedder
+from sycamore.transforms.sketcher import Sketcher
+from sycamore.connectors.duckdb.duckdb_writer import (
+ DuckDBWriterClientParams,
+ DuckDBWriterTargetParams,
+ DuckDBWriter,
+)
+
+###############################################################################
+
+def iterInputs(inputs: list[str], aws_sess = None) -> Iterator[BinaryIO]:
You can replace all of this with
docs = BinaryScan(paths=inputs).local_source()
once https://github.com/aryn-ai/sycamore/pull/712/files is in.
—
Reply to this email directly, view it on GitHub<#731 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BAA5BM4K7V7MHHTOT5MZDVLZTTED5AVCNFSM6AAAAABNGXI4NCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENRUGIZDCOBTGU>.
You are receiving this because your review was requested.Message ID: ***@***.***>
|
This is a proposal/example, not meant to be checked in.
The basic idea here is not only to bypass Ray, but also avoid the lazy-evaluated pipeline abstraction. Instead, it's coded the way a typical programmer would expect to write the code. This approach is synchronous rather than functional and allows different documents to be treated differently on the fly.
Instead of DocSet, we deal with a list of Document. DocSet confuses people because it's not a set of documents.
One finding is that most existing transforms would be easier to use a simple functions. Then they could the the target of "map", either directly or via DocSet.
This code represents the exercise of simplifying without modifying Sycamore. The function
iterInputs
would be intended as an addition to the Sycamore library. There are FIXME comments for how Sycamore could become easier to use directly.The remaining piece would be a way to encapsulate common processing sequences into higher-level single calls. We could do this generally, or just provide some off-the-shelf. This may turn into an exercise in naming.