-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: map_data support for Nomic Embed Vision #308
Conversation
@ellipsis-dev review this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Reviewed everything up to ac9a55c in 1 minute and 52 seconds
More details
- Looked at
291
lines of code in4
files - Skipped
0
files when reviewing. - Skipped posting
3
drafted comments based on config settings.
1. nomic/atlas.py:61
- Draft comment:
Themap_data
function now handles image blobs by setting the modality to 'image' and defaulting the embedding model to 'nomic-embed-vision-v1.5' if not specified. This is a significant change as it integrates image handling directly into the existing data mapping function, which previously focused on embeddings and text data. Ensure that the default embedding model and the handling of image data are thoroughly tested, especially since this modality change affects how data is processed and indexed. - Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 50%.
2. nomic/atlas.py:63
- Draft comment:
The introduction of '_blob_hash' as an indexed field for images is a critical change. Ensure that this new field is properly managed across the system, especially in functions that handle data indexing and retrieval. This change could impact how data is stored, retrieved, and indexed in the system. - Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 50%.
3. nomic/atlas.py:133
- Draft comment:
The handling of topic models for image data should be carefully reviewed to ensure that the 'community_description_target_field' is appropriately used to generate meaningful topic labels from image data. This is a new application of the topic modeling feature, and its effectiveness and accuracy in the context of image data should be validated. - Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 50%.
Workflow ID: wflow_lx0rEhsESDyvCAax
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 6 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to e707f94 in 1 minute and 9 seconds
More details
- Looked at
312
lines of code in4
files - Skipped
0
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_mBvzWO0puHyHuefa
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 6 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 53bd566 in 3 minutes and 51 seconds
More details
- Looked at
22
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. nomic/dataset.py:1494
- Draft comment:
It's important to handle exceptions within the ThreadPoolExecutor in_add_blobs
to ensure that all threads are properly managed and potential errors are gracefully handled. Consider adding a try-except block around thefuture.result()
call to catch and log exceptions, ensuring that all resources are properly cleaned up.
try:
response = future.result()
except Exception as e:
logger.error(f"Failed to upload batch: {e}")
continue
- Reason this comment was not posted:
Confidence of 20% on close inspection, compared to threshold of 50%.
Workflow ID: wflow_bwUvy5f5hGNJK5oV
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 6 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on ca3746f in 1 minute and 51 seconds
More details
- Looked at
12
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. examples/image/map_cifar10.py:34
- Draft comment:
It seems themap_data
function call is missing themodality
parameter which should be set to 'image' for correct handling of image data. Please add this parameter to ensure the data is processed correctly.
atlas.map_data(blobs=images,
identifier='cifar-50k-image-upload-with-topic',
data=datums,
id_field='id',
modality='image', # Specify the modality as image
topic_model={
"build_topic_model": True,
"community_description_target_field": "label"
},
)
- Reason this comment was not posted:
Based on historical feedback, this comment is too similar to comments previously marked by users as bad.
Workflow ID: wflow_9Td2T99khZS1gesN
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 3 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on fa031db in 1 minute and 47 seconds
More details
- Looked at
14
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. nomic/dataset.py:1376
- Draft comment:
The conversion of IDs to strings is a crucial step for ensuring consistency in data handling. However, it's important to ensure that this conversion does not lead to any unintended consequences elsewhere in the system where IDs are expected in their original format. If there are specific requirements or constraints on the ID format in other parts of the system, consider documenting these requirements or implementing checks to ensure compatibility. - Reason this comment was not posted:
Confidence changes required:33%
The code snippet provided fromnomic/dataset.py
includes a modification to the_add_blobs
method where a check is added to ensure that theids
are strings. If they are not, they are converted to strings. This is a necessary step because UUIDs or other types of IDs might not be strings by default, and the system might expect string identifiers for consistency in handling data.
The conversion using a list comprehension is straightforward and should work as expected. However, it's important to ensure that this conversion does not lead to any unintended consequences elsewhere in the system where IDs are expected in their original format.
Workflow ID: wflow_HuQX1xVBe90Ah06e
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 2 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 2850c8a in 2 minutes and 29 seconds
More details
- Looked at
38
lines of code in2
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. nomic/dataset.py:1151
- Draft comment:
The PR description mentions updates to themap_data
function to handleblobs
and setmodality
toimage
, but these changes are not visible in the diff. Please ensure that the necessary changes are included in the PR. - Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 70%.
Workflow ID: wflow_nrrJNSBZIfy3muzi
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 2 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on ca1b60f in 1 minute and 36 seconds
More details
- Looked at
22
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. nomic/data_inference.py:88
- Draft comment:
The PR description mentions updates to handle_blob_hash
inconvert_pyarrow_schema_for_atlas
, but these changes are not visible in the diff. Please ensure that the intended changes are correctly implemented and included in the PR. - Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 70%.
Workflow ID: wflow_WF3S6McrLoTDG4q3
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 2 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on da0411f in 2 minutes and 10 seconds
More details
- Looked at
39
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. nomic/dataset.py:1430
- Draft comment:
The schema for the hash table is hardcoded with the field names as 'id' and '_blob_hash'. It's better to use the dynamic field name from 'self.id_field' to maintain consistency and avoid potential errors if the id field is named differently.
hash_schema = pa.schema([(self.id_field, pa.string()), ("_blob_hash", pa.string())])
- Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 70%.
Workflow ID: wflow_89sI8rQv8n81o3n8
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 9 days left in your free trial, upgrade for $20/seat/month or contact us.
* fix topic label field alias bug * version bump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 3b0715d in 1 minute and 9 seconds
More details
- Looked at
27
lines of code in2
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. nomic/data_inference.py:91
- Draft comment:
The removal ofvalidation_alias
fromtopic_label_field
inNomicTopicOptions
is not explained in the PR description. Ensure this change does not affect other parts of the application wheretopic_label_field
is used or expected to be validated in a specific way. - Reason this comment was not posted:
Confidence changes required:66%
The PR description mentions that thetopic_label_field
inNomicTopicOptions
class has been updated to remove thevalidation_alias
parameter. This change is reflected in the code. However, the PR does not provide a rationale for this change. It's important to understand why this change was made, as it could affect how the field is used or validated elsewhere in the codebase. Since the PR description does not mention this, it's crucial to ensure that this change does not unintentionally break any functionality or expectations around thetopic_label_field
.
Workflow ID: wflow_STv01skh1QajWGhs
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 9978a8a in 47 seconds
More details
- Looked at
13
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. setup.py:11
- Draft comment:
The version bump to3.0.35
is correctly implemented. - Reason this comment was not posted:
Confidence changes required:0%
The version bump in the setup.py file is correctly updated from 3.0.34 to 3.0.35 as per the PR description. This is a standard practice to increment the version number for new features or significant updates to ensure proper package management and dependency resolution.
Workflow ID: wflow_1CnRjrOO6Hq3GcV2
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 7501db3 in 1 minute and 36 seconds
More details
- Looked at
18
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_hCRhMiUK2b3PDb5h
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 3e9e0f2 in 3 minutes and 6 seconds
More details
- Looked at
17
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_Xlqsc6EaPTFt9mEG
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 5a134db in 2 minutes and 33 seconds
More details
- Looked at
24
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. nomic/dataset.py:1374
- Draft comment:
Consider adding more robust error handling in thesend_request
function within_add_blobs
. Currently, it only checks for a status code of 200 and raises an exception otherwise. It would be beneficial to handle different types of HTTP errors and possibly implement a retry mechanism for recoverable errors. - Reason this comment was not posted:
Confidence of 0% on close inspection, compared to threshold of 70%.
Workflow ID: wflow_q3uzC1aU5cORAWmw
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 016b3e1 in 1 minute and 54 seconds
More details
- Looked at
25
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. examples/image/map_cifar10.py:26
- Draft comment:
Consider using slicing on the dataset to limit the number of iterations instead of using a break statement inside the loop. This would improve code readability and efficiency. - Reason this comment was not posted:
Confidence changes required:66%
The code snippet provided fromexamples/image/map_cifar10.py
shows the use of themap_data
function from theatlas
module, which is being updated in this PR to handle image data. The loop iterates over the dataset, appending images and metadata to lists, which are then passed tomap_data
. Themap_data
function is called with theblobs
parameter to handle image uploads, and a topic model is specified with a label field.
The code is straightforward and seems to align with the PR's intent to enhance image data handling. However, the loop control using if idx >= max_embeddings: break
is a bit unusual. It's more common and clearer to use slicing on the dataset directly to limit the number of iterations, which would be more efficient and readable.
Workflow ID: wflow_8DSDdGMKtIQ9IQv7
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on e286831 in 1 minute and 6 seconds
More details
- Looked at
13
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. nomic/dataset.py:1407
- Draft comment:
Increasing the batch size from 20 to 40 in the_add_blobs
method could potentially impact performance and error handling. Ensure that the server-side implementation can handle these larger batches efficiently and consider adjusting the number of workers or improving error handling mechanisms to maintain robustness. - Reason this comment was not posted:
Confidence changes required:66%
The change in the PR increases the batch size from 20 to 40 in the_add_blobs
method. This change could potentially impact the performance of the system, especially if the server or network cannot handle larger batch sizes efficiently. It's important to ensure that the server-side implementation can handle these larger batches without issues such as timeouts or memory constraints. Additionally, increasing the batch size without adjusting the number of workers or handling potential failures in batch processing might lead to less robust error handling and increased risk of data loss or corruption.
Workflow ID: wflow_s6SNIOCT3LGl6NgJ
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
⌛ 8 days left in your free trial, upgrade for $20/seat/month or contact us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 2b00dc3 in 1 minute and 21 seconds
More details
- Looked at
42
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_zK8fFgHYyBc6MZqX
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
Summary:
Enhanced Nomic Embed Vision with support for image data handling and updated relevant methods and files.
Key points:
nomic/dataset.py
to handle image uploads and setmodality
toimage
._add_blobs
method.create_index
method inAtlasDataset
to handletopic_label_field
for image modality.examples/image/map_cifar10.py
to usemap_data
withblobs
.map_data
function innomic/atlas.py
to acceptblobs
.convert_pyarrow_schema_for_atlas
innomic/data_inference.py
to handle_blob_hash
.setup.py
to3.0.36
.Generated with ❤️ by ellipsis.dev