Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: map_data support for Nomic Embed Vision #308

Merged
merged 29 commits into from
Jul 9, 2024
Merged

feat: map_data support for Nomic Embed Vision #308

merged 29 commits into from
Jul 9, 2024

Conversation

zanussbaum
Copy link
Collaborator

@zanussbaum zanussbaum commented Jun 13, 2024

🚀 This description was created by Ellipsis for commit c8db85e

Summary:

Enhanced Nomic Embed Vision with support for image data handling and updated relevant methods and files.

Key points:

  • Enhanced Nomic Embed Vision with support for image data handling.
  • Updated nomic/dataset.py to handle image uploads and set modality to image.
  • Refactored concurrent futures handling in _add_blobs method.
  • Updated create_index method in AtlasDataset to handle topic_label_field for image modality.
  • Updated examples/image/map_cifar10.py to use map_data with blobs.
  • Updated map_data function in nomic/atlas.py to accept blobs.
  • Updated convert_pyarrow_schema_for_atlas in nomic/data_inference.py to handle _blob_hash.
  • Bumped version in setup.py to 3.0.36.

Generated with ❤️ by ellipsis.dev

@AndriyMulyar
Copy link
Contributor

AndriyMulyar commented Jun 16, 2024

@ellipsis-dev review this

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to ac9a55c in 1 minute and 52 seconds

More details
  • Looked at 291 lines of code in 4 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 drafted comments based on config settings.
1. nomic/atlas.py:61
  • Draft comment:
    The map_data function now handles image blobs by setting the modality to 'image' and defaulting the embedding model to 'nomic-embed-vision-v1.5' if not specified. This is a significant change as it integrates image handling directly into the existing data mapping function, which previously focused on embeddings and text data. Ensure that the default embedding model and the handling of image data are thoroughly tested, especially since this modality change affects how data is processed and indexed.
  • Reason this comment was not posted:
    Confidence of 0% on close inspection, compared to threshold of 50%.
2. nomic/atlas.py:63
  • Draft comment:
    The introduction of '_blob_hash' as an indexed field for images is a critical change. Ensure that this new field is properly managed across the system, especially in functions that handle data indexing and retrieval. This change could impact how data is stored, retrieved, and indexed in the system.
  • Reason this comment was not posted:
    Confidence of 0% on close inspection, compared to threshold of 50%.
3. nomic/atlas.py:133
  • Draft comment:
    The handling of topic models for image data should be carefully reviewed to ensure that the 'community_description_target_field' is appropriately used to generate meaningful topic labels from image data. This is a new application of the topic modeling feature, and its effectiveness and accuracy in the context of image data should be validated.
  • Reason this comment was not posted:
    Confidence of 0% on close inspection, compared to threshold of 50%.

Workflow ID: wflow_lx0rEhsESDyvCAax


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

6 days left in your free trial, upgrade for $20/seat/month or contact us.

@zanussbaum zanussbaum marked this pull request as ready for review June 17, 2024 14:38
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to e707f94 in 1 minute and 9 seconds

More details
  • Looked at 312 lines of code in 4 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_mBvzWO0puHyHuefa


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

6 days left in your free trial, upgrade for $20/seat/month or contact us.

nomic/dataset.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 53bd566 in 3 minutes and 51 seconds

More details
  • Looked at 22 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. nomic/dataset.py:1494
  • Draft comment:
    It's important to handle exceptions within the ThreadPoolExecutor in _add_blobs to ensure that all threads are properly managed and potential errors are gracefully handled. Consider adding a try-except block around the future.result() call to catch and log exceptions, ensuring that all resources are properly cleaned up.
            try:
                response = future.result()
            except Exception as e:
                logger.error(f"Failed to upload batch: {e}")
                continue
  • Reason this comment was not posted:
    Confidence of 20% on close inspection, compared to threshold of 50%.

Workflow ID: wflow_bwUvy5f5hGNJK5oV


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

6 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on ca3746f in 1 minute and 51 seconds

More details
  • Looked at 12 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. examples/image/map_cifar10.py:34
  • Draft comment:
    It seems the map_data function call is missing the modality parameter which should be set to 'image' for correct handling of image data. Please add this parameter to ensure the data is processed correctly.
atlas.map_data(blobs=images,
               identifier='cifar-50k-image-upload-with-topic',
               data=datums,
               id_field='id',
               modality='image',  # Specify the modality as image
               topic_model={
                     "build_topic_model": True,
                     "community_description_target_field": "label"
                },
)
  • Reason this comment was not posted:
    Based on historical feedback, this comment is too similar to comments previously marked by users as bad.

Workflow ID: wflow_9Td2T99khZS1gesN


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

3 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on fa031db in 1 minute and 47 seconds

More details
  • Looked at 14 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. nomic/dataset.py:1376
  • Draft comment:
    The conversion of IDs to strings is a crucial step for ensuring consistency in data handling. However, it's important to ensure that this conversion does not lead to any unintended consequences elsewhere in the system where IDs are expected in their original format. If there are specific requirements or constraints on the ID format in other parts of the system, consider documenting these requirements or implementing checks to ensure compatibility.
  • Reason this comment was not posted:
    Confidence changes required: 33%
    The code snippet provided from nomic/dataset.py includes a modification to the _add_blobs method where a check is added to ensure that the ids are strings. If they are not, they are converted to strings. This is a necessary step because UUIDs or other types of IDs might not be strings by default, and the system might expect string identifiers for consistency in handling data.

The conversion using a list comprehension is straightforward and should work as expected. However, it's important to ensure that this conversion does not lead to any unintended consequences elsewhere in the system where IDs are expected in their original format.

Workflow ID: wflow_HuQX1xVBe90Ah06e


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

2 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 2850c8a in 2 minutes and 29 seconds

More details
  • Looked at 38 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. nomic/dataset.py:1151
  • Draft comment:
    The PR description mentions updates to the map_data function to handle blobs and set modality to image, but these changes are not visible in the diff. Please ensure that the necessary changes are included in the PR.
  • Reason this comment was not posted:
    Confidence of 0% on close inspection, compared to threshold of 70%.

Workflow ID: wflow_nrrJNSBZIfy3muzi


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

2 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on ca1b60f in 1 minute and 36 seconds

More details
  • Looked at 22 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. nomic/data_inference.py:88
  • Draft comment:
    The PR description mentions updates to handle _blob_hash in convert_pyarrow_schema_for_atlas, but these changes are not visible in the diff. Please ensure that the intended changes are correctly implemented and included in the PR.
  • Reason this comment was not posted:
    Confidence of 0% on close inspection, compared to threshold of 70%.

Workflow ID: wflow_WF3S6McrLoTDG4q3


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

2 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on da0411f in 2 minutes and 10 seconds

More details
  • Looked at 39 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. nomic/dataset.py:1430
  • Draft comment:
    The schema for the hash table is hardcoded with the field names as 'id' and '_blob_hash'. It's better to use the dynamic field name from 'self.id_field' to maintain consistency and avoid potential errors if the id field is named differently.
        hash_schema = pa.schema([(self.id_field, pa.string()), ("_blob_hash", pa.string())])
  • Reason this comment was not posted:
    Confidence of 0% on close inspection, compared to threshold of 70%.

Workflow ID: wflow_89sI8rQv8n81o3n8


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

9 days left in your free trial, upgrade for $20/seat/month or contact us.

* fix topic label field alias bug
* version bump
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 3b0715d in 1 minute and 9 seconds

More details
  • Looked at 27 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. nomic/data_inference.py:91
  • Draft comment:
    The removal of validation_alias from topic_label_field in NomicTopicOptions is not explained in the PR description. Ensure this change does not affect other parts of the application where topic_label_field is used or expected to be validated in a specific way.
  • Reason this comment was not posted:
    Confidence changes required: 66%
    The PR description mentions that the topic_label_field in NomicTopicOptions class has been updated to remove the validation_alias parameter. This change is reflected in the code. However, the PR does not provide a rationale for this change. It's important to understand why this change was made, as it could affect how the field is used or validated elsewhere in the codebase. Since the PR description does not mention this, it's crucial to ensure that this change does not unintentionally break any functionality or expectations around the topic_label_field.

Workflow ID: wflow_STv01skh1QajWGhs


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

8 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 9978a8a in 47 seconds

More details
  • Looked at 13 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. setup.py:11
  • Draft comment:
    The version bump to 3.0.35 is correctly implemented.
  • Reason this comment was not posted:
    Confidence changes required: 0%
    The version bump in the setup.py file is correctly updated from 3.0.34 to 3.0.35 as per the PR description. This is a standard practice to increment the version number for new features or significant updates to ensure proper package management and dependency resolution.

Workflow ID: wflow_1CnRjrOO6Hq3GcV2


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

8 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on 7501db3 in 1 minute and 36 seconds

More details
  • Looked at 18 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_hCRhMiUK2b3PDb5h


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

8 days left in your free trial, upgrade for $20/seat/month or contact us.

nomic/atlas.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 3e9e0f2 in 3 minutes and 6 seconds

More details
  • Looked at 17 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_Xlqsc6EaPTFt9mEG


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

8 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 5a134db in 2 minutes and 33 seconds

More details
  • Looked at 24 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. nomic/dataset.py:1374
  • Draft comment:
    Consider adding more robust error handling in the send_request function within _add_blobs. Currently, it only checks for a status code of 200 and raises an exception otherwise. It would be beneficial to handle different types of HTTP errors and possibly implement a retry mechanism for recoverable errors.
  • Reason this comment was not posted:
    Confidence of 0% on close inspection, compared to threshold of 70%.

Workflow ID: wflow_q3uzC1aU5cORAWmw


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

8 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 016b3e1 in 1 minute and 54 seconds

More details
  • Looked at 25 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. examples/image/map_cifar10.py:26
  • Draft comment:
    Consider using slicing on the dataset to limit the number of iterations instead of using a break statement inside the loop. This would improve code readability and efficiency.
  • Reason this comment was not posted:
    Confidence changes required: 66%
    The code snippet provided from examples/image/map_cifar10.py shows the use of the map_data function from the atlas module, which is being updated in this PR to handle image data. The loop iterates over the dataset, appending images and metadata to lists, which are then passed to map_data. The map_data function is called with the blobs parameter to handle image uploads, and a topic model is specified with a label field.

The code is straightforward and seems to align with the PR's intent to enhance image data handling. However, the loop control using if idx >= max_embeddings: break is a bit unusual. It's more common and clearer to use slicing on the dataset directly to limit the number of iterations, which would be more efficient and readable.

Workflow ID: wflow_8DSDdGMKtIQ9IQv7


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

8 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on e286831 in 1 minute and 6 seconds

More details
  • Looked at 13 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. nomic/dataset.py:1407
  • Draft comment:
    Increasing the batch size from 20 to 40 in the _add_blobs method could potentially impact performance and error handling. Ensure that the server-side implementation can handle these larger batches efficiently and consider adjusting the number of workers or improving error handling mechanisms to maintain robustness.
  • Reason this comment was not posted:
    Confidence changes required: 66%
    The change in the PR increases the batch size from 20 to 40 in the _add_blobs method. This change could potentially impact the performance of the system, especially if the server or network cannot handle larger batch sizes efficiently. It's important to ensure that the server-side implementation can handle these larger batches without issues such as timeouts or memory constraints. Additionally, increasing the batch size without adjusting the number of workers or handling potential failures in batch processing might lead to less robust error handling and increased risk of data loss or corruption.

Workflow ID: wflow_s6SNIOCT3LGl6NgJ


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

8 days left in your free trial, upgrade for $20/seat/month or contact us.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on 2b00dc3 in 1 minute and 21 seconds

More details
  • Looked at 42 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_zK8fFgHYyBc6MZqX


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

nomic/dataset.py Show resolved Hide resolved
@zanussbaum zanussbaum merged commit e4a1c73 into main Jul 9, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants