Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature 122 extend research context #123

Merged

Conversation

ydamit
Copy link
Collaborator

@ydamit ydamit commented Jun 24, 2024

Added three new files to handle the Usecase Model, Import Port, and Usecase code for a new feature to allow for extending existing Research Contexts. Since adding new source data to an existing Research Context isn't a practical solution for several reasons, this feature expects that the end user will provide new data sources together with identifying information about the existing Research Context, which the Usecase can then use to create a new Research Context by combining (and deduplicating) the new source data with the source data already present in the provided Research Context.

These files were created by cloning and adapting the corresponding files for the New Research Context feature. Models outside the scope of this training exercise (e.g. View Model) were not updated. A new error was created for the scenario in which no actually new source data is provided (i.e. all provided source data is duplicative of source data already found in the existing Research Context).

@ydamit ydamit requested review from alebg June 24, 2024 19:06

new_client_source_data_list = new_client_source_data_list_dto.data

if not isinstance(new_client_source_data_list, list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maany quick question here: is this check with isinstance OK? Don't know if I'm misremembering, but I think you had some opinion against using this kind of check at runtime

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydamit so, the DTO you're getting is a pydantic model (see ListSourceDataDTO at lib/core/dto/client_repository_dto.py), so pydantic already enforces at runtime that data is in fact a list. I.e., we don't need this check with isinstance. What you'll need to check instead is that the list is not empty (and thanks to pydantic, if it's not empty, it's guaranteed to be a list of SourceData in particular).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty convenient! Do we still need it, then, in the New Research Context feature (which is where I copied it from)? Or is there something about that use case that does require it, that's different here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my bad, that other one should be a check of a non-empty list too. There's some more cleanup / refactoring to do in the current KP, but we'll tackle that some other day. For now I'll just open a different issue.

description="Title of the new research context to be created to include new data sources."
)
new_research_context_description: str = Field(description="Description of the research context to be created.")
client_sub: str = Field(description="SUB of the client for which the research context is to be created.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydamit There's a repository function in sqla_research_context_repository called get_research_context_client. You can use this one, instead of asking for a client_sub as an input. If you do this, you'll directly get the client you need and you wouldn't need to check if the client provided as an input matches the client of the old research_context

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted! But since we need client_sub to create new research contexts, I just want to triple-check: we never want to allow one SUB to extend a research context originally created by another SUB?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since we're at it, you're right, we might want this eventually. If we implement this, then we'll change your check for some sort of permission check. So no problem, please just leave this as is!

Copy link
Collaborator

@alebg alebg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydamit Great job! Looks organized enough. Apart from the inline comment I made, please:

  • Please provide a smaller amount of commits that are more conceptually unified. Navigating this git history will be kinda messy as it is right now. We don't have a specific standard at the moment (but we might soon), so please just take inspiration from our previous commits. If you're comfortable with git, to sort this out, please use git rebase to squash your commits into more organized ones (anything between 1 and 3 commits should be enough for all you've done so far). Then, force push back to this branch. If you don't feel comfortable enough with git to do this (rebasing + force pushes are potentially destructive of your work), let me know and we'll do it over a call

  • Also please provide a test for this usecase. We haven't done many usecase-specific tests, but since this one doesn't have new repositories, it is worth it. The usecase tests are at tests/api. The test_new_message_feature.py contains a substantial usecase test you can check to find your way about instantiating what you need. Let me know if you have any questions about this

  • Once these changes are done, I'll show you how to implement the rest of the chain (presenter, controller, and API endpoint) in a call. You'll need to implement those in this same branch before we accept this PR

@ydamit
Copy link
Collaborator Author

ydamit commented Jun 25, 2024

Sounds good, @alebg , I'll work on all this tomorrow!

@ydamit ydamit force-pushed the feature-122-extend_research_context branch 3 times, most recently from 972c18d to f603aa1 Compare June 26, 2024 15:46
@ydamit ydamit force-pushed the feature-122-extend_research_context branch from 82ee9c3 to 937db2c Compare June 26, 2024 15:47
@ydamit ydamit force-pushed the feature-122-extend_research_context branch from 6fc208e to 0dfcb64 Compare June 28, 2024 00:12
@alebg
Copy link
Collaborator

alebg commented Jun 28, 2024

@ydamit Great job!!! next up would be implementing the rest to have the whole feature

@ydamit
Copy link
Collaborator Author

ydamit commented Jul 1, 2024

@alebg , it took me until it was pretty late for you to figure out the flaw in my logic and fix everything. But it should be done now! All tests passing, and no references remaining anywhere to any component of the New Research Context feature!

@alebg
Copy link
Collaborator

alebg commented Jul 2, 2024

@ydamit looks great! Only piece missing that we discussed yesterday is a test to check if you try to "extend" an existing research context but not providing truly new source data IDs. I ran locally the test below (appended to the file with your tests, it's just your usecase success test with a slight modification towards the end), and it passed, so that means the code is handling that case properly now. But just please add a test for this error case. Mine was like this, but feel free to change it if you wish, and then commit it. After that we're good to merge!:

def test_error_extend_research_context_usecase_no_new_source_data(
    app_initialization_container: ApplicationContainer,
    db_session: TDatabaseFactory,
    fake: Faker,
    fake_client_with_research_context_and_sources: SQLAClient,  # Fake client with Research Context
    fake_source_data_list: list[SQLAResearchContext],
) -> None:
    usecase: ExtendResearchContextUseCase = app_initialization_container.extend_research_context_feature.usecase()

    assert usecase is not None

    client_with_context = fake_client_with_research_context_and_sources
    llm_name = fake.name()

    existing_research_context = random.choice(client_with_context.research_contexts)
    # Populate source data into the existing research context
    existing_source_data_ids = []
    for source_datum in client_with_context.source_data:
        existing_research_context.source_data.append(source_datum)
        existing_source_data_ids.append(source_datum.id)

    incoming_source_data_list = fake_source_data_list
    new_source_data_list: List[SQLASourceData] = []
    for source_data in incoming_source_data_list:
        while source_data.id is None or source_data.id in existing_source_data_ids + new_source_data_list:
            source_data.id = random.randint(1001, 2000)
        new_source_data_list.append(source_data)

    client_with_context.source_data.extend(new_source_data_list)
    new_source_data_ids = [sd.id for sd in new_source_data_list]
    assert new_source_data_ids != []

    llm = SQLALLM(
        llm_name=llm_name,
        research_contexts=client_with_context.research_contexts,
    )

    # Make title unique to query later
    existing_research_context_title = f"{existing_research_context.title}-{uuid.uuid4()}"
    existing_research_context.title = existing_research_context_title

    with db_session() as session:
        client_with_context.save(session=session, flush=True)
        session.commit()

    with db_session() as session:
        queried_existing_research_context = (
            session.query(SQLAResearchContext).filter_by(title=existing_research_context_title).first()
        )

        assert queried_existing_research_context is not None
        queried_existing_source_data_list = queried_existing_research_context.source_data
        queried_existing_source_data_ids = [sd.id for sd in queried_existing_source_data_list]

        queried_client = session.get(SQLAClient, queried_existing_research_context.client_id)

        assert queried_client is not None

        # Details for new Research Context data
        new_research_context_title = f"{fake.name().replace(' ', '_')}-{uuid.uuid4()}"
        new_research_context_description = f"{fake.text()}-{uuid.uuid4()}"

        request = ExtendResearchContextRequest(
            new_research_context_title=new_research_context_title,
            new_research_context_description=new_research_context_description,
            client_sub=queried_client.sub,
            llm_name=llm_name,
            new_source_data_ids=queried_existing_source_data_ids,
            existing_research_context_id=queried_existing_research_context.id,
        )
        response = usecase.execute(request=request)

        assert response is not None
        assert isinstance(response, ExtendResearchContextError)

@ydamit ydamit force-pushed the feature-122-extend_research_context branch from 8238b56 to b0abd5f Compare July 3, 2024 16:22
Copy link
Collaborator

@alebg alebg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @ydamit ! Merging this :)

@alebg alebg merged commit 2db6843 into dream-aim-deliver:main Jul 4, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants