Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Genre extraction of bibliographic entities #36

Open
5 of 10 tasks
alliyya opened this issue Apr 30, 2022 · 6 comments
Open
5 of 10 tasks

Review Genre extraction of bibliographic entities #36

alliyya opened this issue Apr 30, 2022 · 6 comments
Assignees
Labels
Conversion: LINCS This is related to the conversion process using CIDOC-CRM and the CWRC vocabularies. (Main Branch) type:bug

Comments

@alliyya
Copy link
Member

alliyya commented Apr 30, 2022

see results from query

Is genre being extracted correctly? Potentially rextract.

Tasks:

@alliyya alliyya self-assigned this Apr 30, 2022
@alliyya
Copy link
Member Author

alliyya commented Apr 30, 2022

@alliyya alliyya closed this as completed Apr 30, 2022
@alliyya alliyya reopened this Apr 30, 2022
@alliyya
Copy link
Member Author

alliyya commented May 8, 2022

bibliographic records ID (5440d4e1-6a18-413d-a866-9364bf0c0e51) changed with the newer files but the ids(94449) used in the mapping to genre was the old ones.

bibliographic records ID (5440d4e1-6a18-413d-a866-9364bf0c0e51) doesn't align with items in
genre_map ('94449': ['NOVEL', 'DETECTIVE'])

Also look into if the list for genre_map is being properly appended to.

@alliyya alliyya added Conversion: CWRC This is related to the conversion process using the CWRC ontologies. (Classic Branch) Conversion: LINCS This is related to the conversion process using CIDOC-CRM and the CWRC vocabularies. (Main Branch) labels May 8, 2022
@alliyya
Copy link
Member Author

alliyya commented May 8, 2022

can confirm that genre_map keys was not being appended to but was being overwritten with every file parsed, leading to some extra missed genres.

Future todo potentially: establish if there's any weighting to be attached to help narrow down genres. Ex. Only use a genre if it's been associated with textscope 3+ times or something in cases when the genres are not correct and are just a one off mention referring to a different text.

@alliyya
Copy link
Member Author

alliyya commented May 8, 2022

Next steps:

  • update genre_map lists to be appended to instead of overwritten.

@SusanBrown
Copy link
Collaborator

SusanBrown commented May 8, 2022 via email

alliyya added a commit that referenced this issue May 9, 2022
@alliyya
Copy link
Member Author

alliyya commented May 9, 2022

I don’t follow the logic of the “todo” —is this to omit rarely mentioned genres from the list of those associated with an author?

Essentially, yes. At a later point, we'd likely want to review the genres extracted and see how accurate they are.

Example: we have a textscope that's mentioned in 5 different entries, and 4 entries use similar genres (ex. cwrc:letter and cwrc:romance) to describe it but 1 entry uses a genre that doesn't align or make sense for the particular work (cwrc:dictionary).

We can make up some rules that are like grab the 3 most common genres of text or only use a genre if it's associated with a text scope more than 2 times.

It was of an idea requiring further investigation rather than a concrete TODO.

@alliyya alliyya removed the Conversion: CWRC This is related to the conversion process using the CWRC ontologies. (Classic Branch) label May 11, 2022
alliyya added a commit that referenced this issue May 17, 2022
alliyya added a commit that referenced this issue Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Conversion: LINCS This is related to the conversion process using CIDOC-CRM and the CWRC vocabularies. (Main Branch) type:bug
Projects
None yet
Development

No branches or pull requests

2 participants