Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029

Closed
chewsw opened this issue Sep 7, 2018 · 30 comments · Fixed by #9089

Comments

@chewsw
Copy link

chewsw commented Sep 7, 2018

Author names are missing from the dataset records that Google Dataset Search indexed from Dataverse.

The JSON-LD schema for authors should be updated to "@type": "Person" or other appropriate types to differentiate between organization as authors or individuals as authors.

@pdurbin
Copy link
Member

pdurbin commented Sep 7, 2018

@chewsw thanks for opening this issue. Here's the thread from the dataverse-users mailing list: https://groups.google.com/d/msg/dataverse-community/TlQPNI3Ip2E/srLf29aSBAAJ

Originally we had "@type": "Person" in the JSON-LD output (in development, before release) but in Dataverse it's possible to have organizations as authors ("Gallup Organization", "Geological Survey (U.S.)", etc.) so we took it out. Please see discussion in these two places:

@kcondon kcondon changed the title Improving Dataverse's JSON-LD schema to enable author names display in Google Dataset Search's records Improving Dataverse's JSON-LD schema to enable author names display in Google Dataset Search's Sep 10, 2018
@jggautier
Copy link
Contributor

jggautier commented Sep 14, 2018

Maybe outlining some more details would help with estimation:

  • This will be required metadata that the author can change from the UI or API, indicating in some way whether the author names she enters are people or organizations
  • We need a plan for how installations can add author types to existing datasets

@jggautier
Copy link
Contributor

jggautier commented Sep 26, 2018

I'm starting to think that Google Dataset Search prefers the creator property (as opposed to the author property that Dataverse uses). For every dataset landing page I've found in Google Dataset Search where Google shows the author, the creator property is used instead of the author property.

Unless someone finds something different, I'd propose that instead of the author property Dataverse uses the creator property, which can use the same sub-properties:

"creator": [
    {
      "affiliation": "affiliation",
      "@type": "Person",
      "name": "Lname, Fname"
    },
    {
      "@type": "Organization",
      "name": "Org name"
    }

(Google doesn't like affiliation when the @type organization is used with the author or creator properties.)

Also, Google's Structured Data Testing Tool is no longer showing errors when author or creator types are missing (and it defaults to "Thing" instead of Person or Organization), although I still agree that Dataverse's schema.org metadata should say whether dataset authors are people or organizations.

@pdurbin
Copy link
Member

pdurbin commented Oct 4, 2018

Today during sprint planning @jggautier explained his hunch on how switching from author to creator might help. This was while discussing #4371.

@jggautier
Copy link
Contributor

jggautier commented Oct 5, 2018

I asked for clarification in the structured data section of Google's webmaster forum.

"Creator" seems like the more used property, but DataCite is using "author" with an @type. On this dataset on Google Dataset Search), authors are displayed even though the "author" property is used, so maybe Google does want to see the specified @type.

@chewsw
Copy link
Author

chewsw commented Oct 8, 2018 via email

@jggautier
Copy link
Contributor

Author names are showing up on some but not all Google Dataset Search pages for datasets in Dataverse repositories, like this page for a dataset from the Texas Data Repository (TDR), and this page for a dataset from Harvard Dataverse. But those pages also say metadata is coming from DataCite, which publishes its own schema.org metadata and uses only the "author" property, but includes its guessed @type (e.g. this schema.org metadata from DataCite for that TDR dataset). From what I can tell so far, every Google Dataset Search page for datasets from a Dataverse repository includes author names only when the "dataset provided by" includes DataCite.

Harvard Dataverse upgraded to Dataverse 4.10.1 two days ago (Jan 8), which includes adding the "creator" property to the schema.org metadata. Once Google starts indexing more recently published datasets, we can see if authors are displayed on Google Dataset Search pages (especially when it isn't also isn't using DataCite's schema.org metadata).

@jggautier
Copy link
Contributor

Just an update: Datasets published in Harvard Dataverse after Jan 7, with the updated Schema.org metadata, are showing up in Google Dataset Search without the author names (like this one and this one). I think we can rule out any preference for "author" versus "creator" elements.

@pdurbin
Copy link
Member

pdurbin commented Jan 22, 2019

@jggautier bummer. Does that mean we should try adding "@type": "Person"? As indicated above, the Dataverse UI/API would need to allow dataset authors to choose between a person and an organization.

@jggautier
Copy link
Contributor

That or we could do what @mfenner wrote in #2243 that DataCite does, which is basically guess (with >90% accuracy).

@Venki18
Copy link

Venki18 commented Jun 5, 2019

@pdurbin and @jggautier thank you for looking into this problem. I thought I will add my findings here if that will help to make the changes in the next version.

What I discovered using the Structured dataset test tool is for NTU datasets the author fields show as Thing in the test tool. If I change the "@type": "Person" for one of the authors the tool doesn't show any error. I think we must include the "@type": "Person" for all the authors in the ld+json in the script. Then google dataset results page will show the author names under Person.

image

I see from an example of GBIF—the Global Biodiversity Information Facility dataset record in Google dataset results page display author names as "@type": "Person".

image

Please refer to the screenshot below
image

I think we need to add "@type": "Person" in Dataverse to show the author name in Google dataset page.

@pdurbin
Copy link
Member

pdurbin commented Jun 5, 2019

@Venki18 hi! Yes, the "every person and organization is a Thing" problem is well known to us, unfortunately.

I've been hoping we can use some new code added by @fcadili in pull request #4664 to pass in a string that could either be a person or an organization and the code will tell use which it is.

I haven't studied the code yet but here's a test he wrote that shows the code figuring out if a string is for an organization or a person, for example:

https://github.com/IQSS/dataverse/blob/v4.14/src/test/java/edu/harvard/iq/dataverse/export/OrganizationsTest.java

Screenshot from 2019-06-05 06-23-00

@Venki18
Copy link

Venki18 commented Jun 5, 2019

@pdurbin thank you for the quick reply. May I know how does the export for TermsOfUse work? We have been using CC-BY-NC instead of CC0 and we have changed the necessary text in Bundle.properties file. But we are using the code CC0 as it is. Hence when you guys export to ld+json format it is exported as CC0. Hence for all our datasets with waiver terms the code exports what is entered in the additional text box. For default CC0 it is taken as it is.
Is there any way to show CC-BY-NC?

@pdurbin
Copy link
Member

pdurbin commented Jun 5, 2019

@Venki18 I'm not sure but let me at least give you and @Thanh-Thanh and others some pointers to the code:

https://github.com/IQSS/dataverse/blob/v4.14/src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java#L1777

It looks like if CC0 isn't specified the code will put in the free form text the user entered as an alternative to CC0.

This is somewhat off topic for this issue, of course, but I hope this helps! 😄 Please feel free to create as many issues as we need!

@pdurbin
Copy link
Member

pdurbin commented Jun 5, 2019

@Venki18 also, if you're interested @rigelk and I are talking about Schema.org JSON-LD, especially in relation to ActivityPub (#5883) in chat. You can catch up on the conversation at http://irclog.iq.harvard.edu/dataverse/2019-06-05

@jggautier jggautier changed the title Improving Dataverse's JSON-LD schema to enable author names display in Google Dataset Search's Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's Oct 22, 2020
@adam3smith
Copy link
Contributor

Just to register our interest in fixing this, using the Datacite strategy of guessing. Would you take a fix along those lines as a PR? Otherwise we'd address it locally -- not having authors in the Google Dataset search is a bummer.

@jggautier
Copy link
Contributor

jggautier commented Oct 12, 2022

I think we were worried about how well the Datacite strategy of guessing the name type (#2243 (comment)), used in the Dataverse software's OpenAIRE metadata export, would work for all Dataverse installations.

These are some or hopefully all of the next steps proposed already in another GitHub issue (which I can't find right now), some by @qqmyers:

  • Test how well the algorithm for guessing author name types works for different installations, especially those with different types of names, and improve the algorithm if needed before using it in other metadata exports like the Schema.org export
  • Let installations decide whether or not to use the algorithm for any of their metadata exports
  • Use the algorithm only for datasets already created in an installation and ask depositors to indicate the type of author name being entered

I think a combination of these things could be done, like testing the algorithm and adding a way for installations to say that they don't want to use it. In this case I'd advocate for providing installations with guidance about how to evaluate the accuracy of the algorithm. If GUI work could be done to let depositors choose the name type when they enter names, installations could also decide not to use the algorithm to add "type" metadata to the author/creator names of datasets already created in their installations.

But one blocker might be a lack of resources (people and time) to review the algorithm and design and test GUI changes. Maybe a review of the algorithm's accuracy would be less resource intensive if it was distributed among a representative sample of Dataverse installations, like those with names from different cultures in their metadata. Each installation could test the accuracy of their own metadata, report back, and the community could make a decision about what to do then.

The algorithm is being used to determine the name type for names entered in other fields, like the Contact Person field, but I'm also assuming that we would prioritize identifying the types of names in the authors/creator fields because Google Dataset Search usually displays those name but isn't when there's no name type in the Schema.org metadata.

@adam3smith
Copy link
Contributor

Thanks for responding so quickly!

I think we were worried about how well [the Datacite strategy of guessing the name type] (#2243 (comment)) (used in the Dataverse software's OpenAIRE metadata export) would work for all Dataverse installations.

I understand this, but I'm wondering whether this is the right trade-off:

  • With the current behavior, we're generating invalid JSON-LD and having our data presented at Google without proper attribution to their creators
  • With a worst-case scenario of implementing the fix and "guessing" working for only half the data (a highly unlikely scenario), we'd fix the invalid JSON-LD, have our holding show up properly in google, with the main downside that some people would be categorized as organizations and vice versa. The main effect of this would be, I think, that some auto-generated citations would look a bit off.

Am I overlooking a massive cost to incorrectly labelling a creator?
In other words, even if the algo would work poorly in some installations -- is there any scenario where it's worse than the status quo? And if it isn't is there any way in which implementing a "guessing" fix would create a form of technology lock-in that we want to avoid? Otherwise, I don't really see the case against this.

@jggautier
Copy link
Contributor

Am I overlooking a massive cost to incorrectly labelling a creator?

I agree. Maybe maybe making sure author names are displayed in Google Dataset Search outweighs any adverse effects of calling a person an organization or vice versa. And right now the only adverse effect we can think of is that the citations would look a bit off.

I don't think there's any technical lock-in, if I get your meaning. I think it would be easy to improve this over time, right? For example, repositories could run an improved algorithm and create or overwrite dataset versions with improved metadata, however that metadata is corrected.

The lock-in might be more behavioral and I'm probably suggesting that we front-load research because I always worry that the effectiveness of the changes that make it into each Dataverse release won't be evaluated soon enough, if at all. By the time we see that something needs to be fixed, the problem could be much more difficult to fix.

For example, I'm thinking of the problems I wrote about in #5920. The scale and consequences of those problems haven't been explored, but I think it wouldn't be great for discoverability and access if datasets that aren't really "closed access" are labelled as such.

Maybe it's enough that there's this record of a conversation about "what could go wrong with this decision" and an agreement that we don't need to more pro-actively evaluate the changes if fixing any problems won't become harder over time, in this case as more and more datasets are created.

@adam3smith
Copy link
Contributor

I don't think there's any technical lock-in, if I get your meaning. I think it would be easy to improve this over time, right?

Right, that's what I mean -- I don't think this would make implementing either better algorithmic solutions or a manual UI for type selection in the future harder.

The lock-in might be more behavioral and I'm probably suggesting that we front-load research because I always worry that the effectiveness of the changes that make it into each Dataverse release won't be evaluated soon enough, if at all. By the time we see that something needs to be fixed, the problem could be much more difficult to fix.

That's fair. I guess beyond better algos (which I expect would be hard) the relevant question is whether, and to what extent, there is the need to allow author-type selection in the GUI, which would obviously be more precise in theory, but is also another UI feature and itself prone to user error, especially for large, self-curated repositories like Harvard DV.
How would you do the research on this? Just sample and look manually? At QDR we can obviously look at our entire holdings but I also know our metadata well enough to be pretty certain that even a naive algo without first-name matching would get us to 100%.

@jggautier
Copy link
Contributor

Yes the only way I could think of is to have people look at a sample of the guesses that the algorithm makes.

But looking at the discussion in this issue some more I remember that Martin Fenner said the algorithm works over 90% of the time, and I think DataCite would have tested it on a much greater number and variety of names than we could. So maybe that's enough to be confident that the algorithm would be accurate enough, and instead we could consider here or later on why someone would want to correct a guess that the algorithm made and how they might be able to.

@adam3smith
Copy link
Contributor

So maybe that's enough to be confident that the algorithm would be accurate enough, and instead we could consider here or later on why someone would want to correct a guess that the algorithm made and how they might be able to.

Great, yes, let's make that the plan

@jggautier
Copy link
Contributor

jggautier commented Oct 19, 2022

@adam3smith, about planning to review why someone would want to correct a guess that the algorithm made and how they might be able to, I wanted to clarify that I think this should be done before the algorithm is used to add metadata to other exports, like the Schema.org export.

What do you think?

Very early in the Schema.org conversation and development work, I asked a liaison at Google why Google Dataset Search would insist on a nameType, but I thought the answer was vague and left a lot of room for speculation (like, their knowledge graph could take advantage of knowing more about dataset authors), which at the time I think made it easy for us to feel okay with ignoring the nameType error we kept seeing when checking the Schema.org exports (until we started noticing that if the dataset didn't have a person or organization nameType, the author names wouldn't show up in Google Dataset Search).

But since you mentioned that the algorithm guessing wrong might result in an off-looking citation, and Martin Fenner wrote earlier that the algorithm is necessary for generating correct citations, maybe we could start by assuming that the only way that a depositor or curator would notice the wrong guess is if they saw a citation of their dataset that was off. They could always just correct the autogenerated citation themselves, right? I remember having to do this a lot when managing citations in Zotero. But what are the chances that this becomes a big inconvenience and someone wants a curator or repository manager to correct the generated citations? In your experience, is the risk of that happening so low that we don't need to consider how a depositor or curator could have the guess corrected? You mentioned that this won't happen in QDR, but might it for other Dataverse repositories?

@adam3smith
Copy link
Contributor

But what are the chances that this becomes a big inconvenience and someone wants a curator or repository manager to correct the generated citations?

I think for personal authors this is a real issue and it comes up with poor metadata in Zotero a fair amount, but:
Given the way the algo works, this is easy to do for person authors and we don't need any additional functionality for this (it's also why I think it's irrelevant at QDR): As long as you add your name as "Lastname, Firstname" or you have an ORCID in the metadata, you're always going to be a person. Lastname, Firstname is, of course, already explicitly recommended by Dataverse as the mode of entry for just this reason.

The only scenario for which we'd therefore need new functionality is an organization author that gets misclassified by the algo, either because they have a comma in their name or because they have a first word that looks like a first name. This, I think, is going to be incredibly rare and I'd argue not worth considering. The chances of it happening could be further reduced if the last step of the algorithm is skipped, i.e. we treat all names without a comma as organizations, but I don't think that's a good idea given the trade-offs of missing a large number of 'incorrectly' added persons.

@jggautier
Copy link
Contributor

Great points! Thanks @adam3smith.

@pdurbin
Copy link
Member

pdurbin commented Jan 24, 2023

This PR is related, at least:

Might fix it? Or go a long way? I'm not sure! 😄

@qqmyers
Copy link
Member

qqmyers commented Jan 24, 2023

It probably does close it but there's a lot of discussion above. Perhaps whatever's left, if anything, can be a new issue?

@pdurbin
Copy link
Member

pdurbin commented Jan 24, 2023

@qqmyers fantastic. I just marked that PR (#9089) as closing this issue.

To anyone reading this, please look through that pull request. If there are any remaining issues after it is merged, please open a fresh issue. Thanks! 🚀

@adam3smith
Copy link
Contributor

Here's how things look on our end 😄
image

@jggautier
Copy link
Contributor

Thanks @qqmyers for always helping make sure things don't fall through the cracks!

The two things I see in this issue that aren't addressed in PR #9089 are:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants