-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trim whitespaces during metadata block TSV file ingest #10688
Comments
Sure, a PR would be lovely! Thanks! |
I vaguely remember chatting with someone who was working on some automated checks of metadata block TSV files, to see if certain things were missing in the TSV files. Was it you @poikilotherm that I was chatting with about this? I'm wondering if those checks could also include a warning about leading and trailing spaces in datasetField names. |
I think @poikilotherm and I have both been working on something like this independently. The tool I linked above started as a yaml to tsv converter for Dataverse metadata blocks but now also doubles as a linter and checks for trailing spaces, duplicate keys etc. We routinely run it as a CI script for all our metadata blocks now. |
Ah great. Thanks @HenningTimm! Then I suppose that while your tool, and potentially @poikilotherm's tool if that's who I chatted with about this, prevents leading and trailing spaces from getting into the database in the first place, the check during TSV file ingest in #10696 will be another guard against that happening 👍 |
@jggautier you're likely thinking of https://github.com/gdcc/mdbtool by @poikilotherm . Please see also https://dataverse.zulipchat.com/#narrow/stream/375812-containers/topic/hacking.20on.20metadata.20blocks/near/340715866 And there's https://gist.github.com/vera/edcebfcab6406759029ccb7d9e8d470b by @vera , described at #9463 (comment) |
When adding a custom metadata block using the
api/admin/datasetfield/load
API as described here, it is currently possible to create datasetField names that start or end with spaces. These datasetField names read from the corresponding column of the TSV file are entered into the database as is. For example, adding the following metadata block would introduce the field name"whitespaceDemoOne "
(with one trailing space) to the database.This has led to some confusing behavior during updates. Generating a new SOLR schema using the
update-fields.sh
script keeps the trailing spaces in place. However, SOLR does not properly work with these fields and I had to manually remove them from the SOLR schema file for the search to work on those fields again.Overview of the Feature Request
When reading a TSV file in
loadDatasetField
method trim leading and trailing white spaces around each TSV cell. For most, if not all, entries both leading and trailing white spaces do not serve a purpose and can be regarded as typos with relative safety.While only datasetField names and controlled vocabulary identifiers are entered into the database, this approach would also be helpful for all other columns of a metadata block.
The only possible column where trimming could possibly be detrimental is
displayFormat
when picking an explicit character (the last option in the list). I am not sure if that would be a reasonable use of that variable.What kind of user is the feature intended for?
This feature is intended as a safeguard for Sysadmins who need to add and maintain custom metadata blocks. It aims to avoid introducing malformed entries into the database since these cause trouble with SOLR during update processes (see above).
What inspired the request?
We work with several data stewards in research projects that develop their own metadata blocks. However, they do not have SSH access to the Dataverse server and thus rely on us (the library) to add and maintain the metadata blocks for them. While we do our best to screen for such errors, both manually and with software, this kind of error is easy to make, easy to miss in QA, and slipped by us in the past. Preventing them in the first place would be a great help.
Any open or closed issues related to this feature request?
None that we know of.
Implementation
We would be happy to provide a PR implementing this. @erodde has already been working on one.
The text was updated successfully, but these errors were encountered: