Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim whitespaces during metadata block TSV file ingest #10688

Closed
HenningTimm opened this issue Jul 16, 2024 · 5 comments · Fixed by #10696
Closed

Trim whitespaces during metadata block TSV file ingest #10688

HenningTimm opened this issue Jul 16, 2024 · 5 comments · Fixed by #10696
Labels
Type: Feature a feature request
Milestone

Comments

@HenningTimm
Copy link
Contributor

When adding a custom metadata block using the api/admin/datasetfield/load API as described here, it is currently possible to create datasetField names that start or end with spaces. These datasetField names read from the corresponding column of the TSV file are entered into the database as is. For example, adding the following metadata block would introduce the field name "whitespaceDemoOne " (with one trailing space) to the database.

#metadataBlock	name	dataverseAlias	displayName												
	whitespaceDemo		Whitespace Demo												
#datasetField	name	title	description	watermark	fieldType	displayOrder	displayFormat	advancedSearchField	allowControlledVocabulary	allowmultiples	facetable	displayoncreate	required	parent	metadatablock_id
	whitespaceDemoOne 	One	Trailing Space		text	0		TRUE	TRUE	TRUE	FALSE	TRUE	FALSE		whitespaceDemo
	 whitespaceDemoTwo	Two	Leading Space		text	1		TRUE	TRUE	TRUE	FALSE	TRUE	FALSE		whitespaceDemo
	whitespaceDemoThree	Three	CV with errors		text	2		TRUE	TRUE	TRUE	FALSE	TRUE	FALSE		whitespaceDemo
#controlledVocabulary	DatasetField	Value	identifier	displayOrder											
	whitespaceDemoThree	CV1 		0											
	whitespaceDemoThree	 CV2		1											
	whitespaceDemoThree	CV3		2											

This has led to some confusing behavior during updates. Generating a new SOLR schema using the update-fields.sh script keeps the trailing spaces in place. However, SOLR does not properly work with these fields and I had to manually remove them from the SOLR schema file for the search to work on those fields again.

Overview of the Feature Request
When reading a TSV file in loadDatasetField method trim leading and trailing white spaces around each TSV cell. For most, if not all, entries both leading and trailing white spaces do not serve a purpose and can be regarded as typos with relative safety.
While only datasetField names and controlled vocabulary identifiers are entered into the database, this approach would also be helpful for all other columns of a metadata block.

The only possible column where trimming could possibly be detrimental is displayFormat when picking an explicit character (the last option in the list). I am not sure if that would be a reasonable use of that variable.

What kind of user is the feature intended for?
This feature is intended as a safeguard for Sysadmins who need to add and maintain custom metadata blocks. It aims to avoid introducing malformed entries into the database since these cause trouble with SOLR during update processes (see above).

What inspired the request?
We work with several data stewards in research projects that develop their own metadata blocks. However, they do not have SSH access to the Dataverse server and thus rely on us (the library) to add and maintain the metadata blocks for them. While we do our best to screen for such errors, both manually and with software, this kind of error is easy to make, easy to miss in QA, and slipped by us in the past. Preventing them in the first place would be a great help.

Any open or closed issues related to this feature request?
None that we know of.

Implementation
We would be happy to provide a PR implementing this. @erodde has already been working on one.

@HenningTimm HenningTimm added the Type: Feature a feature request label Jul 16, 2024
@pdurbin
Copy link
Member

pdurbin commented Jul 16, 2024

Sure, a PR would be lovely! Thanks!

@jggautier
Copy link
Contributor

I vaguely remember chatting with someone who was working on some automated checks of metadata block TSV files, to see if certain things were missing in the TSV files. Was it you @poikilotherm that I was chatting with about this?

I'm wondering if those checks could also include a warning about leading and trailing spaces in datasetField names.

erodde added a commit to erodde/dataverse that referenced this issue Jul 19, 2024
@HenningTimm
Copy link
Contributor Author

I think @poikilotherm and I have both been working on something like this independently. The tool I linked above started as a yaml to tsv converter for Dataverse metadata blocks but now also doubles as a linter and checks for trailing spaces, duplicate keys etc. We routinely run it as a CI script for all our metadata blocks now.

@jggautier
Copy link
Contributor

Ah great. Thanks @HenningTimm! Then I suppose that while your tool, and potentially @poikilotherm's tool if that's who I chatted with about this, prevents leading and trailing spaces from getting into the database in the first place, the check during TSV file ingest in #10696 will be another guard against that happening 👍

@pdurbin
Copy link
Member

pdurbin commented Jul 19, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature a feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants