Trim whitespaces during metadata block TSV file ingest #10688

HenningTimm · 2024-07-16T13:45:02Z

When adding a custom metadata block using the api/admin/datasetfield/load API as described here, it is currently possible to create datasetField names that start or end with spaces. These datasetField names read from the corresponding column of the TSV file are entered into the database as is. For example, adding the following metadata block would introduce the field name "whitespaceDemoOne " (with one trailing space) to the database.

#metadataBlock	name	dataverseAlias	displayName												
	whitespaceDemo		Whitespace Demo												
#datasetField	name	title	description	watermark	fieldType	displayOrder	displayFormat	advancedSearchField	allowControlledVocabulary	allowmultiples	facetable	displayoncreate	required	parent	metadatablock_id
	whitespaceDemoOne 	One	Trailing Space		text	0		TRUE	TRUE	TRUE	FALSE	TRUE	FALSE		whitespaceDemo
	 whitespaceDemoTwo	Two	Leading Space		text	1		TRUE	TRUE	TRUE	FALSE	TRUE	FALSE		whitespaceDemo
	whitespaceDemoThree	Three	CV with errors		text	2		TRUE	TRUE	TRUE	FALSE	TRUE	FALSE		whitespaceDemo
#controlledVocabulary	DatasetField	Value	identifier	displayOrder											
	whitespaceDemoThree	CV1 		0											
	whitespaceDemoThree	 CV2		1											
	whitespaceDemoThree	CV3		2

This has led to some confusing behavior during updates. Generating a new SOLR schema using the update-fields.sh script keeps the trailing spaces in place. However, SOLR does not properly work with these fields and I had to manually remove them from the SOLR schema file for the search to work on those fields again.

Overview of the Feature Request
When reading a TSV file in loadDatasetField method trim leading and trailing white spaces around each TSV cell. For most, if not all, entries both leading and trailing white spaces do not serve a purpose and can be regarded as typos with relative safety.
While only datasetField names and controlled vocabulary identifiers are entered into the database, this approach would also be helpful for all other columns of a metadata block.

The only possible column where trimming could possibly be detrimental is displayFormat when picking an explicit character (the last option in the list). I am not sure if that would be a reasonable use of that variable.

What kind of user is the feature intended for?
This feature is intended as a safeguard for Sysadmins who need to add and maintain custom metadata blocks. It aims to avoid introducing malformed entries into the database since these cause trouble with SOLR during update processes (see above).

What inspired the request?
We work with several data stewards in research projects that develop their own metadata blocks. However, they do not have SSH access to the Dataverse server and thus rely on us (the library) to add and maintain the metadata blocks for them. While we do our best to screen for such errors, both manually and with software, this kind of error is easy to make, easy to miss in QA, and slipped by us in the past. Preventing them in the first place would be a great help.

Any open or closed issues related to this feature request?
None that we know of.

Implementation
We would be happy to provide a PR implementing this. @erodde has already been working on one.

The text was updated successfully, but these errors were encountered:

pdurbin · 2024-07-16T13:50:15Z

Sure, a PR would be lovely! Thanks!

…eader IQSS#10688

jggautier · 2024-07-18T14:28:17Z

I vaguely remember chatting with someone who was working on some automated checks of metadata block TSV files, to see if certain things were missing in the TSV files. Was it you @poikilotherm that I was chatting with about this?

I'm wondering if those checks could also include a warning about leading and trailing spaces in datasetField names.

HenningTimm · 2024-07-19T17:49:47Z

I think @poikilotherm and I have both been working on something like this independently. The tool I linked above started as a yaml to tsv converter for Dataverse metadata blocks but now also doubles as a linter and checks for trailing spaces, duplicate keys etc. We routinely run it as a CI script for all our metadata blocks now.

jggautier · 2024-07-19T18:11:08Z

Ah great. Thanks @HenningTimm! Then I suppose that while your tool, and potentially @poikilotherm's tool if that's who I chatted with about this, prevents leading and trailing spaces from getting into the database in the first place, the check during TSV file ingest in #10696 will be another guard against that happening 👍

pdurbin · 2024-07-19T19:21:02Z

@jggautier you're likely thinking of https://github.com/gdcc/mdbtool by @poikilotherm . Please see also https://dataverse.zulipchat.com/#narrow/stream/375812-containers/topic/hacking.20on.20metadata.20blocks/near/340715866

And there's https://gist.github.com/vera/edcebfcab6406759029ccb7d9e8d470b by @vera , described at #9463 (comment)

HenningTimm added the Type: Feature a feature request label Jul 16, 2024

erodde added a commit to erodde/dataverse that referenced this issue Jul 17, 2024

added whitespace trimming to dataset-field loading + null check for h…

72ded5c

…eader IQSS#10688

erodde mentioned this issue Jul 18, 2024

Added whitespace trimming to uploaded custom metadata TSV files #10696

Merged

erodde added a commit to erodde/dataverse that referenced this issue Jul 19, 2024

added realease note IQSS#10688

1cbe75f

jggautier mentioned this issue Jul 22, 2024

Multiple value in a child field #10674

Open

pdurbin added a commit to erodde/dataverse that referenced this issue Sep 11, 2024

Merge branch 'develop' into 10688_whitespace_trimming IQSS#10688

59789b8

ofahimIQSS closed this as completed in #10696 Dec 3, 2024

pdurbin added this to the 6.5 milestone Dec 3, 2024

HenningTimm mentioned this issue Dec 18, 2024

Changing lint levels for different Dataverse versions HenningTimm/yml2block#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trim whitespaces during metadata block TSV file ingest #10688

Trim whitespaces during metadata block TSV file ingest #10688

HenningTimm commented Jul 16, 2024

pdurbin commented Jul 16, 2024

jggautier commented Jul 18, 2024

HenningTimm commented Jul 19, 2024

jggautier commented Jul 19, 2024

pdurbin commented Jul 19, 2024

Trim whitespaces during metadata block TSV file ingest #10688

Trim whitespaces during metadata block TSV file ingest #10688

Comments

HenningTimm commented Jul 16, 2024

pdurbin commented Jul 16, 2024

jggautier commented Jul 18, 2024

HenningTimm commented Jul 19, 2024

jggautier commented Jul 19, 2024

pdurbin commented Jul 19, 2024