Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_raw_bids() cannot infer recording kind, filename extension, leading to awkward workarounds #408

Closed
hoechenberger opened this issue May 13, 2020 · 12 comments · Fixed by #410

Comments

@hoechenberger
Copy link
Member

hoechenberger commented May 13, 2020

Describe the problem

make_bids_folders() accepts a kind kwarg:

mne-bids/mne_bids/utils.py

Lines 200 to 201 in 1ce1717

def make_bids_folders(subject, session=None, kind=None, bids_root=None,
make_dir=True, overwrite=False, verbose=False):

write_raw_bids()infers the kind from the Raw object:

kind = _handle_kind(raw)

write_raw_bids() infers the filename extension from the Raw object:

mne-bids/mne_bids/write.py

Lines 1132 to 1138 in 1ce1717

raw_fname = raw.filenames[0]
if '.ds' in op.dirname(raw.filenames[0]):
raw_fname = op.dirname(raw.filenames[0])
# point to file containing header info for multifile systems
raw_fname = raw_fname.replace('.eeg', '.vhdr')
raw_fname = raw_fname.replace('.fdt', '.set')
_, ext = _parse_ext(raw_fname, verbose=verbose)

read_raw_bids(), in contrast, requires the "full name" of the input file

mne-bids/mne_bids/read.py

Lines 331 to 341 in 1ce1717

def read_raw_bids(bids_fname, bids_root, extra_params=None,
verbose=True):
"""Read BIDS compatible data.
Will attempt to read associated events.tsv and channels.tsv files to
populate the returned raw object with raw.annotations and raw.info['bads'].
Parameters
----------
bids_fname : str
Full name of the data file

This makes using read_raw_bids() awkward

This is demonstrated in the example gallery:

bids_fname = bids_basename + '_meg.fif'
raw = read_raw_bids(bids_fname, output_path)

Since read_raw_bids() doesn't know about the kind nor extension of a file, both have to be manually appended to the bids_basename. This leads to code that isn't data-format agnostic.

Describe your solution

  • We could add a kind kwarg to read_raw_bids() to allow specification of the requested recoding modality.
  • However, this would still leave the problem that we wouldn't know about the correct filename extension to use. Currently our way to automatically find the data file without prior knowledge of the extension in the mne-study-template is to concat bids_basename + '_' + kind, and then glob the (hopefully correctly assembled!) directory, excluding the .json sidecar files. This doesn't seem like a good solution.
  • The approach taken in the example gallery quoted above -- appending a constant string to the filename -- is not portable.If I were to convert data from FIFF to BranVision, I would expect all my mne-bids-based pipelines to continue working without any modifications.

Additional context

I think this touches the discussion we've had in #407 (comment); namely, that it would not only be nice, but it seems increasingly necessary to have a central registry of all the files in a dataset. In the concrete example, I should be able to tell read_raw_bids(): Please load raw data from: sub-01, ses-test, meg and shouldn't be bothered with the filename extension or folder hierarchy.

@agramfort
Copy link
Member

agramfort commented May 13, 2020 via email

@hoechenberger
Copy link
Member Author

we should have a video session to discuss this but my first reaction is that

My schedule is flexible. Let me know when you would be available – who else would like to chime in? @jasmainak @sappelhoff (just randomly pinging people here :D)

I can also set up a Doodle to find a date that would suit us all.

Thanks!

@sappelhoff
Copy link
Member

I don't have a particular opinion about this but I am happy to listen and chime in when I think I have something important to say. My schedule is also flexible (except mondays)

@jasmainak
Copy link
Member

I would suggest going a little back and forth on github first so we can think about the problem first. I like the idea of globbing.

But one problem with globbing is that you might have two files which match the basename. For example, if someone placed a derivative file in the same folder that has the same basename, then how do you choose the right file? It would be nice if there are guarantees in the validator from this happening. Could you check?

@agramfort
Copy link
Member

agramfort commented May 13, 2020 via email

@jasmainak
Copy link
Member

I think nothing stops you from adding non-compliant extra files in the BIDS folders. But I seem to recall that the validator throws warnings/errors if you do that. Maybe @sappelhoff will know. I'm fine with detecting this case and raising an exception until there is a compelling reason to do otherwise.

@sappelhoff
Copy link
Member

I think nothing stops you from adding non-compliant extra files in the BIDS folders

The validator should (and to my knowledge DOES) stop you if you do that. However:

  1. we know that the validator is not as perfect as we'd like
  2. people can add a .bidsignore to cheat the validator and still include all files they want to

That said, the validator does not go into the /derivatives folder of a BIDS dataset, because we do not yet impose any structure on derivatives. So whatever happens in /derivatives is out of our control. -> globbing seems to be not very robust to me, but would probably work for most cases.

Perhaps we can use a check whether globbing is ambiguous (several matches returned) ... and if it is, we warn and demand extra params ... or a check by the user of their BIDS data?

if it's not ambiguous (just a single match), we just proceed?

@hoechenberger
Copy link
Member Author

hoechenberger commented May 14, 2020

@agramfort and I had a brief phone call yesterday to discuss the issue. Here's an idea we came up with:

Scenario

Assume a dataset with simultaneous MEG and EEG recordings. MEG data is in FIFF format, and EEG data is stored as BrainVision.

Current solution

bids_root = '/bids'
bids_basename = make_bids_basename(subject='foo', session='bar', task='baz')

# Load MEG data.
kind = 'meg'
fname_ext = 'fif'
bids_fname = f'{bids_basename}_{kind}.{fname_ext}'
raw_meg = read_raw_bids(bids_fname=bids_fname, bids_root=bids_root)

# Load EEG data.
kind = 'eeg'
fname_ext = 'vhdr'
bids_fname = f'{bids_basename}_{kind}.{fname_ext}'
raw_eeg = read_raw_bids(bids_fname=bids_fname, bids_root=bids_root) 

How it could be better

bids_root = '/bids'
bids_basename = make_bids_basename(subject='foo', session='bar', task='baz')

# Load MEG data.
kind = 'meg'
raw_meg = read_raw_bids(bids_basename=bids_basename, bids_root=bids_root, kind=kind)

# Load EEG data.
kind = 'eeg'
raw_eeg = read_raw_bids(bids_basename=bids_basename, bids_root=bids_root, kind=kind)

Notice how

  • we can simply pass bids_basename and kind directly to read_raw_bids
  • the filename extension will be automatically inferred (based on kind and / or globbing)

➔ We don't require users to create a bids_fname string manually anymore, leading to less user code that's also more portable (filename-extension agnostic.)

Issues with this approach

  • It's obviously not backward-compatible, at least once we remove the bids_fname parameter from read_raw_bids().
  • Since the filename extension isn't specified by the user, there can be ambiguities in edge cases.

@sappelhoff

Perhaps we can use a check whether globbing is ambiguous (several matches returned) ... and if it is, we warn and demand extra params ... or a check by the user of their BIDS data?

if it's not ambiguous (just a single match), we just proceed?

I'd be in favor of raising an exception in an ambiguous situation :)

@jasmainak
Copy link
Member

people can add a .bidsignore to cheat the validator and still include all files they want to

I think you raise an interesting point though. mne-bids should also be bidsignore-aware when reading the file with globbing.

Assume a dataset with simultaneous MEG and EEG recordings.

This is by the way an unlikely scenario in the sense that there are long discussions going on around this. So it's not fully fleshed out in BIDS yet. There might be other mechanisms to disambiguate in this scenario. This is just to say that the kind argument may not be necessary.

@agramfort
Copy link
Member

agramfort commented May 14, 2020 via email

@hoechenberger
Copy link
Member Author

hoechenberger commented May 15, 2020

@agramfort

bids_basename = make_bids_basename(subject=subject, kind='meg')
raw = read_raw_bids(bids_basename=bids_basename, bids_root=bids_root)

Yes so the difference between this proposal and the one I had made is whether the kind shall be considered a part of the basename or not.

Briefly skimming over the discussion that @jasmainak linked to, it seems possible that at one point in the future we'll have a piece of metadata in our BIDS datasets that tells us whether multiple modalities were recorded simultaneously, e.g. MEG & EEG: something of the kind "SimultaneousRecording": "EEG, MEG". So considering that, I would say the bids_basename should stay free of the kind, and one should only specify the kind when actually intending to read a certain piece of data. Because then the basename could be easily reused for reading multiple kinds. So pretty much like I proposed above:

bids_basename = make_bids_basename(subject='foo')
raw_meg = read_raw_bids(bids_basename=bids_basename, bids_root=bids_root, kind='meg')
raw_eeg = read_raw_bids(bids_basename=bids_basename, bids_root=bids_root, kind='eeg')

Thoughts?

@hoechenberger
Copy link
Member Author

I'm working on this now and will propose a PR shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants