Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alevin --rad mode does not output cmd_info.json #688

Closed
jashapiro opened this issue Jul 19, 2021 · 12 comments
Closed

Alevin --rad mode does not output cmd_info.json #688

jashapiro opened this issue Jul 19, 2021 · 12 comments

Comments

@jashapiro
Copy link

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?

Alevin

Describe the bug

When running alevin in --rad mode, the the cmd_info.json file is not output (and/or has very limited information).

To Reproduce

I am working with version 1.5.1 of salmon, run with the following command:

salmon alevin \
 -l ISR \
 --chromiumV3 \
 -1 read1.fastq.gz -2 read2.fastq.gz \
 -i splici_index \
 --tgMap splici.tx2gene.tsv \
 -o test_align \
 --rad 

Expected behavior
Without the --rad flag, there is a cmd_info.json file with the expected fields:

{
    "salmon_version": "1.5.1",
    "libType": "ISR",
    "chromiumV3": [],
    "mates1": "read1.fastq.gz",
    "mates2": "read2.fastq.gz",
    "index": "splici_index",
    "tgMap": "splici.tx2gene.tsv",
    "output": "test",
    "threads": "2",
    "auxDir": "aux_info"
}

I would expect an output file with the same information (roughly) when salmon alevin is invoked with --rad, to aid in tracking file provenance.

In prior versions (<= 1.4.0), there was a cmd.json file in --rad mode, but the contents were only:

{
  "auxDir": "aux_info",    
  "salmon_version": "1.4.0"   
}

(As a side note, the aux_info directory is empty in this case, but this may be expected)

Desktop (please complete the following information):

  • OS: OSX
  • Version
ProductName:	macOS
ProductVersion:	11.4
BuildVersion:	20F71

Also run using the biocontainers docker images, with the same result.

Additional context
Add any other context about the problem here.

@rob-p
Copy link
Collaborator

rob-p commented Jul 19, 2021

Hi @jashapiro,

Thanks for the report! I am curious if this is one issue or two. Specifically, are you saying that with 1.5.1 you get no cmd_info.json, while with 1.4.0 you got the stripped-down one, or that you get the stripped down version in both but you expect a full json file. I agree that, in any case, having some more info in the cmd_info.json output in RAD mode would be useful. It follows a separate code path, so this is likely a matter of making sure the function to write the json file is invoked at the right point.

Best,
Rob

@jashapiro
Copy link
Author

I suppose it is two issues. With 1.4.0 I get a stripped down cmd_info.json file, and with 1.5.1 I get none at all.

rob-p pushed a commit that referenced this issue Jul 20, 2021
This addresses #688 by adding reasonably fully featured cmd_info.json
and meta_info.json files when alevin is run in RAD mode.  Some
fields are not present in meta_info.json, but important ones like
the sequence hash are.
@rob-p
Copy link
Collaborator

rob-p commented Jul 20, 2021

Hi @jashapiro,

I've just pushed a commit that should address this, adding both a "full" cmd_info.json and a reasonable meta_info.json. If this is something you need on short notice we can try an push out a 1.5.2 soon, otherwise, this will be in the next release.

Best,
Rob

@jashapiro
Copy link
Author

Hi @rob-p,

It turns out I slightly misdiagnosed the problem... The "no cmd_info.json" was correct 1.5.1, but as you may have discovered, it is also missing in earlier versions. What I was seeing was a cmd_info.json file that is somehow generated along the alevin-fry pipeline we are using. For extra confusion, that was reporting a salmon version of 1.4.0, even when salmon 1.5.1 was used. We will dive in a bit deeper to see if we can find exactly where that file was generated.

@jashapiro
Copy link
Author

I didn't do more full testing, but it looks like the latter problem is indeed coming from alevin-fry quant (and this is a temporary but expected behavior). https://github.com/COMBINE-lab/alevin-fry/blob/967f5cbb404fb86a71291d88a73afa071570b575/libradicl/src/quant.rs#L1622-L1651

I'm not familiar enough with rust to know whether this will result in overwriting cmd_info.json and meta_info.json files that exist, but it does seem to me a good behavior would be to merge the info from salmon alevin and alevin-fry, particularly for the meta_info.json file, where both tools add useful information.

@rob-p
Copy link
Collaborator

rob-p commented Jul 20, 2021

Hi @jashapiro,

So there are definitely a few things going on here. The first is that you correctly diagnosed the missing cmd_info.json information when alevin is run in RAD mode. That was simply an oversight, and there is no reason that file shouldn't have been written. Second, there is also useful information that belongs in meta_info.json in the aux_info directory (like the SHA hash of the reference sequences); that was also missing but has now been added.

In addition to salmon's alevin command, each step of alevin-fry also writes some useful metadata when it executes. For example, there is a json file written by the generate-permit-list step, one written by the collate step, and one written by the quant step.

We've never run into the problem of the output of alevin-fry overwriting the output of alevin because we use a directory structure where the output quantifications reside in a separate directory from the input RAD file. However, I can now see that if you're writing the quants in the same place as the input, then there will be a conflict in the file names, and the existing files will be overwritten with the new ones.

I agree that both tools output useful information. I'm a bit ambivalent about assuming the salmon-generated files exist, and merging them into one output file, as I think there might be cases where those files aren't present and alevin-fry should still run properly since it doesn't require them to perform it's processing. One option would be to rename the alevin-fry output files to prefix/postfix them so they don't collide with the salmon files even if they live in the same directory. Then, one could (now or later) write a small command to merge the relevant json files into a unified output if that would be more convenient downstream. Let me know your thoughts.

Thanks!
Rob

@jashapiro
Copy link
Author

Hi @rob-p,

Ah, I hadn't looked carefully at the outputs, so I was overlooking the fact that the meta_info.json for salmon alevin is in the aux_info subdirectory, while the one for alevin-fry quant is in the output directory, so I don't think there will be a conflict. (We have been using the same directory in our pipeline for simplicity.) However, I do think having them named the same thing is a potential source of confusion. Is there a reason not to name the file produced by alevin-fry quant something more like quant.json or quant_info.json to be more in parallel with the collate.json and generate_permit_list.json files generated at those steps? Either way, I agree that merging the files within alevin-fry is probably not the best solution.

The cmd_info.json file, seems a special case: I am not sure what the ultimate goal for that file is; it seems now to be included "for R compatibility" though I am not fully clear on what that means (with .mtx input we don't need it, but tximeta/tximport may be looking for it?). If the final quant output directory does need the file, it would seem to make sense to copy it along somehow from the salmon alevin --rad output directory (with a stop along the way in the collate output I guess?). Presumably the aux_info would also be desired for tximeta if/when alevin-fry support is implemented there?

@rob-p
Copy link
Collaborator

rob-p commented Jul 20, 2021

Hi @jashapiro,

So there is no need for the cmd_info.json file currently for regular R compatibility; this is more specifically about tximport/tximeta. The meta_info.json output by alevin-fry is used in our function to read the USA-mode input --- it records the mode in which alevin-fry quant was run, and some other information. The meta_info.json output by salmon alevin is currently unused, but eventually can be used for provenance tracking of the reference aligned against, since it contains the signature of the reference sequence.

In general though, I agree it would be a good idea to clean up the file names. It's probably a good idea not to mess with any of the names or locations currently used by salmon, so I'll open up a related issue in the alevin-fry repo.

--Rob

@jashapiro
Copy link
Author

Hi @jashapiro,

I've just pushed a commit that should address this, adding both a "full" cmd_info.json and a reasonable meta_info.json. If this is something you need on short notice we can try an push out a 1.5.2 soon, otherwise, this will be in the next release.

Best,
Rob

Hi Rob-

I was happy to wait, but since you were so quick to push out alevin-fry 0.4.1 , it might be nice to have the version bump here too, so we can use all of the available info in our pipeline!

Thanks for your quick responses!
-Josh

@rob-p
Copy link
Collaborator

rob-p commented Jul 23, 2021

Hi Josh,

We've released 1.5.2 that incorporates these changes. Let me know if everything works from your end.

Best,
Rob

@jashapiro
Copy link
Author

Hi Rob,

Seems to work great! Thanks for the updates!

I would consider this issue closed!

-Josh

@rob-p
Copy link
Collaborator

rob-p commented Jul 23, 2021

Excellent! Thanks for reporting it.

@rob-p rob-p closed this as completed Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants