-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Accept any valid delimiters/EOF markers in TCK files #720
Conversation
for fiber_delimiter and eof_delimiter.
Thanks for this. Can you add a test, maybe with a currently failing small trk file? Can you think of a way of speeding up again? Can we first check what the binary format is, then use that in subsequent reads from the same file? |
@matthew-brett Yes, I can try adding a unit test. Currently though.. "make test" takes a really long time.. (it's stuck on 99% CPU usage for >30 min when I try it) Is it normal?
Maybe.. but I think it's a bit risky.. tck file could be generated by concatenating different .tck files from different sources, so depending on the implementation, it could end up with having a mix of different binary formats within a single file (I can think of a case where someone tries to by-pass the same obj. buffer that's loaded from a file and appended it to the new aggregated file) Using a proper float32 NaN/Inf function is slower, but loading 600K fiber would only take a few seconds so I wouldn't worry too much about optimizing this. I think it's best to do this in the right way (and simpler); looking for a specific binary pattern for NaN/Inf or any floating decimal values for that matter is a bit iffy. I've attached a sample .tck file that fails to load with the current implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple comments that may explain at least some of the slowdown. I think you're accidentally coercing to float64, and doing many more appends than necessary.
@effigies Thanks for the suggestion. I've applied a few of your suggestions and now my code runs about 5% faster compared to the current production version (instead of slower). I assumed that numpy.append with an empty array is a no-op, but I guess I was wrong. I've kept most of my original code, as your version doesn't handle the EOF delimiter checking correctly. I think there is a way to make it work.. but I thought mine is a bit simpler. I still can't get the unittest to work, but I've tested my code with various input files I had and made sure that it loads the correct number of fibers/streams. |
For some reason I can't get the Travis-CI page with the test failures - can you describe them? Can you add a small test file written by Matlab to confirm this fixes the read? |
different binary format from the current numpy.nan and numpy.inf
@matthew-brett I've added matlab_nan.tck in tests/data. Please let me know if there is anything else I should do. |
The test failures occur because numpy expects a real file object in |
RF: Use bytearray/frombuffer and other minor fixes
FIX: Restore missing delimiter error message
Codecov Report
@@ Coverage Diff @@
## master #720 +/- ##
=========================================
+ Coverage 89.1% 89.1% +<.01%
=========================================
Files 93 93
Lines 11468 11469 +1
Branches 1991 1990 -1
=========================================
+ Hits 10218 10220 +2
+ Misses 911 910 -1
Partials 339 339
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for tolerating my refactors. One tiny thing I failed to clean up.
@matthew-brett @MarcCote Do one or both of you have time for a quick look? In particular I'd appreciate a check that the logic of the DataError
matches the original intent; I tried to replicate the cases that would produce each message, but the structure changed a bit.
Thanks for the heads up Matthew. |
Thanks for pinging me. I'm not explicitly watching Nibabel repo anymore but I'm happy to help whenever I can. I'll have a look tonight. |
nibabel/streamlines/tck.py
Outdated
n_streams += 1 | ||
begin = delim + 1 | ||
|
||
# The rest gets appended to the leftover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# The rest gets appended to the leftover | |
# The rest becomes the new leftover. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you apply this suggestion, as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Thank you @soichih.
Great! Thanks! |
According to mrtrix/tck specification.
The current implementation of streamlines/tck.py / _read() expects that these delimiters are in specific binary format; whatever numpy.nan and numpy.inf happens to be). However, according to IEEE standard, NaN/Inf can have many different binary representations.
Therefore, streamlines generated by other programming languages (such as Matlab) can store these NaN and Inf with a slightly different binary format, and nibabel.streamlines.tck fails to detect these delimieters which results in the following error
I have re-implemented _read() so that it will actually look at each values to see if they are NaN values to determine if the values are stream line delimiters. My code runs around 30% slower (4.7sec v.s. 3.6sec for loading 500k streamlines) but I believe proper checking of NaN/Inf are unavoidable so that this code can load streamlines generated by other programming languages (or numpy might change the definition for numpy.nan and numpy.inf in the future)