Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-memory data layout #52

Closed
tpanzarella opened this issue Jun 10, 2020 · 15 comments · Fixed by #53
Closed

In-memory data layout #52

tpanzarella opened this issue Jun 10, 2020 · 15 comments · Fixed by #53

Comments

@tpanzarella
Copy link
Member

In #46 I outlined some puzzling issues as it relates to how the driver organizes its in-memory data buffers. I created this notebook to more clearly articulate the issue and its downstream effects.

I'd like to use this issue to discuss this topic and come to a consensus on a path forward. I am happy to take on the work to play with some potential implementations and see how it affects, among other things, the performance of the system.

As is more completely articulated in the notebook linked to above, the current in-memory data layout has implications for performance, usability, and correctness. Per the latter, given how the data are laid out in memory an argument could be made that the meta-data on the point cloud is inconsistent with these data and could be considered a bug. I think as it stands today, users who simply use the PointCloud2 and/or PCL interfaces to these data may be insulated from those issues. However, users developing deeper perception algorithms that access the array data directly, and assume it is modeled consistent with the LiDAR physically, will certainly be affected.

Let's discuss openly and come to consensus.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 10, 2020

Can't easily see the notebook, the github rendering for python notebooks is terrible. But I'll take your word for it.

The buffering lambda for structuring data was taken from the ouster client software they provide. Have you done a quick test to see if just structuring the data differently will help performance (e.g. just mess with the image format to me HxW rather than WxH so the buffering isn't missing caches, if that's the issue) before we full blown go to redesign it? The image processors are probably the simplest to understand to do since its a long vector of data (which a PC2 is as well, but less directly). You could also try just publishing one long ordered vector so there's absolutely no misses. It would be garbage to visualize, but it should give you a direct 1:1 for the ideal best case vs the current case to see if that is the cause of any performance issues (e.g. 1xH*W).

@tpanzarella
Copy link
Member Author

Can't easily see the notebook, the github rendering for python notebooks is terrible. But I'll take your word for it.

Maybe try this link instead. I think it is important to digest the notebook as it really elucidates the issue. I want to make sure you are in agreement with me here before moving forward. Without the details presented in the notebook, it will be hard to convey the nature of the problem.

Let me know if you get a chance to read it before I comment on your other suggestions above.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 10, 2020

I understood what you meant when you sent this to me in an email. I can glance over that notebook but I'm not a python developer, so its not immediately clear to me important details I'd want to understand (ex (64, 1024) is not very helpful since I don't know how numpy organizes these things in print statements, which is width and which is height. I trust that you know it right, but then its not really meaningful for me to look closely at your analysis if I'm just not familiar with the conventions of numpy, etc. I think python notebooks are not a suitable replacement to an actual report - not that I'm asking for one). So I'm taking your word for it that things are indeed transposed since you play with python more than me. ln[5] is pretty meaningless to me since there's ... you can't possibly tell the structure from that printout so it doesn't tell me much other than the timestamps increase row-wise rather than column-wise. But if they all fired at once, then I don't know why its 0 and then 1464320, shouldn't it be 0 all the way across since the flash lidar is firing all at once? I would think that the rows would all be the same timestamp since they're representing one encoder-tick-triggered firing of the N laser beams.

I think the only place that needs to change is this https://github.com/SteveMacenski/ros2_ouster_drivers/blob/foxy-devel/ros2_ouster/include/ros2_ouster/OS1/OS1_util.hpp#L118 batch iterator function. I think you just want to embed the information differently to be transposed. So I'm not sure we need to be too much on the same page since the change should be small and the result should be evident if it lowers resource utilization. I believe the data is coming in per encoder tick fire (1/1024th azimuth) so its just how we batch up that data when it comes in needs to change. unless what you're proposing is that the information coming over the wire is coming in from full revolution beams, rather than over time beams, which I would be suspect of

Does that make sense? It should be straight forward to change that to test quickly if manipulating the batching up of data will fix your issue.

@tpanzarella
Copy link
Member Author

(ex (64, 1024) is not very helpful since I don't know how numpy organizes these things in print statements, which is width and which is height.

I assumed from the narrative of stating the LiDAR mode (1024x10) and general knowledge of the LiDAR (it always sends back 64 "rings" of data) this would not be ambiguous even without knowledge of Python or numpy. Sorry.

In this case 64 x 1024 is rows x columns or height x width.

I think python notebooks are not a suitable replacement to an actual report - not that I'm asking for one).

OK. There is not that much Python in there though. There is lots of English narrative and supporting graphics as well. That I took quite a bit of time to try to articulate. The Python is there so we can look at data inline with the narrative. It has been my experience that people like to see actual executable code to support a "report".

So I'm taking your word for it that things are indeed transposed since you play with python more than me.

The narrative (English not Python) goes on to explain that it is not just transposed data. The meta data on the point cloud claims 64 x 1024 (rows x cols or height x width). But it is tiling the actual cols transposed into the rows. So, there are really 3 "problems":

  1. The shape is inconsistent with the physical LiDAR
  2. The data are transposed
  3. The data are laid out in memory in Fortran-order (column-major) rather than C-order (row-major).

Now, on the last point, I am not sure if ROS specifies that the data array of an organized PointCloud2 be laid out in row-major or column-major order explicitly, but I would claim that most people would assume row-major. Maybe I am incorrect.

ln[5] is pretty meaningless to me since there's ... you can't possibly tell the structure from that printout so it doesn't tell me much other than the timestamps increase row-wise rather than column-wise.

Fair enough. I explicitly used the word "peek" in the narrative for this reason. However, I'd argue that understanding the Ouster LiDAR data and seeing this time image data and knowing (since it was stated in the analysis) that the LiDAR was running in 1024x10 mode (i.e., 10 Hz) it is pretty clear what is going on with the data (i.e., 0's along the top-left of the array (first col) and 99835904 along the bottom right of the array (last col ... roughly 100 ms later)).

But if they all fired at once, then I don't know why its 0 and then 1464320, shouldn't it be 0 all the way across since the flash lidar is firing all at once? I would think that the rows would all be the same

That is not how it works. Yes it is a flash LiDAR but it only exposes a column at a time. So, in the case I used in the notebook of running in 1024x10 mode, each row of the data array actually contains 16 columns worth of (transposed) data. This is why you do not see 0's all across the top row. Plus, there are 1024 columns per row in the array but only 64 rings (rows) on the LiDAR. Zeros across the first row of the array would assume 1024 rings (given how the data are being laid out in memory). I thought the graphics illustrated how the data are being tiled clearly (again, w/o needing to know any Python or numpy).

I think the only place that needs to change is this https://github.com/SteveMacenski/ros2_ouster_drivers/blob/foxy-devel/ros2_ouster/include/ros2_ouster/OS1/OS1_util.hpp#L118 batch iterator function.

I will take a look at it and provide an implementation that solves the problem I am bringing up here. Once I have it, we can review it. If you are OK with that. I won't be able to look at this again until (at best) end of next week.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 11, 2020

Let me read over again this in more detail tomorrow and come back with more tangible thoughts. I finally just got ROS / python3 playing well together on my machine so I can actually reproduce that notebook now and play around to get an intuition. Before now, my machine's been basically bricked with ROS / python3 due to 18.04->20.04 issues (this week until now I've just been debugging and writing documentation in the mean time 😉).

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 12, 2020

So, in the case I used in the notebook of running in 1024x10 mode, each row of the data array actually contains 16 columns worth of (transposed) data. This is why you do not see 0's all across the top row.

I still don't understand why they're all not 0. If all 16 beams are shot at once, the first column of data should be all zeros (and in this python struct, the first row). We have 64 entries in the row, all 64 should be 0 or some invalid value because there was no exposure because your 16 beam lidar only uses 25% of the slots. From the analysis below, it is because the points aren't actually ordered? If so, then structuring this experiment as a 2D array and publishing images of the "lidar image" is deceptive to the data's meaning. Though I do understand that you want to get it to a point where that "lidar image" isn't deceptive, that's the point I think I'm gathering that you're trying to make.

Doing

In [8]: a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])  
In [9]: a                                                                       
Out[9]: 
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])         
In [10]: a.shape                                                                
Out[10]: (3, 4)

Shows me that your (64, 1024) is 64 columns and 1024 rows, which is the same as how we use the data in ROS. So the 0, 1561600, 3122688 ..., 95253248, first column in your printed struct is the first column of timestamps (?) from the laser. That seems to check out 1:1 for what we want and how this is structured. Maybe you had a typo thinking that it was transposed?

Later you say in the document

From the data above, we can see that the first column of data is written into the first row of the array. If you look closer, basically all of the columnar data are being tiled into the array along its rows.

Which is suspect to me, since the shapes we just looked at seems to indicate that the first column is indeed the first column in this numpy array. I can't explain the contents of the array, those numbers seem all wrong to me, but the shape seems to indicate that it is correct.

Moving forward in your analysis

We first plot with our grad_cmap to show how the column firing times increase across the scan. What we would have expected was a smooth gradient from left-to-right. This would indicate column firings over time in the direction that the LiDAR spins. Which is, in fact, what happens. But this image shows how the data are being laid out in memory "incorrectly."

I agree if that was indeed a measure of time, that that's very strange. It should be smooth gradient from 0 -> 1024 and all of the rings 0->64 should be the same color. This makes me suspect that what you're plotting isn't actually time (either from a mistake on the sensor or in parsing the packets) or the points aren't actually in a structured order. Lets narrow in on the order assuming that the times are actually correct. You didn't indicate at the start of this analysis what the H5 file is of. If this is a pointcloud message from the output of the driver, then the above could be explained because the order is bogus, but then why are you plotting it as if it is in an image grid? If in ln[10] the plot is after you reorganize the points corresponding to not the order but the azimuth / elevation values, then that makes sense. Please verify that to me. The reshape command is unclear to me what exactly that is sorting.

What I would challenge you to do at this point is do the same thing with the image processors. Those should be in correct orders, because otherwise the image values would look bogus. So if you instead took the intensity image processor or something, and repeated this experiment, it should be in the correct gradient order. Assuming that your assertion here is that the pointcloud isn't in a structured order due to the batch processing. If your assertion that the order is weird from the sensor, then the image processors would be constantly reordering and sticking points around to get a structured image value. It would be valuable to know who's unordering things; the sensor or the batch processor. You could create a time image processor as well and the values of that directly should be a very similar experiment.

tl;dr, after diving into this, I see that there is an issue, but I don't think this experiment identified whether the issue is in the driver or the order that data is sent from the sensor. That should be something relatively easy to determine by having the image / pointcloud processors print to screen the azimuth and elevation numbers on reception. If they're in order, then its the batch processor for the pointcloud. If they're not in order, the batch processor for the pointcloud is just translating the information it receives in the order it receives it.

@SteveMacenski
Copy link
Member

Or I went down the totally wrong rabbit hole and what we're plotting isn't actually time for some reason, but I think from your reshape command, it shows that its probably right. But again, not totally sure how reshape just sorted the times into the correct azimuth / elevation slots since you didn't give it any knowledge about those values.

@tpanzarella
Copy link
Member Author

Shows me that your (64, 1024) is 64 columns and 1024 rows,

No. Look at your data again. It is rows x cols.

I agree if that was indeed a measure of time, that that's very strange

It is time. This is the time image. The units are nanoseconds relative to the first column firing, which is why the first 64 entries are zeros. Then every column after that is relative nanos from 0. Just convert them if you don't believe it. Running at 10 Hz you will see this is a 100 ms scan. Convert the last column, for example, you will see it is ~100 ms after the first column firing. It is correct.

The reshape command is unclear to me what exactly that is sorting.

Two things are happening. First I am reshaping from 64 x 1024 to 1024 x 64. Then the .T takes the transpose of that reshaped array. So, back to 64 x 1024 but the data are now correctly ordered via this view of the memory buffer. However, at the end I point out that even with this correct view we are now in column-major order, which is non-standard in our domain.

Anyway, I have a fix already in place in my tree. It works correctly. I just want to take some performance metrics on it. I think the hit will be minimal. I'll write up the technical details in the PR.

Sorry, it's almost 7PM here on Friday night and I'm trying to get dinner together for my kids, so, I cannot elaborate more at the moment.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 12, 2020

Ah got you - sorry, yes, 64x1024, got you.

I think the hit will be minimal.

Ordering the points has a performance hit? That's not great. Is that because the data is coming in wrong or the batching up is wrong?

@tpanzarella
Copy link
Member Author

Ordering the points has a performance hit? That's not great.

Yeah. The performance hit is because once we have the batched scan as one big linear array in column-major order, we want to reorder it to row-major order. To do that, we traverse the source array, by column, which is fine and totally performant / cache friendly as the data are column-ordered. But writing into the destination array, we have no choice but to start filling it in in a very irregular memory access pattern which basically results in cache thrashing. Unfortunately, there is no way around it. As I mentioned above, I'll measure and quantify the impact, which I think will be minimal. I say that, because there was already a copy being done from the PCL data structure to the ROS Message prior to publication. So, we still make a copy, but rather than a big memcpy of the blob of column-ordered point data, we properly reorder it first.

If you are unfamiliar with these issues, I would highly suggest you watch Scott Myer's CPU Caches talk. Specific to this discussion, look at 23:30 - 27:00 in the linked to video. I think it may help elucidate some things. The whole video is well worth your time though if you haven't seen it.

Is that because the data is coming in wrong or the batching up is wrong?

Its tough to say "wrong". On the data input side, the LiDAR fires by column, and batches up 16 columns at a time which are then sent over UDP to the driver. The driver then writes those 16 columns worth of data into the allocated memory buffer (by column). Then it gets the next 16 columns, and so on until there is a whole scan. Once the scan accumulates, that buffer is simply memcpy'd into a ROS PointCloud2 message and sent on its way. The net result is the receiving application is delivered irregularly shaped data.

As of right now, I think the best thing we can do is what I have done in my tree and what I have explained here. I'll get the PR out soon. But, it actually does make me think, and I've had several early conversations about this with some people... maybe we need a new data structure (and associated algorithms to operate on those data) that more closely models 3D ring-based LiDAR. One that recognizes a priori that these things are inherently columnar, sparse, etc. Anyway, that is not our problem to solve in this issue or this driver. I think the PR that I'll submit that reorders the data in the way you would expect will be a major enhancement (bug fix?) to what we are talking about here.

@SteveMacenski
Copy link
Member

I'm familiar with caching.

We can chat in the PR I suppose. I think there's a slightly more intuitive solution that would avoid an extra traversal.

@tpanzarella
Copy link
Member Author

I think there's a slightly more intuitive solution that would avoid an extra traversal.

There is already an implicit copy in the PCL to ROS data structure conversion. This is why I did it there -- same number of copying as before. Just a reordering as well in my implementation. I think changing it elsewhere requires more areas of the code to be touched. Without tests, I am not ready to take on the task to ensure the correctness. We consume the point cloud.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 15, 2020

Now there are 2 copies though, which is what I'd like to avoid if possible. There's a copy from the buffered pointcloud -> dense pointcloud -> sensor_msgs/PointCloud2. If we modified the buffering function, then it would be only buffered pointcloud -> sensor_msgs/PointCloud2. I'm pretty sure the PCLPointCloud ROS-PCL shim conversion still involves a copy.

But better than nothing. I might though see if I can't find some time in the next couple of days to prototype a solution.

@tpanzarella
Copy link
Member Author

I'm pretty sure the PCLPointCloud ROS-PCL shim conversion still involves a copy.

Right. So, I eliminated that one, but doing the re-ordering and copying in one step, then just calling pcl_conversions::moveFromPCL(cloud2, msg); which does a move not a copy.

@SteveMacenski
Copy link
Member

Ah ok. That works for me then 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants