KhushalB commented 4 years ago

During training, we find GPU utilization is very low but CPU utilization is 100% as most of the computing time is taken to load the data and prepare the semantic maps. I ran a profiler on the generate_agent_sample function in agent_sampling.py, and it seems around 55% of the time is taken up by rasterizations, particular the transformations. (The rest of the time is taken to read the zarr data.)

Would it be possible to have the rasterizers use the GPU by optionally passing the device like device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")? I've not worked with rasterization before and not deeply familiar with the l5kit yet, so I don't know the effort it would take to refactor all that rasterization code. But it would also help if one of the devs could point out how one could go about this.

lucabergamini commented 4 years ago

Hi @KhushalB, Yeah I've also run the profiler a lot in the past weeks trying to optimise that module. There are two major factors determining the runtime speed currently:

loading and decompressing data from the zarr into the memory. This is cached on compressed data only, so we still need to run the blosc decompressor for each memory access;
calls to transforms to get coordinates in the image space (these lines here in particular)

The second one is clearly the one we should aim to optimise atm. Technically, you could bring that transform operation on the GPU easily (it's just a dot product), but I guess that wouldn’t change a lot unless we can pack all the operations together. And even at that point, there is the cpu->gpu->cpu data transfer to consider. For rasterisation we use opencv, so no easy way to jump to the GPU there afaik.

Probably, just packing together all the lane_coords and performing a single transform may already give us quite a noticeable speedup. But also this one is not trivial, as lanes have a variable number of points (so a simple reshape won't do).

cxz commented 4 years ago

Yes, I ran some experiments related to this, vectorizing within a single sample (a call to Rasterizer#rasterize so all lanes, boxes, crosswalks are transformed in a single transform_points call each, and also reducing memory allocation overhead and np.vstack etc along the way) is so that doing on the CPU is faster than the time to copy to GPU memory, due to the number of points involved in a frame. The speedup is noticeable, but low two-digit percent (measured on the whole rasterization end-to-end for py_semantic).

The real gain would be if it was possible to vectorize across samples, or at least pre-compute the world-to-agent-coords transformations ahead of time.

Just thinking out loud, no concrete suggestions yet :-)

lucabergamini commented 4 years ago

Hey @cxz

The real gain would be if it was possible to vectorize across samples, or at least pre-compute the world-to-agent-coords transformations ahead of time. Yeah that sadly is not possible. The world_to_image matrix is always different because it intrinsically encodes the agent location and rotation in that frame (which is 99% of the time unique I would say).

Just thinking out loud, no concrete suggestions yet :-)

That's more then welcomed, I've been optimising this thing for weeks but it's still a huge bottleneck (sometimes I think there must be some bugs in the code to justify this performance honestly...)

cxz commented 4 years ago

I saw in one of the referenced papers an argument for the efficiency of the agent-centric coordinate system and heading normalization. Have you had any experience with something different in this specific dataset or was it implemented this way from the start? I'm thinking maybe the speedup in rasterization due to removal of head-normalization could compensate the gains by having it (in terms of metric), or maybe it's an obviously bad idea (tm)? :-)

lucabergamini commented 4 years ago

agent-centric coordinate system and heading normalization

mmm... not sure what you mean by this. The rasters are already normalised on the heading (the agent is always aligned with the x axis looking right). Still, the matrix to get there need to acknowledge for the angle

cxz commented 4 years ago

What I meant above is to remove the head normalization to remove all the related algebra.

Another different idea I had is: do you think it would make sense to rasterize always on the AV coordinate system and then fixing the raster by a 2-D rotation? (I know it's not equivalent and much less precise, but I'm wondering whether it's good enough for the conditions present in the dataset, and it would be a lot faster).

lucabergamini commented 4 years ago

do you think it would make sense to rasterize always on the AV coordinate system and then fixing the raster by a 2-D rotation? (I know it's not equivalent and much less precise, but I'm wondering whether it's good enough for the conditions present in the dataset, and it would be a lot faster).

It's technically possible. If we increase the raster size we could rasterize only once and still have enough horizon for all the agents in the 50 meters radius we're currently using. It would still require a massive refactor (or alternatively a LRU cache for the ego raster) as the current unit is the agent, while this shift the focus on the frame basically . I'm thinking about how can we have this in a PyTorch Dataset object basically

cxz commented 4 years ago

Exactly.. well, I'm going to sleep on this and see if I can come up with simpler ideas. Thank you for replies and for your work on l5kit and sorry for somewhat hijacking the initial issue :-)

KhushalB commented 4 years ago

Thanks for your valuable insights guys. I'll take a deeper look at the rasterizer modules with these in mind and see if anything comes to me.

lucabergamini commented 4 years ago

I'll leave this open so people can add their contributions :) Note that there is also this interesting discussion on Kaggle about the same "issue"

pestipeti commented 4 years ago

I think I found some duplicated calculations. In the generate_agent_sample method you calculate the world to image tr-matrix. If I am not mistaken then you calculate the same thing in both the box- and the semantic rasterizer as well. You should pass the tr-matrix to the rasterizers.

lucabergamini commented 4 years ago

If I am not mistaken then you calculate the same thing in both the box- and the semantic rasterizer as well. You should pass the tr-matrix to the rasterizer

True! Although I think the actual impact of that is quite limited according to my profiler. For 10 frames that would be 11 calls to the function in total, against hundreds to the transform function. Still, let me check the real impact of this in a branch :)

thomasbrandon commented 4 years ago

Having also noted that transform_points was a major performance point I rewrote it using Numba with good results. I get a 60-90% improvement in synthetic tests (depending on number of points, 90% for 10 points down to 60% for 10000). This improves overall SemanticRasterizer performance by ~40% in single-process tests (iterating an AgentDataset, with num_history_frames of 0 so mostly just testing SemanticRasterizer, focused on this first but should hopefully see some improvement in BoxRasterizer too, looking at that now). You can see the results here along with the numba code. Not sure if l5kit maintainers would consider a dependency on Numba but if so happy to work on upstreaming (or for maintainers to borrow my code). I haven't had any issues installing Numba and it's available on pip (and Conda) for Linux/Mac/Windows (and also installed on Kaggle). But no experience distributing it. Code is also in a branch on my l5kit fork. But was just creating the code for testing, so not necessarily the best way to integrate and some choices I'd reconsider (I'd at least probably move the transform_points_subpixel into l5kit.transforms and update the transform_points there, and try more to avoid the code duplication between subpixel and non-subpixel versions).

Testing on Kaggle I did find a different performance pattern with improvements degrading for larger numbers of points (down to no improvement for 10000 points for the non subpixel version). Not quite sure why, but still see a similar 40% improvement in dataset performance (again just testing SemanticRasterizer).

lucabergamini commented 4 years ago

This improves overall SemanticRasterizer performance by ~40% in single-process tests

Well that's great!

Not sure if l5kit maintainers would consider a dependency on Numba

I mean...we have a dependency on jupyter just for the sake of the example notebooks so I don't really see an issue with having also Numba on the bandwagon xD

Testing on Kaggle I did find a different performance pattern with improvements degrading for larger numbers of points (down to no improvement for 10000 points for the non subpixel version)

It may be that the other bottleneck (data loading) is dominant there...

thomasbrandon commented 4 years ago

OK, getting it into shape to PR now. I've also converted the creation of the boxes for BoxRasterizer to Numba with a good improvement there too (~60% faster iterating the dataset with just a BoxRasterizer with 10 history frames). One question was in draw_boxes it just uses .astype(np.int64), truncating the floating point values. Any reason it doesn't use the subpixel conversion used for the semantic map here? As part of my changes I've created a l5kit.transform.transform_points_subpixel function (as it shares an internal implementation with transform_points). So I could also use that for BoxRasterizer.

I mean...we have a dependency on jupyter just for the sake of the example notebooks so I don't really see an issue with having also Numba on the bandwagon xD

One option I considered was implementing the numba routines as optional replacements of the original routines (import numba and replace functions in a try/except). That would avoid total failure if there are issues compiling the numba routines (I don't think this is likely but is possible). It would also allow people to skip the Numba install and fallback to numpy versions (not sure why, but there may be some case). You could also then have Numba as a purely optional dependency but that would require people to realise this so may not be best. It also makes dev slighly easier as I can compare the implementations so I'll implement them this way and then can see if you think it's better to just remove originals (or wait a version of two then reconsider after experience).

Testing on Kaggle I did find a different performance pattern with improvements degrading for larger numbers of points (down to no improvement for 10000 points for the non subpixel version)

It may be that the other bottleneck (data loading) is dominant there...

Yeah, I would have expected that. But oddly the difference seems to show up more in synthetic tests (just repeatedly calling the Numba function). So not quite sure there. Maybe because it's a shared system (or as it's Xeon).

lucabergamini commented 4 years ago

It also makes dev slighly easier as I can compare the implementations so I'll implement them this way and then can see if you think it's better to just remove originals (or wait a version of two then reconsider after experience).

Yeah we can deprecate and decommission the old functions in 1 or 2 release cycles

thomasbrandon commented 4 years ago

To update, I haven't had too much time to spend on this but have now pushed an initial version. Just want to do some final checks and then will do an initial PR.

lucabergamini commented 4 years ago

To update, I haven't had too much time to spend on this but have now pushed an initial version. Just want to do some final checks and then will do an initial PR.

FYI: we're refactoring a lot of code in transforms (which will affect also the rasterisers). You will likely have merge conflict at some point, sorry about that :(

thomasbrandon commented 4 years ago

No problem, thanks for the heads-up. Would that just be #150 or are there other changes in the works? Will check out that PR for conflicts/clashes.

lucabergamini commented 4 years ago

150 will introduce the core changes, but I expect at least another PR soon

morizin commented 3 years ago

Will they get faster if we recoded them in torch instead of numpy ( I think not )

Herolin12 commented 3 years ago

Hi @KhushalB, I've struggling the same problem as you methioned above,is there any feasible solutions to speed up the training ? By the way,your profiling results are fantastic,is it convenient to know your profiler ~

woven-planet / l5kit

Use GPU for rasterization #136

150 will introduce the core changes, but I expect at least another PR soon