Skip to content

Commit

Permalink
Bring CUDA support to Tracking.jl (#33)
Browse files Browse the repository at this point in the history
* initial commit in the gpu branch

* GPU function blueprints for downcovert & car. rep.

* Gpu carrier generation draft

* downconvert loop blueprint

* Implement carrier replica gpu function

* Implement code replica and corr on GPU

* Delete unnecessary comments

* Reflect the development in readme

* Adjust Gain Controlled Signal for a GPU Signal

* Fix constructor, make use of vector operations

* Remove .vscode garbage, adjust global gitignore

* Change Float16 occurances to Float32 due to perf.

* Change GPU carrier_replica to StructArray, adjust the calculation method

* Complete the GPU downconvert function

* Add CUDA package dependency

* GPU correlation anc code replica blueprint

* Fix syntax

* Correlation using dot product on GPU

* Fix GPU correlate parameter types

* Add new dependencies

* Fix algorithm, use @views macro for performance

* Optimize carrier generation by taking fewer steps

* Union types for the carrier and code

* Gain control for the gpu signal implemented

* GPU code replica, GNSSSignals 28d32c4324e40a0b93391b06820deea98112a02d

* Add functions from ozmaden/GNSSBenchmarks.jl

* Reflect changes under GNSSSignals#feature/gpu

* Functioning GPU TrackingState

* account for CPU TrackingState

* Reflect GNSSSignals changes for tracking_loop

* Update README for the GPSL1 struct change

* Enforce AbstractArray

* AGC for CUDA signals

* functioning GPU tracking loop

* rectify start_sample

* Fix resize problems

* Stylistic change, variable names small letters

* Replace mutiple function calls with a variable

* Remove conditional use_gpu flag, as it's taken care of in GNSSSignals.jl

* cleanup residual errors

* Fix tracking_loop trunc inexact error

* Fix CPU tracking loop

* Implement GPU StructArray gen_carrier_replica!

* Implement GPU StructArray correlate

* Implement GPU StructArray downconvert!

* Allow for both CuArray and StructArray of CuArrays tracking loop

* Performance improvement for the CuArray correlator, implement dot product via Hadamard product for StructArray of CuArrays

* Performance improvements for the StructArray of CuArrays correlate, implement Hadamard correlate

* Performance improvement for the CuArray correlate

* Create match_size_to_signal! function that checks if resizing is needed beforehand

* Delete extra match_size_to_signal! definitions, fix dot products, implement matrix correlation

* Remove Loop Vectorization compat

* Reflect changes in JuliaGNSS:master

* GPU TrackingState, DownconvertedSignalGPU, CarrierReplicaGPU

* GPU tracking state initializes iff signal is known

* GPU Tracking State doesn't need code, insert the main kernel

* Fix phase error in kernel; kernel works for start:end signal; TrackingState code type Nothing for GPU

* Checks for type equality of system.codes and signal, signal structarray assertion

* GPU TrackingState testset

* GPU tracking results testset

* GPU tracking_loop testset, add CUDA to test name

* GPU bit detector testset

* GPU GPSL5 testset

* GPU GPSL1 testset

* GPU GalileoE1B testset

* GPU discriminators testset

* GPU CN0 estimation testset

* GPU BOC testset

* GPU bit buffer testset

* Fix phase calculation (multiples of 2pi)

* Add CUDA tests to runtests includes

* Allow scalar indexing for cn0_estimation test

* Allowscalar deprecation

* Solve scalar indexing in accumaltor results

* Fix GPU multi antenna tracking state

* Seperate functions for matrix and vector cases

* Allowscalar for tracking loop tests

* Remove CUDA broadcasting functions, clean comments

* Update readme with a `CUDA.jl` example

* Merge ozmaden/Tracking#14

* Adjust GPU functions according to the change #31

* Make CUDA test names consistent

* Add multiple antenna GPU test

* Fix examples according to #31

* Check for signal and codes type consistency

* Add Julia BuildKite CI for CUDA tests

* Remove leftovers

* Remove unnecessary structs

* Remove the unnecessary carrier vector

* Remove unused functions and duplicates

Co-authored-by: Soeren Schoenbrod <soeren.schoenbrod@rwth-aachen.de>
  • Loading branch information
coezmaden and Soeren Schoenbrod authored Nov 17, 2021
1 parent 5c32278 commit fc50604
Show file tree
Hide file tree
Showing 20 changed files with 1,017 additions and 19 deletions.
16 changes: 16 additions & 0 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
env:
SECRET_CODECOV_TOKEN: "Q3fuMdJjaQy9h/uk43rwSqz8M6ulvlCedU2Ir0S3QLP4t9F8cf7pzrTkX+nVhkGycZ/r5FRtTOwPr445R3wK5v9mEAsJN5GMOgI5w/L8m2XDwLmW3PN8RMno+fm2JVxZyPMNNmIQqbYEmmQcBS6Q3nywW3xi0Cl5umJuwDB+NdOFbpq3wc2wrnbOAbwlBJoCJmlH+F4ncuVY6EMmsgNKAf9RqUNWQxIthG616X1cNwuYEpL4dO/PWY2GMXWXTQ8ndO/713p4b5yIlzDP0mr2MrO+1A5fhgPc7Vr+f9mUlIAx+9AsWQYPrqPTkr2L5+mfaTodVE3u2Cop877WJZQD7w==;U2FsdGVkX1/wk2jzfWlRZ66IWgionQK/5Fu0pg3u0b26hhmmMjAjOklyi7QZKhJHjjt4KjK/dJzhd3eK28S0qQ=="

steps:
- label: "Julia v1.6"
plugins:
- JuliaCI/julia#v1:
version: "1.6"
- JuliaCI/julia-test#v1: ~
- JuliaCI/julia-coverage#v1:
codecov: true
agents:
queue: "juliagpu"
cuda: "*"
if: build.message !~ /\[skip tests\]/
timeout_in_minutes: 60
1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ authors = ["Soeren Zorn <soeren.zorn@nav.rwth-aachen.de>"]
version = "0.14.8"

[deps]
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
DocStringExtensions = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"
GNSSSignals = "52c80523-2a4e-5c38-8979-05588f836870"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Expand Down
32 changes: 28 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ This implements a basic tracking functionality for GNSS signals. The correlation
* Secondary code detection
* Bit detection
* Phased array tracking
* GPU acceleration (CUDA)

## Getting started

Expand All @@ -25,15 +26,17 @@ pkg> add Tracking
## Usage

```julia
using GNSSSignals
using Tracking
using Tracking: Hz, GPSL1
using Tracking: Hz
carrier_doppler = 1000Hz
code_phase = 50
sampling_frequency = 2.5e6Hz
prn = 1
state = TrackingState(GPSL1, carrier_doppler, code_phase)
results = track(signal, state, prn, sampling_frequency)
next_results = track(next_signal, get_state(results), prn, sampling_frequency)
gpsl1 = GPSL1()
state = TrackingState(prn, gpsl1, carrier_doppler, code_phase)
results = track(signal, state, sampling_frequency)
next_results = track(next_signal, get_state(results), sampling_frequency)
```

If you'd like to track several signals at once (e.g. in the case of phased antenna arrays), you will have to specify the optional parameter `num_ants::NumAnts{N}` and pass a beamforming function to the `track` function:
Expand All @@ -42,3 +45,24 @@ If you'd like to track several signals at once (e.g. in the case of phased anten
state = TrackingState(GPSL1, carrier_doppler, code_phase, num_ants = NumAnts(4)) # 4 antenna channels
results = track(signal, state, prn, sampling_frequency, post_corr_filter = x -> x[1]) # Post corr filter is optional
```

### Usage with `CUDA.jl`
This package supports accelerating the tracking loop by using the GPU. At the moment support is only provided for `CUDA.jl`. If you'd like to use this option, you'd have to opt-in by providing the following argument upon creating an `AbstractGNSS`:
``` julia
gpsl1_gpu = GPSL1(use_gpu = Val(true))
```
Beware that `num_samples` must be provided explicitly upon creating a `TrackingState`:
``` julia
state_gpu = TrackingState(prn, gpsl1_gpu, carrier_doppler, code_phase, num_samples = N)
```
Moreover, your signal must be a `StructArray{ComplexF32}` of `CuArray{Float32}` type:
``` julia
using StructArrays
signal_cu = CuArray{ComplexF32}(signal_cpu)
signal_gpu = StructArray(signal_cu)
```
Otherwise the usage is identical to the example provided above, including the case for multi-antenna tracking:
``` julia
results_gpu = track(signal_gpu, state_gpu, sampling_frequency)
next_results_gpu = track(next_signal_gpu, get_state(results_gpu), sampling_frequency)
```
8 changes: 5 additions & 3 deletions src/Tracking.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@ module Tracking
StaticArrays,
TrackingLoopFilters,
StructArrays,
LoopVectorization
LoopVectorization,
CUDA

using Unitful: upreferred, Hz, dBHz, ms
import Base.zero, Base.length, Base.resize!
import Base.zero, Base.length, Base.resize!, LinearAlgebra.dot

export
get_early,
Expand Down Expand Up @@ -48,7 +50,7 @@ module Tracking

struct NumAnts{x}
end

NumAnts(x) = NumAnts{x}()

struct NumAccumulators{x}
Expand Down
5 changes: 5 additions & 0 deletions src/carrier_replica.jl
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
"""
$(SIGNATURES)
Fixed point CPU StructArray carrier replica generation
"""
function gen_carrier_replica!(
carrier_replica::StructArray{Complex{T}},
carrier_frequency,
Expand Down
121 changes: 120 additions & 1 deletion src/downconvert_and_correlate.jl
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,123 @@ function downconvert_and_correlate(
accumulators_result = complex.(a_re, a_im)
C(map(+, get_accumulators(correlator), accumulators_result))
end
=#
=#

# CUDA Kernel
function downconvert_and_correlate_kernel(
res_re,
res_im,
signal_re,
signal_im,
codes,
code_frequency,
correlator_sample_shifts,
carrier_frequency,
sampling_frequency,
start_code_phase,
carrier_phase,
code_length,
prn,
num_samples,
num_ants,
num_corrs
)
cache = @cuDynamicSharedMem(Float32, (2 * blockDim().x, num_ants, num_corrs))
sample_idx = 1 + ((blockIdx().x - 1) * blockDim().x + (threadIdx().x - 1))
antenna_idx = 1 + ((blockIdx().y - 1) * blockDim().y + (threadIdx().y - 1))
corr_idx = 1 + ((blockIdx().z - 1) * blockDim().z + (threadIdx().z - 1))
iq_offset = blockDim().x
cache_index = threadIdx().x - 1

code_phase = accum_re = accum_im = dw_re = dw_im = carrier_re = carrier_im = 0.0f0
mod_floor_code_phase = Int(0)

if sample_idx <= num_samples && antenna_idx <= num_ants && corr_idx <= num_corrs
# generate carrier
carrier_im, carrier_re = CUDA.sincos(2π * ((sample_idx - 1) * carrier_frequency / sampling_frequency + carrier_phase))

# downconvert with the conjugate of the carrier
dw_re = signal_re[sample_idx, antenna_idx] * carrier_re + signal_im[sample_idx, antenna_idx] * carrier_im
dw_im = signal_im[sample_idx, antenna_idx] * carrier_re - signal_re[sample_idx, antenna_idx] * carrier_im

# calculate the code phase
code_phase = code_frequency / sampling_frequency * ((sample_idx - 1) + correlator_sample_shifts[corr_idx]) + start_code_phase

# wrap the code phase around the code length e.g. phase = 1024 -> modfloorphase = 1
mod_floor_code_phase = 1 + mod(floor(Int32, code_phase), code_length)

# multiply elementwise with the code
accum_re += codes[mod_floor_code_phase, prn] * dw_re
accum_im += codes[mod_floor_code_phase, prn] * dw_im
end

cache[1 + cache_index + 0 * iq_offset, antenna_idx, corr_idx] = accum_re
cache[1 + cache_index + 1 * iq_offset, antenna_idx, corr_idx] = accum_im

## Reduction
# wait until all the accumulators have done writing the results to the cache
sync_threads()

i::Int = blockDim().x ÷ 2
@inbounds while i != 0
if cache_index < i
cache[1 + cache_index + 0 * iq_offset, antenna_idx, corr_idx] += cache[1 + cache_index + 0 * iq_offset + i, antenna_idx, corr_idx]
cache[1 + cache_index + 1 * iq_offset, antenna_idx, corr_idx] += cache[1 + cache_index + 1 * iq_offset + i, antenna_idx, corr_idx]
end
sync_threads()
i ÷= 2
end

if (threadIdx().x - 1) == 0
res_re[blockIdx().x, antenna_idx, corr_idx] += cache[1 + 0 * iq_offset, antenna_idx, corr_idx]
res_im[blockIdx().x, antenna_idx, corr_idx] += cache[1 + 1 * iq_offset, antenna_idx, corr_idx]
end
return nothing
end

function downconvert_and_correlate_kernel_wrapper(
system,
signal,
correlator,
code_phase,
carrier_phase,
code_frequency,
correlator_sample_shifts,
carrier_frequency,
sampling_frequency,
signal_start_sample,
num_samples_left,
prn
)
num_corrs = length(correlator_sample_shifts)
num_ants = size(signal, 2)
num_samples = size(signal, 1)
block_dim_z = num_corrs
block_dim_y = num_ants
# keep num_corrs and num_ants in seperate dimensions, truncate num_samples accordingly to fit
block_dim_x = prevpow(2, 1024 ÷ block_dim_y ÷ block_dim_z)
threads = (block_dim_x, block_dim_y, block_dim_z)
blocks = cld(size(signal, 1), block_dim_x)
res_re = CUDA.zeros(Float32, blocks, block_dim_y, block_dim_z)
res_im = CUDA.zeros(Float32, blocks, block_dim_y, block_dim_z)
shmem_size = sizeof(ComplexF32)*block_dim_x*block_dim_y*block_dim_z
@cuda threads=threads blocks=blocks shmem=shmem_size downconvert_and_correlate_kernel(
res_re,
res_im,
signal.re,
signal.im,
system.codes,
Float32(code_frequency),
correlator_sample_shifts,
Float32(carrier_frequency),
Float32(sampling_frequency),
Float32(code_phase),
Float32(carrier_phase),
size(system.codes, 1),
prn,
num_samples,
num_ants,
num_corrs
)
return sum(res_re .+ 1im*res_im, dims=1)
end
89 changes: 88 additions & 1 deletion src/tracking_loop.jl
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,10 @@ function track(
correlator = get_correlator(state)
num_ants = get_num_ants(correlator)
size(signal, 2) == num_ants || throw(ArgumentError("The second dimension of the signal should be equal to the number of antennas specified by num_ants = NumAnts(N) in the TrackingState."))
if typeof(system.codes) <: CuMatrix
typeof(signal) <: StructArray || throw(ArgumentError("Signal is not a StructArray, initialize the signal properly and try again."))
typeof(signal.re) <: CuArray && typeof(state.system.codes) <: CuArray || throw(ArgumentError("Signal and GNSS codes are not of the same type. Please check if CPU or GPU is used."))
end
downconverted_signal_temp = get_downconverted_signal(state)
downconverted_signal = resize!(downconverted_signal_temp, size(signal, 1), signal)
carrier_replica = get_carrier(state)
Expand Down Expand Up @@ -277,6 +281,77 @@ function downconvert_and_correlate!(
)
end

# CUDA downconvert_and_correlate for num_ants > 1
function downconvert_and_correlate!(
system::AbstractGNSS{C},
signal::AbstractMatrix,
correlator::T,
code_replica,
code_phase,
carrier_replica,
carrier_phase,
downconverted_signal,
code_frequency,
correlator_sample_shifts,
carrier_frequency,
sampling_frequency,
signal_start_sample,
num_samples_left,
prn
) where {C <: CuMatrix, T <: AbstractCorrelator}
accumulator_result = downconvert_and_correlate_kernel_wrapper(
system,
view(signal, signal_start_sample:signal_start_sample - 1 + num_samples_left,:),
correlator,
code_phase,
carrier_phase,
code_frequency,
correlator_sample_shifts,
carrier_frequency,
sampling_frequency,
signal_start_sample,
num_samples_left,
prn
)
return T(map(+, get_accumulators(correlator), eachcol(Array(accumulator_result[1,:,:]))))
end

# CUDA downconvert_and_correlate for num_ants = 1
function downconvert_and_correlate!(
system::AbstractGNSS{C},
signal::AbstractVector,
correlator::T,
code_replica,
code_phase,
carrier_replica,
carrier_phase,
downconverted_signal,
code_frequency,
correlator_sample_shifts,
carrier_frequency,
sampling_frequency,
signal_start_sample,
num_samples_left,
prn
) where {C <: CuMatrix, T <: AbstractCorrelator}
accumulator_result = downconvert_and_correlate_kernel_wrapper(
system,
view(signal, signal_start_sample:signal_start_sample - 1 + num_samples_left),
correlator,
code_phase,
carrier_phase,
code_frequency,
correlator_sample_shifts,
carrier_frequency,
sampling_frequency,
signal_start_sample,
num_samples_left,
prn
)
addition(a,b) = a + first(b)
return T(map(addition, get_accumulators(correlator), eachcol(Array(accumulator_result[1,:,:]))))
end

function choose(replica::CarrierReplicaCPU, signal::AbstractArray{Complex{Float64}})
replica.carrier_f64
end
Expand All @@ -289,6 +364,9 @@ end
function choose(replica::DownconvertedSignalCPU, signal::AbstractArray{Complex{T}}) where T <: Number
replica.downconverted_signal_f32
end
function choose(replica::Nothing, signal::AbstractArray)
nothing
end

"""
$(SIGNATURES)
Expand Down Expand Up @@ -338,7 +416,7 @@ function get_num_chips_to_integrate(
max_phase = Int(upreferred(get_code_frequency(system) *
get_integration_time(system, max_integration_time, secondary_code_or_bit_found)))
current_phase_mod_max_phase = mod(current_code_phase, max_phase)
max_phase - current_phase_mod_max_phase
return max_phase - current_phase_mod_max_phase
end

"""
Expand Down Expand Up @@ -419,4 +497,13 @@ function resize!(ds::DownconvertedSignalCPU, b::Integer, signal::AbstractMatrix{
StructArray{Complex{Float32}}((Matrix{Float32}(undef, b, num_ants), Matrix{Float32}(undef, b, num_ants))),
ds.downconverted_signal_f64
)
end

# No need for resizing when dealing with GPU signals
function resize!(ds::Nothing, b::Integer, signal::AbstractArray)
return ds
end
# No need for resizing the GPU GNSS codes
function resize!(codes::Nothing, b::Integer)
return codes
end
Loading

0 comments on commit fc50604

Please sign in to comment.