Bring CUDA support to Tracking.jl (#33)

* initial commit in the gpu branch * GPU function blueprints for downcovert & car. rep. * Gpu carrier generation draft * downconvert loop blueprint * Implement carrier replica gpu function * Implement code replica and corr on GPU * Delete unnecessary comments * Reflect the development in readme * Adjust Gain Controlled Signal for a GPU Signal * Fix constructor, make use of vector operations * Remove .vscode garbage, adjust global gitignore * Change Float16 occurances to Float32 due to perf. * Change GPU carrier_replica to StructArray, adjust the calculation method * Complete the GPU downconvert function * Add CUDA package dependency * GPU correlation anc code replica blueprint * Fix syntax * Correlation using dot product on GPU * Fix GPU correlate parameter types * Add new dependencies * Fix algorithm, use @views macro for performance * Optimize carrier generation by taking fewer steps * Union types for the carrier and code * Gain control for the gpu signal implemented * GPU code replica, GNSSSignals 28d32c4324e40a0b93391b06820deea98112a02d * Add functions from ozmaden/GNSSBenchmarks.jl * Reflect changes under GNSSSignals#feature/gpu * Functioning GPU TrackingState * account for CPU TrackingState * Reflect GNSSSignals changes for tracking_loop * Update README for the GPSL1 struct change * Enforce AbstractArray * AGC for CUDA signals * functioning GPU tracking loop * rectify start_sample * Fix resize problems * Stylistic change, variable names small letters * Replace mutiple function calls with a variable * Remove conditional use_gpu flag, as it's taken care of in GNSSSignals.jl * cleanup residual errors * Fix tracking_loop trunc inexact error * Fix CPU tracking loop * Implement GPU StructArray gen_carrier_replica! * Implement GPU StructArray correlate * Implement GPU StructArray downconvert! * Allow for both CuArray and StructArray of CuArrays tracking loop * Performance improvement for the CuArray correlator, implement dot product via Hadamard product for StructArray of CuArrays * Performance improvements for the StructArray of CuArrays correlate, implement Hadamard correlate * Performance improvement for the CuArray correlate * Create match_size_to_signal! function that checks if resizing is needed beforehand * Delete extra match_size_to_signal! definitions, fix dot products, implement matrix correlation * Remove Loop Vectorization compat * Reflect changes in JuliaGNSS:master * GPU TrackingState, DownconvertedSignalGPU, CarrierReplicaGPU * GPU tracking state initializes iff signal is known * GPU Tracking State doesn't need code, insert the main kernel * Fix phase error in kernel; kernel works for start:end signal; TrackingState code type Nothing for GPU * Checks for type equality of system.codes and signal, signal structarray assertion * GPU TrackingState testset * GPU tracking results testset * GPU tracking_loop testset, add CUDA to test name * GPU bit detector testset * GPU GPSL5 testset * GPU GPSL1 testset * GPU GalileoE1B testset * GPU discriminators testset * GPU CN0 estimation testset * GPU BOC testset * GPU bit buffer testset * Fix phase calculation (multiples of 2pi) * Add CUDA tests to runtests includes * Allow scalar indexing for cn0_estimation test * Allowscalar deprecation * Solve scalar indexing in accumaltor results * Fix GPU multi antenna tracking state * Seperate functions for matrix and vector cases * Allowscalar for tracking loop tests * Remove CUDA broadcasting functions, clean comments * Update readme with a `CUDA.jl` example * Merge ozmaden/Tracking#14 * Adjust GPU functions according to the change #31 * Make CUDA test names consistent * Add multiple antenna GPU test * Fix examples according to #31 * Check for signal and codes type consistency * Add Julia BuildKite CI for CUDA tests * Remove leftovers * Remove unnecessary structs * Remove the unnecessary carrier vector * Remove unused functions and duplicates Co-authored-by: Soeren Schoenbrod <soeren.schoenbrod@rwth-aachen.de>
JuliaGNSS · Nov 17, 2021 · fc50604 · fc50604
1 parent 5c32278
commit fc50604
Show file tree

Hide file tree

Showing 20 changed files with 1,017 additions and 19 deletions.
diff --git a/.buildkite/pipeline.yml b/.buildkite/pipeline.yml
@@ -0,0 +1,16 @@
+env:
+  SECRET_CODECOV_TOKEN: "Q3fuMdJjaQy9h/uk43rwSqz8M6ulvlCedU2Ir0S3QLP4t9F8cf7pzrTkX+nVhkGycZ/r5FRtTOwPr445R3wK5v9mEAsJN5GMOgI5w/L8m2XDwLmW3PN8RMno+fm2JVxZyPMNNmIQqbYEmmQcBS6Q3nywW3xi0Cl5umJuwDB+NdOFbpq3wc2wrnbOAbwlBJoCJmlH+F4ncuVY6EMmsgNKAf9RqUNWQxIthG616X1cNwuYEpL4dO/PWY2GMXWXTQ8ndO/713p4b5yIlzDP0mr2MrO+1A5fhgPc7Vr+f9mUlIAx+9AsWQYPrqPTkr2L5+mfaTodVE3u2Cop877WJZQD7w==;U2FsdGVkX1/wk2jzfWlRZ66IWgionQK/5Fu0pg3u0b26hhmmMjAjOklyi7QZKhJHjjt4KjK/dJzhd3eK28S0qQ=="
+
+steps:
+  - label: "Julia v1.6"
+    plugins:
+      - JuliaCI/julia#v1:
+          version: "1.6"
+      - JuliaCI/julia-test#v1: ~
+      - JuliaCI/julia-coverage#v1:
+          codecov: true
+    agents:
+      queue: "juliagpu"
+      cuda: "*"
+    if: build.message !~ /\[skip tests\]/
+    timeout_in_minutes: 60
diff --git a/Project.toml b/Project.toml
@@ -4,6 +4,7 @@ authors = ["Soeren Zorn <soeren.zorn@nav.rwth-aachen.de>"]
 version = "0.14.8"
 
 [deps]
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 DocStringExtensions = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"
 GNSSSignals = "52c80523-2a4e-5c38-8979-05588f836870"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

diff --git a/README.md b/README.md
@@ -13,6 +13,7 @@ This implements a basic tracking functionality for GNSS signals. The correlation
 * Secondary code detection
 * Bit detection
 * Phased array tracking
+* GPU acceleration (CUDA)
 
 ## Getting started
 
@@ -25,15 +26,17 @@ pkg> add Tracking
 ## Usage
 
 ```julia
+using GNSSSignals
 using Tracking
-using Tracking: Hz, GPSL1
+using Tracking: Hz
 carrier_doppler = 1000Hz
 code_phase = 50
 sampling_frequency = 2.5e6Hz
 prn = 1
-state = TrackingState(GPSL1, carrier_doppler, code_phase)
-results = track(signal, state, prn, sampling_frequency)
-next_results = track(next_signal, get_state(results), prn, sampling_frequency)
+gpsl1 = GPSL1()
+state = TrackingState(prn, gpsl1, carrier_doppler, code_phase)
+results = track(signal, state, sampling_frequency)
+next_results = track(next_signal, get_state(results), sampling_frequency)
 ```
 
 If you'd like to track several signals at once (e.g. in the case of phased antenna arrays), you will have to specify the optional parameter `num_ants::NumAnts{N}` and pass a beamforming function to the `track` function:
@@ -42,3 +45,24 @@ If you'd like to track several signals at once (e.g. in the case of phased anten
 state = TrackingState(GPSL1, carrier_doppler, code_phase, num_ants = NumAnts(4)) # 4 antenna channels
 results = track(signal, state, prn, sampling_frequency, post_corr_filter = x -> x[1]) # Post corr filter is optional
 ```
+
+### Usage with `CUDA.jl`
+This package supports accelerating the tracking loop by using the GPU. At the moment support is only provided for `CUDA.jl`. If you'd like to use this option, you'd have to opt-in by providing the following argument upon creating an `AbstractGNSS`:
+``` julia
+gpsl1_gpu = GPSL1(use_gpu = Val(true))
+```
+Beware that `num_samples` must be provided explicitly upon creating a `TrackingState`:
+``` julia
+state_gpu = TrackingState(prn, gpsl1_gpu, carrier_doppler, code_phase, num_samples = N)
+```
+Moreover, your signal must be a `StructArray{ComplexF32}` of `CuArray{Float32}` type:
+``` julia
+using StructArrays
+signal_cu = CuArray{ComplexF32}(signal_cpu)
+signal_gpu = StructArray(signal_cu)
+```
+Otherwise the usage is identical to the example provided above, including the case for multi-antenna tracking:
+``` julia
+results_gpu = track(signal_gpu, state_gpu, sampling_frequency)
+next_results_gpu = track(next_signal_gpu, get_state(results_gpu), sampling_frequency)
+```
diff --git a/src/Tracking.jl b/src/Tracking.jl
@@ -5,9 +5,11 @@ module Tracking
         StaticArrays,
         TrackingLoopFilters,
         StructArrays,
-        LoopVectorization
+        LoopVectorization,
+        CUDA
+
     using Unitful: upreferred, Hz, dBHz, ms
-    import Base.zero, Base.length, Base.resize!
+    import Base.zero, Base.length, Base.resize!, LinearAlgebra.dot  
 
     export
         get_early,
@@ -48,7 +50,7 @@ module Tracking
 
     struct NumAnts{x}
     end
-
+    
     NumAnts(x) = NumAnts{x}()
 
     struct NumAccumulators{x}

diff --git a/src/carrier_replica.jl b/src/carrier_replica.jl
@@ -1,3 +1,8 @@
+"""
+$(SIGNATURES)
+
+Fixed point CPU StructArray carrier replica generation
+"""
 function gen_carrier_replica!(
     carrier_replica::StructArray{Complex{T}},
     carrier_frequency,

diff --git a/src/downconvert_and_correlate.jl b/src/downconvert_and_correlate.jl
@@ -29,4 +29,123 @@ function downconvert_and_correlate(
     accumulators_result = complex.(a_re, a_im)
     C(map(+, get_accumulators(correlator), accumulators_result))
 end
-=#
+=#
+
+# CUDA Kernel 
+function downconvert_and_correlate_kernel(
+    res_re,
+    res_im,
+    signal_re,
+    signal_im,
+    codes,
+    code_frequency,
+    correlator_sample_shifts,
+    carrier_frequency,
+    sampling_frequency,
+    start_code_phase,
+    carrier_phase,
+    code_length,
+    prn,
+    num_samples,
+    num_ants,
+    num_corrs
+)   
+    cache = @cuDynamicSharedMem(Float32, (2 * blockDim().x, num_ants, num_corrs))   
+    sample_idx   = 1 + ((blockIdx().x - 1) * blockDim().x + (threadIdx().x - 1))
+    antenna_idx  = 1 + ((blockIdx().y - 1) * blockDim().y + (threadIdx().y - 1))
+    corr_idx     = 1 + ((blockIdx().z - 1) * blockDim().z + (threadIdx().z - 1))
+    iq_offset = blockDim().x
+    cache_index = threadIdx().x - 1 
+
+    code_phase = accum_re = accum_im = dw_re = dw_im = carrier_re = carrier_im = 0.0f0
+    mod_floor_code_phase = Int(0)
+
+    if sample_idx <= num_samples && antenna_idx <= num_ants && corr_idx <= num_corrs
+        # generate carrier
+        carrier_im, carrier_re = CUDA.sincos(2π * ((sample_idx - 1) * carrier_frequency / sampling_frequency + carrier_phase))
+
+        # downconvert with the conjugate of the carrier
+        dw_re = signal_re[sample_idx, antenna_idx] * carrier_re + signal_im[sample_idx, antenna_idx] * carrier_im
+        dw_im = signal_im[sample_idx, antenna_idx] * carrier_re - signal_re[sample_idx, antenna_idx] * carrier_im
+
+        # calculate the code phase
+        code_phase = code_frequency / sampling_frequency * ((sample_idx - 1) + correlator_sample_shifts[corr_idx]) + start_code_phase
+
+        # wrap the code phase around the code length e.g. phase = 1024 -> modfloorphase = 1
+        mod_floor_code_phase = 1 + mod(floor(Int32, code_phase), code_length)
+
+        # multiply elementwise with the code
+        accum_re += codes[mod_floor_code_phase, prn] * dw_re
+        accum_im += codes[mod_floor_code_phase, prn] * dw_im
+    end
+
+    cache[1 + cache_index + 0 * iq_offset, antenna_idx, corr_idx] = accum_re
+    cache[1 + cache_index + 1 * iq_offset, antenna_idx, corr_idx] = accum_im
+
+    ## Reduction
+    # wait until all the accumulators have done writing the results to the cache
+    sync_threads()
+
+    i::Int = blockDim().x ÷ 2
+    @inbounds while i != 0
+        if cache_index < i
+            cache[1 + cache_index + 0 * iq_offset, antenna_idx, corr_idx] += cache[1 + cache_index + 0 * iq_offset + i, antenna_idx, corr_idx]
+            cache[1 + cache_index + 1 * iq_offset, antenna_idx, corr_idx] += cache[1 + cache_index + 1 * iq_offset + i, antenna_idx, corr_idx]
+        end
+        sync_threads()
+        i ÷= 2
+    end
+
+    if (threadIdx().x - 1) == 0
+        res_re[blockIdx().x, antenna_idx, corr_idx] += cache[1 + 0 * iq_offset, antenna_idx, corr_idx]
+        res_im[blockIdx().x, antenna_idx, corr_idx] += cache[1 + 1 * iq_offset, antenna_idx, corr_idx]
+    end
+    return nothing
+end
+
+function downconvert_and_correlate_kernel_wrapper(
+    system,
+    signal,
+    correlator,
+    code_phase,
+    carrier_phase,
+    code_frequency,
+    correlator_sample_shifts,
+    carrier_frequency,
+    sampling_frequency,
+    signal_start_sample,
+    num_samples_left,
+    prn
+)
+    num_corrs = length(correlator_sample_shifts)
+    num_ants = size(signal, 2)
+    num_samples = size(signal, 1)
+    block_dim_z = num_corrs
+    block_dim_y = num_ants
+    # keep num_corrs and num_ants in seperate dimensions, truncate num_samples accordingly to fit
+    block_dim_x = prevpow(2, 1024 ÷ block_dim_y ÷ block_dim_z)
+    threads = (block_dim_x, block_dim_y, block_dim_z)
+    blocks = cld(size(signal, 1), block_dim_x)
+    res_re = CUDA.zeros(Float32, blocks, block_dim_y, block_dim_z)
+    res_im = CUDA.zeros(Float32, blocks, block_dim_y, block_dim_z)
+    shmem_size = sizeof(ComplexF32)*block_dim_x*block_dim_y*block_dim_z
+    @cuda threads=threads blocks=blocks shmem=shmem_size downconvert_and_correlate_kernel(
+        res_re, 
+        res_im, 
+        signal.re, 
+        signal.im,
+        system.codes,
+        Float32(code_frequency),
+        correlator_sample_shifts,
+        Float32(carrier_frequency),
+        Float32(sampling_frequency),
+        Float32(code_phase),
+        Float32(carrier_phase),
+        size(system.codes, 1),
+        prn,
+        num_samples, 
+        num_ants,
+        num_corrs
+    )
+    return sum(res_re .+ 1im*res_im, dims=1)
+end
diff --git a/src/tracking_loop.jl b/src/tracking_loop.jl
@@ -50,6 +50,10 @@ function track(
     correlator = get_correlator(state)
     num_ants = get_num_ants(correlator)
     size(signal, 2) == num_ants || throw(ArgumentError("The second dimension of the signal should be equal to the number of antennas specified by num_ants = NumAnts(N) in the TrackingState."))
+    if typeof(system.codes) <: CuMatrix
+        typeof(signal) <: StructArray || throw(ArgumentError("Signal is not a StructArray, initialize the signal properly and try again."))
+        typeof(signal.re) <: CuArray && typeof(state.system.codes) <: CuArray || throw(ArgumentError("Signal and GNSS codes are not of the same type. Please check if CPU or GPU is used."))
+    end
     downconverted_signal_temp = get_downconverted_signal(state)
     downconverted_signal = resize!(downconverted_signal_temp, size(signal, 1), signal)
     carrier_replica = get_carrier(state)
@@ -277,6 +281,77 @@ function downconvert_and_correlate!(
     )
 end
 
+# CUDA downconvert_and_correlate for num_ants > 1
+function downconvert_and_correlate!(
+    system::AbstractGNSS{C},
+    signal::AbstractMatrix,
+    correlator::T,
+    code_replica,
+    code_phase,
+    carrier_replica,
+    carrier_phase,
+    downconverted_signal,
+    code_frequency,
+    correlator_sample_shifts,
+    carrier_frequency,
+    sampling_frequency,
+    signal_start_sample,
+    num_samples_left,
+    prn
+) where {C <: CuMatrix, T <: AbstractCorrelator}
+    accumulator_result = downconvert_and_correlate_kernel_wrapper(
+        system,
+        view(signal, signal_start_sample:signal_start_sample - 1 + num_samples_left,:),
+        correlator,
+        code_phase,
+        carrier_phase,
+        code_frequency,
+        correlator_sample_shifts,
+        carrier_frequency,
+        sampling_frequency,
+        signal_start_sample,
+        num_samples_left,
+        prn
+    )
+    return T(map(+, get_accumulators(correlator), eachcol(Array(accumulator_result[1,:,:]))))
+end
+
+# CUDA downconvert_and_correlate for num_ants = 1
+function downconvert_and_correlate!(
+    system::AbstractGNSS{C},
+    signal::AbstractVector,
+    correlator::T,
+    code_replica,
+    code_phase,
+    carrier_replica,
+    carrier_phase,
+    downconverted_signal,
+    code_frequency,
+    correlator_sample_shifts,
+    carrier_frequency,
+    sampling_frequency,
+    signal_start_sample,
+    num_samples_left,
+    prn
+) where {C <: CuMatrix, T <: AbstractCorrelator}
+    accumulator_result = downconvert_and_correlate_kernel_wrapper(
+        system,
+        view(signal, signal_start_sample:signal_start_sample - 1 + num_samples_left),
+        correlator,
+        code_phase,
+        carrier_phase,
+        code_frequency,
+        correlator_sample_shifts,
+        carrier_frequency,
+        sampling_frequency,
+        signal_start_sample,
+        num_samples_left,
+        prn
+    )
+    addition(a,b) = a + first(b)
+    return T(map(addition, get_accumulators(correlator), eachcol(Array(accumulator_result[1,:,:]))))
+end
+
 function choose(replica::CarrierReplicaCPU, signal::AbstractArray{Complex{Float64}})
     replica.carrier_f64
 end
@@ -289,6 +364,9 @@ end
 function choose(replica::DownconvertedSignalCPU, signal::AbstractArray{Complex{T}}) where T <: Number
     replica.downconverted_signal_f32
 end
+function choose(replica::Nothing, signal::AbstractArray)
+    nothing
+end
 
 """
 $(SIGNATURES)
@@ -338,7 +416,7 @@ function get_num_chips_to_integrate(
     max_phase = Int(upreferred(get_code_frequency(system) *
         get_integration_time(system, max_integration_time, secondary_code_or_bit_found)))
     current_phase_mod_max_phase = mod(current_code_phase, max_phase)
-    max_phase - current_phase_mod_max_phase
+    return max_phase - current_phase_mod_max_phase
 end
 
 """
@@ -419,4 +497,13 @@ function resize!(ds::DownconvertedSignalCPU, b::Integer, signal::AbstractMatrix{
             StructArray{Complex{Float32}}((Matrix{Float32}(undef, b, num_ants), Matrix{Float32}(undef, b, num_ants))),
         ds.downconverted_signal_f64
     )
+end
+
+# No need for resizing when dealing with GPU signals
+function resize!(ds::Nothing, b::Integer, signal::AbstractArray)
+    return ds
+end
+# No need for resizing the GPU GNSS codes
+function resize!(codes::Nothing, b::Integer)
+    return codes
 end