Check efficiency of RT implementation #637

fverdugo · 2021-08-12T10:33:03Z

By looking at

Gridap.jl/src/FESpaces/DivConformingFESpaces.jl

Line 116 in 67e5ee1

ContraVariantPiolaMap()(get_shapefuns(reffe),phi,sign_flip)

I would say that we are reevaluating the reference polynomials each time we visit a new cell.

fverdugo · 2021-08-12T10:41:38Z

I think that we need to replace

    lazy_map(_transform_rt_shapefuns,
             cell_reffe,
             get_cell_map(Triangulation(model)),
             lazy_map(Broadcasting(constant_field), sign_flip))

by

cell_fs = lazy_map(get_shape_functions,cell_reffe)
lazy_map(ContraPiolaMap(),cell_fs,
             get_cell_map(Triangulation(model)),
             lazy_map(Broadcasting(constant_field), sign_flip))

and remove _transform_rt_shapefuns.

Perhaps, Idem with the celldofs

amartinhuertas · 2021-08-13T01:59:55Z

Base line commit `67e5ee1` (2D Darcy on a box)

Profile data attached below

10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        0.33423      10      0      2      0.337986        100  9.02765e-15          0.00177146  solve_darcy_rt    0.00198408
   2 │        0.366117     20      0      2      0.378329        400  3.08589e-14          0.00284878  solve_darcy_rt    0.00816082
   3 │        0.374344     30      0      2      0.403508        900  3.03932e-14          0.00273725  solve_darcy_rt    0.0226722
   4 │        0.387381     40      0      2      0.430464       1600  7.15834e-14          0.00387972  solve_darcy_rt    0.0282138
   5 │        0.4075       50      0      2      0.47579        2500  9.25386e-14          0.00481167  solve_darcy_rt    0.0538609
   6 │        0.432286     60      0      2      0.534182       3600  1.78746e-13          0.00550788  solve_darcy_rt    0.0934964
   7 │        0.456456     70      0      2      0.594821       4900  2.19828e-13          0.00683736  solve_darcy_rt    0.129203
   8 │        0.486746     80      0      2      0.682113       6400  2.63806e-13          0.00761905  solve_darcy_rt    0.18718
   9 │        0.512964     90      0      2      0.77064        8100  3.08166e-13          0.00885309  solve_darcy_rt    0.243275
  10 │        0.545432    100      0      2      0.859867      10000  1.90947e-13          0.00986676  solve_darcy_rt    0.302959

profile_bef_opts.zip

amartinhuertas · 2021-08-13T03:25:26Z

After first optimization pointed by @fverdugo above (`92a69ab`)

We have improved performance. The width of the mountains on the left hand side panel (assembly) have been reduced. Still there is type instability in the evaluation of the Monomial Basis (I will take a look). It is surprising the huge amount of time spent in the explicit GC after assembly.

Anyway, I guess @fverdugo is anyway aware of this ...

10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        0.264812     10      0      2      0.268516        100  9.02765e-15          0.00176761  solve_darcy_rt    0.00165767
   2 │        0.285107     20      0      2      0.296845        400  3.08589e-14          0.00287766  solve_darcy_rt    0.00787332
   3 │        0.295462     30      0      2      0.325127        900  3.03932e-14          0.00289937  solve_darcy_rt    0.0256846
   4 │        0.303083     40      0      2      0.353405       1600  7.15834e-14          0.0038059   solve_darcy_rt    0.0411777
   5 │        0.316631     50      0      2      0.396771       2500  9.25386e-14          0.00445374  solve_darcy_rt    0.0594063
   6 │        0.334186     60      0      2      0.443978       3600  1.78746e-13          0.00535773  solve_darcy_rt    0.0943956
   7 │        0.35817      70      0      2      0.497818       4900  2.19828e-13          0.00622951  solve_darcy_rt    0.128155
   8 │        0.380576     80      0      2      0.573763       6400  2.63806e-13          0.00737641  solve_darcy_rt    0.185429
   9 │        0.401534     90      0      2      0.655064       8100  3.08166e-13          0.00713316  solve_darcy_rt    0.244715
  10 │        0.425449    100      0      2      0.752229      10000  1.90947e-13          0.00877073  solve_darcy_rt    0.31425

profile_aft.op1.zip

amartinhuertas · 2021-08-13T11:22:02Z

After marationian sessions of profiling, I think I have detected a performance pitfall in the current version of Gridap when working with differential operators (eg., gradient) applied to ArrayBlocks of fields. The pitfall is reproduced with the mwe below. I have come up with a solution in commit dc4350f. I am concerned about the following:

Is it the best possible solution?
Is this problem also present in other scenarios that should also be solved?

using Test
using Gridap
import Gridap: ∇, divergence
using LinearAlgebra
using Gridap.CellData
using Profile
using ProfileView

u(x) = VectorValue(2*x[1],x[1]+x[2])

divergence(::typeof(u)) = (x) -> 3

p(x) = x[1]-x[2]

∇p(x) = VectorValue(1,-1)

∇(::typeof(p)) = ∇p

f(x) = u(x) + ∇p(x)

domain = (0,1,0,1)
partition = (100,100)
order = 1
model = CartesianDiscreteModel(domain,partition)

V = FESpace(model,ReferenceFE(raviart_thomas,Float64,order),conformity=:Hdiv,
  dirichlet_tags=[5,6])

Q = FESpace(model,ReferenceFE(lagrangian,Float64,order); conformity=:L2)

U = TrialFESpace(V,u)
P = TrialFESpace(Q)

Y = MultiFieldFESpace([V, Q])
X = MultiFieldFESpace([U, P])

trian = Triangulation(model)
degree = 2
dΩ = Measure(trian,degree)


points=get_cell_points(dΩ.quad)
v=get_fe_basis(U)
pb=get_trial_fe_basis(Q)
div_v_q=∫(divergence(v)*q)dΩ
div_v_q_auto=div_v_q.dict[dΩ.quad.trian]
div_v_q_man=lazy_map(evaluate,get_data(divergence(v)*pb),get_data(points))


a((u, p),(v, q)) = ∫( (∇⋅v)*p )*dΩ
Y = MultiFieldFESpace([V, Q])
X = MultiFieldFESpace([U, P])
y=get_fe_basis(Y)
x=get_trial_fe_basis(X)
dc=a(x,y)
div_v_q_auto_block=dc.dict[dΩ.quad.trian]


@noinline function myloop!(cache,arr)
  l = 0
  for i in 1:length(arr)
    ai = getindex!(cache,arr,i)
    l += length(ai)
  end
  l
end

# Performance pitfall is here
cache = array_cache(div_v_q_auto_block);
@profile myloop!(cache,div_v_q_auto_block)

# Without blocks it is OK, no performance pitfall.
cache = array_cache(div_v_q_auto);
@profile myloop!(cache,div_v_q_auto)

amartinhuertas · 2021-08-13T12:30:49Z

After second optimization (`dc4350f`)

We have improved the performance of the assembly even more. An additional mountain has disappearead on the left hand side of the flamegraph. We are getting there!

10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        0.256516     10      0      2      0.261268        100  9.02765e-15          0.0017999   solve_darcy_rt    0.00179685
   2 │        0.259209     20      0      2      0.270993        400  3.08589e-14          0.00279935  solve_darcy_rt    0.00863311
   3 │        0.254477     30      0      2      0.282098        900  3.03932e-14          0.00307789  solve_darcy_rt    0.0245428
   4 │        0.262536     40      0      2      0.310051       1600  7.15834e-14          0.00367827  solve_darcy_rt    0.0438366
   5 │        0.27004      50      0      2      0.347974       2500  9.25386e-14          0.00426771  solve_darcy_rt    0.0678215
   6 │        0.281427     60      0      2      0.387864       3600  1.78746e-13          0.00531458  solve_darcy_rt    0.0948862
   7 │        0.288045     70      0      2      0.430728       4900  2.19828e-13          0.00608961  solve_darcy_rt    0.130244
   8 │        0.298291     80      0      2      0.495547       6400  2.63806e-13          0.00676337  solve_darcy_rt    0.186996
   9 │        0.308904     90      0      2      0.56214        8100  3.08166e-13          0.00802405  solve_darcy_rt    0.244622
  10 │        0.31568     100      0      2      0.6469        10000  1.90947e-13          0.00962899  solve_darcy_rt    0.313966

fverdugo · 2021-08-13T12:43:50Z

the "mountains" look very nice now!

If you are able to fix the plateau, you would easily get a 2x improvement.

amartinhuertas · 2021-08-13T12:57:51Z

If you are able to fix the plateau, you would easily get a 2x improvement.

The plateau is GC (explicit call after assembly). Do u think this is solvable?

fverdugo · 2021-08-13T13:25:42Z

which is the size of the mesh in this example?

amartinhuertas · 2021-08-13T13:46:51Z

which is the size of the mesh in this example?

100x100 quads

fverdugo · 2021-08-13T14:00:17Z

this CG time seems a lot! what does happen if you comment the line? are you running with --check-bounds=no ?

amartinhuertas · 2021-08-14T04:10:19Z

are you running with --check-bounds=no ?

No, I was not. Results with --check-bounds=no in the table below. 5-8% improvements, the GC plateu still there, though.

10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        0.223511     10      0      2      0.227706        100  9.02765e-15          0.00209083  solve_darcy_rt    0.00176186
   2 │        0.226838     20      0      2      0.237474        400  3.08589e-14          0.00235199  solve_darcy_rt    0.00810505
   3 │        0.229696     30      0      2      0.257955        900  3.03932e-14          0.00273573  solve_darcy_rt    0.0247038
   4 │        0.233849     40      0      2      0.281739       1600  7.15834e-14          0.00316141  solve_darcy_rt    0.0439335
   5 │        0.236913     50      0      2      0.309103       2500  9.25386e-14          0.00362761  solve_darcy_rt    0.0673354
   6 │        0.239644     60      0      2      0.339201       3600  1.78746e-13          0.00399835  solve_darcy_rt    0.0928566
   7 │        0.248194     70      0      2      0.383949       4900  2.19828e-13          0.00459787  solve_darcy_rt    0.127483
   8 │        0.257276     80      0      2      0.445318       6400  2.63806e-13          0.00531908  solve_darcy_rt    0.181995
   9 │        0.262293     90      0      2      0.509795       8100  3.08166e-13          0.00591585  solve_darcy_rt    0.240132
  10 │        0.267833    100      0      2      0.582029      10000  1.90947e-13          0.00689507  solve_darcy_rt    0.304982

amartinhuertas · 2021-08-14T04:34:41Z

this CG time seems a lot! what does happen if you comment the line?

The assemly times improved dramatically! Why did you add that GC() line after assembly? (I guess to reduce permanent memory allocation, right?)

julia> include("experiments/prepare_results.jl")
10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │      0.00652435     10      0      2    0.00971113        100  9.02765e-15          0.00208224  solve_darcy_rt    0.00110454
   2 │      0.0110655      20      0      2    0.0207089         400  3.08589e-14          0.00231101  solve_darcy_rt    0.00684298
   3 │      0.0137443      30      0      2    0.0384344         900  3.03932e-14          0.00237185  solve_darcy_rt    0.0209766
   4 │      0.0179322      40      0      2    0.0576663        1600  7.15834e-14          0.00290676  solve_darcy_rt    0.0360791
   5 │      0.0229827      50      0      2    0.0851398        2500  9.25386e-14          0.00322762  solve_darcy_rt    0.0561803
   6 │      0.0287668      60      0      2    0.126569         3600  1.78746e-13          0.00372828  solve_darcy_rt    0.0940649
   7 │      0.0357792      70      0      2    0.176257         4900  2.19828e-13          0.00432622  solve_darcy_rt    0.134461
   8 │      0.0409472      80      0      2    0.25984          6400  2.63806e-13          0.00502693  solve_darcy_rt    0.209684
   9 │      0.0485386      90      0      2    0.31709          8100  3.08166e-13          0.00580612  solve_darcy_rt    0.260729
  10 │      0.0565112     100      0      2    0.439944        10000  1.90947e-13          0.00671831  solve_darcy_rt    0.375843

amartinhuertas · 2021-08-14T05:39:49Z

New optimization. Using DIV and integral over the reference domain. Only very mild improvements observed. I would say this is because the optimization with CartesianGrids ... the Jacobian is the same for all cells, and this is exploited. For unstructured meshes I would expect even higher improvements.

10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │      0.0046118      10      0      2    0.00750439        100  9.02765e-15          0.00186444  solve_darcy_rt    0.00102815
   2 │      0.00776088     20      0      2    0.0173738         400  3.08589e-14          0.00219309  solve_darcy_rt    0.00726899
   3 │      0.0108752      30      0      2    0.036804          900  3.0013e-14           0.0025957   solve_darcy_rt    0.0226412
   4 │      0.0153467      40      0      2    0.0583026        1600  7.15834e-14          0.00300031  solve_darcy_rt    0.0389771
   5 │      0.0196759      50      0      2    0.0817289        2500  9.27815e-14          0.0033414   solve_darcy_rt    0.0583115
   6 │      0.0222976      60      0      2    0.114485         3600  1.79296e-13          0.00364426  solve_darcy_rt    0.0841313
   7 │      0.03052        70      0      2    0.150756         4900  2.20256e-13          0.00441119  solve_darcy_rt    0.115382
   8 │      0.0378711      80      0      2    0.213512         6400  2.63806e-13          0.00531192  solve_darcy_rt    0.170246
   9 │      0.0440555      90      0      2    0.277877         8100  3.08355e-13          0.00609523  solve_darcy_rt    0.226845
  10 │      0.0508673     100      0      2    0.362304        10000  1.91369e-13          0.007078    solve_darcy_rt    0.304359

amartinhuertas · 2021-08-14T10:16:56Z

Perhaps, Idem with the celldofs

I looked at this, and I came with the optimization in commit a250431. However, I did not observe significant performance improvements. This might be related to the fact that in the examples I have been playing with use AffineMap, CartesianMaps, etc., i.e., for general unstructured mappings we might observe an improvement.

For the records, the code I used to measure the improvement is the following one:

using Test
using Gridap
using Gridap.Geometry
using Gridap.ReferenceFEs
using Gridap.FESpaces
using Gridap.CellData
using Gridap.TensorValues
using Gridap.Fields
using Gridap.Io
using Gridap.CellData
using Profile
using ProfileView
using BenchmarkTools


order = 1

reffe = ReferenceFE(raviart_thomas,order)

domain =(0,1,0,1)
partition = (100,100)
model = CartesianDiscreteModel(domain,partition) |> simplexify

V = FESpace(model,reffe,conformity=DivConformity())
U = TrialFESpace(V)

v(x) = VectorValue(-0.5*x[1]+1.0,-0.5*x[2])
vh = interpolate(v,V)

e = v - vh

Ω = Triangulation(model)
dΩ = Measure(Ω,2*order)

el2 = sqrt(sum( ∫( e⋅e )*dΩ ))
@test el2 < 1.0e-10

@noinline function myloop!(cache,arr)
  l = 0
  for i in 1:length(arr)
    ai = getindex!(cache,arr,i)
    l += length(ai)
  end
  l
end

array=Gridap.FESpaces._cell_vals(V,v)
@time cache = array_cache(array); 
@benchmark myloop!(cache,array)

amartinhuertas mentioned this issue Aug 13, 2021

Optimizing conforming rt fes #638

Merged

amartinhuertas self-assigned this Aug 19, 2021

amartinhuertas added the performance Performance enhancements label Aug 19, 2021

amartinhuertas closed this as completed Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check efficiency of RT implementation #637

Check efficiency of RT implementation #637

fverdugo commented Aug 12, 2021

fverdugo commented Aug 12, 2021

amartinhuertas commented Aug 13, 2021

amartinhuertas commented Aug 13, 2021 •

edited

Loading

amartinhuertas commented Aug 13, 2021

amartinhuertas commented Aug 13, 2021

fverdugo commented Aug 13, 2021

amartinhuertas commented Aug 13, 2021

fverdugo commented Aug 13, 2021 •

edited

Loading

amartinhuertas commented Aug 13, 2021

fverdugo commented Aug 13, 2021

amartinhuertas commented Aug 14, 2021

amartinhuertas commented Aug 14, 2021

amartinhuertas commented Aug 14, 2021

amartinhuertas commented Aug 14, 2021

Check efficiency of RT implementation #637

Check efficiency of RT implementation #637

Comments

fverdugo commented Aug 12, 2021

fverdugo commented Aug 12, 2021

amartinhuertas commented Aug 13, 2021

Base line commit 67e5ee1 (2D Darcy on a box)

amartinhuertas commented Aug 13, 2021 • edited Loading

After first optimization pointed by @fverdugo above (92a69ab)

amartinhuertas commented Aug 13, 2021

amartinhuertas commented Aug 13, 2021

After second optimization (dc4350f)

fverdugo commented Aug 13, 2021

amartinhuertas commented Aug 13, 2021

fverdugo commented Aug 13, 2021 • edited Loading

amartinhuertas commented Aug 13, 2021

fverdugo commented Aug 13, 2021

amartinhuertas commented Aug 14, 2021

amartinhuertas commented Aug 14, 2021

amartinhuertas commented Aug 14, 2021

amartinhuertas commented Aug 14, 2021

Base line commit `67e5ee1` (2D Darcy on a box)

amartinhuertas commented Aug 13, 2021 •

edited

Loading

After first optimization pointed by @fverdugo above (`92a69ab`)

After second optimization (`dc4350f`)

fverdugo commented Aug 13, 2021 •

edited

Loading