Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check efficiency of RT implementation #637

Closed
fverdugo opened this issue Aug 12, 2021 · 14 comments
Closed

Check efficiency of RT implementation #637

fverdugo opened this issue Aug 12, 2021 · 14 comments
Assignees
Labels
performance Performance enhancements

Comments

@fverdugo
Copy link
Member

By looking at

ContraVariantPiolaMap()(get_shapefuns(reffe),phi,sign_flip)

I would say that we are reevaluating the reference polynomials each time we visit a new cell.

@fverdugo
Copy link
Member Author

I think that we need to replace

    lazy_map(_transform_rt_shapefuns,
             cell_reffe,
             get_cell_map(Triangulation(model)),
             lazy_map(Broadcasting(constant_field), sign_flip))

by

cell_fs = lazy_map(get_shape_functions,cell_reffe)
lazy_map(ContraPiolaMap(),cell_fs,
             get_cell_map(Triangulation(model)),
             lazy_map(Broadcasting(constant_field), sign_flip))

and remove _transform_rt_shapefuns.

Perhaps, Idem with the celldofs

@amartinhuertas
Copy link
Member

Base line commit 67e5ee1 (2D Darcy on a box)

Profile data attached below

10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        0.33423      10      0      2      0.337986        100  9.02765e-15          0.00177146  solve_darcy_rt    0.00198408
   2 │        0.366117     20      0      2      0.378329        400  3.08589e-14          0.00284878  solve_darcy_rt    0.00816082
   3 │        0.374344     30      0      2      0.403508        900  3.03932e-14          0.00273725  solve_darcy_rt    0.0226722
   4 │        0.387381     40      0      2      0.430464       1600  7.15834e-14          0.00387972  solve_darcy_rt    0.0282138
   5 │        0.4075       50      0      2      0.47579        2500  9.25386e-14          0.00481167  solve_darcy_rt    0.0538609
   6 │        0.432286     60      0      2      0.534182       3600  1.78746e-13          0.00550788  solve_darcy_rt    0.0934964
   7 │        0.456456     70      0      2      0.594821       4900  2.19828e-13          0.00683736  solve_darcy_rt    0.129203
   8 │        0.486746     80      0      2      0.682113       6400  2.63806e-13          0.00761905  solve_darcy_rt    0.18718
   9 │        0.512964     90      0      2      0.77064        8100  3.08166e-13          0.00885309  solve_darcy_rt    0.243275
  10 │        0.545432    100      0      2      0.859867      10000  1.90947e-13          0.00986676  solve_darcy_rt    0.302959

Screenshot from 2021-08-13 11-58-53

profile_bef_opts.zip

@amartinhuertas
Copy link
Member

amartinhuertas commented Aug 13, 2021

After first optimization pointed by @fverdugo above (92a69ab)

We have improved performance. The width of the mountains on the left hand side panel (assembly) have been reduced. Still there is type instability in the evaluation of the Monomial Basis (I will take a look). It is surprising the huge amount of time spent in the explicit GC after assembly.

Anyway, I guess @fverdugo is anyway aware of this ...

10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        0.264812     10      0      2      0.268516        100  9.02765e-15          0.00176761  solve_darcy_rt    0.00165767
   2 │        0.285107     20      0      2      0.296845        400  3.08589e-14          0.00287766  solve_darcy_rt    0.00787332
   3 │        0.295462     30      0      2      0.325127        900  3.03932e-14          0.00289937  solve_darcy_rt    0.0256846
   4 │        0.303083     40      0      2      0.353405       1600  7.15834e-14          0.0038059   solve_darcy_rt    0.0411777
   5 │        0.316631     50      0      2      0.396771       2500  9.25386e-14          0.00445374  solve_darcy_rt    0.0594063
   6 │        0.334186     60      0      2      0.443978       3600  1.78746e-13          0.00535773  solve_darcy_rt    0.0943956
   7 │        0.35817      70      0      2      0.497818       4900  2.19828e-13          0.00622951  solve_darcy_rt    0.128155
   8 │        0.380576     80      0      2      0.573763       6400  2.63806e-13          0.00737641  solve_darcy_rt    0.185429
   9 │        0.401534     90      0      2      0.655064       8100  3.08166e-13          0.00713316  solve_darcy_rt    0.244715
  10 │        0.425449    100      0      2      0.752229      10000  1.90947e-13          0.00877073  solve_darcy_rt    0.31425

Screenshot from 2021-08-13 13-22-43

profile_aft.op1.zip

@amartinhuertas
Copy link
Member

After marationian sessions of profiling, I think I have detected a performance pitfall in the current version of Gridap when working with differential operators (eg., gradient) applied to ArrayBlocks of fields. The pitfall is reproduced with the mwe below. I have come up with a solution in commit dc4350f. I am concerned about the following:

  1. Is it the best possible solution?
  2. Is this problem also present in other scenarios that should also be solved?
using Test
using Gridap
import Gridap: ∇, divergence
using LinearAlgebra
using Gridap.CellData
using Profile
using ProfileView

u(x) = VectorValue(2*x[1],x[1]+x[2])

divergence(::typeof(u)) = (x) -> 3

p(x) = x[1]-x[2]

∇p(x) = VectorValue(1,-1)

(::typeof(p)) = ∇p

f(x) = u(x) + ∇p(x)

domain = (0,1,0,1)
partition = (100,100)
order = 1
model = CartesianDiscreteModel(domain,partition)

V = FESpace(model,ReferenceFE(raviart_thomas,Float64,order),conformity=:Hdiv,
  dirichlet_tags=[5,6])

Q = FESpace(model,ReferenceFE(lagrangian,Float64,order); conformity=:L2)

U = TrialFESpace(V,u)
P = TrialFESpace(Q)

Y = MultiFieldFESpace([V, Q])
X = MultiFieldFESpace([U, P])

trian = Triangulation(model)
degree = 2= Measure(trian,degree)


points=get_cell_points(dΩ.quad)
v=get_fe_basis(U)
pb=get_trial_fe_basis(Q)
div_v_q=(divergence(v)*q)dΩ
div_v_q_auto=div_v_q.dict[dΩ.quad.trian]
div_v_q_man=lazy_map(evaluate,get_data(divergence(v)*pb),get_data(points))


a((u, p),(v, q)) = ( (∇v)*p )*dΩ
Y = MultiFieldFESpace([V, Q])
X = MultiFieldFESpace([U, P])
y=get_fe_basis(Y)
x=get_trial_fe_basis(X)
dc=a(x,y)
div_v_q_auto_block=dc.dict[dΩ.quad.trian]


@noinline function myloop!(cache,arr)
  l = 0
  for i in 1:length(arr)
    ai = getindex!(cache,arr,i)
    l += length(ai)
  end
  l
end

# Performance pitfall is here
cache = array_cache(div_v_q_auto_block);
@profile myloop!(cache,div_v_q_auto_block)

# Without blocks it is OK, no performance pitfall.
cache = array_cache(div_v_q_auto);
@profile myloop!(cache,div_v_q_auto)

@amartinhuertas
Copy link
Member

After second optimization (dc4350f)

We have improved the performance of the assembly even more. An additional mountain has disappearead on the left hand side of the flamegraph. We are getting there!

10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        0.256516     10      0      2      0.261268        100  9.02765e-15          0.0017999   solve_darcy_rt    0.00179685
   2 │        0.259209     20      0      2      0.270993        400  3.08589e-14          0.00279935  solve_darcy_rt    0.00863311
   3 │        0.254477     30      0      2      0.282098        900  3.03932e-14          0.00307789  solve_darcy_rt    0.0245428
   4 │        0.262536     40      0      2      0.310051       1600  7.15834e-14          0.00367827  solve_darcy_rt    0.0438366
   5 │        0.27004      50      0      2      0.347974       2500  9.25386e-14          0.00426771  solve_darcy_rt    0.0678215
   6 │        0.281427     60      0      2      0.387864       3600  1.78746e-13          0.00531458  solve_darcy_rt    0.0948862
   7 │        0.288045     70      0      2      0.430728       4900  2.19828e-13          0.00608961  solve_darcy_rt    0.130244
   8 │        0.298291     80      0      2      0.495547       6400  2.63806e-13          0.00676337  solve_darcy_rt    0.186996
   9 │        0.308904     90      0      2      0.56214        8100  3.08166e-13          0.00802405  solve_darcy_rt    0.244622
  10 │        0.31568     100      0      2      0.6469        10000  1.90947e-13          0.00962899  solve_darcy_rt    0.313966

Screenshot from 2021-08-13 21-15-47

@fverdugo
Copy link
Member Author

the "mountains" look very nice now!

If you are able to fix the plateau, you would easily get a 2x improvement.

@amartinhuertas
Copy link
Member

If you are able to fix the plateau, you would easily get a 2x improvement.

The plateau is GC (explicit call after assembly). Do u think this is solvable?

@fverdugo
Copy link
Member Author

fverdugo commented Aug 13, 2021

which is the size of the mesh in this example?

@amartinhuertas
Copy link
Member

which is the size of the mesh in this example?

100x100 quads

@fverdugo
Copy link
Member Author

this CG time seems a lot! what does happen if you comment the line? are you running with --check-bounds=no ?

@amartinhuertas
Copy link
Member

are you running with --check-bounds=no ?

No, I was not. Results with --check-bounds=no in the table below. 5-8% improvements, the GC plateu still there, though.

10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        0.223511     10      0      2      0.227706        100  9.02765e-15          0.00209083  solve_darcy_rt    0.00176186
   2 │        0.226838     20      0      2      0.237474        400  3.08589e-14          0.00235199  solve_darcy_rt    0.00810505
   3 │        0.229696     30      0      2      0.257955        900  3.03932e-14          0.00273573  solve_darcy_rt    0.0247038
   4 │        0.233849     40      0      2      0.281739       1600  7.15834e-14          0.00316141  solve_darcy_rt    0.0439335
   5 │        0.236913     50      0      2      0.309103       2500  9.25386e-14          0.00362761  solve_darcy_rt    0.0673354
   6 │        0.239644     60      0      2      0.339201       3600  1.78746e-13          0.00399835  solve_darcy_rt    0.0928566
   7 │        0.248194     70      0      2      0.383949       4900  2.19828e-13          0.00459787  solve_darcy_rt    0.127483
   8 │        0.257276     80      0      2      0.445318       6400  2.63806e-13          0.00531908  solve_darcy_rt    0.181995
   9 │        0.262293     90      0      2      0.509795       8100  3.08166e-13          0.00591585  solve_darcy_rt    0.240132
  10 │        0.267833    100      0      2      0.582029      10000  1.90947e-13          0.00689507  solve_darcy_rt    0.304982

Screenshot from 2021-08-14 14-09-22

@amartinhuertas
Copy link
Member

this CG time seems a lot! what does happen if you comment the line?

The assemly times improved dramatically! Why did you add that GC() line after assembly? (I guess to reduce permanent memory allocation, right?)

julia> include("experiments/prepare_results.jl")
10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │      0.00652435     10      0      2    0.00971113        100  9.02765e-15          0.00208224  solve_darcy_rt    0.00110454
   2 │      0.0110655      20      0      2    0.0207089         400  3.08589e-14          0.00231101  solve_darcy_rt    0.00684298
   3 │      0.0137443      30      0      2    0.0384344         900  3.03932e-14          0.00237185  solve_darcy_rt    0.0209766
   4 │      0.0179322      40      0      2    0.0576663        1600  7.15834e-14          0.00290676  solve_darcy_rt    0.0360791
   5 │      0.0229827      50      0      2    0.0851398        2500  9.25386e-14          0.00322762  solve_darcy_rt    0.0561803
   6 │      0.0287668      60      0      2    0.126569         3600  1.78746e-13          0.00372828  solve_darcy_rt    0.0940649
   7 │      0.0357792      70      0      2    0.176257         4900  2.19828e-13          0.00432622  solve_darcy_rt    0.134461
   8 │      0.0409472      80      0      2    0.25984          6400  2.63806e-13          0.00502693  solve_darcy_rt    0.209684
   9 │      0.0485386      90      0      2    0.31709          8100  3.08166e-13          0.00580612  solve_darcy_rt    0.260729
  10 │      0.0565112     100      0      2    0.439944        10000  1.90947e-13          0.00671831  solve_darcy_rt    0.375843

Screenshot from 2021-08-14 14-23-07

@amartinhuertas
Copy link
Member

New optimization. Using DIV and integral over the reference domain. Only very mild improvements observed. I would say this is because the optimization with CartesianGrids ... the Jacobian is the same for all cells, and this is exploited. For unstructured meshes I would expect even higher improvements.

10×10 DataFrame
 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │      0.0046118      10      0      2    0.00750439        100  9.02765e-15          0.00186444  solve_darcy_rt    0.00102815
   2 │      0.00776088     20      0      2    0.0173738         400  3.08589e-14          0.00219309  solve_darcy_rt    0.00726899
   3 │      0.0108752      30      0      2    0.036804          900  3.0013e-14           0.0025957   solve_darcy_rt    0.0226412
   4 │      0.0153467      40      0      2    0.0583026        1600  7.15834e-14          0.00300031  solve_darcy_rt    0.0389771
   5 │      0.0196759      50      0      2    0.0817289        2500  9.27815e-14          0.0033414   solve_darcy_rt    0.0583115
   6 │      0.0222976      60      0      2    0.114485         3600  1.79296e-13          0.00364426  solve_darcy_rt    0.0841313
   7 │      0.03052        70      0      2    0.150756         4900  2.20256e-13          0.00441119  solve_darcy_rt    0.115382
   8 │      0.0378711      80      0      2    0.213512         6400  2.63806e-13          0.00531192  solve_darcy_rt    0.170246
   9 │      0.0440555      90      0      2    0.277877         8100  3.08355e-13          0.00609523  solve_darcy_rt    0.226845
  10 │      0.0508673     100      0      2    0.362304        10000  1.91369e-13          0.007078    solve_darcy_rt    0.304359

Screenshot from 2021-08-14 15-34-49

@amartinhuertas
Copy link
Member

Perhaps, Idem with the celldofs

I looked at this, and I came with the optimization in commit a250431. However, I did not observe significant performance improvements. This might be related to the fact that in the examples I have been playing with use AffineMap, CartesianMaps, etc., i.e., for general unstructured mappings we might observe an improvement.

For the records, the code I used to measure the improvement is the following one:

using Test
using Gridap
using Gridap.Geometry
using Gridap.ReferenceFEs
using Gridap.FESpaces
using Gridap.CellData
using Gridap.TensorValues
using Gridap.Fields
using Gridap.Io
using Gridap.CellData
using Profile
using ProfileView
using BenchmarkTools


order = 1

reffe = ReferenceFE(raviart_thomas,order)

domain =(0,1,0,1)
partition = (100,100)
model = CartesianDiscreteModel(domain,partition) |> simplexify

V = FESpace(model,reffe,conformity=DivConformity())
U = TrialFESpace(V)

v(x) = VectorValue(-0.5*x[1]+1.0,-0.5*x[2])
vh = interpolate(v,V)

e = v - vh

Ω = Triangulation(model)
dΩ = Measure(Ω,2*order)

el2 = sqrt(sum( ( ee )*dΩ ))
@test el2 < 1.0e-10

@noinline function myloop!(cache,arr)
  l = 0
  for i in 1:length(arr)
    ai = getindex!(cache,arr,i)
    l += length(ai)
  end
  l
end

array=Gridap.FESpaces._cell_vals(V,v)
@time cache = array_cache(array); 
@benchmark myloop!(cache,array)

@amartinhuertas amartinhuertas self-assigned this Aug 19, 2021
@amartinhuertas amartinhuertas added the performance Performance enhancements label Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance enhancements
Projects
None yet
Development

No branches or pull requests

2 participants