Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile: Thread and task-specific profiling #41742

Merged
merged 13 commits into from
Sep 2, 2021

Conversation

IanButterworth
Copy link
Sponsor Member

@IanButterworth IanButterworth commented Jul 31, 2021

Aims to fix #41713

  • Adds thread information into the profile samples and adds grouping options to Profile.print
    Profile.print(groupby = [:threads, :tasks]) to group first by threads, then tasks
    or Profile.print(groupby = :threads) or :tasks to group by those respectively
    or default behavior via Profile.print(groupby = :none)

Todo:

  • extend to MacOS
  • extend to Windows
  • add special handling given windows doesn't have multithread profiling
  • Add task id
  • Change kwarg to groupby = :threads groupby = :tasks groupby = [:threads, :tasks] Thanks @felixcremer!
  • Move buffer count nthreads factoring into another PR Profile: allocate buffer for n instruction pointers per thread #41821
  • Figure out what format to export the data as for external consumers (vscode profiling, ProfileView etc.) - Profile.fetch by default now strips the metadata out to ensure backwards compat with external consumers. They can enable and handle when ready
  • Add % utilization per thread
  • Add sample clock time
  • Test flat printing
  • Add to news

Edit: See latest examples below

@IanButterworth

This comment has been minimized.

@vchuravy vchuravy requested a review from timholy July 31, 2021 06:38
@IanButterworth

This comment has been minimized.

@IanButterworth

This comment has been minimized.

@IanButterworth
Copy link
Sponsor Member Author

IanButterworth commented Aug 1, 2021

Added a heuristic to detect, flag, count and summarize idling as a % per-group. See Total snapshots: 805 (100% idle)
The heuristic might to too brittle, so would be good to figure out what to do instead. Or at least add a test for it.
Also compressed the formatting where only a single thread or task is nested.

Screenshot from 2021-08-01 22-05-37

The same result with no grouping

Screenshot from 2021-08-01 22-05-56

Another example that shows initial idle time on other threads for

function foo()
    x = rand(1000,1000)
    Threads.@threads for i in 1:100
        x * x
    end
end

Screenshot from 2021-08-01 23-39-46

@vchuravy
Copy link
Sponsor Member

vchuravy commented Aug 2, 2021

add special handling given windows doesn't have multithread profiling

I was surprised when I looked at the code just now and realized this as well. We should make this clear in the docs and maybe track it in a separate issue?

Multiplies the requested buffer samples by nthreads given the process is actually sampling nthreads instruction pointers per time period.

Might be best to split into two PRs? That change is orthogonal and less complicated

@IanButterworth IanButterworth force-pushed the ib/threadyprofiling branch 3 times, most recently from 6174b74 to af1e549 Compare August 3, 2021 03:43
@IanButterworth

This comment has been minimized.

@felixcremer
Copy link
Contributor

Playing around with some keyword arguments I got the following error, when I tried to combine the groupby with the threads keyword.
This doesn't happen for groupby = :task

julia> Profile.print(groupby = :thread, threads=[1,2])
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
ERROR: MethodError: no method matching mod(::Int64, ::Vector{Int64})
Closest candidates are:
  mod(::Union{Int128, Int16, Int32, Int64, Int8}, ::Unsigned) at int.jl:272
  mod(::Integer, ::Base.OneTo) at range.jl:1401
  mod(::Integer, ::AbstractUnitRange{<:Integer}) at range.jl:1402
  ...
Stacktrace:
 [1] _broadcast_getindex_evalf
   @ ./broadcast.jl:670 [inlined]
 [2] _broadcast_getindex
   @ ./broadcast.jl:643 [inlined]
 [3] getindex
   @ ./broadcast.jl:597 [inlined]
 [4] copy
   @ ./broadcast.jl:943 [inlined]
 [5] materialize
   @ ./broadcast.jl:904 [inlined]
 [6] _intersect
   @ ~/Documents/Programmieren/julia/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:225 [inlined]
 [7] print(io::Base.TTY, data::Vector{UInt64}, lidict::Dict{UInt64, Vector{Base.StackTraces.StackFrame}}; format::Symbol, C::Bool, combine::Bool, maxdepth::Int64, mincount::Int64, noisefloor::Int64, sortedby::Symbol, groupby::Symbol, recur::Symbol, threads::Vector{Int64}, tasks::UnitRange{UInt64})
   @ Profile ~/Documents/Programmieren/julia/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:210
 [8] print(data::Vector{UInt64}, lidict::Dict{UInt64, Vector{Base.StackTraces.StackFrame}}; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:groupby, :threads), Tuple{Symbol, Vector{Int64}}}})
   @ Profile ~/Documents/Programmieren/julia/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:280
 [9] top-level scope
   @ REPL[34]:1

Also when I try to use format=:flat I get the following error:

ERROR: TypeError: non-boolean (Nothing) used in boolean context
Stacktrace:
 [1] print(io::Base.TTY, data::Vector{UInt64}, lidict::Dict{UInt64, Vector{Base.StackTraces.StackFrame}}; format::Symbol, C::Bool, combine::Bool, maxdepth::Int64, mincount::Int64, noisefloor::Int64, sortedby::Symbol, groupby::Symbol, recur::Symbol, threads::UnitRange{Int64}, tasks::UnitRange{UInt64})
   @ Profile ~/Documents/Programmieren/julia/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:205
 [2] print(data::Vector{UInt64}, lidict::Dict{UInt64, Vector{Base.StackTraces.StackFrame}}; kwargs::Base.Pairs{Symbol, Symbol, Tuple{Symbol, Symbol}, NamedTuple{(:groupby, :format), Tuple{Symbol, Symbol}}})
   @ Profile ~/Documents/Programmieren/julia/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:280
 [3] top-level scope
   @ REPL[55]:1

@IanButterworth
Copy link
Sponsor Member Author

IanButterworth commented Aug 3, 2021

ERROR: MethodError: no method matching mod(::Int64, ::Vector{Int64})

Once the suggestion in #41759 is added I'll use that. There wasn't an efficient intersect method for what I needed, so the dispatch is worse currently

Also when I try to use format=:flat I get the following error

Thanks for testing! I haven't done any testing of the flat format yet. I'll move onto that

@IanButterworth IanButterworth changed the title Thread-specific profiling Profile: Thread and task-specific profiling Aug 7, 2021
src/signals-unix.c Outdated Show resolved Hide resolved
@IanButterworth IanButterworth force-pushed the ib/threadyprofiling branch 2 times, most recently from 7131a46 to b8e9b7a Compare August 8, 2021 08:23
@IanButterworth
Copy link
Sponsor Member Author

I think this is ready for review. I've walked some of the wilder features back to the basics, and retained the default behavior to not split Profile.print by thread/task. So it's opt-in now

@vchuravy vchuravy added needs news A NEWS entry is required for this change profiler labels Aug 9, 2021
@IanButterworth
Copy link
Sponsor Member Author

Trying to get to the bottom of the issue on Linux32 as it seems real

@IanButterworth IanButterworth removed the status:merge me PR is reviewed. Merge when all tests are passing label Aug 31, 2021
@vchuravy
Copy link
Sponsor Member

vchuravy commented Sep 1, 2021

You can also compile for Linux 32bit on Linux 64 by setting the XC_HOST variable I think (whatever we use for mingw)

@IanButterworth
Copy link
Sponsor Member Author

IanButterworth commented Sep 2, 2021

It seems that on 32-bit linux some of the ips are entered as 0x00000000 and thus being interpreted as a block end. And the check this PR added to Profile.fetch() is noticing.

See the ?'s below

julia> Profile.clear()

julia> @profile busywait(1, 20)

julia> Profile.fetch()
ERROR: AssertionError: metadata stripping failed i=20 j=0 data[1:i]=UInt32[0xf7f39547, 0xf7f025cb, 0xf7af6a0c, 0xf79f9ae5, 0xe8694f8e, 0xe86c1831, 0xe8875ac0, 0xf799c4f7, 0xf79c4202, 0xf79f952c, 0xf7efb634, 0xf7e0c8f9, 0x00000004, 0xefb94010, 0x247e7c04, 0x00000002, 0x00000000, 0xf7f3a10c, 0xf7dca17e, 0xf7dca2e2]
Stacktrace:
 [1] fetch(; include_meta::Bool)
   @ Profile /buildworker/worker/package_linux32/build/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:519
 [2] fetch()
   @ Profile /buildworker/worker/package_linux32/build/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:494
 [3] top-level scope
   @ REPL[65]:1

julia> d = Profile.fetch(include_meta = true);

julia> for i in 1:length(d)-1
           if d[i + 1] == 0 && !in(d[i], [1,2]) 
               # the entry before a block end is the idle state, which can only be 1 or 2 (given we add 1 to the state val to avoid zero)
               # so indicate anything zeros that don't match that pattern
               @info i
           end
       end
[ Info: 9875
[ Info: 36298
[ Info: 38375
[ Info: 62623

julia> d[9850:9900]
51-element Vector{UInt32}:
 0xf79f952c
 0xf7efb634
 0xf7e0c8f9
 0x00000002 # threadid
 0xefb90010 # taskid
 0x3bf946fc # cycleclock
 0x00000002 # idle state (+1)
 0x00000000 # block end
 0xf7f39547
 0xf7f025cb
 0xf7af6a0c
 0xf79f9ae5
 0xe8694f8e
 0xe86c1831
 0xe8875ac0
 0xf799c4f7
 0xf79c4202
 0xf79f952c
 0xf7efb634
 0xf7e0c8f9
 0x00000004 # threadid
 0xefb94010 # taskid
 0x3bfa0ede # cycleclock
 0x00000002 # idle state (+1)
 0x00000000 # block end
 0xf7dca235
 0x00000000 # ???????
 0x75c085c0
 0x00000001 # threadid
 0xefb08010 # taskid
 0x3bfb1482 # cycleclock
 0x00000001 # idle state (+1)
 0x00000000 # block end
 0xf7f39547
 0xf7f025cb
 0xf7af6a0c
 0xf79f9ae5
 0xe8694f8e
 0xe86c1831
 0xe8875ac0
 0xf799c4f7
 0xf79c4202
 0xf79f952c
 0xf7efb634
 0xf7e0c8f9
 0x00000002 # threadid
 0xefb90010 # taskid
 0x3c30e878 # cycleclock
 0x00000002 # idle state (+1)
 0x00000000 # block end
 0xf7f3a10c

@vtjnash any idea what's happening here?


Btw @vchuravy I couldn't figure this out

You can also compile for Linux 32bit on Linux 64 by setting the XC_HOST variable I think (whatever we use for mingw)

@IanButterworth
Copy link
Sponsor Member Author

I added a bandaid that should in most cases ignore rogue 0's when detecting the block ends.
If we can't figure out why the rogue 0's are happening, perhaps this is ok?

@vchuravy
Copy link
Sponsor Member

vchuravy commented Sep 2, 2021

Can you squash the PR down?

@IanButterworth
Copy link
Sponsor Member Author

We can just squash-merge, right?

@vchuravy vchuravy merged commit c5af061 into JuliaLang:master Sep 2, 2021
@IanButterworth IanButterworth deleted the ib/threadyprofiling branch September 2, 2021 18:36
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
- Adds thread and task ids to profile samples
- Implements thread and task selection for Profile.print()
- Implements thread and task groupby options for Profile.print()
- Add include_meta to Profile.fetch() which defaults to false to ensure backwards compat with external profiling tooling
- store time of each profile sample (cycleclock)
- add sleep_check_state to metadata and show % utilization
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
- Adds thread and task ids to profile samples
- Implements thread and task selection for Profile.print()
- Implements thread and task groupby options for Profile.print()
- Add include_meta to Profile.fetch() which defaults to false to ensure backwards compat with external profiling tooling
- store time of each profile sample (cycleclock)
- add sleep_check_state to metadata and show % utilization
vtjnash added a commit that referenced this pull request May 15, 2023
Regression caused by #41742, which inverted the loop without inverting
the logic. And fix a number of related formatting mistakes.

Fix #49732
vtjnash added a commit that referenced this pull request May 15, 2023
Regression caused by #41742, which inverted the loop without inverting
the logic. And fix a number of related formatting mistakes.

Fix #49732
vtjnash added a commit that referenced this pull request May 16, 2023
Regression caused by #41742, which inverted the loop without inverting
the logic. And fix a number of related formatting mistakes.

Fix #49732
KristofferC pushed a commit that referenced this pull request May 19, 2023
Regression caused by #41742, which inverted the loop without inverting
the logic. And fix a number of related formatting mistakes.

Fix #49732

(cherry picked from commit 45748b8)
KristofferC pushed a commit that referenced this pull request May 27, 2023
Regression caused by #41742, which inverted the loop without inverting
the logic. And fix a number of related formatting mistakes.

Fix #49732

(cherry picked from commit 45748b8)
kpamnany pushed a commit that referenced this pull request Jun 21, 2023
Regression caused by #41742, which inverted the loop without inverting
the logic. And fix a number of related formatting mistakes.

Fix #49732

(cherry picked from commit 45748b8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

most of the profile runtime is in task_done_hook unless threads = 1
5 participants