-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload trace fails because traces are misordered #14
Comments
That analysis isn't quite right. |
Something very strange is going on - see screenshot. |
Trace file for above: |
Hi, David,
I'll try to address your questions and discussion, but tl;dr: here
<https://github.com/google/schedviz/blob/master/server/fs_storage.go#L110>,
change the NewCollection call to:
collection, err := sched.NewCollection(
collectionProto.EventSet,
sched.UsingEventLoaders(sched.FaultTolerantEventLoaders()),
sched.NormalizeTimestamps(true))
(that is, adding the
sched.UsingEventLoaders(sched.FaultTolerantEventLoaders()) collection
option) and your collection will load.
Now, to your questions and comments, and some background:
1. You discuss timestamp normalization. That's a step much later in the
pipeline, and is unrelated to inference errors like this. The idea there
is that in many cases event timestamps are in a fairly useless epoch (e.g.,
nanoseconds since boot). Timestamp normalization simply shifts all events'
timestamps so that the trace starts at 'time 0'.
2. Clipping is something we do internally, but don't yet have
externally; this lack may be the source of your problem. As you probably
know, in the kernel, each CPU streams its events to its own individual
buffer, whose size is specified at trace configuration. When these buffers
overflow, we can do one of two things: overwrite the oldest records (which
occurs in flight-data-recorder mode), or stop emitting new events (which
occurs in one-shot mode). In either case, we lose information for the
overflowed CPU, either at the start of the trace (in the former case) or at
the end of the trace (in the latter.) Since scheduling is a whole-system
phenomenon, we cannot tolerate such lost data, and therefore we must
correct for these overflows. This is done by 1) identifying which CPU
buffers overflowed; 2) finding the latest (for FDR-mode, earliest for
one-shot) first (last) event among these buffers, and 3) marking all
events, in any CPU buffer, earlier (later) than that event's timestamp as
'clipped'. Clipped events will then not be considered by SchedViz. Again,
I don't think we're doing this in the external SchedViz; it's certainly
something we'd like to do and would welcome contributions on.
3. It looks like the kernel's own trace reader ignores pad events
<https://github.com/torvalds/linux/blob/master/kernel/trace/ring_buffer.c#L3649>
.
4. These scheduling events are not entirely reliable. For example, we
frequently get multiple wakeup events for the same sleeping thread, and
sometimes wakeups on running threads. There are also some race conditions
in the kernel that can apparently reorder events so that their ordering is
incoherent (e.g., a thread wakes up on CPU B, then appears to migrate from
CPU A to B). This phenomenon is exacerbated as more event types are
collected, since the problematic mechanism is hit when an event is emitted
while another event is being emitted. To handle this, we've introduced a
variety of event loaders
<https://github.com/google/schedviz/blob/master/analysis/sched_event_loaders.go>.
The default event loader tolerates duplicate wakeups and
wakeup-while-running, since this is a really common issue, but anything
else is rejected with an inference error. But there are two others
provided: the fault tolerant loader, which allows problematic migrates to
be dropped, and a switch-only loader, which ignores all events but
switches. These represent increasingly permissive event-stream
interpretation, with increasing uncertainty (though typically the
fault-tolerant loader only needs to drop O(10) events, out of millions, to
make the trace coherent.)
It looks like your per-CPU buffers are really small (20KiB) in your trace,
which means that buffer overflow is very likely, and the lack of clipping
problematic. However, the good news is that the fault-tolerant loader is
generally able to work around the lack of clipping, hence you can use it to
load your traces. You might also try increasing the per-CPU buffer size
(or decreasing trace duration) until overflow is uncommon. Internally, we
mostly use 8MiB per-CPU buffers, and very seldom overflow even on 2s
collections; the time-to-overflow of course depends on the system load and
the amount of scheduling thrash, but traces under 10s very rarely overflow
at all.
Lee
…On Wed, Jan 8, 2020 at 4:41 AM david-laight ***@***.***> wrote:
I've been getting 'inference error (CPU)' trying to load some traces.
Having written a program to read and dump (and adjust) the raw trace data
I think I know what is happening.
A process (pid 3857) in the trace (I'll try to attach) has run a few times
on cpu2.
At (raw) time 39866b80c873 it is migrated to cpu1 (stolen by) and runs.
At 39866c19166e cpu1 migrates it back to cpu2 and then enables it.
At 39866c1a093f it executes on cpu2.
These are the only 'migrates' for pid 3857 in the trace, so must be
relevant.
The file upload fails because the process state is 'unknown' on cpu2
@179948522 having gone 'sleeping' on cp1 @179792871.
Digging through the *.go files I think I understand the problem.
The large absolute timestamps are converted to small values in
setTimestampNormalize().
This code subracts off the timestamp of the first real trace entry from
all the others.
However it is only (AFAICT) looking at the data from a single cpu.
So it subtracts different amounts off the timestamps from the different
files.
This slight difference doesn't usually make much difference, but when a
migrate is done by the cpu the process last ran on immediiadtely prior to
being woken; the process schedule trace can end up with a timestamp earlier
than the migrate.
Note that the trace files contain a lot of 'pad' entries where they've
been 'adjusted' so that they all start at the same time and only contain
valid trace.
(Oh can you also fix ringBufferEvent() so it doesn't reject 'Padding' with
time interval zero - these happen if a pad is added after a 'time extend'
entry.)
trace.tar.gz
<https://github.com/google/schedviz/files/4035275/trace.tar.gz>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#14?email_source=notifications&email_token=AA27XBYNGIGQQLDCWICYNNTQ4XCYBA5CNFSM4KEHX7V2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IEYCBJA>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA27XB4E56FZCSCKKDVH5ADQ4XCYBANCNFSM4KEHX7VQ>
.
|
I need to load traces that overflow. I intend running tests that take hours and trigger the trace to stop when there are excessive schedling delays. Since the test is processing real-time audio 100ms of trace is a long time! I will use a slightly longer trace, but I'm testing the loading of wrapped traces. When I've dumped the raw trace files with my own decoder, I don't see any missing events or wakeups mis-sequenced against migrates. If I (well my program!) look through all the trace files and find the timestamp of the first trace after the wrap (typically the first trace in the second 4k page), and take the latest of those timestamps across all the cpus. I then want to discard all the entries before that timestamp - since the trace after that point is complete. So I'm fairly certain that something is wrong with the logic that timestamps the trace entries and then merges them into a single trace. I can probably debug it, but the instructions I have for building and running it aren't very informative about what they are actually doing. |
Did you try the code change I suggest at the start of my response? That
works for me.
…On Thu, Jan 9, 2020, 05:55 david-laight ***@***.***> wrote:
I need to load traces that overflow. I intend running tests that take
hours and trigger the trace to stop when there are excessive schedling
delays. Since the test is processing real-time audio 100ms of trace is a
long time! I will use a slightly longer trace, but I'm testing the loading
of wrapped traces.
When I've dumped the raw trace files with my own decoder, I don't see any
missing events or wakeups mis-sequenced against migrates.
If I (well my program!) look through all the trace files and find the
timestamp of the first trace after the wrap (typically the first trace in
the second 4k page), and take the latest of those timestamps across all the
cpus. I then want to discard all the entries before that timestamp - since
the trace after that point is complete.
If I just change all the entries themselves to 'pad' ones the trace still
fails to load.
Zeroing the header flags byte doesn't help.
Changing the page header timestamp to a few ns before trace timestamp and
using a single pad to delete all the unwanted traces gives me a trace file
that loads.
So I'm fairly certain that something is wrong with the logic that
timestamps the trace entries and then merges them into a single trace.
I can probably debug it, but the instructions I have for building and
running it aren't very informative about what they are actually doing.
What do I have to rerun to rebuild the code that is run when the browser
uploads a file?
I'm guessing I don't need to restart the webserver after such a change.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#14?email_source=notifications&email_token=AA27XBZWXUOJGUVATDLN7YTQ44UDPA5CNFSM4KEHX7V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIQMABI#issuecomment-572571653>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA27XB2ZTMCTDNSVCWL7BCTQ44UDPANCNFSM4KEHX7VQ>
.
|
Is the foo.go code interpreted? I've never looked at any before. If that change just allows the process state transition through it is just papering over the cracks in the code that parses the traces. |
It is compiled. You'll need to rebuild and rerun.
I don't believe it is 'papering over the cracks'. I believe its results
are correct, and that it's dropping only those events that don't work with
the preponderance of evidence from all events.
If you'd like to use your own trace interpreter to generate processed
traces to view in SV, this is quite possible; look at how we load traces
from ebpf for an example.
…On Thu, Jan 9, 2020, 06:44 david-laight ***@***.***> wrote:
Is the foo.go code interpreted? I've never looked at any before.
If that change just allows the process state transition through it is just
papering over the cracks in the code that parses the traces.
There are no missing (or mis-sequenced) traces in the files I'm using when
I decode tham with a little program I wrote.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#14?email_source=notifications&email_token=AA27XB7ZQE35JPDO5UCGZ2TQ44Z5VA5CNFSM4KEHX7V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIQRBTI#issuecomment-572592333>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA27XB6337SBLWGJ6OUI3CTQ44Z5VANCNFSM4KEHX7VQ>
.
|
I've managed to add some trace and found one of the problems. Anyway I've now got a program that deletes the early events from traces that have wrapped. An extra question, If I trace an extra event (and include its format in the tar file) will I be able to enable it as a marker on the 'events' page? |
You should be able to include non-sched events and visualize them in the
viewer.
…On Thu, Jan 9, 2020 at 9:16 AM david-laight ***@***.***> wrote:
I've managed to add some trace and found one of the problems.
When I try to delete unwanted events by converting them to 'pad' the time
deltas aren't added in.
The library code in linux/tools/lib/traceevent/kbuffer-parse.c (which is
what I based my code on) assumes that the high 27 bits are a time delta for
all trace types.
Whereas the schedviz code has some strange comments about the time delta
od pads being zero/non-zero means something, it also fails to add in the
time delta for pad entries.
Always adding in the time delta would simplify the code - especially since
I'm 99.999% sure type 31 entries haven't been used for a long, long, long
time.
Anyway I've now got a program that deletes the early events from traces
that have wrapped.
An extra question, If I trace an extra event (and include its format in
the tar file) will I be able to enable it as a marker on the 'events' page?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#14?email_source=notifications&email_token=AA27XB5UWQ4R5NGTLUHS543Q45LWZA5CNFSM4KEHX7V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIRCGPI#issuecomment-572662589>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA27XB5SJO5JU4AUIZ6J5BTQ45LWZANCNFSM4KEHX7VQ>
.
|
The "strange comment" I think you are referring to comes from the kernel: The time_delta doesn't appear to be used for padding events when iterating over events in the kernel (this is the code we looked at when writing the SchedViz parser): |
I was looking at this https://github.com/torvalds/linux/blob/master/tools/lib/traceevent/kbuffer-parse.c#L345 which is intended to be library code that applications can use to read the raw trace. It does't bother to parse absolute time stamps - I suspect they are never used (it skips the wrong length!). The kernel probably never writes 'pad' entries (except at the end of trace blocks) so knows the time update is irrelevant. If 'pad' can be used for deleted items then the timestamp probably needs to be processed. |
I've been getting 'inference error (CPU)' trying to load some traces.
Having written a program to read and dump (and adjust) the raw trace data I think I know what is happening.
A process (pid 3857) in the trace (I'll try to attach) has run a few times on cpu2.
At (raw) time 39866b80c873 it is migrated to cpu1 (stolen by) and runs.
At 39866c19166e cpu1 migrates it back to cpu2 and then enables it.
At 39866c1a093f it executes on cpu2.
These are the only 'migrates' for pid 3857 in the trace, so must be relevant.
The file upload fails because the process state is 'unknown' on cpu2 @179948522 having gone 'sleeping' on cp1 @179792871.
Digging through the *.go files I think I understand the problem.
The large absolute timestamps are converted to small values in setTimestampNormalize().
This code subracts off the timestamp of the first real trace entry from all the others.
However it is only (AFAICT) looking at the data from a single cpu.
So it subtracts different amounts off the timestamps from the different files.
This slight difference doesn't usually make much difference, but when a migrate is done by the cpu the process last ran on immediiadtely prior to being woken; the process schedule trace can end up with a timestamp earlier than the migrate.
Note that the trace files contain a lot of 'pad' entries where they've been 'adjusted' so that they all start at the same time and only contain valid trace.
(Oh can you also fix ringBufferEvent() so it doesn't reject 'Padding' with time interval zero - these happen if a pad is added after a 'time extend' entry.)
trace.tar.gz
The text was updated successfully, but these errors were encountered: