Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1/1000 timebase causes deviation when converting with other protocols #3

Open
winlinvip opened this issue Apr 11, 2023 · 25 comments
Open

Comments

@winlinvip
Copy link

The timebase of RTMP/FLV is 1/1000, while for MPEGTS or other protocols might be 1/90000. When convert MPEGTS to RTMP/FLV, there will be some small deviation and cause some stuttering.

Is it possible to support other timebase in enhanced RTMP?

See issue SRS#512 and SRS#547

@veovera
Copy link
Owner

veovera commented Apr 17, 2023

Great topic, we will add this to our backlog to investigate

@igorshevach
Copy link

igorshevach commented Aug 2, 2024

we at Kaltura co are having issues with aac audio timestamps which was cropped due to lack of timestamp precision with some players exactly because of this.
please add composition offset/ timestamp enhancement! e.g. extend it with a variable bit length field depending on appropriate timescale or fixed at higher timescale.

@veovera
Copy link
Owner

veovera commented Aug 2, 2024

please add composition offset/ timestamp enhancement!

Can you provide more details on what you wan to see on the wire and where? Perhaps a description of the end to end of a packet.

@igorshevach
Copy link

igorshevach commented Aug 4, 2024

I apologize for misleading use of "composition offset"; I meant chunk type 0,1 and 2 i.e. those containing timestamp information. New optional timestamp extension field is expected to be placed after audio/videotagheader: field named timestampFraction UB[2] which would represent fractional part of the timestamp in a 100 nanosecond resolution = 1-^-7. the field would require capacity of to 10^4 varying values, i.e. up to 2 additional bytes. Why 100 ns? Because they are file timestamps used by os and are considered max feasible time resolution of timestamps. The topic of required resolution and required number of bits is subject to discussion.
The timescale of the field in case of total 10^-7 is 1/10000 and total timestamp value is T' = timestamp in milliseconds + derived fraction / 10000 / 1000. The only remaining part is signaling of presence of the timestamp fractional part in the bitstream. Without loss of generality we may require multiple independent extensions apply concurrently, and as such is outside the scope of this discussion - i.e. up to you :)

@zenomt
Copy link

zenomt commented Aug 4, 2024

the current timescale of 1000/second is more than adequate for the original intended purpose of synchronizing video, audio, and data messages within human perception for playback in Flash Player, and given how the Flash timing model works. however, i agree that the coarseness of the timescale is annoying when transmultiplexing for other formats (like MP4 or M2TS) or environments (Safari's implementation of Media Source Extensions will have audible pops if audio frame timestamps aren't accurate to within a sample or two). currently, to accurately transmultiplex to a format like MP4 or M2TS, way too much knowledge about the audio codecs in use is required in order to keep sample-accurate time, along with annoying heuristics to allow for discontinuities in the original RTMP timestamps.

i strongly recommend against changes to the Chunk Stream (like extending the timestamps in some way). this would need to be negotiated in some way right at the beginning, which most likely would require a new RTMP Chunk Stream version number (the current publicly documented version of the RTMP Chunk Stream is version 3; later version numbers are used for other proprietary things, and having a number greater than one of those would imply support for the proprietary and undocumented extensions to the Chunk Stream). further, this kind of change would just be for the Chunk Stream transport for RTMP; it wouldn't address RTMFP or FLV.

rather than changing the Chunk Stream, what if there was a new AudioPacketType for Enhanced RTMP audio messages that encodes a (signed 32 bit) number of nanoseconds offset for the RTMP timestamps of all* following audio messages in that stream (or track i guess) until superseded, probably right before the very next audio message. these messages could be stored in FLV, and could be transmitted/forwarded by RTMP Chunk Stream and RTMFP with no additional work.

(*) perhaps instead of "all following audio messages until superseded", the offset should only apply to "the next audio message(s) having the same ordinary RTMP timestamp", which would simplify some other processing, especially for expiring/abandoning one of these along with the audio message it goes with. i think it's likely that if there's a high resolution offset, it'll probably be different frame-to-frame; otherwise a permanent sub-millisecond offset isn't useful since that's just shifting the entire millisecond-accurate timeline for synchronization, and that's below human perceptibility.

for some time i'd been thinking the right way to address this problem would be to negotiate a different timescale (number of timestamp ticks per second) in the connect/_result handshake when first connecting to a server. however, this is significantly more complicated than a new AudioPacketType message, would shorten the time to RTMP timestamp rollover (which people already don't get right), and wouldn't work for FLV.

@veovera
Copy link
Owner

veovera commented Aug 4, 2024

Thank you for the detailed feedback and suggestions. Here are my thoughts:

  1. I agree that it's crucial for the entire system to function cohesively, and supporting FLV with a new timestamp is an essential aspect of this.
  2. We need to clarify our primary objective. Are we aiming to enhance human perception for playback, or are we focusing on facilitating tooling and transmuxing? As mentioned, human perception for playback currently works well. So, if the goal is to support tooling and transmuxing, we need to address that specifically.
  3. Can you provide an end-to-end example to illustrate the issue? For example, a real-world use case where the coarseness of the RTMP timescale has caused significant problems. This could be something like a live streaming scenario where precise synchronization across different platforms is critical. Understanding the specific contexts where this issue arises will help us better address the problem.

Regarding the timescale, the current 1000/second is adequate for its original purpose of synchronizing video, audio, and data messages within human perception for playback in Flash Player, considering the Flash timing model. However, I agree that the coarseness of this timescale can be problematic when transmuxing to other formats like MP4 or M2TS or dealing with environments like Safari's Media Source Extensions, where accurate timestamps are critical to avoid audio issues.

At the risk of being redundant, the suggested proposal sounds like:

  1. Introduce a New AudioPacketType:

    • Create a new AudioPacketType for Enhanced RTMP audio messages.
    • This type will encode a (signed 32-bit) number of nanoseconds offset for RTMP timestamps of all following messages in that stream until superseded.
    • This will allow for more precise timestamping without altering the existing Chunk Stream or RTMP protocol versions.
  2. Maintain Backward Compatibility:

    • Ensure that the new PacketType can be ignored by systems that do not support it, maintaining backward compatibility.
    • The new messages should be stored in FLV and can be transmitted/forwarded by RTMP Chunk Stream and RTMFP with no additional work.
  3. Simplify Processing:

    • Apply the offset only to the next message(s) having the same ordinary RTMP timestamp. This approach will simplify processing and handling of these messages.

Questions:

  1. Is the AudioPacketType that carries the timestamp offset a separate message or is it combined with another AudioPacketType (e.g., CodedFrames) into one message. Last option is more optimum but less backwards compatible, but it might be ok since E-RTMP for audio is new and is still in alpha mode.
  2. How do we handle a late joiner? Perhaps the solution is that the offset is only valid for the current (i.e. one) Audio message?
  3. What about Video and Data messages?

This proposal aims to provide a more precise and flexible timestamping mechanism that will facilitate smoother transmuxing and compatibility with high-precision environments.

@zenomt thanks for the suggestions on how to solve this. Your insights are invaluable in shaping this approach.

Looking forward to all the feedback and any further suggestions!

@zenomt
Copy link

zenomt commented Aug 5, 2024

i think the issue raised by @igorshevach is more to do with transmuxing and playback in systems (like Safari) where more precise timestamps are required to avoid audio glitches. i wouldn't worry about more precise timing for video or data unless or until it actually becomes a problem. i don't think that's likely until we're talking about >100 frames/second, and even then we most likely would have acceptable fidelity and jitter at rates approaching 250 Hz. also, unlike audio playback, video frames don't have an inherent duration, and are supposed to be presented at the specified time, whereas audio frames do have inherent duration, and the up-to-a-millisecond error of the timestamps is what can lead to audio pops and glitches.

@veovera i was proposing a new separate message to precede each audio message and having the same timestamp of that message to encode a high resolution offset. however, having a whole nother message is a lot of extra bytes, which can add up in FLVs and on the wire. instead i think i like your insight of "E-RTMP Audio isn't 'done' yet" better. i'd suggest having a new "CodedFrames with high resolution offset" AudioPacketType with a new signed 16 bit field in a fixed position between the FourCc and the coded data, encoding a + or - from the RTMP timestamp in units of 1/32768000 second (about 30.5 nanoseconds). if nanosecond precision is needed, make it signed 24 bits in units of 1/8388608000 second (about 0.119 nanoseconds). keep the existing "CodedFrames" AudioPacketType as-is, for when the offset is 0 or unknown.

using a new AudioPacketType for "CodedFrames with high resolution offset" is the simplest all-around i think, because most processing stages and simple forwarders just need to recognize that packet type as a "coded frame" and treat it as such (like for applying a transmission deadline). it's only a transmux or final playback & rendering that would need to take the high res offset into account.

@zenomt
Copy link

zenomt commented Aug 5, 2024

note: the only reason to use a signed high-res offset instead of unsigned is to accommodate different rounding policies for the traditional coarse RTMP timestamp (1 ms accuracy); that is, "round to nearest ms" or "round down by truncating fraction of ms" . it would be much simpler to say that the offset is unsigned 16 (or 24) bits of fractional milliseconds after the coarse RTMP timestamp. i'm not sure accommodating different rounding policies is necessary or desirable, particularly since processing the coarse timestamps when fine timing is needed today must allow for an up-to-ms error.

having it be signed is more flexible and allows for either rounding policy, but requires a smidge more effort by processors to correctly handle a negative offset. but note that today's "composition time offset" in video for AVC and HEVC is already signed, and processors need to do the right thing there too.

i have no strong preference either way, but i think going with signed costs very little and retains more flexibility for use cases we might not be seeing right away.

@igorshevach
Copy link

I would like to thank everyone involved for the thoughts and propositions. I feel now the direction of the solution is correct. I only want to emphasize that we rather not underestimate significance of timestamp correctness judging from codec implementations in use and quality of the equipment. I think that by extending both audio and video tag headers we ensure that no matter what codecs there will be in use in the future no other additions will be needed in this regard. zenomt Can you please elaborate how rounding decision is made? is it documented elsewhere?

@winlinvip
Copy link
Author

A timescale of 1/1000 will obviously cause many issues. This is why nginx-rtmp and SRS, when converting RTMP to MPEGTS, do not rely on the RTMP timestamp but instead recalculate timestamps based on the AAC sample count. Otherwise, there would be audible audio noise problems. However, this method has many potential pitfalls and does not solve the issue of insufficient timestamp precision in RTMP. It only accurately recalculates the audio timestamps for MPEGTS, which has a timescale of 90000, making it 90 times more precise than RTMP.

RTMP timestamp rollover is a significant potential risk. A 24-bit timestamp will wrap around approximately every few hours, and different software implementations of extended timestamps are inconsistent. This makes it difficult to verify whether the software implementation truly complies with the standard. MPEGTS uses a longer timestamp length, and it is recommended to use more bits to avoid timestamp rollover issues. WebRTC's RTP timestamp has an even shorter bit length, making it more prone to rollover. Lengths shorter or longer than 24 bits are not problematic; longer ones avoid rollover, while shorter ones wrap around more quickly.

I personally recommend using a longer bit length since the current network bandwidth and audio/video huge bitrates allow for using longer bits to avoid timestamp rollover issues and support a more precise timescale.

@zenomt
Copy link

zenomt commented Aug 13, 2024

when converting RTMP to MPEGTS, do not rely on the RTMP timestamp but instead recalculate timestamps based on the AAC sample count

if you don't rely on the RTMP timestamps at all, then audio and video will go out of sync if there's any missing audio frames, or if the actual audio sample rate is different from the nominal sample rate, even by a little bit. to work properly, using the AAC sample count also requires some heuristics looking at the RTMP timestamps to see if you're "close enough" to the RTMP time to decide you haven't missed one or more frames, or that the sample clock hasn't drifted too far from the wall clock. if there's too big of a discrepancy, you need to signal a discontinuity and resynchronize.

Otherwise, there would be audible audio noise problems.

not if you use RTMP timestamps for their intended purpose. :) RTMP's timestamps were intended to synchronize audio, video, and data for playback in Flash Player. when there's an audio track, the timestamp of each audio message establishes/snaps the "current system time" at the instant of that message's first decoded audio sample being played, and then the system time advances with real time as long as audio is still playing up to the next audio message and its timestamp. video and data frames are then rendered according to the system time. this can cause video frame rendering jitter of up to 1ms, which is still more accurate than can be reproduced with your monitor for nearly all practical values of "your monitor".

@winlinvip
Copy link
Author

winlinvip commented Aug 13, 2024

@zenomt On the contrary, using RTMP timestamps will lead to audio noise. Initially, SRS used RTMP timestamps, which caused issues. Therefore, it switched to recalculating timestamps using AAC sample counts. In fact, nginx-rtmp also does this. There is a very detailed analysis process on this ossrs/srs#547 (comment).

In short, the RTMP timestamp is not accurate, for 44100HZ audio, each audio frame is:

1024/44100.0=0.02321995s=23.21995ms

The audio frame will set to 23ms, loss 0.2ms data, this is what cause the audio noisy when convert to HLS. Right now, using RTMP timestamps is easier to calculate for converting RTMP to HLS, as you only need to multiply by 90, but it is not correct.

@zenomt
Copy link

zenomt commented Aug 13, 2024

@zenomt On the contrary, using RTMP timestamps will lead to audio noise.

my point is that using RTMP timestamps as intended (that is, using the RTMP/Flash timing model) does not lead to audio noise or desynchronization. the RTMP/Flash timing model does not involve scheduling audio samples to play back at a particular time; rather, the timestamps of the audio messages and the continuous playback of samples at their natural sampling rate establishes the clock against which video and data messages are rendered.

when playing back audio in a system that schedules audio samples to play at a particular time, then yes, RTMP timestamps have insufficient precision to align to within a single audio sample.

@veovera
Copy link
Owner

veovera commented Aug 13, 2024

Thank you for the thoughtful discussion on this topic. There are several approaches to solving this problem, and while there isn’t a single 'right' way, what follows is our formal proposed solution that maintains compatibility with standard timestamp tracking practices. I encourage you to review it and share any feedback you may have.

E-RTMP Specification

  • Audio timestamp offset signal within the packet type <link>
  • Video timestamp offset signal within the packet type <link>
  • Timestamp offset types. For now we only propose a presentation time offset. In the future we might have things like composition time, decoding time offsets <link>
  • Audio bitstream parsing logic <link>
  • Video bitstream parsing logic <link>
  • Enhanced timestamps capability flags <link>

Writeup

We are enhancing both audio and video RTMP messages by adding the optional capability to apply nanosecond offsets to the standard 32-bit RTMP timestamps, which are in milliseconds. When required, this enhancement allows us to fine-tune the presentation time of each message within the media streams with much greater precision. The nanosecond offset is particularly useful for addressing RTMP’s timescale limitations and improving compatibility with formats like MP4 and M2TS, as well as supporting environments like Safari's Media Source Extensions. By applying this fine-grained offset, we can ensure that audio, video, and data streams remain perfectly synchronized across various media formats and playback environments, without needing to alter the core 32-bit RTMP timestamps. However, it's important to note that the nanosecond offset in Enhanced RTMP (E-RTMP) is optional and should only be used when higher precision is necessary for specific audio and/or video messages.

In this specification, when the VideoPacketType or AudioPacketType is identified as TimestampOffsets, the system first checks if additional offsets need to be processed and then retrieves the type of timestamp offset. If the type is TimestampOffsetType.Nano, the system processes this nanosecond-level precision offset by fetching an unsigned 20-bit nanosecond value (just enough to add up to one millisecond). This value is then applied to the media message timestamp, providing the needed precision for synchronization. If the same TimestampOffsetType is encountered multiple times within the same packet, the bits should be combined from left to right to create a larger value, enabling offsets greater than 1 millisecond. This approach is particularly beneficial in specialized solutions where the presentation time needs significant adjustments beyond mere precision, such as when addressing substantial delays or timing corrections.

We considered various approaches, such as not allowing multiple offsets, replacing the old value, or supporting the combination of offsets. Ultimately, we opted to support combining offsets to enhance the system's flexibility, even though this feature may be rarely required and only in specific scenarios. After processing the nanosecond offset, it is integrated with the existing timestamp handling logic to adjust the presentation time of the media samples as necessary.

Looking ahead, we plan to explore adding other types of timestamp offsets related to composition, decoding, and other aspects of media playback, further expanding our capability to fine-tune the presentation of media streams.

So, who wants to test this? :)

@igorshevach
Copy link

igorshevach commented Aug 18, 2024

@veovera. what is the testing procedure? do you provide sample rtmp stream in specified format or encoder software?

@zenomt
Copy link

zenomt commented Aug 18, 2024

@veovera : i have a few concerns about the proposal above.

  1. i understand the "backward compatible" aspect of having a separate message preceding coded frames to add extra precision if you want. however:
    • this is a lot of extra bytes on the wire vs the amount of information being sent. especially for audio where the coded frames are already pretty small and message overhead is already significant.
    • a separate message complicates real-time treatments (like transmission deadlines). you want to make sure that a timestamp offset message has a chance to get through, but you don't want to try forever if you're going to potentially abandon its accompanying audio message. the worst case is that the following audio message makes it through (in a transport like RTMFP, which has partial reliability and retransmission at the individual message level), but the timestamp offset message didn't make it through before its deadline. this isn't insurmountable, but it's a significant increase in complexity to handle properly.
    • related to the above point, a separate message might make it impossible to properly process audio messages when in a "super low latency" mode (possible with RTMFP), where you take delivery of messages as they arrive in the network and put them into a short reorder/dejitter buffer. there's a possibility that the timestamp offset message for an audio frame might arrive after the playout time for a frame that's already arrived, or worse (especially if you allow for shifting the time by 40 bits of nanoseconds, which is up to 1024 seconds) the audio message might actually be in the wrong spot/order in the reorder buffer and playout would be all jumbly.
    • this message becomes yet another "sequence special" message that needs to be preserved (at least temporarily, and cleared at the appropriate time), and played out for "late joiners" who arrive between the timestamp offset message and its accompanying coded frame(s) (again adding complexity).
  2. assuming a separate timestamp offset message, i don't understand the utility of the "shift to add more bits" (to go to, say, 40 bits of nanoseconds allowing up to 1024 seconds). if there's a major synchronization problem, i feel the RTMP timestamps should be adjusted, rather than trying to adjust the timeline with a timestamp offset message.
  3. in the "shifting left to add more bits" case, what happens if the first 20 bits is nanoseconds, but the second 20 bits is something else (once there is another TimestampOffsetType to set the field to)? if they all have to be the same type, why is the type repeated?

if you're already planning on signaling support with a capsEx flag, i'd recommend the much simpler approach of a new "coded frames with extra precision" type that includes a field to get to nanoseconds. that way the coded frame and its high-precision timestamp are atomically bound, which solves all of the transmission deadline, reorder, and "sequence special" problems, and is much less overhead compared to the alternative. and i would recommend against being able to shift by more than one ms, or of having different possible precisions (since then you need to signal support for new precisions too).

@veovera
Copy link
Owner

veovera commented Aug 19, 2024

@zenomt Thanks for your detailed feedback! To ensure I understand your points correctly:

  • Are you suggesting that a TimestampOffsetType should not be repeated within a message? This would imply rewording the proposed behavior below.
// If the same TimestampOffsetType is encountered more than once in the same
// packet, we combine the bits left-to-right to create a larger value. This
// ensures that the first offset is placed in the more significant bits, and
// subsequent offsets are appended to the right. This is useful if there is
// a need to offset the presentation by more than 1 millisecond, which might
// be required in unique solutions where the presentation time needs to be
// offset for reasons beyond precision (e.g., significant delays or
// corrections).
  • Could you provide an example of how you would propose signaling the TimestampNanoOffset capability? Are you suggesting an alternative to the approach proposed below?
enum CapsExMask {
  Reconnect           = 0x01,
  Multitrack          = 0x02,
  TimestampNanoOffset = 0x04,  // Indicates support for nanosecond offset
};

@veovera
Copy link
Owner

veovera commented Aug 19, 2024

@veovera. what is the testing procedure? do you provide sample rtmp stream in specified format or encoder software?

@igorshevach
Thanks for your interest in testing! As part of our open-source initiative, we provide the E-RTMP specification and encourage the community to contribute to its development and testing. There have already been many valuable contributions.

VSO does not provide sample E-RTMP/FLV streams, files, or encoder software directly. We rely on the community to create and share such resources. While there is no sample content specifically for enhanced timestamps at this time, we hope those interested in this capability will be able to test it within their own setups and contribute back.

If enhanced timestamp capability is what you're looking for, we hope you find the specification straightforward for implementing E-RTMP in your solution.

We welcome any feedback and contributions to help refine and enhance the specification based on real-world use. The feedback we've received so far has been very compelling, and we look forward to any further input you or anyone else may have!

@zenomt
Copy link

zenomt commented Aug 19, 2024

if using the "TimestampOffsets" message, i'm suggesting that there only be one offset in it, because i don't believe there's a reasonable use for > 1ms of offset, and the bit shifting won't work if there are different kinds of offsets you're trying to combine together in the same message. if there's a compelling reason i don't currently understand to encode offsets > 1ms, then i'd say that repeating the TimestampOffsetType doesn't make sense and shouldn't be done, especially if the fields are to be combined by shift+add. (also, "shift+add" would mean that the first "nanoseconds" field is not in fact nanoseconds, but actually number of 1048576 nanosecond periods, which makes the semantics of that field even less precise).

but i'm really suggesting not having the "TimestampOffsets" message, and instead having a new type of Coded Frames message that includes 3 more bytes to encode the number of additional (and i think it should be signed, so + or -) nanoseconds (and only nanoseconds) to add to the RTMP timestamp to get the "high precision" timestamp. support for this new coded frames message could be negotiated between client and server with a capsEx flag, maybe called HighPrecisionCodedFrames or something (depending on what the actual message type ends up getting called).

if a server tells a client that it supports the high precision coded frames messages, then it (BCP 14) MUST also be prepared to translate those messages to the normal-precision coded frame types when forwarding those messages to a client that didn't signal that it understands them.

PS. if you really really want to have the TimestampOffsets message and encode offsets > 1ms, then i'd make the field variable-length (minimum of 20 bits, total length derived from the RTMP message length). so it could be 20 bits, or 28, or 36, or 42 bits of nanoseconds (or whatever units, in the future), depending on how long the RTMP message is. or you could use a Variable Length Unsigned Integer (VLU) if you wanted to leave the door open for additional fields in the future. but i also really think this message isn't the right solution, for all the reasons i listed in my previous message.

@zenomt
Copy link

zenomt commented Aug 20, 2024

closing the loop on my objections: after an offline conversation with @veovera , i see i missed & misunderstood a crucial point in the current proposal. i thought the proposal was for a separate message that would apply a nanosecond offset to following RTMP messages (and would therefore have a huge additional on-the-wire overhead). however, i'm 💯 on board with the actual proposal of an optional field inside the same RTMP message to apply a high-res offset.

i have some minor concerns on how much code it'll take to properly handle this case, both for parsing and potentially rewriting for clients that don't understand this new message type. i'm hoping to have time this weekend to try it out to see if it's onerous or no big deal (my gut feeling is "not that big a deal" but i want to make sure).

@veovera
Copy link
Owner

veovera commented Aug 22, 2024

closing the loop

Great to hear and thank you for taking the time clarify the details. After our offline conversation I made some clarification in the specification. The updated information are linked below.

E-RTMP Specification

  • Audio timestamp offset signal within the packet type <link>
  • Video timestamp offset signal within the packet type <link>
  • Timestamp offset types. For now we only propose a presentation time offset. In the future we might have things like composition time, decoding time offsets <link>
  • Audio bitstream parsing logic <link>
  • Video bitstream parsing logic <link>
  • Enhanced timestamps capability flags <link>

Once the feedback for this feature has been solidified we will merge the feature/timestamp-offset branch into the main branch.

@zenomt
Copy link

zenomt commented Aug 24, 2024

@veovera i read through the new revision above, and it looks good. i haven't implemented it yet -- i'm still thinking through the cleanest way for that.

there are still two things that are nagging at me though, but they are minor things that are more about the encoding than the general idea:

  1. this new message type (actually a prefix sub-message in the same RTMP message) is specific to "timestamp offset", so future message modifications that aren't about timestamps will require another message type and more specialized logic.
  2. as written, with the packet length implicit from the type, a parser needs to know the length of every "timestamp offset" type, and if future ones were ever defined, filtering them out or translating them for a peer that doesn't understand them is (a) necessary and (b) painful.

in taking a step back, it occurred to me that this is more like an "option" added to the RTMP message, similar to an RTP extension header or to the "message options" in http://zenomt.com/ns/rtmfp#media.

what if, instead of a "TimestampOffsets" packet type, there was an "Option" packet that had a 4-bit type, 4-bit payload length (in bytes), and then that many bytes (or maybe that many plus one, so you could have 1-16 bytes instead of 0-15) of payload. instead of a "more coming" bit, you could just have more "Option" packets, with the constraint that all the "Option"s had to come first in the message. that would allow other kinds of options in the future, if there was ever a need, and they wouldn't be constrained to just different kinds of timestamp offsets. the most important part, though, is that the only check that needs to be done when sending to a peer is "do they understand the Option packet", and the "filter out" transform is now just "filter them all out" (implemented by "just skip over all the options bytes when forwarding"). peers that understand "Option" packets at all can skip over option types they don't understand because their lengths are explicit. a peer could/should still signal whether it understands particular option types, in case that's important to the other peer.

an enhanced audio message with a nanosecond offset option could look like

soundFormat     = UB[4] = 9: ExHeader
audioPacketType = UB[4] = 6: Option
optionType      = UB[4] = 0: Timestamp Offset Nanos option type
optionLength    = UB[4] = 3: there are 3 bytes of payload
nanosOffset     = UI24     : those 3 bytes of payload

soundFormat     = UB[4] = 9: ExHeader
audioPacketType = UB[4] = 1: CodedFrames
audioFourCc     = FOURCC as AudioFourCc
[audio data]

this approach isn't as clean for video though, since the VideoPacketType would be repeated each time. however, this approach does enable "enhanced RTMP" peers to apply enhanced message options even to legacy RTMP audio and video messages, if that would ever be beneficial.

@zenomt
Copy link

zenomt commented Aug 24, 2024

also, unless i'm missing something, i think the "Fetch audioPacketType once more after processing audio timestamp offsets" here leaves the parser having read only 4 bits and not being at a byte boundary to continue processing, where it would be at a byte boundary if there hadn't been the audio timestamp offsets packet.

@zenomt
Copy link

zenomt commented Aug 24, 2024

the video pseudocode looks to have the same problem.

@veovera
Copy link
Owner

veovera commented Aug 26, 2024

also, unless i'm missing something, i think the "Fetch audioPacketType once more after processing audio timestamp offsets" here leaves the parser having read only 4 bits and not being at a byte boundary to continue processing, where it would be at a byte boundary if there hadn't been the audio timestamp offsets packet.

Great catch! Yes it looks like there is a bug in the pseudocode where we end up not on a byte boundary. This means instead of 20bit offset we actually can have a 24 bit offset (16 bits would not be enough) to make sure we are aligned on a byte boundary. I'll update this in the documentation. Thank you for pointing it out! Also, I'm currently reviewing the additional suggestion...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants