-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
floppy: apply NCO-based 2nd-order PLL to WDATA line #864
base: master
Are you sure you want to change the base?
Conversation
Thanks for this! I need to consider whether to apply this as a default, or only, clock recovery. I can probably simplify the edges of this code a bit, where it interfaces to the rest of the firmware. My main concern is how long it takes to run per bit clock, as the old STM32 Goteks aren't very fast, and FlashFloppy is tested up to ED data rate (MFM 500ns clock). What Gotek model(s) has this been tested on? However, fundamentally it is more principled than ae42450 and adjusting the frequency rather than messing directly with the clock phase seems obviously better for dealing with aggressive precomp. I can test this against Greaseweazle easily enough, but my gut says that this implementation shouls be quite a bit more robust. Possibly I should try something like this in Greaseweazle too, which has a PLL that's not really a PLL. Albeit the existing implementation has proven to work pretty well even on marginal disks. I could have two implementations though, and then A/B them. |
I only have 2x Gotek SFRKC30.AT4.35 so I've only tested with the Artery AT32F435. I was pretty careful to choose an implementation that avoids expensive math. There's one uint32_t multiply and one int32_t divide per sample. Everything else is adds, subtracts, and bit shifts. I didn't check the disassembly to see if the multiple and division are replaced by the compiler. |
The other question that comes to mind is how were the PLL parameters arrived at? I'm no PLL expert, and I don't know whether for example the loop frequency and damping factors chosen are good ones, nor whether the resulting calculated k_l and k_i are good. Was there much trial and error? |
I'll admit that I'm not particularly strong on PLL design either. I included a link to https://www.dsprelated.com/showarticle/967.php in the comment with the parameters as I referred to that site often while developing this implementation. I did experiment with various parameters to see what range of loop frequency would work. I left damping factor alone as all the information I turned up said 1 was a safe choice. Ultimately, I found anything from 300kHz up to 750kHz worked fine for a 1us MFM clock with write precomp ranging from 100ns up to 350ns. Given that, I worked backwards from k_l and k_i that were convenient for bit shifting which also resulted in a loop frequency in that range. That resulted in k_l = 1/8, k_i = 1/256 (715kHz) and k_l = 1/16, k_i = 1/1024 (358kHz). Both worked fine with 3.5" HD and Centurion 8" DD flux images and on real hardware. To do all this experimentation (and not need Usagi Electric to fire up the Centurion dozens of times), I wrote a tool (https://github.com/mx-shift/Centurion_Floppy/tree/main/kryoflux_to_flashfloppy) that converts Kryoflux files to a list of tick counts approximating how the AT32F435's timer capture unit generates samples via the DMA engine. This tool can also apply additional write precompensation. Then I extracted the core algorithm from IRQ_wdata_dma into another tool (https://github.com/mx-shift/Centurion_Floppy/tree/main/flashfloppy_to_hfe) so I could try different implementations and parameters. You'll notice that the tool has algorithms from FlashFloppy v4.31, a snapshot of master, Greaseweazle's two PLLs, a design based on an FDC9216, and finally this NCO-based PLL approach. If you have Kryoflux files of other disks you'd like to verify these algorithms against, I'll gladly run them through the tooling and verify HxC can decode the generated HFEs. As I only got into floppy emulation because Usagi Electric was having difficulty with Centurion, I don't have a large library of flux images to draw from for verification. |
Okay that's very awesome! I assume you found this NCO-PLL scheme to work best? Sounds like you used empirical methods to pick good k_l and k_i, and f_n emerged from that rather than working forwards, and that certainly seems fine to me. Regarding the signed divisions, you should find that these compile to three instructions. Unfortunately conventional signed division (which rounds to zero) is not the same as arithmetic shift (which rounds to -inf). For the purposes here however, perhaps arithmetic shift is okay and will save two instructions? It is "safe enough" to try The main weirdness of ASR is that a negative number repeatedly shifted will tend to -1 rather than 0. |
Did you in the end find any reason to pick between 358kHz vs 715kHz? I'd certainly be inclined to go with what experiments say if they are conclusive either way but, since WDATA should have a very stable clock close to what FlashFloppy expects and basically all the noise is bit shifts introduced by precomp, the lower-bandwidth 358kHz parameters would be a natural choice. By the way, you've probably seen this doc before, but I found this useful for thinking about PLL behaviours in relation to floppy drive data: https://info-coach.fr/atari/hardware/_fd-hard/AN-505.pdf |
This NCO-PLL scheme was the only one that could handle the Centurion's
extreme write precomp. I found no difference in results between 358kHz and
715kHz though only 715kHz was tried against actual Centurion hardware. I
expect 358kHz would work as well. I didn't want to trouble Usagi Electric
is all.
Re: bit shifts vs divisions. I tend to always write the code with division
and let the compiler figure it out. Old habits from being a performance
engineer: write the obvious thing until you have results showing you need
to do something different. So the only explicit bit shifts in the code
right now are the initial value of the PLL period and the unsigned scaling
up of timer samples. Everything else is written with multiplication and
division, most of them with power-of-2 constants. If it's a problem on
older Goteks, I can look at what the compiler is doing to see how much
further it can be streamlined.
…On Sun, Jan 21, 2024, 12:37 AM Keir Fraser ***@***.***> wrote:
Did you in the end find any reason to pick between 358kHz vs 715kHz? I'd
certainly be inclined to go with what experiments say if they are
conclusive either way but, since WDATA should have a very stable clock
close to what FlashFloppy expects (say comfortably within 500ppm?) and
basically all the noise is bit shifts introduced by precomp, the
lower-bandwidth 358kHz parameters would be a natural choice.
By the way, you've probably seen this doc before, but I found this useful
for thinking about PLL behaviours in relation to floppy drive data:
https://info-coach.fr/atari/hardware/_fd-hard/AN-505.pdf
Yes, I do skip most of the maths!
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIEHF266G555W46WFH36WLYPTHV3AVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGU2TKMZQG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Well that's fine, I have no strong reason to prefer 358kHz vs 715kHz. Clearly the latter has really good tolerance for bit shifts! And yes makes sense I suppose to leave as is. I have a feeling this may only get applied for the '435 build of FlashFloppy anyway. That given you could probably even have used single-precision floats, no problem! Three-instruction divisions are unlikely to need optimisation there. |
Someone just pointed out a slight numerical accuracy problem in the
phase_integral. I need to move the divide by 256 down to the phase_step
calculation. As is, small errors aren't being accumulated in the integral.
I'll push an updated commit sometime tomorrow.
…On Sun, Jan 21, 2024, 12:59 AM Keir Fraser ***@***.***> wrote:
Well that's fine, I have no strong reason to prefer 358kHz vs 715kHz.
Clearly the latter has really good tolerance for bit shifts!
And yes makes sense I suppose to leave as is. I have a feeling this may
only get applied for the '435 build of FlashFloppy anyway. That given you
could probably even have used single-precision floats, no problem!
Three-instruction divisions are unlikely to need optimisation there.
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIEHF4T4WATCW55IXYJ7NDYPTKHZAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGU2TSOBWHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Interesting, having a play with your conversion and analysis tools. I don't actually find 350ns precomp reliable with the 715kHz PLL and an MFM clock of 1us. For example (paraphrasing slightly, but this is the sequence of commands ignoring paths):
If I inspect the result using HxC software's track analyzer, there are plenty of badly decoded bitcells and most sectors fail data CRC. Also the resulting HFE seems a bit long, hence HxC software thinks the RPM is approx 298. This is probably just some artifact of your HFE-stuffing code. |
Warning. Ramble: Another thing I took to wondering is if there is a better way to handle precomp shifts, given that we basically know that the only high-frequency noise we will see is the deterministic shifting caused by precomp. Whenever I start thinking about it I reckon I always end up wanting something like a conservative low-bandwidth PLL though. It is interesting that a PLL (with appropriate parameters!) can tolerate even this aggressive shifting. It's not immediately obvious to me that it should be possible. After all the flux timings are expected to be written direct to disk, and only passed through a PLL on readback, at which point the precomp shifts should be mostly compensated by effects of the magnetic media. |
Huh. Is gw applying any precomp? My tool adds additional precomp.
…On Sun, Jan 21, 2024, 3:53 AM Keir Fraser ***@***.***> wrote:
Interesting, having a play with your conversion and analysis tools. I
don't actually find 350ns precomp reliable with the 715kHz PLL and an MFM
clock of 1us. For example (paraphrasing slightly, but this is the sequence
of commands ignoring paths):
dd if=/dev/urandom of=img1440.img bs=1024 count=1440
gw convert --format=ibm.1440 img1440.img img1440_00.0.raw
kryoflux_to_flashfloppy img1440_00.0.raw --write-precomp-ns 350
flashfloppy_to_hfe img1440_00.0.revolution1.ff_samples a.hfe nco_715k
If I inspect the result using HxC software's track analyzer, there are
plenty of badly decoded bitcells and most sectors fail data CRC.
Also the resulting HFE seems a bit long, hence HxC software thinks the RPM
is approx 298. This is probably just some artifact of your HFE-stuffing
code.
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIEHF3EQHXH6CLFGX43YLTYPT6URAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGYYDKMBTGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
No Greaseweazle doesn't apply any precomp to image conversions. Could the problem be that the data is more random than previous tests, and that certain bit sequences are defeating the higher bandwidth PLL? The lower bandwidth version worked okay on this test input. Worth investigating anyway. |
I believe you are right. I'm starting to play with even lower bandwidth versions to see if 400ns can be tolerated. |
Another interesting measurement for low bandwidth PLLs will be their tolerance to frequency shift. This can also be tested with Greaseweazle as a custom format can be specified with a shifted bitrate (eg 490-510 in place of 500kbps). We should make sure the PLL can lock and lock reasonably fast to deviations of a few percent I'd say? |
I was thinking along the same lines. I started to work through the formulas again and remembered that f_n (loop natural frequency) determines how far the PLL will "pull in" an input that has a frequency offset. One way to think of this PLL design is as a PLL that tracks a nominal 72MHz clock. Since MFM doesn't provide a clock pulse consistently, I'm simulating the PLL clock running N cycles until the MFM edge and then spreading the accumulated error found at that edge across those N cycles. So when thinking about how much the PLL will tolerate a frequency offset, it's in reference to the nominal 72MHz clock and the tolerance is divided across however many ticks occur before an MFM edge. For nco_715k, this means the PLL tolerates roughly 72MHz +/- 715kHz. With a 1MHz MFM clock, it's possible to see 4us between MFM pulses. So the 715kHz tolerance is spread over 288 PLL clock ticks resulting in 2.482kHz tolerance on the incoming 1MHz MFM clock. The tolerance actually gets worse with slower MFM clocks. To experiment with slower MFM clocks, I modified https://github.com/mx-shift/Centurion_Floppy/tree/main/flashfloppy_to_hfe to allow specifying the HFE bitrate which is used to calculate write_bc_ticks just like FlashFloppy does. With a 250kbps HFE, both nco_715k and nco_358k were able to track an ibm.360 image. (Aside: I also added https://github.com/mx-shift/Centurion_Floppy/blob/main/scripts/kryo_to_hfe.sh to quickly generate HFEs with 0, 300, 350, and 400ns precomp from a kryoflux image). I also added nco_178k to see what happens if the loop natural frequency is dropped further. This works fine with ibm.1440 and ibm.360 images containing random data and converted with One way to improve this is to slow the PLL virtual clock down to be much closer to the expected range of MFM clocks. That requires some thinking through how to handle the PLL virtual clock and the timer samples overflowing at different times. |
Playing around a bit more, I think I was wrong about the tolerance being spread across the measurement time. The PLL only gets updated at the end of the next MFM edge. Even if that 4us from the previous edge, the phase error is divided by the number of PLL clock ticks before updating the PLL. So the PLL can tolerate x% at the update which translates to x% across the measurement window. Zeta is what becomes important for tolerating jitter while keeping a wide loop natural frequency. As zeta approaches 0, the jitter tolerance improves until a point where the PLL loses stability. As zeta increases above 1, phase tracking to the input clock is improved until a point where the PLL loses stability. Putting these two pieces together, I calculated parameters for loop natural frequency of 3%, 2%, and 1% of 72MHz and zeta at various points between 0 and 1. I've added all of these to https://github.com/mx-shift/Centurion_Floppy/tree/main/flashfloppy_to_hfe as nco__. No matter what I do, I can't tolerate 400ns of write precomp. nco_1440k_0p25 seems to be a good balance of 2% frequency tolerance while also tolerating 350ns write precomp. I have not tried this with intentional frequency offsets. I did try it against https://github.com/Nakazoto/CenturionComputer/tree/main/Drives/Floppy/FluxReading/Old/RawDS_Kryoflux which is a Centurion floppy read via greaseweazle. Note that since this was created from actual media, it only has moderate write precomp remaining as most of it was compensated for by the drive and media. Also note that Centurion adds additional write precomp starting at track 42. So looking at track 42 with an additional 150ns or so of write precomp is an example of what Centurion FFC looks like going into FlashFloppy. It has both a small frequency offset (and drift) as well as extreme write precomp. nco_1440k_0p25 seems to handle this just fine. |
473e8f6
to
bdcbd3c
Compare
Updated the PR with the numerical accuracy fix for the integral and to adjust the PLL to 1440kHz loop frequency and loop dampening of 0.25. |
Centurion Finch Floppy Controller (FFC, https://github.com/Nakazoto/CenturionComputer/wiki/FFC-Board) is an 8" floppy controller that applies up to 350ns write precompensation. FlashFloppy's existing approach of re-aligning the bit clock to the last pulse edge causes enough additional error for bitcells to be missed or injected in this case. This commit applies a 2nd-order (proportional and integral) PLL to the incoming WDATA to mitigate both frequency offset and extreme write precompensation. The implementation is based on a virtual numerically-controlled oscillator nominally running at the same frequency at the WDATA timer but with a 16.16 fixed-point representation allowing for considerable precision in frequency adjustment. Proportional and integral constants were calcuated to nominally provide a 1440kHz natural loop frequency and a damping factor of 0.25 to allow for 2% frequency offset and strong jitter rejection against write precompenstaion.
bdcbd3c
to
ee6882e
Compare
Okay, I will have a play with this when I have some time. Could be next weekend. I can't really comment authoritatively on how to interpret PLL parameters, especially when it's a digital approximation of an APLL applied to an incomplete clock stream. I think the natural frequency is actually to do with the frequency that the PLL oscillates at when perturbed. I'm also not sure the absolute f_n and f_s have much meaning for us when the only parameters we depend on in the code (k_l and k_i) depend only on the ratio between f_n and f_s. Anyway, empirical evaluation of sensible-looking 2^-N values for k_l and k_i seems a reasonable approach. |
Interestingly, a quick fairly arbitrary test with data bitrate +/- 3% of nominal, and 350ns precomp, seems handled better by nco_358k than nco_1440k_0p25. Seems there's possibility of searching this space somewhat automatically, at the cost of implementing a harness:
Then run a multi-dim search across (rate,precomp,k_l,k_i) -> nr_valid_sectors Then pick optimal k_l,k_i for preferred tradeoff between worst-case rate vs precomp. We could chuck in a few of the other 'pll' algorithms as baselines. Then the whole thing seems empirically watertight at least, within our constraints on values for k_l,k_i. |
I forked and added a parametric NCO and a test harness at https://github.com/keirf/Centurion_Floppy I searched across bit rates up to +/-8% from 500kbps nominal. And across precomp values up to 350ns. The best parameters I can find are k_l = 1/16 and k_i = 1/128 (though k_i = 1/256 is a very close second). It is good that the k_l matches your preferred NCOs (358k and 1440k_0p25). While k_i sits somewhere between, but closer to nco_358k. |
Neat! I haven't had a chance this week to poke at it. I do think I have a
mistake in the phase error calculation. It should _not_ be divided by
write_bc_ticks. First, the error might be accumulated over multiple
bitcells. Second, dividing the error to smooth it across all the clock
ticks would require replaying each tick to calculate a new phase_step and
associated phase_error. Since we don't know anything about the nature of
the phase error between the two edges, it's best to treat it as a single,
large error introduced as a transient. Basically, remove the divide by
write_bc_ticks and leave the rest of the code the same.
…On Fri, Jan 26, 2024, 8:42 AM Keir Fraser ***@***.***> wrote:
I forked and added a parametric NCO and a test harness at
https://github.com/keirf/Centurion_Floppy
I searched across bit rates up to +/-8% from 500kbps nominal. And across
precomp values up to 350ns. The best parameters I can find are k_l = 1/16
and k_i = 1/128 (though k_i = 1/256 is a very close second).
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIEHF4G34XYDIANUOZ23T3YQPMITAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSGM3DAOJYGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Dividing by write_bc_ticks is simply to remove the data frequency from phase calculations. It's nothing to do with averaging out the phase error across bitcells. That would be wrong, and it's not what's done here, so all good. That said, I want to finesse the code some on the FlashFloppy side, and then I will want to re-extract that into the test program and re-run the search. But I haven't found anything in the existing NCO-PLL code I'd strictly call wrong. |
The delta between the ideal bitcell center and the actual edge is the total
phase error at that instant according to the PLL clock. I originally added
the divide by write_bc_ticks thinking it would average the error but it
didn't. It just feeds back less error equivalent to scaling k_l and k_i by
the same amount.
…On Fri, Jan 26, 2024, 9:48 AM Keir Fraser ***@***.***> wrote:
Dividing by write_bc_ticks is simply to remove the data frequency from
phase calculations. It's nothing to do with averaging out the phase error
across bitcells. That would be wrong, and it's not what's done here, so all
good.
That said, I want to finesse the code some on the FlashFloppy side, and
then I will want to re-extract that into the test program and re-run the
search. But I haven't found anything in the existing NCO-PLL code I'd
strictly call *wrong*.
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIEHFZVVKP3OBQ6D2JSR7DYQPUATAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSGQ2DSNRSHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
But think about the dimensions of the values. phase_step is dimensionless, then multipled by write_bc_ticks (clock ticks) to get bc_step (clock ticks). distance_from_curr_bc_left and bc_step are also in units of clock ticks. But then phase_error needs to be dimensionless because it is used to calculate phase_step, hence dividing through by something in units of clock ticks surely makes sense. Anyway, inarguably the code does seem to work :) |
phase_step is nco-ticks/timer-tick
write_bc_ticks is timer-ticks/mfm-tick
So phase_step*write_bc_ticks is nco-ticks/mfm-tick and current_bc_left_edge
is supposed to be in nco-ticks (modulo 32-bits). phase_error should also be
in nco-ticks.
…On Fri, Jan 26, 2024, 10:56 AM Keir Fraser ***@***.***> wrote:
But think about the dimensions of the values. phase_step is dimensionless,
then multipled by write_bc_ticks (clock ticks) to get bc_step (clock
ticks). distance_from_curr_bc_left and bc_step are also in units of clock
ticks. But then phase_error needs to be dimensionless because it is used to
calculate phase_step, hence dividing through by something in units of clock
ticks surely makes sense.
Anyway, inarguably the code does seem to work :)
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIEHFZXRTJDKRXWY73LXL3YQP35DAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSGUZTQMJVHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Finally found some time to revisit this. I cleaned up https://github.com/mx-shift/Centurion_Floppy/tree/main/flashfloppy_to_hfe quite a bit and replaced all the nco_* variations with a parameterized version as you had done in your fork. I also reworked your harness.py into sweep_ibm.py which scans across both ibm.360 and ibm.1440 formats to see what happens with different MFM clock rates. I also added summarize_results.py which will show how well the bitrate/precomp combinations are covered as well as the coverage of the set of parameters that had the widest coverage. I've also worked up a revision of the NCO PLL that behaves like the math says it should. I call this nco_v2 in flashfloppy_to_hfe. sweep_ibm.py will run both nco_v1 and nco_v2 and generate results for each. I haven't let it run completely but early results show nco_v2 working quite well with p_mul=1, p_div=2048, i_mul=1, i_div=8192. I also added a CSV output of phase_error over time to flashfloppy_to_hfe. TechTangents provided me with a kryoflux image of a floppy that failed formatting in a Xerox 645S word processor due to the drive spindle speed varying so much that it couldn't read it's own writes. I ran that through flashfloppy_master, greaseweazle_default_pll, nco_v1, and nco_v2 and plotted all of their phase errors: nco_v2 slightly edges out nco_v1. I also did the same with a Centurion floppy as I know it has a modest amount of write precomp: Again, nco_v2 is slightly better than nco_v1. I'll update again once my run of sweep_ibm.py finishes. |
Nice. Can it tolerate 400ns precomp in a 1.44m pc disk? |
Nope. That seems to be pretty difficult to do.
…On Tue, Feb 20, 2024, 12:17 AM Keir Fraser ***@***.***> wrote:
Nice. Can it tolerate 400ns precomp in a 1.44m pc disk?
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIEHF45JJTSAZO7MKAKESDYURLY5AVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJTGY4DINZVGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Well this is great anyway. I still haven't got my head round the mathematical correctness of v1 versus v2, but I need to give it a bit more thought, and it's hard to argue against empirical results, plus dropping a divide, and the common-sense interpretation that we are propagating and accumulating errors with greater precision, in v2. My main concern is whether v2 is sufficiently "scale invariant" to various write_bc_ticks values. Sweeping over precomp and clock variations at base clock 1us versus 2us will certainly be interesting: will the same parameters work best for both base clocks in v1 and/or v2. |
TL;DR Use nco_v2[p_mul=1,p_div=2048,i_mul=1,i_div=8192] and it should work for everything up to and including ibm.2880. Sweep finished overnight. Results are pretty clear: both nco_v1[p_mul=1,p_div=16,i_mul=1,i_div=128] and nco_v2[p_mul=1,p_div=2048,i_mul=1,i_div=8192] can cover:
nco_v1 can also do ibm.1440 with +7% data rate and 350ns precomp. I'm surprised they are different on this one data point. I also looked at ibm.2880 for those two specific parameter sets. I can't use the same tooling since it is all built around generating HFE which can't store a track long enough for ibm.2880. Instead, I looked at the phase error plot of a nominal ibm.2880 just to check that nothing went particularly haywire. nco_v1 does slightly better here but both nco_v1 and nco_v2 are worse than flashfloppy_master. The error phase is well within the tolerance needed for correct decode though (bitcells are 36 sample ticks wide for 1000kbps). I suspect the extra phase error noise for both nco_v1 and nco_v2 come down to using 32-bits for the NCO. Moving up to 64-bits might give it enough precision to settle down instead of oscillating around 35kHz which is very close to where the math says this filter should oscillate. Full summaries of results: |
I'll gladly do the work to update this PR with nco_v2. Up to you. |
Doesn't seem there's much between 'em. You can do the ibm.2880 bit rate but not the exact format. Just do ibm.1440 but at rate=1000 and rpm=600. You'll need to add an rpm config parameter to your
100ms of track should be plenty in which to evaluate the PLLs. |
Thanks! I didn't think to shorten the tracks.
You're right that nco_v1 and nco_v2 aren't very different in behavior or
implementation. The extra divide in v1 mostly just scales the input
parameters by write_bc_ticks (2048 / 72 is 28 which is close enough to 16
and 8192 / 72 is 113 which is close enough to 128). I thought that might be
a problem for picking a single set of parameters that with across different
data rates. For 250kbps, the tolerance is so wide it doesn't matter. It
might for 1000. nco_v2 doesn't have this problem.
The other difference is nco_v1 assumes both the PLL uint32 and the sample
clock uint16 will overflow at the same time. Since the PLL frequency is
constantly being adjusted, they won't. nco_v2 doesn't make that assumption.
In practice, that error is tiny compared to the normal phase error seen so
it doesn't meaningfully change the results at 500kbps. I doubt it will at
1000kbps either.
So my preference for nco_v2 is mostly about removing the write_bc_ticks
scaling so the PLL parameters are fixed across data rates.
Rick
…On Tue, Feb 20, 2024, 10:18 PM Keir Fraser ***@***.***> wrote:
Doesn't seem there's much between 'em.
You can do the ibm.2880 bit rate but not the exact format. Just do
ibm.1440 but at rate=1000 and rpm=600. You'll need to add an rpm config
parameter to your class Format and write that into the generated diskdef.
Something like:
Format(
'ibm.ed',
heads=2,
cylinders=80,
sectors_per_cylinder=18,
bytes_per_sector=512,
data_rate_kbps=1000,
rpm=600
)
100ms of track should be plenty in which to evaluate the PLLs.
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIEHFZOEXOGYVJWVND3BG3YUWGTLAVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJVHE3DQMJZHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Let's get the higher rate numbers and see. We do indeed want one set of parameters for all rates if possible. So whichever of v1 and v2 is best for that is going to win. |
Results are not what I expected. Both nco_v1 and nco_v2 have a wide range of parameters that work fine at 1000kbps. What's really surprising is |
Yes interesting, not only is Is there anything else from v2 that's worth considering in combo with v1 for a v3 or v4? |
v3 is two small adjustments to v2. |
I'd stick with I also note from previous comment: Is this the following nco_v2 code?
|
Ultimately my path forward is as follows: Get an acceptable patch ready for firmware, then port that as directly as possible back out into the test harness (so the code is line for line as close to identical as possible), and check it against the nco_v1 baseline again. This will allow to confidently test firmware optimisations too. |
No, that bit of code is limiting the tunable range of the NCO. If it is
left unlimited, it can go so low that the decoding process gets very slow.
The relevant piece from nco_v2 is:
// If this edge would fall within the previous bitcell, it is a runt.
// Since the pll timestamps will wrap on overflow, the timestamps
can't
// be directly compared. Instead, compare the distance from the
// previous bitcell's left edge as the subtraction will account any
// wrapping.
uint32_t curr_pll_edge = prev_pll_edge +
(uint32_t)duration_between_samples * PLL_PERIOD_NOMINAL;
uint32_t curr_pll_bc_left = prev_pll_bc_left + pll_write_bc_ticks;
if ((curr_pll_edge - prev_pll_bc_left) < (curr_pll_bc_left -
prev_pll_bc_left)) {
printf("Runt\n");
continue;
}
Note how prev_pll_edge and prev_pll_bc_left are both preserved from the
previous iteration. curr_pll_edge is calculated with nominal PLL ticks
while curr_pll_bc_left uses the current pll_period. Because those advance
at different rates and thus might overflow at different times, they are
always converted into a distance by subtracting from some other point,
which takes care of the overflow, before any comparison. By preserving both
across iterations, any accumulated skew from the different rates is
preserved.
…On Thu, Feb 22, 2024, 3:22 AM Keir Fraser ***@***.***> wrote:
I'd stick with nco_v1[p_mul=1, p_div=16, i_mul=1, i_div=128].
I also note from previous comment: The other difference is nco_v1 assumes
both the PLL uint32 and the sample clock uint16 will overflow at the same
time.
Is this the following nco_v2 code?
if (pll_period < PLL_PERIOD_MIN) {
pll_period = PLL_PERIOD_MIN;
} else if (pll_period > PLL_PERIOD_MAX) {
pll_period = PLL_PERIOD_MAX;
}
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIEHF35J4XWA7TXA3KIKYLYU4S63AVCNFSM6AAAAABCDREAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJZGI2DKMZYGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
That might be worth keeping then? As for the rest, as long as phase_step (aka pll_period) doesn't get very large, I don't see that prev_sample and prev_bc_left can really deviate far enough that the arithmetic involving curr_sample will overflow if we use the latter (since they should be within plus or minus bc_step/2 of each other). I also don't see how the more complex code in nco_v2 helps anyway: prev_pll_edge appears to always be prev_sample<<16 plus a fixed delta. |
Centurion Finch Floppy Controller
(FFC, https://github.com/Nakazoto/CenturionComputer/wiki/FFC-Board) is an 8" floppy controller that applies up to 350ns write precompensation. FlashFloppy's existing approach of re-aligning the bit clock to the last pulse edge causes enough additional error for bitcells to be missed or injected in this case.
This commit applies a 2nd-order (proportional and integral) PLL to the incoming WDATA to mitigate both frequency offset and extreme write precompensation. The implementation is based on a virtual numerically-controlled oscillator with a frequency 2^16 times FlashFloppy's tick rate. Proportional and integral constants were chosen to nominally provide a 715kHz natural loop frequency and a damping factor of 1 and then adjusted to be powers of 2 to allow computation to use bit shifts instead of multiplication and division.
This commit has been successfully tested with a variety of PC-compatible floppy controllers as well as Centurion FFC.