Lets actually try Hybrid Emulation
Re: Lets actually try Hybrid Emulation
Real dhrystones:
400MHz: 47169 dhrystones/s, 26 DMIPS (and mouse jerky!)
800MHz: 84033 dhrystones/s, 47 DMIPS
1000MHz: 126582 dhrystones/s, 72 DMIPS
1200MHz: 169491 dhrystones/s, 96 DMIPS
400MHz: 47169 dhrystones/s, 26 DMIPS (and mouse jerky!)
800MHz: 84033 dhrystones/s, 47 DMIPS
1000MHz: 126582 dhrystones/s, 72 DMIPS
1200MHz: 169491 dhrystones/s, 96 DMIPS
Re: Lets actually try Hybrid Emulation
Now, back to "why does it feel slow".
Sysinfo drive speed (DH1):
with ARM as cpu: 701,545 bytes/second
with 68020 soft core: 2,665,871 bytes/second
So drive reads are almost 5x slower, which would make things like browsing disk feel slow. All down to interrupt latency?
Sysinfo drive speed (DH1):
with ARM as cpu: 701,545 bytes/second
with 68020 soft core: 2,665,871 bytes/second
So drive reads are almost 5x slower, which would make things like browsing disk feel slow. All down to interrupt latency?
Re: Lets actually try Hybrid Emulation
One more data point, at 1200MHz Musashi is still not worth it. Still significantly slower than TG68 (like 30% of the speed...). Qemu seems worth it, other than the latency...
Re: Lets actually try Hybrid Emulation
I also noticed in sysinfo that 'chip speed vs A600' is 12 for the tg68k. Its about 3.18 in qemu.
Now 701545*12/3.18 =~2600000. Very similar to the drive speed fraction, hmmm.
Now 701545*12/3.18 =~2600000. Very similar to the drive speed fraction, hmmm.
Re: Lets actually try Hybrid Emulation
Some thoughts on chipram speed...
Actually on the bustest qemu does about the same as an A1200 (6MB/s). However the TG68k does much better than the A1200 (18MB/s).
The ARM should be able to benefit from the same, even if its own caching is off. However its after the HPS-FPGA bridge bottleneck.
Actually on the bustest qemu does about the same as an A1200 (6MB/s). However the TG68k does much better than the A1200 (18MB/s).
The ARM should be able to benefit from the same, even if its own caching is off. However its after the HPS-FPGA bridge bottleneck.
-
- Top Contributor
- Posts: 374
- Joined: Sun Sep 27, 2020 10:16 am
- Has thanked: 207 times
- Been thanked: 86 times
Re: Lets actually try Hybrid Emulation
Thanks for the testing!
The upgrade is still very good!
Any instability during benchmarks at 1.2ghz?
Remastering Classic Game Cinematics: My new Youtube fun, check it out
https://www.youtube.com/@neocaron87
Re: Lets actually try Hybrid Emulation
Yes it seems stable at 1.2GHz. Though I didn't run it for long...
I'm investigating speeding up the chip/chipram access. This goes via the full HPS-FPGA bridge.
I have a few thoughts on this:
i) Frequency the bridge is clocked at.
ii) Increase bridge width
iii) Try the lightweight bridge (supposed to be faster)
iv) Burst support
So far on these:
i) I was previously using 114MHz having found 32MHz glacially slow. I just tried it at 228MHz with some improvement.
ii) I made a 32-bit version of the bridge (I was using 16-bit). This also supports the 'longword' feature of the core for faster chipram access, though I'm not yet sure I'm using it right.
iii) I tried to change this setting hoping it would 'just work'. It gave me an error that it can only address 17 bits, though I need to confirm this.
iv) Not tried yet.
At 228MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~27MB/s.
With the arm at 1200MHz that increases slightly to ~35MB/s.
(For reference, if the bridge was 'ideal' it could get to 1GB/s at this speed...)
At 114MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~20MB/s.
With the arm at 1200MHz that increases slightly to ~25MB/s.
At 32MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~8MB/s.
With the arm at 1200MHz that increases slightly to ~10MB/s.
Another idea just popped in my head when writing this. Perhaps I should see if the lightweight bridge is faster despite the limited address range. It may be possible to bank-switch. Though I'm not sure how I could plug that into qemu, but that is another problem.
I'm investigating speeding up the chip/chipram access. This goes via the full HPS-FPGA bridge.
I have a few thoughts on this:
i) Frequency the bridge is clocked at.
ii) Increase bridge width
iii) Try the lightweight bridge (supposed to be faster)
iv) Burst support
So far on these:
i) I was previously using 114MHz having found 32MHz glacially slow. I just tried it at 228MHz with some improvement.
ii) I made a 32-bit version of the bridge (I was using 16-bit). This also supports the 'longword' feature of the core for faster chipram access, though I'm not yet sure I'm using it right.
iii) I tried to change this setting hoping it would 'just work'. It gave me an error that it can only address 17 bits, though I need to confirm this.
iv) Not tried yet.
At 228MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~27MB/s.
With the arm at 1200MHz that increases slightly to ~35MB/s.
(For reference, if the bridge was 'ideal' it could get to 1GB/s at this speed...)
At 114MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~20MB/s.
With the arm at 1200MHz that increases slightly to ~25MB/s.
At 32MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~8MB/s.
With the arm at 1200MHz that increases slightly to ~10MB/s.
Another idea just popped in my head when writing this. Perhaps I should see if the lightweight bridge is faster despite the limited address range. It may be possible to bank-switch. Though I'm not sure how I could plug that into qemu, but that is another problem.
-
- Top Contributor
- Posts: 374
- Joined: Sun Sep 27, 2020 10:16 am
- Has thanked: 207 times
- Been thanked: 86 times
Re: Lets actually try Hybrid Emulation
Would faster DDR 3 ram helps?foft wrote: ↑Thu Aug 11, 2022 6:24 pm Yes it seems stable at 1.2GHz. Though I didn't run it for long...
I'm investigating speeding up the chip/chipram access. This goes via the full HPS-FPGA bridge.
I have a few thoughts on this:
i) Frequency the bridge is clocked at.
ii) Increase bridge width
iii) Try the lightweight bridge (supposed to be faster)
iv) Burst support
So far on these:
i) I was previously using 114MHz having found 32MHz glacially slow. I just tried it at 228MHz with some improvement.
ii) I made a 32-bit version of the bridge (I was using 16-bit). This also supports the 'longword' feature of the core for faster chipram access, though I'm not yet sure I'm using it right.
iii) I tried to change this setting hoping it would 'just work'. It gave me an error that it can only address 17 bits, though I need to confirm this.
iv) Not tried yet.
At 228MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~27MB/s.
With the arm at 1200MHz that increases slightly to ~35MB/s.
(For reference, if the bridge was 'ideal' it could get to 1GB/s at this speed...)
At 114MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~20MB/s.
With the arm at 1200MHz that increases slightly to ~25MB/s.
Another idea just popped in my head when writing this. Perhaps I should see if the lightweight bridge is faster despite the limited address range. It may be possible to bank-switch. Though I'm not sure how I could plug that into qemu, but that is another problem.
I know Coolbho3k was looking into getting the DDR3 ram running at its rated 1066 speed instead of the current 800. My guess is that for any latency or bandwidth limiting scenarios it could make a massive difference. Maybe you should talk to him about this, or investigate on your own to see what's possible.
Here's what he said on the subject:
" There may be a way to overclock the memory too. The memory chips on the DE10 Nano BOM are rated at DDR3-1066, while the DE10 Nano runs them at DDR3-800. I'm not sure if this will affect the FPGA side of things. If so, I'm also not sure if this would help alleviate the need for the SDRAM for some cores. It might be worth looking into."
Remastering Classic Game Cinematics: My new Youtube fun, check it out
https://www.youtube.com/@neocaron87
Re: Lets actually try Hybrid Emulation
Well faster ram is always good. It doesn't help with this bottleneck much probably.
Although... there are always alternative approaches. In this one I took the approach of leaving the core mostly intact. Then I plumbed in the CPU emulator using the HPS-FPGA bridge to access chip ram and hardware registers.
I think we're stuck with the HPS-FPGA bridge to write the the hardware registers.
Chip ram though, that could be changed. The DDR ram can already be accessed from both HPS and FPGA pretty transparently. This is used by e.g. the scalar. So we could put chip ram in the DDR and then point the hardware logic dma at this instead.
Without changing that we could also enable the caching on the chip ram area. I tried it and it booted fine, but when I went to sysinfo I saw a corrupted screen. So we need to flush the cache sometimes - but when and how? Is chip ram uncachable on all 'real' accelerators?
Although... there are always alternative approaches. In this one I took the approach of leaving the core mostly intact. Then I plumbed in the CPU emulator using the HPS-FPGA bridge to access chip ram and hardware registers.
I think we're stuck with the HPS-FPGA bridge to write the the hardware registers.
Chip ram though, that could be changed. The DDR ram can already be accessed from both HPS and FPGA pretty transparently. This is used by e.g. the scalar. So we could put chip ram in the DDR and then point the hardware logic dma at this instead.
Without changing that we could also enable the caching on the chip ram area. I tried it and it booted fine, but when I went to sysinfo I saw a corrupted screen. So we need to flush the cache sometimes - but when and how? Is chip ram uncachable on all 'real' accelerators?
Re: Lets actually try Hybrid Emulation
So I wrote a very simple program in devpac:
loop:
move.l $100000,d0
move.l $100004,d1
move.l $100008,d2
move.l $10000c,d3
move.l $100010,d4
move.l $100004,d5
move.l $100008,d6
move.l $10000c,d7
jmp loop
Which I captured in signaltap, specifically the chip sdram reading signals.
Firstly here is how it looks on TG68: Then using qemu with the HPS/FPGA bridge clocked at 28MHz: Finally with qemu and the HPS/FPGA bridge clocked at 28*7MHz: (*8 does not always synthesize ok) Now on TG68 you can see how it takes about 20 114MHz (28*4) cycles to read 4 bytes. So about 21MB/s (4*4*28000000/20/1024/1024)
On the ARM with natural bridge (32-bit and 28MHz) it takes about 56 114MHz (28*4) cycles to read 4 bytes. So about 7.5MB/s.
On the ARM with the sped up bridge (32-bit and 28*7) it takes about 28 114MHz (28*4) cycles to read 4 bytes. So about 15MB/s.
In the January release I was using the sped up bridge at 28*4MHz and a 16-bit HPS-FPGA bridge. Can you guess the improvement I see in 'bustest' by changing it to 28x7MHz and 32-bit HPS-FPGA bridge. None, arg! It shows me 6MB/s.
So what is going on with bustest? Well it seems like every other transaction is slow for some reason: Oh and this last picture has the signal names, which I accidentally chopped off the others - oops.
loop:
move.l $100000,d0
move.l $100004,d1
move.l $100008,d2
move.l $10000c,d3
move.l $100010,d4
move.l $100004,d5
move.l $100008,d6
move.l $10000c,d7
jmp loop
Which I captured in signaltap, specifically the chip sdram reading signals.
Firstly here is how it looks on TG68: Then using qemu with the HPS/FPGA bridge clocked at 28MHz: Finally with qemu and the HPS/FPGA bridge clocked at 28*7MHz: (*8 does not always synthesize ok) Now on TG68 you can see how it takes about 20 114MHz (28*4) cycles to read 4 bytes. So about 21MB/s (4*4*28000000/20/1024/1024)
On the ARM with natural bridge (32-bit and 28MHz) it takes about 56 114MHz (28*4) cycles to read 4 bytes. So about 7.5MB/s.
On the ARM with the sped up bridge (32-bit and 28*7) it takes about 28 114MHz (28*4) cycles to read 4 bytes. So about 15MB/s.
In the January release I was using the sped up bridge at 28*4MHz and a 16-bit HPS-FPGA bridge. Can you guess the improvement I see in 'bustest' by changing it to 28x7MHz and 32-bit HPS-FPGA bridge. None, arg! It shows me 6MB/s.
So what is going on with bustest? Well it seems like every other transaction is slow for some reason: Oh and this last picture has the signal names, which I accidentally chopped off the others - oops.
Re: Lets actually try Hybrid Emulation
Anyway long story short, there is a 10 cycle (at 114MHz) 'waste' overhead due to the HPS-FPGA bridge. Then another 4 cycles (average) due to clock domain alignment (from 28x7 to 28). So 14 cycles waste per transaction. So I guess we have only ~6 cycles (1.5 cycles at 28MHz) to do the actual memory access to still reach TG68 level memory access performance.
Re: Lets actually try Hybrid Emulation
Well I updated a new core and matching qemu with what I have to github - hardware description, patched qemu and the compiled setup one:
https://github.com/scrameta/Minimig-AGA_MiSTer_Hybrid
https://github.com/scrameta/qemu_MiSTer_Hybrid
https://github.com/scrameta/MiSTer_Hybrid_Support
The changes did not give as big a boost as I hoped, though they are as follows:
i) HPS-FPGA bridge changed to 32-bit from 16-bit.
ii) HPS-FPGA bridge clock changed from 114MHz to 170MHz.
iii) HPS-FPGA 16-deep fifo for writes.
iv) HPS-FPGA 16-deep pipelined read support.
v) Expose CACR and VBR to the FPGA. For now qemu just defaults them to 1 and 0.
Note that I've only seen memcpy native use the pipelined read and only then 2 deep, so it doesn't help much in reality.
edit: Update, I reverted the rtg cache change, it caused corruption. Also note that e.g. doom runs much nicer overclocked.
https://github.com/scrameta/Minimig-AGA_MiSTer_Hybrid
https://github.com/scrameta/qemu_MiSTer_Hybrid
https://github.com/scrameta/MiSTer_Hybrid_Support
The changes did not give as big a boost as I hoped, though they are as follows:
i) HPS-FPGA bridge changed to 32-bit from 16-bit.
ii) HPS-FPGA bridge clock changed from 114MHz to 170MHz.
iii) HPS-FPGA 16-deep fifo for writes.
iv) HPS-FPGA 16-deep pipelined read support.
v) Expose CACR and VBR to the FPGA. For now qemu just defaults them to 1 and 0.
Note that I've only seen memcpy native use the pipelined read and only then 2 deep, so it doesn't help much in reality.
edit: Update, I reverted the rtg cache change, it caused corruption. Also note that e.g. doom runs much nicer overclocked.
Re: Lets actually try Hybrid Emulation
So I was thinking, perhaps it'd be better to change how the hard disk works in hybrid mode.
Its kind of bonkers to go the route it goes... sd card->ide.cpp->spi->fpga->hps/fpga bridge->qemu
It'd probably make more sense to go sd card->qemu.
I've noticed the MiSTer poll slows down when qemu is used. I don't know yet if this is cpu contention or down to it waiting for something from the FPGA.
Its kind of bonkers to go the route it goes... sd card->ide.cpp->spi->fpga->hps/fpga bridge->qemu
It'd probably make more sense to go sd card->qemu.
I've noticed the MiSTer poll slows down when qemu is used. I don't know yet if this is cpu contention or down to it waiting for something from the FPGA.
- LamerDeluxe
- Top Contributor
- Posts: 1230
- Joined: Sun May 24, 2020 10:25 pm
- Has thanked: 876 times
- Been thanked: 281 times
Re: Lets actually try Hybrid Emulation
Really great that progress is being made again on this project. It is very fascinating to follow.
- Caldor
- Top Contributor
- Posts: 930
- Joined: Sat Jul 25, 2020 11:20 am
- Has thanked: 112 times
- Been thanked: 111 times
Re: Lets actually try Hybrid Emulation
This does sounds like it could end up making the CPU emulation faster than just using the FPGA overall
I was speculating a bit on what might make for faster disk access, but ended up concluding I just do not know enough about the FPGA code and how much work different solutions might require, or what might and might not be possible. But some way of accessing disks differently ought to help.
I do think its a similar problem the PiStorm has? Well... I think its disk access is faster, but the PiStorm problem I think is access to the slow RAM? So if that is the case I would suspect that giving QEMU direct access to the disk would give similar results to what PiStorm sees.
I was speculating a bit on what might make for faster disk access, but ended up concluding I just do not know enough about the FPGA code and how much work different solutions might require, or what might and might not be possible. But some way of accessing disks differently ought to help.
I do think its a similar problem the PiStorm has? Well... I think its disk access is faster, but the PiStorm problem I think is access to the slow RAM? So if that is the case I would suspect that giving QEMU direct access to the disk would give similar results to what PiStorm sees.
Re: Lets actually try Hybrid Emulation
I'm trying again to get uae4all working.
This code is really non-trivial to tear apart, I've tried about 3 times, so I'm trying to instead get it running with minimal changes.
Memory -> point to hps/fpga bright or memory instead
GUI -> just make it start straight away
video/audio -> point to the dummy device
Once that lives I can try turning off some more parts!
So far... diagrom runs, but its set to 68k for some reason and no jit, but that is probably just a setting. So I'll change that setting then add interrupts. Then, fingers crossed:)
This code is really non-trivial to tear apart, I've tried about 3 times, so I'm trying to instead get it running with minimal changes.
Memory -> point to hps/fpga bright or memory instead
GUI -> just make it start straight away
video/audio -> point to the dummy device
Once that lives I can try turning off some more parts!
So far... diagrom runs, but its set to 68k for some reason and no jit, but that is probably just a setting. So I'll change that setting then add interrupts. Then, fingers crossed:)
Re: Lets actually try Hybrid Emulation
OK, jit from uae4arm lives too.
Next, wire up interrupts again, then to workbench... Tomorrow!
I also think its probably pausing the jit to do other uae stuff, so I should find/remove that too.
Next, wire up interrupts again, then to workbench... Tomorrow!
I also think its probably pausing the jit to do other uae stuff, so I should find/remove that too.
Re: Lets actually try Hybrid Emulation
Interrupts wired up, a bunch of not-needed graphics sound code and threads disabled.
DiagROM seems to be running well.
Real kickstart gives me a yellow screen briefly then it reboots. This is even before getting to the disk prompt etc.
DiagROM seems to be running well.
Real kickstart gives me a yellow screen briefly then it reboots. This is even before getting to the disk prompt etc.
Re: Lets actually try Hybrid Emulation
Pop this on your sd card, then start the minimig hybrid core:
https://github.com/scrameta/MiSTer_Hybrid_Support
https://github.com/scrameta/MiSTer_Hybrid_Support
Re: Lets actually try Hybrid Emulation
Will try it tonight!foft wrote: ↑Fri Sep 16, 2022 4:31 pm Pop this on your sd card, then start the minimig hybrid core:
https://github.com/scrameta/MiSTer_Hybrid_Support
Thank you!!!
Re: Lets actually try Hybrid Emulation
I come back to it every year or so.
No-one else is interested in pushing this further though? Thought someone might try porting for instance emu86 or merge the core changes etc.
I was hopeful with the uae4arm cpu. Got that working a while back with diagrom but the actual os didn’t boot. Must be something simple…