Lets actually try Hybrid Emulation

foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Real dhrystones:
400MHz: 47169 dhrystones/s, 26 DMIPS (and mouse jerky!)
800MHz: 84033 dhrystones/s, 47 DMIPS
1000MHz: 126582 dhrystones/s, 72 DMIPS
1200MHz: 169491 dhrystones/s, 96 DMIPS
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Now, back to "why does it feel slow".
Sysinfo drive speed (DH1):
with ARM as cpu: 701,545 bytes/second
with 68020 soft core: 2,665,871 bytes/second

So drive reads are almost 5x slower, which would make things like browsing disk feel slow. All down to interrupt latency?
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

One more data point, at 1200MHz Musashi is still not worth it. Still significantly slower than TG68 (like 30% of the speed...). Qemu seems worth it, other than the latency...
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

I also noticed in sysinfo that 'chip speed vs A600' is 12 for the tg68k. Its about 3.18 in qemu.

Now 701545*12/3.18 =~2600000. Very similar to the drive speed fraction, hmmm.
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

So, since I vanished since January did anyone try anything fun with this? Aranym jit, emu68k?
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Some thoughts on chipram speed...

Actually on the bustest qemu does about the same as an A1200 (6MB/s). However the TG68k does much better than the A1200 (18MB/s).

The ARM should be able to benefit from the same, even if its own caching is off. However its after the HPS-FPGA bridge bottleneck.
Neocaron
Top Contributor
Posts: 374
Joined: Sun Sep 27, 2020 10:16 am
Has thanked: 207 times
Been thanked: 86 times

Re: Lets actually try Hybrid Emulation

Unread post by Neocaron »

foft wrote: Wed Aug 10, 2022 3:21 pm Real dhrystones:
400MHz: 47169 dhrystones/s, 26 DMIPS (and mouse jerky!)
800MHz: 84033 dhrystones/s, 47 DMIPS
1000MHz: 126582 dhrystones/s, 72 DMIPS
1200MHz: 169491 dhrystones/s, 96 DMIPS
Thanks for the testing!
The upgrade is still very good!
Any instability during benchmarks at 1.2ghz?

Remastering Classic Game Cinematics: My new Youtube fun, check it out :D
https://www.youtube.com/@neocaron87

foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Yes it seems stable at 1.2GHz. Though I didn't run it for long...

I'm investigating speeding up the chip/chipram access. This goes via the full HPS-FPGA bridge.

I have a few thoughts on this:
i) Frequency the bridge is clocked at.
ii) Increase bridge width
iii) Try the lightweight bridge (supposed to be faster)
iv) Burst support

So far on these:
i) I was previously using 114MHz having found 32MHz glacially slow. I just tried it at 228MHz with some improvement.
ii) I made a 32-bit version of the bridge (I was using 16-bit). This also supports the 'longword' feature of the core for faster chipram access, though I'm not yet sure I'm using it right.
iii) I tried to change this setting hoping it would 'just work'. It gave me an error that it can only address 17 bits, though I need to confirm this.
iv) Not tried yet.

At 228MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~27MB/s.
With the arm at 1200MHz that increases slightly to ~35MB/s.
(For reference, if the bridge was 'ideal' it could get to 1GB/s at this speed...)

At 114MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~20MB/s.
With the arm at 1200MHz that increases slightly to ~25MB/s.

At 32MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~8MB/s.
With the arm at 1200MHz that increases slightly to ~10MB/s.

Another idea just popped in my head when writing this. Perhaps I should see if the lightweight bridge is faster despite the limited address range. It may be possible to bank-switch. Though I'm not sure how I could plug that into qemu, but that is another problem.
Neocaron
Top Contributor
Posts: 374
Joined: Sun Sep 27, 2020 10:16 am
Has thanked: 207 times
Been thanked: 86 times

Re: Lets actually try Hybrid Emulation

Unread post by Neocaron »

foft wrote: Thu Aug 11, 2022 6:24 pm Yes it seems stable at 1.2GHz. Though I didn't run it for long...

I'm investigating speeding up the chip/chipram access. This goes via the full HPS-FPGA bridge.

I have a few thoughts on this:
i) Frequency the bridge is clocked at.
ii) Increase bridge width
iii) Try the lightweight bridge (supposed to be faster)
iv) Burst support

So far on these:
i) I was previously using 114MHz having found 32MHz glacially slow. I just tried it at 228MHz with some improvement.
ii) I made a 32-bit version of the bridge (I was using 16-bit). This also supports the 'longword' feature of the core for faster chipram access, though I'm not yet sure I'm using it right.
iii) I tried to change this setting hoping it would 'just work'. It gave me an error that it can only address 17 bits, though I need to confirm this.
iv) Not tried yet.

At 228MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~27MB/s.
With the arm at 1200MHz that increases slightly to ~35MB/s.
(For reference, if the bridge was 'ideal' it could get to 1GB/s at this speed...)

At 114MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~20MB/s.
With the arm at 1200MHz that increases slightly to ~25MB/s.

Another idea just popped in my head when writing this. Perhaps I should see if the lightweight bridge is faster despite the limited address range. It may be possible to bank-switch. Though I'm not sure how I could plug that into qemu, but that is another problem.
Would faster DDR 3 ram helps?

I know Coolbho3k was looking into getting the DDR3 ram running at its rated 1066 speed instead of the current 800. My guess is that for any latency or bandwidth limiting scenarios it could make a massive difference. Maybe you should talk to him about this, or investigate on your own to see what's possible.

Here's what he said on the subject:
" There may be a way to overclock the memory too. The memory chips on the DE10 Nano BOM are rated at DDR3-1066, while the DE10 Nano runs them at DDR3-800. I'm not sure if this will affect the FPGA side of things. If so, I'm also not sure if this would help alleviate the need for the SDRAM for some cores. It might be worth looking into."

Remastering Classic Game Cinematics: My new Youtube fun, check it out :D
https://www.youtube.com/@neocaron87

foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Well faster ram is always good. It doesn't help with this bottleneck much probably.

Although... there are always alternative approaches. In this one I took the approach of leaving the core mostly intact. Then I plumbed in the CPU emulator using the HPS-FPGA bridge to access chip ram and hardware registers.

I think we're stuck with the HPS-FPGA bridge to write the the hardware registers.

Chip ram though, that could be changed. The DDR ram can already be accessed from both HPS and FPGA pretty transparently. This is used by e.g. the scalar. So we could put chip ram in the DDR and then point the hardware logic dma at this instead.

Without changing that we could also enable the caching on the chip ram area. I tried it and it booted fine, but when I went to sysinfo I saw a corrupted screen. So we need to flush the cache sometimes - but when and how? Is chip ram uncachable on all 'real' accelerators?
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

So I wrote a very simple program in devpac:
loop:
move.l $100000,d0
move.l $100004,d1
move.l $100008,d2
move.l $10000c,d3
move.l $100010,d4
move.l $100004,d5
move.l $100008,d6
move.l $10000c,d7
jmp loop

Which I captured in signaltap, specifically the chip sdram reading signals.

Firstly here is how it looks on TG68:
read_tight_loop_unrolled_tg68.png
read_tight_loop_unrolled_tg68.png (8.69 KiB) Viewed 10535 times
Then using qemu with the HPS/FPGA bridge clocked at 28MHz:
read_tight_loop_unrolled_arm_28.png
read_tight_loop_unrolled_arm_28.png (11.65 KiB) Viewed 10535 times
Finally with qemu and the HPS/FPGA bridge clocked at 28*7MHz: (*8 does not always synthesize ok)
read_tight_loop_unrolled_arm_28x7.png
read_tight_loop_unrolled_arm_28x7.png (7.69 KiB) Viewed 10535 times
Now on TG68 you can see how it takes about 20 114MHz (28*4) cycles to read 4 bytes. So about 21MB/s (4*4*28000000/20/1024/1024)
On the ARM with natural bridge (32-bit and 28MHz) it takes about 56 114MHz (28*4) cycles to read 4 bytes. So about 7.5MB/s.
On the ARM with the sped up bridge (32-bit and 28*7) it takes about 28 114MHz (28*4) cycles to read 4 bytes. So about 15MB/s.

In the January release I was using the sped up bridge at 28*4MHz and a 16-bit HPS-FPGA bridge. Can you guess the improvement I see in 'bustest' by changing it to 28x7MHz and 32-bit HPS-FPGA bridge. None, arg! It shows me 6MB/s.

So what is going on with bustest? Well it seems like every other transaction is slow for some reason:
bustest_longword.png
bustest_longword.png (59.43 KiB) Viewed 10535 times
Oh and this last picture has the signal names, which I accidentally chopped off the others - oops.
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Actually TG68 has a similar pattern on bustest. i.e. slow/fat/slow/fast. Just its fast is very fast!
tg68_bustest.png
tg68_bustest.png (62.43 KiB) Viewed 10520 times
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Anyway long story short, there is a 10 cycle (at 114MHz) 'waste' overhead due to the HPS-FPGA bridge. Then another 4 cycles (average) due to clock domain alignment (from 28x7 to 28). So 14 cycles waste per transaction. So I guess we have only ~6 cycles (1.5 cycles at 28MHz) to do the actual memory access to still reach TG68 level memory access performance.
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Some promising experiments with fifos for immediate write completion and pipelined reads...
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Well I updated a new core and matching qemu with what I have to github - hardware description, patched qemu and the compiled setup one:
https://github.com/scrameta/Minimig-AGA_MiSTer_Hybrid
https://github.com/scrameta/qemu_MiSTer_Hybrid
https://github.com/scrameta/MiSTer_Hybrid_Support

The changes did not give as big a boost as I hoped, though they are as follows:
i) HPS-FPGA bridge changed to 32-bit from 16-bit.
ii) HPS-FPGA bridge clock changed from 114MHz to 170MHz.
iii) HPS-FPGA 16-deep fifo for writes.
iv) HPS-FPGA 16-deep pipelined read support.
v) Expose CACR and VBR to the FPGA. For now qemu just defaults them to 1 and 0.

Note that I've only seen memcpy native use the pipelined read and only then 2 deep, so it doesn't help much in reality.

edit: Update, I reverted the rtg cache change, it caused corruption. Also note that e.g. doom runs much nicer overclocked.
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

So I was thinking, perhaps it'd be better to change how the hard disk works in hybrid mode.

Its kind of bonkers to go the route it goes... sd card->ide.cpp->spi->fpga->hps/fpga bridge->qemu :lol:
It'd probably make more sense to go sd card->qemu.

I've noticed the MiSTer poll slows down when qemu is used. I don't know yet if this is cpu contention or down to it waiting for something from the FPGA.
kolla
Posts: 191
Joined: Sat Jun 13, 2020 7:56 am
Has thanked: 17 times
Been thanked: 33 times

Re: Lets actually try Hybrid Emulation

Unread post by kolla »

Yes, keep as much as possible “close" to the CPU and fast ram, especially I/O. Ideally, when RTG is used, the FPGA should have almost no use :)
User avatar
LamerDeluxe
Top Contributor
Posts: 1230
Joined: Sun May 24, 2020 10:25 pm
Has thanked: 876 times
Been thanked: 281 times

Re: Lets actually try Hybrid Emulation

Unread post by LamerDeluxe »

Really great that progress is being made again on this project. It is very fascinating to follow.
User avatar
Caldor
Top Contributor
Posts: 930
Joined: Sat Jul 25, 2020 11:20 am
Has thanked: 112 times
Been thanked: 111 times

Re: Lets actually try Hybrid Emulation

Unread post by Caldor »

This does sounds like it could end up making the CPU emulation faster than just using the FPGA overall :)

I was speculating a bit on what might make for faster disk access, but ended up concluding I just do not know enough about the FPGA code and how much work different solutions might require, or what might and might not be possible. But some way of accessing disks differently ought to help.

I do think its a similar problem the PiStorm has? Well... I think its disk access is faster, but the PiStorm problem I think is access to the slow RAM? So if that is the case I would suspect that giving QEMU direct access to the disk would give similar results to what PiStorm sees.
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

I'm trying again to get uae4all working.

This code is really non-trivial to tear apart, I've tried about 3 times, so I'm trying to instead get it running with minimal changes.

Memory -> point to hps/fpga bright or memory instead
GUI -> just make it start straight away
video/audio -> point to the dummy device

Once that lives I can try turning off some more parts!

So far... diagrom runs, but its set to 68k for some reason and no jit, but that is probably just a setting. So I'll change that setting then add interrupts. Then, fingers crossed:)
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

OK, jit from uae4arm lives too.

Next, wire up interrupts again, then to workbench... Tomorrow!

I also think its probably pausing the jit to do other uae stuff, so I should find/remove that too.
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Interrupts wired up, a bunch of not-needed graphics sound code and threads disabled.

DiagROM seems to be running well. :)

Real kickstart gives me a yellow screen briefly then it reboots. :? This is even before getting to the disk prompt etc.
Solskogen
Posts: 91
Joined: Mon May 25, 2020 5:33 am
Has thanked: 11 times
Been thanked: 6 times

Re: Lets actually try Hybrid Emulation

Unread post by Solskogen »

are you using amiberry or the old uae4arm?
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

I’m using TomB’s uae4arm
User avatar
SuperFrog
Posts: 32
Joined: Tue Jun 01, 2021 1:57 pm
Has thanked: 3 times
Been thanked: 1 time

Re: Lets actually try Hybrid Emulation

Unread post by SuperFrog »

Can someone please explain how to test minimig hybrid?!

I would really love to check it, but I have no idea where to start from. :(
foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

Pop this on your sd card, then start the minimig hybrid core:
https://github.com/scrameta/MiSTer_Hybrid_Support
User avatar
SuperFrog
Posts: 32
Joined: Tue Jun 01, 2021 1:57 pm
Has thanked: 3 times
Been thanked: 1 time

Re: Lets actually try Hybrid Emulation

Unread post by SuperFrog »

foft wrote: Fri Sep 16, 2022 4:31 pm Pop this on your sd card, then start the minimig hybrid core:
https://github.com/scrameta/MiSTer_Hybrid_Support
Will try it tonight!

Thank you!!!
User avatar
Juri
Posts: 48
Joined: Sun May 24, 2020 6:49 pm
Has thanked: 12 times
Been thanked: 4 times

Re: Lets actually try Hybrid Emulation

Unread post by Juri »

Hi, what happened to the minimig hybrid core? Dead project? Thanks

JF
Arek0xff
Posts: 3
Joined: Sat Aug 15, 2020 7:39 am
Has thanked: 1 time

Re: Lets actually try Hybrid Emulation

Unread post by Arek0xff »

Is the project alive?

foft
Posts: 342
Joined: Thu Dec 03, 2020 11:05 am
Has thanked: 29 times
Been thanked: 125 times

Re: Lets actually try Hybrid Emulation

Unread post by foft »

I come back to it every year or so.

No-one else is interested in pushing this further though? Thought someone might try porting for instance emu86 or merge the core changes etc.

I was hopeful with the uae4arm cpu. Got that working a while back with diagrom but the actual os didn’t boot. Must be something simple…

Post Reply