& the code is placed a 0x401ae964 (i.e. malloc-ed ram, no fpga involved)
Note that A6, aka stack pointer, is also in fast ram.
...
I have a binary that loads the same code to a malloc'ed array then executes it:
User mode: 0.6 seconds
my 'Mister' machine (from newcli): 5 mins, 15 seconds
official 'Virtual 68k' machine: 1.3 seconds
The only reasons I can think for that are:
i) The qemu machine has some throttling settings and I need to tell it to go flat out?
ii) It accesses the chip memory or io area all the time, despite the code not telling it to. MMU tables, or something like that?
robinsonb5 wrote: ↑Sun May 16, 2021 5:47 pm
What happens if you surround the test program with a Disable() / Enable() pair?
That didn't seem to change it.
Though, something interesting. I ran it several times and it went at full speed sometimes! (To be clear that was with the unchanged build where I didn't add Disable/Enable)
After some red herrings with icount etc... It seems to be a single chain of translation blocks. Which is what I'd expect. So I guess one of them accesses the hardware area, otherwise I really don't understand.
So, found out a few more things...
i) The irq implementation was still incorrect
ii) The slow code is running completely locally in fast ram, no irqs and and hps-fpga bridge.
The problem with the irqs as an off-by-one error and not understanding edge triggered irqs properly.
I thought 'edge triggered' meant that on any edge I'd get an irq. So had just wired up the irq lines directly, thinking whenever they changed I'd get an irq. So I've changed this to an xor on old/new irq flags, or'ed together to give a single irq on any change.
The off-by one error was a mistake in the .dts file. I'm actually pretty shocked it worked at all and passed all the diagrom tests like this. Anyway fixed it now.
For the 'slow loop' code I now know its all in one tb (translation block) chain. I have the 68k code and the arm code logged. When its running nothing further is logged since its all in the (previously logged) dynamically compiled arm code. While it was running I had signal tap up to check for irqs and any hps avalon slave access - no access, no irqs (since I call Disable/Enable now). So, next step is ... trying to run this block of arm machine code to figure out why it doesn't work.
So in theory qemu is executing these jit instructions...
However when I debug it with gdb, I don't seem to get code at these addresses. In case there is an offset (it maps it as both read/execute and read/write at two addresses) I thought I'd do this a few times on all the executing threads:
display/i $pc
stepi
Unfortunately I can't find the code its running! + the stack doesn't show properly (and yes I did try rebuilding qemu with -g and -O0). Ufff!
foft wrote: ↑Tue May 18, 2021 6:18 pm
So... this seems to be happen if the stack shares an mmu page with code. I refer here to the qemu mmu not the emulated 68k mmu.
When starting my tests from newcli this is the case and presumably other times.
Are there any amiga programs to force the stack location? They might be worth a try.
foft wrote: ↑Tue May 18, 2021 6:45 pm
Perhaps we can patch this, to add an extra 8k to each stack then grow downwards from there. So we never share code and stack. http://aminet.net/package/util/boot/StackAttack2
Even unmodified it makes the stack significantly larger if there's loads of available memory - so it might already help. If you have more than 128 meg free the stack will be 128k - so that alone should be enough to make sure there's no page clash, yes?
I just installed stackatack2 'as-is'. With my simple loop it worked every time (of about 5-6 tries) now. I tried (real) dhrystones and get about 55000. This isn't quite the 400000 still so I wonder if something else is going on there, anyway its much much better than I got before.
Anyway this seems worth improving. Though of course the issue remains for other programs with data near code, so if its possible to cut the overhead in qemu that'd be good. I guess the same tlb is often hit so caching the last might save a tree lookup for instance.