Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need some help to finally fix Darwin PowerPC assembler so that not just build succeeds but also tests pass #211

Open
barracuda156 opened this issue Nov 24, 2022 · 28 comments

Comments

@barracuda156
Copy link
Contributor

In a couple of recent PRs I have made some fixes to Darwin PPC assembler here (nothing big, just fixing which was actually breaking the build), which fixed building of context and made it somewhat functional (in a sense that certain dependencies that require context now build successfully, folly and friends being the case).
However, tests do not pass: macports/macports-ports#16407 (comment)

From earlier PRs by @kernigh and @DaoWen for PPC ELF it looks like the issue is quite non-trivial and requires a thorough understanding of both what Boost is doing and the magic code :)
Unfortunately, my knowledge of PPC assembler is rather limited and does not suffice here. I am not going to drop the case, since we really want this fixed properly, but any advice or help will be greatly appreciated.

This has been broken apparently forever – can we fix it now? :)

(Considering a recent case with Boost shared pointer (boostorg/smart_ptr#105), I cannot be entirely sure that it is assembler which is to blame and not again some wrong alignment, but probably it is assembler, given that it was also broken for *BSD.)

P. S. For the record, alternative version of PPC code for context from the following repo also fails, at least with Fibonacci test: https://github.com/twlostow/libcontext

P. P. S. @iains I am aware you are very busy atm, so please feel free to completely ignore the issue. But still tagging you as the person who understands Darwin magic code best.

@olk
Copy link
Member

olk commented Nov 25, 2022

Seams to be a NULL pointer is dereferenced - this might happen because of the calling convention is violated.
I don't own a PPC64(?) system and I'm too busy atm.
You could debug a simple test app (simple jump) by stepping through the assembler instructions and compare the content in teh registers with what the calling convention requires.

@kernigh
Copy link
Contributor

kernigh commented Nov 27, 2022

My PPC32 hardware is running OpenBSD, but I have a QEMU running Darwin. I can't compile Boost for Darwin, because my gcc 3.3 can't do C++11, but I can compile ppc32_sysv_macho.S and call them from C.

I found a problem: make_fcontext(_, _, there) fails to pass the transfer_t there. It is passing r3 = address of transfer_t, but there wants r3:r4 = transfer_t. You can jump there, but there can't jump to its transfer_t's fctx; jump_fcontext will crash.

This diff seems to fix make_context, but ontop_context might still be broken.

diff --git a/make_ppc32_sysv_macho_gas.S b/make_ppc32_sysv_macho_gas.S
--- a/make_ppc32_sysv_macho_gas.S
+++ b/make_ppc32_sysv_macho_gas.S
@@ -85,11 +85,12 @@
     clrrwi  r3, r3, 4
 
     ; reserve space for context-data on context-stack
-    ; including 64 byte of linkage + parameter area (R1  16 == 0)
+    ; including 64 byte of linkage + parameter area (R1 % 16 == 0)
     subi  r3, r3, 336
 
     ; third arg of make_fcontext() == address of context-function
-    stw  r5, 240(r3)
+    ; store as trampoline's R31
+    stw  r5, 224(r3)
 
     ; set back-chain to zero
     li   r0, 0
@@ -106,12 +107,15 @@
     ; load LR
     mflr  r0
     ; jump to label 1
-    bl  l1
-l1:
+    bcl  20, 31, L1
+L1:
     ; load LR into R4
     mflr  r4
+    ; compute abs address of trampoline, use as PC
+    addi  r5, r4, lo16(Ltrampoline - L1)
+    stw  r5, 240(r3)
     ; compute abs address of label finish
-    addi  r4, r4, lo16((finish - .) + 4)
+    addi  r4, r4, lo16(Lfinish - L1)
     ; restore LR
     mtlr  r0
     ; save address of finish as return-address for context-function
@@ -123,15 +127,28 @@ l1:
 
     blr  ; return pointer to context-data
 
-finish:
-    ; save return address into R0
-    mflr  r0
-    ; save return address on stack, set up stack frame
-    stw  r0, 4(r1)
-    ; allocate stack space, R1  16 == 0
-    stwu  r1, -16(r1)
+Ltrampoline:
+    ; We get R31 = context-function, R3 = address of transfer_t,
+    ; but we need to pass R3:R4 = transfer_t.
+    mtctr  r31
+    lwz  r4, 4(r3)
+    lwz  r3, 0(r3)
+    bctr
 
+Lfinish:
+    ; load address of _exit into CTR
+    bcl  20, 31, L2
+L2:
+    mflr  r4
+    addis  r4, r4, ha16(Lexitp - L2)
+    lwz  r4, lo16(Lexitp - L2)(r4)
+    mtctr  r4
     ; exit code is zero
     li  r3, 0
     ; exit application
-    bl  _exit
+    bctr
+
+.const_data
+.align 2
+Lexitp:
+    .long  __exit

I put a few other changes in the diff.

  • I used L to hide local labels from gdb.
  • I used bcl 20, 31, L1 as a branch hint, like gcc; see MPC7450 RISC Microprocessor Family Reference Manual (MPC7450UM.pdf), 6.7.1.4.2 Position-Independent Code Example.
  • I added Lexitp, because my linker rejected bl _exit, "ld: /var/tmp//ccBVLXX8.o has external relocation entries in non-writable section (__TEXT,__text) for symbols: _exit". If the C++ compiler comes with a better linker, you might not need this change; but you should change _exit to __exit, which changes the call from exit(0) to _exit(0), to be like the other platforms.

Someone else, who has the C++ compiler, can check the diff, edit it, and git commit it.

My diff doesn't fix an alignment problem. make_fcontext sets the correct alignment 16, but jump_fcontext does addi r1, r1, 244, and 244 isn't a multiple of 16. When you jump to the new context, you will have a misaligned stack pointer. This will break altivec code (but PPC32 compilers tend to disable altivec) and slow down code that has 8-byte floats on the stack. I fixed the alignment for ELF in df8fb6b by rearranging the frame from 244 to 240 bytes. A similar fix might help Mach-o, but ELF and Mach-o have different stack layouts, so you can't copy ELF's layout.

@barracuda156
Copy link
Contributor Author

@kernigh Thank you very much! I will try building it now and running tests.

For the stack, I think we can make it 240, since now it is saving “hidden”, which was copied from ELF version or wherever. In fact an earlier libcontext version did not have it, and used 240.

P. S. If you can run at least Tiger in QEMU, @catap has fixed GCC10 for it.

@iains
Copy link

iains commented Nov 27, 2022

I cannot comment on much here (no time to load context myself, so not sure exactly what frame you are discussing) but ...

My diff doesn't fix an alignment problem. make_fcontext sets the correct alignment 16, but jump_fcontext does addi r1, r1, 244, and 244 isn't a multiple of 16. When you jump to the new context, you will have a misaligned stack pointer. This will break altivec code (but PPC32 compilers tend to disable altivec) and slow down code that has 8-byte floats on the stack. I fixed the alignment for ELF in df8fb6b by rearranging the frame from 244 to 240 bytes. A similar fix might help Mach-o, but ELF and Mach-o have different stack layouts, so you can't copy ELF's layout.

Darwin ppc32 ABI includes altivec, it cannot be disabled (although 10.4 will run on a G3, it does so by specific code that tests if altivec is present and skips the save/restore - it does not remove the space from the ABI).

So this needs fixing - we know what the stack layout of a saved context should be ... if the synthesised one does not preserve alignment, it is probably still not quite right.

barracuda156 added a commit to barracuda156/context that referenced this issue Nov 27, 2022
@kernigh
Copy link
Contributor

kernigh commented Nov 27, 2022

@barracuda156 I believe that we need the "hidden" pointer to return a transfer_t, as transfer_t jump_fcontext(void * from, void (* fn)(transfer_t)) is really void jump_fcontext(transfer_t * hidden, void * from, void (* fn)(transfer_t)) in Darwin's ABI. Commit ba35720 had added the hidden pointer when it changed jump_fcontext's return type from void * to transfer_t.

The stack layout is in the 32-bit PowerPC Function Calling Conventions of the old OS X ABI Function Call Guide in Apple's documentation archive. The calling routine reserves 8(SP) for the link register and 4(SP) for the condition register. The called routine may save its LR and CR in the calling routine's frame. One might shrink jump_fcontext's frame from 244 to 240 bytes by saving its LR and CR in its caller's frame.

@iains The 244-byte stack frame belongs to jump_fcontext, where it does subi r1, r1, 244 and addi r1, r1, 224; but ontop_fcontext and make_fcontext use the same frame.

When I said, "PPC32 compilers tend to disable altivec", I meant that they don't emit altivec instructions. In the ABI, each function must preserve altivec registers v20 to v31 (AltiVec Technology Programming Interface Manual ALTIVECPIM.pdf, 3.2 Register Usage Conventions; example in Apple's _setjmp). Our jump_fcontext is missing code to save and restore v20..v31, but if the compiler didn't emit altivec, then the program never uses v20..v31, and never needs jump_fcontext to preserve v20..v31.

Some libraries, like libjpeg-turbo and pixman, have code that checks if the CPU has altivec, then calls altivec code. After they return from the altivec code, they would stop using v20..v31, so there would be no problem with v20..v31. I fear a problem with a misaligned stack pointer. A fiber with a misaligned SP might fail to do jpeg or pixman.

@barracuda156
Copy link
Contributor Author

@kernigh With your patch (and also adjusting 244 to 240, I did that earlier before your comment and wanted to see what happens anyway), Fibonacci test passes fine:

sergey-fedorovs-power-mac-g5:boost svacchanda$ /opt/local/bin/g++-mp-12 fibonacci.cpp -I/opt/local/libexec/boost/1.76/include  -std=c++17 -lboost_context-mt -L/opt/local/libexec/boost/1.76/lib -g -O0
sergey-fedorovs-power-mac-g5:boost svacchanda$ sudo /opt/local/bin/gdb-apple ./a.out
sudo: ignoring time stamp from the future
Password:
GNU gdb 6.3.50.20050815-cvs (Sun Oct 30 18:17:59 UTC 2022)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "--host=powerpc-apple-darwin10.0.0d2 --target="...Reading symbols for shared libraries .
warning: Could not find object file "/opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_devel_boost176/boost176/work/boost_1_76_0/bin.v2/libs/context/build/darwin-12.2.0/release/threadapi-pthread/threading-multi/visibility-hidden/posix/stack_traits.o" - no debug information available for "libs/context/src/posix/stack_traits.cpp".

..... done

(gdb) run
Starting program: /Users/svacchanda/Dev/boost/a.out 
Reading symbols for shared libraries +++++warning: Could not find object file "/opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_devel_boost176/boost176/work/boost_1_76_0/bin.v2/libs/chrono/build/darwin-12.2.0/release/threading-multi/visibility-hidden/chrono.o" - no debug information available for "libs/chrono/src/chrono.cpp".

warning: Could not find object file "/opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_devel_boost176/boost176/work/boost_1_76_0/bin.v2/libs/chrono/build/darwin-12.2.0/release/threading-multi/visibility-hidden/thread_clock.o" - no debug information available for "libs/chrono/src/thread_clock.cpp".

warning: Could not find object file "/opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_devel_boost176/boost176/work/boost_1_76_0/bin.v2/libs/chrono/build/darwin-12.2.0/release/threading-multi/visibility-hidden/process_cpu_clocks.o" - no debug information available for "libs/chrono/src/process_cpu_clocks.cpp".

.warning: Could not find object file "/opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_devel_boost176/boost176/work/boost_1_76_0/bin.v2/libs/thread/build/darwin-12.2.0/release/threadapi-pthread/threading-multi/visibility-hidden/pthread/thread.o" - no debug information available for "libs/thread/src/pthread/thread.cpp".

warning: Could not find object file "/opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_devel_boost176/boost176/work/boost_1_76_0/bin.v2/libs/thread/build/darwin-12.2.0/release/threadapi-pthread/threading-multi/visibility-hidden/pthread/once.o" - no debug information available for "libs/thread/src/pthread/once.cpp".

warning: Could not find object file "/opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_devel_boost176/boost176/work/boost_1_76_0/bin.v2/libs/thread/build/darwin-12.2.0/release/threadapi-pthread/threading-multi/visibility-hidden/future.o" - no debug information available for "libs/thread/src/future.cpp".

... done

Program exited normally.

No more Bus errors, awesome!

Two questions:

  1. So what will be the optimal way to save hidden, given alignment requirement?

  2. To what extent this code can be used for Darwin ppc64 (of course with adjusting to 64-bit registers and replacing some insns with 64-bit equivalents)? Also, current ppc64 code in Boost does not save FP registers at all. Not sure that is what we want.

@evanmiller
Copy link
Contributor

@barracuda156 The fibonacci program should print out the first 10 numbers from the Fibonacci sequence. I'm not seeing them in your output, so I wouldn't yet call the test passing.

@evanmiller
Copy link
Contributor

All right, I've been twisting my brain up with this code this morning trying to make sense of it... Disclaimer, I'm not an assembly programmer or systems expert, just making sense of what I can from the links in this thread.

I think the additional change needed in ontop_ppc32_sysv_macho_gas.S amounts to this:

     ; restore CTR
     mtctr r6
 
+    ; set the first arg to the on-top function (r5 is already correct)
+    mr r4, r7
+
     ; jump to ontop-function
     bctr

Basically, the on-top function takes a transfer_t and returns a transfer_t. Because transfer_t is a composite type, the return value will go into r3, but it needs to be passed as the first two registers. In this case, that's R4:R5 rather than R3:R4 that we saw in the make_fcontext callback since r3 is now being used for a return value. R5 is correctly set at this point in the code, so all we need to do is get R4 right.

I'll test this code locally once I get my Boost environment set up on my test machine... In the meantime, @barracuda156 feel free to try it out since you seem to have everything up and running. Btw I strongly suggest NOT making the 244 -> 240 change you mentioned without fully understanding all the implications.

@barracuda156
Copy link
Contributor Author

@evanmiller I am temporarily away from native hardware, but ppc32 can be tested in Rosetta with reasonable reliability. Will try that in coming days.

@barracuda156
Copy link
Contributor Author

@kernigh When you have time, could you please comment on ppc64 versions? Is there something special to be done or just adjust byte sizes?

Hopefully we can fix the whole Darwin PPC implementation in one go (after all) and not leave ppc64 for later.

@evanmiller
Copy link
Contributor

Good news! Well, mixed...

Compiling with the two changes above (@kernigh's plus my register tweak), I am able to get the Fibonacci sequence to print!

$ ./a.out
0 1 1 2 3 5 8 13 21 34
main: done
fish: Job 1, './a.out', terminated by signal SIGSEGV (Address boundary error)

This means that control is successfully passed several times between a fiber and the main program. Hooray!

However, as you can see, there's a segfault when the program ends... I am guessing it is triggered during fiber_exit, and perhaps my change to ontop_fcontext was incorrect or insufficient. I'll need to play around with a debugger to get some more information here. More later...

evanmiller added a commit to evanmiller/context that referenced this issue Dec 27, 2022
For make_fcontext, use the diff provided here:

boostorg#211 (comment)

For ontop_context, adapt the Linux PPC32 fixes from here:

boostorg@df8fb6b

Co-authored-by: George Koehler <[email protected]>
evanmiller added a commit to evanmiller/context that referenced this issue Dec 27, 2022
For make_fcontext, use the diff provided here:

boostorg#211 (comment)

For ontop_context, adapt the Linux PPC32 fixes from here:

boostorg@df8fb6b

Co-authored-by: George Koehler <[email protected]>
@evanmiller
Copy link
Contributor

OK, I think I've gotten it working now... see #215 for the latest and greatest.

In addition to @kernigh's make_fcontext patch above, I studied his patch in #123 to fix the ontop_fcontext code. Basically the Darwin code suffered from the same problems as the Linux code: the incorrect transfer_t was being written to the return value, and C++ exceptions were not properly handled, resulting in a crash when the fiber was being deallocated (Boost deliberately throws a C++ exception to unwind the fiber stack). The solution was to mirror the Linux code, calling out to the C++ tail function (which calls the ontop-function), rather than calling the ontop-function directly. Except for the incorrectly transfer_t destination, the registers were already correctly set up for this function call, so the changes to ontop_fcontext ended up being quite small.

The Fibonacci test now prints correct output, and doesn't crash, so I'm pretty pleased with it. Once I finish fiddling with the file structure I'll remove the Draft status. Stack alignment and PPC64 I will leave to others, or else save for another holiday.

evanmiller added a commit to evanmiller/context that referenced this issue Dec 27, 2022
For make_fcontext, use the diff provided here:

boostorg#211 (comment)

For ontop_context, adapt the Linux PPC32 fixes from here:

boostorg@df8fb6b

Co-authored-by: George Koehler <[email protected]>
@evanmiller
Copy link
Contributor

Btw it looks like the entire Darwin PPC64 make_fcontext function was accidentally commented out in this 2016 commit, and never restored:

ba35720#diff-4812b5eb7dffd2775492b2b706f826bf1463a239399169eb1711ffec9085a49bL53

So PPC64 "needs some work", to say the least.

@barracuda156
Copy link
Contributor Author

barracuda156 commented Dec 30, 2022

@evanmiller @kernigh @olk A quick update. gcc12 does no help, and fiber fails to build with the new fixes to context from #215

Specifically, these components of fiber fail to compile (with assembler errors):

barrier
condition_variable
context
mutex
recursive_mutex
recursive_timed_mutex
timed_mutex
scheduler
waker
algo/work_stealing

@barracuda156
Copy link
Contributor Author

@evanmiller @kernigh @olk A quick update. gcc12 does no help, and fiber fails to build with the new fixes to context from #215

Specifically, these components of fiber fail to compile (with assembler errors):

barrier
condition_variable
context
mutex
recursive_mutex
recursive_timed_mutex
timed_mutex
scheduler
waker
algo/work_stealing

fiber will be fixed once this is merged: boostorg/fiber#306

@barracuda156
Copy link
Contributor Author

Btw it looks like the entire Darwin PPC64 make_fcontext function was accidentally commented out in this 2016 commit, and never restored:

ba35720#diff-4812b5eb7dffd2775492b2b706f826bf1463a239399169eb1711ffec9085a49bL53

So PPC64 "needs some work", to say the least.

@evanmiller Sorry, where in particular?

@evanmiller
Copy link
Contributor

evanmiller commented May 15, 2023

Btw it looks like the entire Darwin PPC64 make_fcontext function was accidentally commented out in this 2016 commit, and never restored:
ba35720#diff-4812b5eb7dffd2775492b2b706f826bf1463a239399169eb1711ffec9085a49bL53
So PPC64 "needs some work", to say the least.

@evanmiller Sorry, where in particular?

See line linked in above comment – the entire PPC64 file was accidentally commented out in 2016, so no one has compiled it in years.

Line 53 of src/asm/make_ppc64_sysv_macho_gas.S

image

@Kojoley
Copy link
Contributor

Kojoley commented May 30, 2023

the entire PPC64 file was accidentally commented out in 2016, so no one has compiled it in years.

Is there is a way to cross-compile to that target from a modern macos? It would be nice to have a CI job.

@iains
Copy link

iains commented May 30, 2023

the entire PPC64 file was accidentally commented out in 2016, so no one has compiled it in years.

Is there is a way to cross-compile to that target from a modern macos? It would be nice to have a CI job.

Yes, it can be done - but requires:

  • a set of "binutils" capable of targeting powerpc-darwin (which means building them locally and is a little fiddly at present).

  • A MacOSX 10.5 SDK on the build machine (there are no PPC64 slices in later SDK versions and no PPC slices after 10.6)

Having said that, I do this reasonably often from x86-64-darwin2* (and earlier).

It is also actually possible at least to run compile tests on the cross-compiler (with TCL and dejagnu needed), and if you have a real hardware box, execute tests can be run remotely (although it is much better performance to do the build locally if you have decent hardware)

edit: "binutils" == cctools, ld64 and a version of dsymutil built from the LLVM sources (with powerpc patched in)... hence the "bit fiddly" - but, nevertheless doable.

@Kojoley
Copy link
Contributor

Kojoley commented May 30, 2023

  • a set of "binutils" capable of targeting powerpc-darwin (which means building them locally and is a little fiddly at present).

AppleClang is stripped from PPC support?

I don't have hardware nor (would) use it. I'm planning on reworking asm compiling to dispatch targets via preprocessing (develop...Kojoley:context:feature/autodispatch) and having as much targets as possible on CI would reduce chances of breaking your targets. I think my case would be covered by just testing if b2 asm_sources succeeds and that requires only clang --target=... -x assembler-with-cpp foo.S -o foo.o to work.

@iains
Copy link

iains commented May 30, 2023

Apple clang never supported PPC (at least, no released or published version did). Neither does upstream clang (without patches and that support is incomplete). The last Apple toolchain with PPC support is the Xcode 3.2.6 gcc-4.2.1.

I was talking about using a GCC cross toolchain (which continues to support powerpc-darwin, even with current trunk [14.0.0]).

Having said that, I realise you need a pretty restricted set of functionality - essentially preprocessor + assembler, right?

(I will see if there's some easier sub-set of tools to do that - but you can be certain that there is nothing available 'out of the box').

@Kojoley
Copy link
Contributor

Kojoley commented May 30, 2023

Having said that, I realise you need a pretty restricted set of functionality - essentially preprocessor + assembler, right?

Yes, though I forgot that you guys added asm/tail_ontop_ppc32_sysv.cpp thing.

(I will see if there's some easier sub-set of tools to do that - but you can be certain that there is nothing available 'out of the box').

I will just ping here to test the PR then.

@barracuda156
Copy link
Contributor Author

edit: "binutils" == cctools, ld64 and a version of dsymutil built from the LLVM sources (with powerpc patched in)... hence the "bit fiddly" - but, nevertheless doable.

Are these darwin-xtools or some other modification?

@Kojoley
Copy link
Contributor

Kojoley commented Jun 14, 2023

(I will see if there's some easier sub-set of tools to do that - but you can be certain that there is nothing available 'out of the box').

I will just ping here to test the PR then.

Here is the PR #228 if you care enough to test it.

@olk
Copy link
Member

olk commented Aug 18, 2023

should be work now

@olk olk closed this as completed Aug 18, 2023
@barracuda156
Copy link
Contributor Author

@olk I guess, ppc64 still broken though? I have not seen anyone fix it.

@olk
Copy link
Member

olk commented Aug 18, 2023

@olk I guess, ppc64 still broken though? I have not seen anyone fix it.

deutsch I assumed it is fixed ... maybe I was confused be the lengthy discussion

@barracuda156
Copy link
Contributor Author

barracuda156 commented Aug 18, 2023

@olk Last time I looked at it – it was broken to the point of one of the source files being accidentally commented out, and that for years went unnoticed :)
Ah, just above: #211 (comment)

It is trivial to make it build, however Fibonacci test fails. Someone with a better understanding of assembler and ABI has to look into it.

@olk olk reopened this Aug 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants