or, An Elementary Explanation of Mach-o Linkage with
Applications to Having Your C++ Program Not Constantly Crash in
dynamic_cast
Sometimes it seems like a good idea to split your large program into
multiple dynamic libraries, and share C++ objects across dylib
boundaries. While this approach is certainly feasible, the waters are
surprisingly treacherous, especially for apps that use C++ RTTI. This
guide is meant for deploying to Apple Platforms ca. Mavericks/iOS7, but
it probably applies to any mach-o platform with clang as a compiler, and
the ELF case is similar enough that this guide will probably still be
useful.
Symbol Visibility
While there are is a veritible managerie of symbol types that exist
in mach-o binaries (and we will be encountering many of these shortly),
there is only one distinction you can make in source code to control
which symbol type gets generated during linking: "default" visibility vs
"hidden" visibility.
- default visibility
-
default visibility creates what
nm -m
calls an
external symbol. These symbols can be looked up by client
libraries during dynamic loading. hidden visibility
-
hidden visibility creates what
nm -m
calls a
non-external symbol. These cannot be seen anywhere outside of
the library where they exist.
Confusingly, if your app uses best practice, you'll pass the
-fvisibility=hidden
flag to the linker, which sets the
"default" for symbols that are not explicitly marked in source code is
hidden. This is a good idea because you don't want all of your
private functions in dynamic libraries leaking out to the rest of your
program. In fact, this is probably why you wanted to split your program
across executables in the first place. Using this flag, for symbols that
you want to be accessible to outside libraries, you must explicitly mark
them as having default visibility. I don't know any
justification for these names - if I were designing gcc, I would have
gone with "external" and "private" instead of "default" and
"hidden".
Defining "Undefined" Symbols
If you give a function a definition, it becomes a defined
symbol in your executable or shared library. If the function has no
definition, but does have a declaration, it gets marked as an
undefined symbol.
defined, non-external symbols are private to the
specific library where they were created
defined, external symbols are exported to all other
libraries. By default, if you link to any library that has a symbol with
the same name, you'll get a linker error.
undefined, non-external symbols are an immediate
linker error
undefined, external symbols will be looked up at
runtime. By default, the linker will check to make sure that some
dynamic library that you've linked to contains the symbol in question,
and if it's missing it will issue an error. This behavior can be
disabled, causing a runtime error in the missing symbol case.
So far, so boring. Hopefully everything up until now has been a
review if you are even remotely considering breaking your program up
into multiple binaries.
A Root of Some Evil: stdc++ dynamic_cast
Implementation
Details
To understand why any of these linker specifics is even relevant to
keeping your program crash-free, we need to understand an important
detail about how dynamic_cast
is implemented in stdc++ and
c++abi - two type infos are detected as equal through a pointer
compare [1]. This means that
dynamic_cast
can be wrong if there's more
than one copy of the type info structure - the structures may be
identical, but it doesn't matter because only their locations in memory
are being compared. So, because of this optimization, we actually have
to care about how and when these type info objects get generated, and
how they behave at runtime.
Vague Linkage (Yes, That's Actually What It's Called)
The way type infos work in gcc and clang is that each .o
file (that is, compiled .cpp
file) generates a
weak version of each class's type info. This is called
vague linkage in the docs (such as there are docs for this
low-level crap). At link time, these are combined into a single copy of
that type info. For classes that are marked in code to be
hidden, that's more or less the end of the story - the typeinfo
is marked non-external, and of course
dynamic_cast
s within that binary work, and since the class
was hidden, you aren't supposed to be able to use it outside of that
binary, so in theory dynamic_cast
s outside of the binary
aren't a concern. In practice, this requires some discipline from
programmers to avoid disaster - if you mark a class as hidden,
you really can't use it safely outside of the binary
where it was defined.
For classes that are marked default, the story is a bit more
complicated as it involves an entirely new symbol type, called a
weak external symbol. These symbols are
coalesced at runtime such that there is only one copy
of each in a given program (which may comprise several binaries). This
is precisely the behavior that enables the pointer compare in
dynamic_cast
- without this, there would be no way to share
types across dynamic library boundaries.
Sounds Great, What Could Go Wrong?
Well, if you've programmed carefully, you should be safe from any
dynamic_cast
problems. The issue is, if you've programmed
slightly sloppily, in the current toolchain, you'll get no major errors
at compile time, and your only indication is that something is wrong is
that dynamic_cast
will sometimes inexplicitly return zero,
probably causing a crash. Here are three common cases that I've
seen:
Using a hidden class from outside its home binary
How could this happen? Well, both clang and g++ happily will mark
member functions as default even if the class is
hidden. So, you can include the class's header from another
binary and use the any of the functions that were marked
default. This is a recipe for disaster, because both
binaries will create a non-external, private copy of
the type info and the pointer comparison will fail.
To detect this, run nm -mo * | c++filt
on all your
binaries, and ensure that there aren't any duplicate "non-external
typeinfo" symbols.
Having a class marked hidden in one binary and
default in another
This can happen if you are using a macro system to mark your classes
that breaks down somehow. I've also seen this when template class
explicit instantiations are not marked with the correct visibility. This
will result in the exact same sort of problems seen above, as the weak
symbol loading will not coalesce the weak
external version with the non-external version.
Similarly to the last case, to detect this, run
nm -mo * | c++filt
and ensure that for any typeinfos that
have at least one non-external copies, there are no other
copies of that typeinfo.
Having a class marked hidden in one compilation unit and
default in another in the same binary
This is sort of the "ultimate screw case", and I've only seen it
happen with template instantiations - if you have an explicit
instantiation that's marked default, and somehwhere else, in a
different compilation unit in the same binary, you have an implicit
instantiation, it will be marked hidden.
What happens here? Well, in modern (10.7+) versions of
ld
, this is a warning. ("direct access
...
means the weak symbol cannot be overriden at runtime
"). The
compilation unit with the hidden definition will incorrectly
use the local definition, and dynamic casts will fail. Other compilation
units will be fine, and the symbol will be marked external in
the binary.
On older versions of ld
(10.6 and below), things are
even worse. The symbol will be non-external in the binary if
it's marked hidden for even one compilation unit - making
dynamic_cast
s fail throughout the entire binary, even in
compilation units where the visibility was correct. No meaningful
warning is generated.
To detect this insidious problem, if you're compiling with the 10.7+
linker, I recommend turning fatal_warnings
on. If you're
using the 10.6 linker, the nm
method from before will work
because the symbol will be marked non-external in the binary.
If you're super-paranoid, you can run nm -om * | c++filt
on
all your .o
files - object files where the visibility was
hidden will report weak private external typeinfo
where object files with default visibility will report
weak external typeinfo
(note that if you forget the
-m
flag on nm
these two cases are
indistinguishable).
footnote
[1]: You can verify this yourself by
looking at the easy-to-read llvm source (check the
is_equal
function implementation). This is actually
determined by a compile-time flag called
_LIBCXX_DYNAMIC_FALLBACK
. If this flag is on, a string
compare occurs during dynamic_cast
. This is intended to
allow c++abi to be used on platforms that don't allow for weak symbols
or COMDAT. This flag is off in apple-provided builds of c++abi.