Crash reporting overhaul: Difference between revisions
(Overall structure & information about fix-stacks) |
(Added information on dump_syms and completed the list of client-side tools) |
||
Line 7: | Line 7: | ||
= Client-side tools and components = | = Client-side tools and components = | ||
== crash | == exception handlers == | ||
== minidump writers == | |||
== crash monitor == | |||
== crash reporter client == | == crash reporter client == | ||
Line 43: | Line 47: | ||
* https://chromium.googlesource.com/breakpad/breakpad/ | * https://chromium.googlesource.com/breakpad/breakpad/ | ||
* https://hg.mozilla.org/mozilla-central/file/40bc01de5e10/toolkit/crashreporter/breakpad-patches | * https://hg.mozilla.org/mozilla-central/file/40bc01de5e10/toolkit/crashreporter/breakpad-patches | ||
Bugs:<br> | |||
* {{bug|1588538}} | |||
* {{bug|1588534}} | |||
* {{bug|1588739}} | |||
* {{bug|1588740}} | |||
=== Description === | === Description === | ||
The dump_syms tool is used to extract symbol files (.sym) from binaries and | |||
libraries. It generates both symbols and stack unwinding information and stores | |||
them in the Breakpad symbol file format | |||
[https://chromium.googlesource.com/breakpad/breakpad/+/HEAD/docs/symbol_files.md]. | |||
We use this tool both to extract symbol files from Firefox builds and from | |||
system libraries across all supported platforms. | |||
=== Rationale === | === Rationale === | ||
The Breakpad-based tools suffer from a number of different issues: | |||
* They lack support for recent additions to the native debugging formats, and particularly DWARF5. Upstream isn't in an hurry to add them so we had to roll our own changes but they're incomplete. | |||
* Each platform has its own tool and each tool cannot be cross-compiled. So we have three distinct implementations of dump_syms: one for Windows, one for Linux and one for macOS. | |||
* The Windows implementation relies on Microsoft's closed-source DLLs from the DIA SDK to access PDB files. Besides making it impossible to run the tool under non-Windows platforms this exposes us to bugs that we cannot fix. | |||
* Function name demangling is platform-dependent, as such the same function yields different symbols on different platforms (e.g. the anonymous namespace being presented as <code>(anonymous namespace)</code> on Linux and as <code>`anonymous namespace`</code> on macOS). | |||
* The Windows dump suffers from bugs in Microsoft's demangler implementation. | |||
* We have to use ugly tricks to fix up certain symbols that are synthesized by LLVM and which the Microsoft demangler does not understand. | |||
* The implementation is slow and consumes large amounts of memory. Dumping a debug build of libXUL can take several minutes and consume over 4 GiB of RAM. | |||
* The Linux implementation is incapable of dealing with compressed debug information. | |||
=== Plan === | === Plan === | ||
The goals for this rewrite are the following: | |||
* Consolidate all the tools into a single portable and retargetable executable | |||
* Leverage Rust's existing ecosytem of crates to read debug information instead of rolling our own. | |||
* Significantly improve the performance and reduce the resource usage of this tool. This is especially important considering that dumping symbol files is in the critical path of all our builds on automation and takes an appreciable amount of time and resources. | |||
To achieve this goal we would like to use a mix of Sentry's | |||
[https://crates.io/crates/symbolic Symbolic] Rust crates - to access debug | |||
information and to demangle the symbols - and crates that allow | |||
direct access to the debug information such as | |||
[https://crates.io/crates/goblin goblin] and [https://crates.io/crates/pdb pdb]. | |||
All these crates are well maintained, have responsive upstream communities, | |||
support more functionality than Breakpad. Additionally they support Rust as a | |||
tier 1 language when it comes to handling and demangling symbols which is a nice | |||
touch given the nature of our codebase. | |||
=== Results === | === Results === | ||
The new dump_syms tool has been rolled out across all of Mozilla infrastructure | |||
and has been in use since the summer of 2020. It is significantly faster than | |||
the old tool (we've seen reductions of an order of magnitude in the time needed | |||
to dump libxul) and consumes an order of magnitude less memory. It has broad | |||
support for modern debug information (including parts that were | |||
reverse-engineered specifically for the new tool such as Apple | |||
[https://gankra.github.io/blah/compact-unwinding/ compact unwinding information]). | |||
The symbols it emits are higher quality than the old tool, uniform across | |||
different platform and have much better coverage. Additionally the symbol files | |||
tend to be smaller thanks to significantly reduced redundancy in the output. | |||
During the coures of the project we contribute changes to the crates we used | |||
and Sentry in particular accomodated for a number of changes that we needed to | |||
implement the new tool. | |||
== fix-stacks == | == fix-stacks == | ||
Line 63: | Line 123: | ||
* https://hg.mozilla.org/mozilla-central/file/55f06c70f4e5/tools/rb/fix_macosx_stack.py | * https://hg.mozilla.org/mozilla-central/file/55f06c70f4e5/tools/rb/fix_macosx_stack.py | ||
* https://hg.mozilla.org/mozilla-central/file/55f06c70f4e5/tools/rb/fix_stack_using_bpsyms.py | * https://hg.mozilla.org/mozilla-central/file/55f06c70f4e5/tools/rb/fix_stack_using_bpsyms.py | ||
Bugs:<br> | |||
* {{bug|1596292}} | |||
=== Description === | === Description === | ||
Line 90: | Line 152: | ||
* Consolidate all the scripts into a single platform-agnostic executable | * Consolidate all the scripts into a single platform-agnostic executable | ||
* Use native debug information so we don't need an extra processing step | * Use native debug information so we don't need an extra processing step | ||
* Significantly improve the performance and reduce the resource usage of this | * Significantly improve the performance and reduce the resource usage of this tool given it affects the runtime of tests both on automation and locally | ||
tool given it affects the runtime of tests both on automation and locally | |||
To achieve this goal we would like to use Sentry's | To achieve this goal we would like to use Sentry's | ||
Line 97: | Line 158: | ||
a platform-agnostic interface to read debug information thus being a perfect | a platform-agnostic interface to read debug information thus being a perfect | ||
fit for our use-case. | fit for our use-case. | ||
=== Results === | === Results === |
Revision as of 14:22, 23 February 2022
Introduction
This page describes the various components involved in the rewrite of our crash reporting machinery, the rationale behind each rewrite, the goals we set for each component as well as the plan and progress information for each of them.
Client-side tools and components
exception handlers
minidump writers
crash monitor
crash reporter client
minidump-analyzer
Server-side tools and components
minidump_stackwalker
Status: completed
Developer(s): gankra, gsvelto
Source code: https://github.com/luser/rust-minidump/
Original source code:
- https://chromium.googlesource.com/breakpad/breakpad/
- https://hg.mozilla.org/mozilla-central/file/40bc01de5e10/toolkit/crashreporter/breakpad-patches
- https://github.com/mozilla-services/minidump-stackwalk
Description
Rationale
Plan
Results
dump_syms
Overview
Status: completed
Developer(s): calixte, gsvelto
Source code: https://github.com/mozilla/dump_syms
Original source code:
- https://chromium.googlesource.com/breakpad/breakpad/
- https://hg.mozilla.org/mozilla-central/file/40bc01de5e10/toolkit/crashreporter/breakpad-patches
Bugs:
Description
The dump_syms tool is used to extract symbol files (.sym) from binaries and libraries. It generates both symbols and stack unwinding information and stores them in the Breakpad symbol file format [1].
We use this tool both to extract symbol files from Firefox builds and from system libraries across all supported platforms.
Rationale
The Breakpad-based tools suffer from a number of different issues:
- They lack support for recent additions to the native debugging formats, and particularly DWARF5. Upstream isn't in an hurry to add them so we had to roll our own changes but they're incomplete.
- Each platform has its own tool and each tool cannot be cross-compiled. So we have three distinct implementations of dump_syms: one for Windows, one for Linux and one for macOS.
- The Windows implementation relies on Microsoft's closed-source DLLs from the DIA SDK to access PDB files. Besides making it impossible to run the tool under non-Windows platforms this exposes us to bugs that we cannot fix.
- Function name demangling is platform-dependent, as such the same function yields different symbols on different platforms (e.g. the anonymous namespace being presented as
(anonymous namespace)
on Linux and as`anonymous namespace`
on macOS). - The Windows dump suffers from bugs in Microsoft's demangler implementation.
- We have to use ugly tricks to fix up certain symbols that are synthesized by LLVM and which the Microsoft demangler does not understand.
- The implementation is slow and consumes large amounts of memory. Dumping a debug build of libXUL can take several minutes and consume over 4 GiB of RAM.
- The Linux implementation is incapable of dealing with compressed debug information.
Plan
The goals for this rewrite are the following:
- Consolidate all the tools into a single portable and retargetable executable
- Leverage Rust's existing ecosytem of crates to read debug information instead of rolling our own.
- Significantly improve the performance and reduce the resource usage of this tool. This is especially important considering that dumping symbol files is in the critical path of all our builds on automation and takes an appreciable amount of time and resources.
To achieve this goal we would like to use a mix of Sentry's Symbolic Rust crates - to access debug information and to demangle the symbols - and crates that allow direct access to the debug information such as goblin and pdb.
All these crates are well maintained, have responsive upstream communities, support more functionality than Breakpad. Additionally they support Rust as a tier 1 language when it comes to handling and demangling symbols which is a nice touch given the nature of our codebase.
Results
The new dump_syms tool has been rolled out across all of Mozilla infrastructure and has been in use since the summer of 2020. It is significantly faster than the old tool (we've seen reductions of an order of magnitude in the time needed to dump libxul) and consumes an order of magnitude less memory. It has broad support for modern debug information (including parts that were reverse-engineered specifically for the new tool such as Apple compact unwinding information).
The symbols it emits are higher quality than the old tool, uniform across different platform and have much better coverage. Additionally the symbol files tend to be smaller thanks to significantly reduced redundancy in the output.
During the coures of the project we contribute changes to the crates we used and Sentry in particular accomodated for a number of changes that we needed to implement the new tool.
fix-stacks
Overview
Status: completed
Developer(s): njn, glandium
Source code: https://github.com/mozilla/fix-stacks/
Original source code:
- https://hg.mozilla.org/mozilla-central/file/55f06c70f4e5/tools/rb/fix_linux_stack.py
- https://hg.mozilla.org/mozilla-central/file/55f06c70f4e5/tools/rb/fix_macosx_stack.py
- https://hg.mozilla.org/mozilla-central/file/55f06c70f4e5/tools/rb/fix_stack_using_bpsyms.py
Bugs:
Description
The fix-stacks tool looks for raw stack traces within the output of our test runs and replaces the raw memory addresses with function names so that the output is readable.
Rationale
The legacy implementation of fix-stacks is split in three different Python
scripts, each one being platform dependent. The Linux and macOS scripts rely on
calling platform-specific tools such as addr2line
or
otool
. These tools are called several times and take a significant
amount of time to process large debug information (such as that produced by a
debug build of libxul). The macOS version is so slow that it's disabled by
default in certain tasks because it would cause the tasks to time out. The
version relying on Breakpad symbols is platform independent but requires an
additional step (generating the symbols) and consumes enormous amounts of
memory (see bug 1493365). We don't have a version that uses native debug
information on Windows.
Plan
The goals for this rewrite are the following:
- Consolidate all the scripts into a single platform-agnostic executable
- Use native debug information so we don't need an extra processing step
- Significantly improve the performance and reduce the resource usage of this tool given it affects the runtime of tests both on automation and locally
To achieve this goal we would like to use Sentry's (https://crates.io/crates/symbolic Symbolic) Rust crates. These crates provide a platform-agnostic interface to read debug information thus being a perfect fit for our use-case.
Results
The project was deemed complete in April 2020, with the old scripts removed and the new tool used across all tasks and all platforms. The resulting tool is significantly smaller in size compared to the original scripts, provides better output, is anywhere from 2x to 100x (!) times faster than the scrips while using less memory. The performance improvements shorten the execution of tasks with failures both on the try server and locally and enabled us to have stack-fixing in tasks that previously couldn't afford it.
njn wrote a detailed blog post [2] describing his approach and results.