Crash reporting overhaul

From MozillaWiki
Revision as of 11:41, 15 March 2022 by Gsvelto (talk | contribs) (Added information about the minidump storage for crash annotations project)
Jump to navigation Jump to search

Introduction

This page describes the various components involved in the rewrite of our crash reporting machinery, the rationale behind each rewrite, the goals we set for each component as well as the plan and progress information for each of them.

Client-side tools and components

Exception handlers

Minidump writers

Crash monitor

Minidump storage for crash annotations

Status: not started
Developer(s):
Source code:
Original source code:

Bugs:

Description

Crash annotations are a set of pieces of information that accompany a minidump to form a complete crash report. Crash annotations contain critical information such as the Firefox version and build ID but also ancillary information such as how much memory a process was using, or a user-provided string associate with a failed assertion that crashed the process.

Currently crash annotations are stored in a JSON file (with an .extra suffix) that is sent along with the minidump to Socorro. Depending on the type of crash this file is either written out by the exception handler (if the main process crashed) or the contents of the annotations are forwarded to the main process which then writes them out (in the case of a child process crash).

Rationale

There are several issues with the current system:

  • Having a separate file adds significant complexity both when submitting and processing crash reports, and also additional failure modes (like only one of the files being present in the report)
  • The file needs to be written out after the minidump has been written out, adding complexity to the exception handler
  • For child processes an extra IPC channel is needed to send the annotations
  • Setting annotations is a relatively expensive process
  • Some annotations are synthesized at crash time and dealt with ad-hoc code, there is no unified mechanism to handle them together with the others

Plan

Annotations should be stored within the minidump and read directly from the crashed process. This requires several steps:

  • The crash annotations interface in Gecko needs to be modified so that a process can flag where its annotations are stored
  • The crash-time annotations need to be removed and replaced with regular ones
  • We need to add a mechanism to separate between the process' annotations and global ones that must be included in every crash
  • Minidump writers need to be modified to identify where the annotations are stored in a process memory, read them and write them out within the minidump
  • Finally teach the stackwalker tool to look for the annotations in the minidump and print them out

Crash reporter client

Status: not started
Developer(s):
Source code:
Original source code:

Bugs:

Description

The crash reporter client is the tool we use to submit crash reports when the browser crashes entirely. Its role is to gather the minidump and crash annotations' file, add missing annotations, send a crash ping and prompt the user to submit the crash (possibly with a comment). Once the user interacts with it the crash reporter client will submit the crash and record its submssion and restart Firefox.

Rationale

The crash reporter client is a particularly rigid piece of code due to its platform-specific nature, the fact that it cannot use any of the libraries we use in Firefox and that it's usually launched from within an exception handler. Because of the above it suffers from a multitude of issues:

  • We have platform-specific code for the UI, file management and network operations making maintainance a nightmare
  • This is mostly platform-specific C/C++ code, but macOS also has some Objective-C thrown into the mix
  • The UI of the macOS version cannot be changed because it uses a binary description generated from a tool that was obsoleted ages ago (the last version of it ran on PowerPC macs only)
  • The Windows version has poor handling of paths with non-ASCII characters in them
  • The Windows version has poor high-DPI screen support
  • There are no tests covering its functionality
  • It relies heavily on environment variables to communicate with Firefox, this causes problems on Linux (see bug 1752703) and makes it hard to run it manually for testing
  • The entire processing flow is synchronous and blocks the UI at every step (parsing the annotations, creating the local stack trace, submitting it, etc...)
  • Localization is done via an INI file and cannot use Fluent
  • Given the hard-coded nature of the UI I don't know how it behave with RTL language, probably very poorly

Plan

We should rewrite the client in Rust leveraging the standard library and common crates as much as possible to remove the platform-specific code:

  • The UI still needs to be done using platform-specific code, but we have a chance to modernize it, especially on Windows and macOS
  • We must use Fluent for localization and ensure proper support of RTL languages
  • Networking code poses a problem. Ideally we'd like to use reqwest but it's not vendored in mozilla-central yet. It might still require platform-specific code though we hope to avoid it
  • We should leverage Rust's asynchronous facility to make processing non-blocking and the UI responsive
  • We should add platform-independent tests for the parts of the code that do not require UI interaction

minidump-analyzer

Status: not started
Developer(s):
Source code: https://github.com/luser/rust-minidump/
Original source code:

Bugs:

Description

The minidump-analyzer tool is similar to the #minidump_stackwalker in that it processes minidumps and emits stack traces in Socorro-compatible JSON format. The main differences between the tools are that minidump-analyzer runs on client machines rather than on our servers, it uses native debug information where possible to unwind stacks instead of Breakpad symbol files and it doesn't symbolicate its output. minidump-analyzer is run on a client machine for every crash that generated a valid minidump and its output is used to populate the contents of the crash ping.

Rationale

The minidump-analyzer suffers from many of the same problems as the #minidump_stackwalker with regards to stability, maintenance burden and lack of automated testing. It is yet another Breakpad-based stack walker that produces slightly different results than the others. Additionally support for using native unwinding information was never fully implemented. We only ever implemented support for the Windows x86-64 platform and we would have to implement support for other platforms from scratch in order to make it fully functional.

We'd like to replace this tool with one which re-uses the same code as Socorro's #minidump_stackwalker in order to reduce maintenance and keep results consistent between Socorro and crash telemetry. Additionally we'd like to add support for all missing platforms (Linux and macOS, as well as Windows/AArch64).

Plan

We plan on reusing the stackwalker developed as part of the #minidump_stackwalker project using Sentry's symbolic crate to parse the native debug information. This will require a few changes to the rust-minidump crate:

  • We need to add support for fetching unwinding directives from native debug information via the symbolic crate
  • We need to add machinery to find the appropriate files on the client machine instead of fetching them from a symbol server
  • We don't need to wire up the symbolicator to the native debug information as the stack traces we emit are raw and will be symbolicated later
  • We'll have to vendor rust-minidump and its dependencies into mozilla-central
  • We'll have to build and package the tool like we do with the existing one
  • Finally we should remove Breakpad's processor sources from the build as they won't be needed anymore

Sentry is already working on integrating symbolic with rust-minidump so we're currently waiting it out. This might require very little work in the end.

Server-side tools and components

minidump_stackwalker

Status: completed
Developer(s): gankra, gsvelto
Source code: https://github.com/luser/rust-minidump/
Original source code:

Description

The Socorro service we use for ingesting and processing crashes relies on a Breakpad-based stackwalker originally written by Ted Mielczarek to extract stack traces from minidumps. This tool takes a minidump as input, fetches the appropriate symbol files from our symbol-server and ultimately emits the stack traces and additional information in JSON format.

Rationale

Like our other server-side tools the stackwalker is based on a forked version of Breakpad which is different than the one we have in mozilla-central. This caused divergence in the past between stack traces seen on developers' machines or try and Socorro. The extra work needed to manually keep in sync slows down development and Breakpad's slow-moving upstream doesn't help. Last but not least this tool is not fully robust in the face of malformed or unexpected inputs. We've often spent time tracking issues that showed up on Socorro months after the fact because we had no useful output to work on, nor it was easy to detect that the failure was happening within the tool versus an issue with the input. Last but not least we don't have proper automated tests for this tool, relying on manual testing of every release which is time-consuming and exposes us to regressions.

Plan

We plan on rewriting the stackwalker tool by extending the rust-minidump crate. Several changes will be needed to the crate including:

  • Bringing the minidump layout structures up-to-date
  • Implementing a CFI & FP-based stackwalker for the x86, x86-64, ARM and AArch64 architectures
  • Teach the stack walker to talk to a symbol server to fetch the required symbol files
  • Implementing parsers of unsupported minidump streams
  • Update all the human-readable mappings of various values and constants to bring them on-par with Breakpad
  • Teach the stackwalker to read crash annotations from the Mozilla-specific .extra file and output Socorro-compatible JSON

Results

The new stackwalker tool was deployed on Socorro in mid-December 2021 and was described by many as the perfect deployment:

  • The new tool proved to be twice as fast as the old one while consuming less memory
  • We had no regressions save for bug 1757890 which was caught out later on
  • The new tool is covered by an extensive test suite
  • Results from the new tool were better than the old one
    • Problematic minidumps that crashed the old tool were now being handled correctly
    • Stack traces were better overall, being much better on macOS (more on this later)
    • Issues were now logged in detail in the debug output which is also accessible on Socorro making solving issues much simpler

In addition to replacing the old tool the new one brought along a very useful new feature: support for Apple compact unwinding format which :gankra reverse-engineered from LLVM's sources. This turned the quality of our macOS stack traces from mediocre to exact overnight.

Last but not least the project has been picked up by Sentry for use in their software as a replacement for their own Breakpad-based stackwalker. Sentry developers have been contributing changes to the crate at a steady pace and will likely take over its maintenance.

dump_syms

Overview

Status: completed
Developer(s): calixte, gsvelto
Source code: https://github.com/mozilla/dump_syms
Original source code:

Bugs:

Description

The dump_syms tool is used to extract symbol files (.sym) from binaries and libraries. It generates both symbols and stack unwinding information and stores them in the Breakpad symbol file format [1].

We use this tool both to extract symbol files from Firefox builds and from system libraries across all supported platforms.

Rationale

The Breakpad-based tools suffer from a number of different issues:

  • They lack support for recent additions to the native debugging formats, and particularly DWARF5. Upstream isn't in an hurry to add them so we had to roll our own changes but they're incomplete.
  • Each platform has its own tool and each tool cannot be cross-compiled. So we have three distinct implementations of dump_syms: one for Windows, one for Linux and one for macOS.
  • The Windows implementation relies on Microsoft's closed-source DLLs from the DIA SDK to access PDB files. Besides making it impossible to run the tool under non-Windows platforms this exposes us to bugs that we cannot fix.
  • Function name demangling is platform-dependent, as such the same function yields different symbols on different platforms (e.g. the anonymous namespace being presented as (anonymous namespace) on Linux and as `anonymous namespace` on macOS).
  • The Windows dump suffers from bugs in Microsoft's demangler implementation.
  • We have to use ugly tricks to fix up certain symbols that are synthesized by LLVM and which the Microsoft demangler does not understand.
  • The implementation is slow and consumes large amounts of memory. Dumping a debug build of libXUL can take several minutes and consume over 4 GiB of RAM.
  • The Linux implementation is incapable of dealing with compressed debug information.

Plan

The goals for this rewrite are the following:

  • Consolidate all the tools into a single portable and retargetable executable
  • Leverage Rust's existing ecosystem of crates to read debug information instead of rolling our own.
  • Significantly improve the performance and reduce the resource usage of this tool. This is especially important considering that dumping symbol files is in the critical path of all our builds on automation and takes an appreciable amount of time and resources.

To achieve this goal we would like to use a mix of Sentry's Symbolic Rust crates - to access debug information and to demangle the symbols - and crates that allow direct access to the debug information such as goblin and pdb.

All these crates are well maintained, have responsive upstream communities, support more functionality than Breakpad. Additionally they support Rust as a tier 1 language when it comes to handling and demangling symbols which is a nice touch given the nature of our codebase.

Results

The new dump_syms tool has been rolled out across all of Mozilla infrastructure and has been in use since the summer of 2020. It is significantly faster than the old tool (we've seen reductions of an order of magnitude in the time needed to dump libxul) and consumes an order of magnitude less memory. It has broad support for modern debug information (including parts that were reverse-engineered specifically for the new tool such as Apple compact unwinding information).

The symbols it emits are higher quality than the old tool, uniform across different platform and have much better coverage. Additionally the symbol files tend to be smaller thanks to significantly reduced redundancy in the output.

During the course of the project we contribute changes to the crates we used and Sentry in particular accommodated for a number of changes that we needed to implement the new tool.

fix-stacks

Overview

Status: completed
Developer(s): njn, glandium
Source code: https://github.com/mozilla/fix-stacks/
Original source code:

Bugs:

Description

The fix-stacks tool looks for raw stack traces within the output of our test runs and replaces the raw memory addresses with function names so that the output is readable.

Rationale

The legacy implementation of fix-stacks is split in three different Python scripts, each one being platform dependent. The Linux and macOS scripts rely on calling platform-specific tools such as addr2line or otool. These tools are called several times and take a significant amount of time to process large debug information (such as that produced by a debug build of libxul). The macOS version is so slow that it's disabled by default in certain tasks because it would cause the tasks to time out. The version relying on Breakpad symbols is platform independent but requires an additional step (generating the symbols) and consumes enormous amounts of memory (see bug 1493365). We don't have a version that uses native debug information on Windows.

Plan

The goals for this rewrite are the following:

  • Consolidate all the scripts into a single platform-agnostic executable
  • Use native debug information so we don't need an extra processing step
  • Significantly improve the performance and reduce the resource usage of this tool given it affects the runtime of tests both on automation and locally

To achieve this goal we would like to use Sentry's (https://crates.io/crates/symbolic Symbolic) Rust crates. These crates provide a platform-agnostic interface to read debug information thus being a perfect fit for our use-case.

Results

The project was deemed complete in April 2020, with the old scripts removed and the new tool used across all tasks and all platforms. The resulting tool is significantly smaller in size compared to the original scripts, provides better output, is anywhere from 2x to 100x (!) times faster than the scrips while using less memory. The performance improvements shorten the execution of tasks with failures both on the try server and locally and enabled us to have stack-fixing in tasks that previously couldn't afford it.

njn wrote a detailed blog post [2] describing his approach and results.

Telemetry-based dashboards

Overview

Status: not started
Developer(s): Source code: Original source code:

Description

Rationale

Plan