Crash reporting improvements

From MozillaWiki
Revision as of 09:45, 20 April 2022 by Gsvelto (talk | contribs) (Added another use-case for the instruction disassembly project)
Jump to navigation Jump to search

Introduction

This page lists the various improvements that we want to introduce after having finished overhauling the existing crash reporting machinery (see the Crash reporting overhaul page for more information). Many of the tasks described here were features that had been requested years ago but could not be implemented in a reasonable amount of time using the old Breakpad-based tooling.

List of projects

Minidump storage for crash annotations

Status: not started
Developer(s):
Source code:
Original source code:

Bugs:

Description

Crash annotations are a set of pieces of information that accompany a minidump to form a complete crash report. Crash annotations contain critical information such as the Firefox version and build ID but also ancillary information such as how much memory a process was using, or a user-provided string associate with a failed assertion that crashed the process.

Currently crash annotations are stored in a JSON file (with an .extra suffix) that is sent along with the minidump to Socorro. Depending on the type of crash this file is either written out by the exception handler (if the main process crashed) or the contents of the annotations are forwarded to the main process which then writes them out (in the case of a child process crash).

Rationale

There are several issues with the current system:

  • Having a separate file adds significant complexity both when submitting and processing crash reports, and also additional failure modes (like only one of the files being present in the report)
  • The file needs to be written out after the minidump has been written out, adding complexity to the exception handler
  • For child processes an extra IPC channel is needed to send the annotations
  • Setting annotations is a relatively expensive process
  • Some annotations are synthesized at crash time and dealt with ad-hoc code, there is no unified mechanism to handle them together with the others

Given the above storing the annotations within a minidump would simplify the crash reporting flow, eliminate an additional IPC channel and greatly streamline the effort to store annotations by user code.

Plan

Annotations should be stored within the minidump and read directly from the crashed process. This requires several steps:

  • The crash annotations interface in Gecko needs to be modified so that a process can flag where its annotations are stored
  • The crash-time annotations need to be removed and replaced with regular ones
  • We need to add a mechanism to separate between the process' annotations and global ones that must be included in every crash
  • Minidump writers need to be modified to identify where the annotations are stored in a process memory, read them and write them out within the minidump
  • Finally teach the stackwalker tool to look for the annotations in the minidump and print them out

Additionally some changes will be required to Socorro on the ingestion side. Socorro currently relies on the .extra file contents for filtering. For example annotations containing the product version are used to decide if a crash is coming from a version of Firefox that is very old and thus should be dropped. If we store the annotations within the minidump we need to provide a way for Socorro to extract them without processing the full minidump, so that it can still apply its filtering rules. To this end we need to write a streamlined minidump pre-processor that only extracts this information and provides it in JSON format. This might prove useful for other types of filtering we don't currently do (such as rejecting reports caused by hardware faults or unconditionally accepting those that might indicate security-sensitive issues). The rust-minidump crate provides all the necessary functionality to write this tool.

Telemetry-based dashboards

Overview

Status: not started
Developer(s): Source code: Original source code:

Description

Rationale

Plan

Disassembly in the stack walker

Overview

Status: not started
Developer(s):
Source code: https://github.com/rust-minidump/rust-minidump
Original source code: N/A

Description

Sometimes some critical information is missing when inspecting a crash. For example crashes hitting non-canonical addresses on x86-64 don't have the real crashing address but a useless placeholder instead (see bug 1493342). In other cases such as when executing an illegal istruction we only have the address of the instruction but no idea what it was.

Rationale

To fill in the missing information in crash reports it would be useful to disassemble the crashing instruction and be able to inspect it:

  • For non-canonical addresses we could reconstruct the real crashing address from the registers and immediate values in the instructions
  • For invalid instructions we could tell if the instruction is valid and non-supported or downright invalid (in the case of a bit-flip or corrupted executable for example)
  • For privileged or unsupported instructions we'd be able to tell if it's our fault or if the machine configuration is not adequate
  • For null pointer accesses we'd be able to remove the fixed offset often applied to the pointer and make the crash more obvious (or tell it apart from bit-flips in the lower bits)
  • Hardware bugs often result in impossible crashes where the crash reason simply could not have been triggered by the faulting instruction. For example the crash reason is an invalid access but the faulting instruction is a branch, or an arithmetic operation that does not access memory. With the disassembled instruction in hand we could detect those cases and flag the crash report as suspicious.

Plan

We could integrate a disassembler such as iced in the rust-minidump stackwalker and use it to disassemble the crashing instruction. Our minidump writers usually include the memory area around the crashing instruction so we could also disassemble the entire area. We'd then use the raw result to verify and adjust the crashing address and add a human readable output of the disassembly to the JSON output so that it can be surfaced.

Support inlined functions in crash stacks

Overview

Status: not started
Developer(s):
Source code:
Original source code: N/A
Bugs:

Description

Rationale

Plan