User:KaiRo/Filing Crash Bugs

From MozillaWiki
Jump to navigation Jump to search

This is a DRAFT for an article that might go up on MozillaWiki or MDC, wherever it fits better.

If you file a bug for a crash you have encountered or found via crash stats, there are a few rules you can follow if you want developers to be as efficient as possible in actually debugging the problem and finding a fix. (If you are unfamiliar with how we use Bugzilla in general, there's a helpful video to watch.)

  • For one thing, use the "Bugzilla - Report this bug in ..." links in a representative crash report on crash-stats (under the table of crash information on the "Details" tab). If you are looking at a list of crashes, e.g. a Signature Summary, you ideally look through a few reports to select a representative that has a good stack and as few extensions installed as possible.
    • The best way to detect if a stack is "good" is to look at a few ones. When you have some that have random hexadecimal addresses in them or that end after just a few lines (frames), those are somewhat "bad". If you have multiple crash reports available, try to find one where the stack does not contain such random address and is going down for a more than 10-15 frames, if possible. That helps developers find the whole trace of functions that have been called to get to the place that finally crashed.
  • Select a fitting product and component for what you think is the cause of the crash.
    • If you don't have good knowledge of our code and Bugzilla hierarchies, it can be helpful to look into bugs connected with recent changes to the files linked in the significant top frames of the stack. This takes a number of clicks, but clicking at those links at the right-side column of the stack and then clicking "revisions" at the top of the resulting page (on hg.mozilla.org) is quite helpful and links to bugs that changed this code recently. Using the product/components mostly mentioned in those bugs for recent entries is usually a good start.
    • Which frames are significant is something you need to use some personal judgement on. Look for frames that are not generic things like "abort", "alloc(ate)" or "processNextEvent" or similar, and try to guess based on what function names there sound like they would actually perform some crucial action and not to be generic intermediate steps.
    • The less experience you have, the more guessing all this will be, but don't fear, you won't get judged badly if you guess wrong, and doing that guessing is a learning experience and you'll be pretty proud of yourself once you manage to guess correctly a few times! Take an as educated as possible guess. The more often you do this, the better you will get.
  • Make sure a link to the representative crash is in the initial comment, the Crash Signature field filled out and the "crash" keyword set on the bug. If you used the "Report this bug" links, then this is already done correctly.
  • Include a link to more reports with the same signature, e.g. by copying the URL from the "More Reports" link next to the signature in the Details table of the crash report.
  • It helps including the top few stack frames in the bug - you can potentially cut the list off at a point where it seems to go generic and unrelated to the actual issue, but if you are not sure, include it all.
    • This is another step that involves some guessing. A good thumb rule is to not go much further than 10-15 frames, and often you have something generic like nsThread::ProcessNextEvent in there, which usually is a good point to cut off and add a "[...]" at the end. See for example what was done in bug 1074196.
    • "Generic" there is meant in what the code is doing. If it's part of an event loop, for example (stuff like "ProcessNextEvent" or so point to that), this is very generic. Also, functions that actually just handle the error but didn't produce it themselves, like "malloc_abort" are also pretty generic.
  • If it's a crash you encountered, it helps to tell developers in the bug what you remember doing when the crash occurred, if possible.
  • If you file based on data, please include data about when it started to appear, how high it is in overall volume (and/or topcrash reports), and any information that is outside the average sample (i.e. only happening on specific OSes, specific Firefox versions, specific Graphics cards, or significant correlations with any modules or add-ons). Also, if comments tell anything about commonalities of actions and/or possible steps to reproduce, please include that. See the walkthrough below for some tips on how to get that.
  • Include any clues you have about this crash could be reproduced. When a developer can reproduce the crash, it's very likely they can find a fix as well.
  • Examples of good crash bug reports: bug 1035168

Example walkthrough

Let's take bug 1074196 as an example again as I mentioned it before. As mentioned at the start of the initial comment, I filed it from this crash report: https://crash-stats.mozilla.com/report/index/fd16294d-0bbf-4530-b2d4-65c2d2140929. This specific crash report expires 6 months after it was submitted and some info is only shown for a shorter time, so you might not see exactly the same as I did.

The lower section of the report has the crash stack with a header of "Crashing Thread". The right-hand column of this table has file names that are linked to actual code. For example clicking the first one there goes to line 228 of nsUrlClassifierPrefixSet.cpp, I click the "gpascutto@197683" link on the left next to the code line 228, which links to the actual code changes that a person with the (nick) name of gpascutto did in the changeset no. 197683 of our code. From there, I click again on the link "9d24ecc84a50" (that's a random identifier for this changeset) next to "changeset 197683". That makes me see the actual comment that gpascutto made for this changeset, in this case "Bug 1046038 - Replace linear prefix array by array of arrays. r=mmc". Finally, I can click on the bug linked there and look at their data, like product and component.

This takes a lot of clicks unfortunately, and there's ideas on how to improve it (the cvs-based system we had before Mercurial/"hg" actually showed the comment and bug link right when you moved your mouse over the name of the person that changed the line in the first code view of this process) - but for now, this series of clicks is what is needed there.

With those steps, I find that the product "Toolkit", component "Phishing Protection", so in the crash report, I clicked the "Toolkit" link next to "Report this bug in" within the Bugzilla section in about the middle of the page when scrolling down. Back on the crash report, I go to the top and click the "More Reports" link to get to a signature summary page to get more info about those crashes.

Looking through those, I try to find properties that fall out of the ordinary (like only few Product versions being listed in the "Products" section, only very specific OSes listed, or specific graphics adapters, or much fewer installations than crashes listed in "Crashes per Install"), and looking at the list in the "Reports" tab, some things like the same address on all reports are interesting as well. All those should be listed in a comment on the bug report. Posting a link to the "More Reports" page is usually a very good idea as well.

Explaining the full hierarchy is not really possible, and it changes all the time in terms of components being added, some also retired. But I can give you a few pointers to the cornerstones, so to speak. Note that all this has grown organically over the 15+ years that the Mozilla project and Bugzilla have exited now, so some things might not be the same as one would design from scratch.

Understanding the Bugzilla hierarchy

The most important pieces:

  1. There's Bugzilla "products" for, logically, the major products we have at Mozilla, like Firefox, Firefox for Android, Firefox OS, Thunderbird, SeaMonkey, etc. - those are *only* about things that are specific to that one product and not in code that multiple of those products share.
  2. For shared code, there's the "Core" product, which has all the underpinnings of Gecko and "the Mozilla platform", like Grpahics, Javascript Engine, networking, etc. - and there's "Toolkit" for some other shared code that is closer to or including user interface pieces, like Add-ons Manager, and some other pieces. The difference between those parts is fluid and many of us often don't know what is in which of those two, you have to search and learn here.
  3. There are a lot of supporting "products" like for websites and server software, legal issues, localizations, IT operations, 3rd-party plugins, etc. - you can ignore them most of the time, but sometimes you might get directed to one for something. Best is to not try to understand more of those than what you run across and need for a certain task.
  4. All those "products" in Bugzilla have "components" underneath them. When filing a bug in Bugzilla, once you have selected a product, there's a description of the component displayed when you select one from the list. Try to use your best guess from those descriptions of where your bugs fits. It's easy to change afterwards and we put bugs into "wrong" components all the time as well as for some things it's just hard to figure out or lines are blurry. Don't shy away from setting one, it's easy to move to a different one afterwards if you guess wrong.

The general rule here like with many other things in our community is: Guess and try to do what makes most sense for you, nobody will think less of you if you guess wrong. Be bold and try what sounds the best way for you and learn as you go. It's better to try and learn than not to try in the first place.