Breakpad/Status Meetings/2016-01-06: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
 
(6 intermediate revisions by the same user not shown)
Line 20: Line 20:
* What's up dear Rabbit?
* What's up dear Rabbit?
** http://pad.mocotoolsprod.net/p/postmortem-dec-2015-4
** http://pad.mocotoolsprod.net/p/postmortem-dec-2015-4
** problem 1: disks filled up due to no server recycling
** problem 1: disks filled up due to no server recycling (jp to file bugs and own)
** problem 2: processed crashes going down. 80% difference, verified. Rabbit is silently losing 80%. 43 unable to send as many crashes (proxy ssl problems). Abandon our RabbitMQ. We're using cloudamqpapp.com. Lars unable to fully test this on staging. Once tested on staging, lars will push that prod. If continues to lose things, backup plan; Upgrading pika libs. We have preserved logs for failed UUIDs for failed ones from 1st Jan. Enumerating UUIDs on S3 takes very long time, thus a problem if we tried to reprocess all of December. "A lot of data pain the butt".  
** problem 2: processed crashes going down. 80% difference, verified. Rabbit is silently losing 80%. 43 unable to send as many crashes (proxy ssl problems). Abandon our RabbitMQ. We're using cloudamqpapp.com. Lars unable to fully test this on staging. Once tested on staging, lars will push that prod. If continues to lose things, backup plan; Upgrading pika libs. We have preserved logs for failed UUIDs for failed ones from 1st Jan. Enumerating UUIDs on S3 takes very long time, thus a problem if we tried to reprocess all of December. "A lot of data pain the butt".  
** problem 3:  
** problem 3: 28th Dec all processors died. ujson couldn't handle bad data. Breakpad bug? ujson bug? (we're on usjon 1.33, there is a 1.34 available). Lars has a patch that drops ujson for regular json.
** (jp) How do we do monitoring of this going forward?
** (jp) How do we do monitoring of this going forward?
*** reaching out to experts for help.  
*** reaching out to experts for help.  
*** alerting socorro-dev@mozilla.com
*** alerting socorro-dev@mozilla.com
* Any news regarding a Hashicorp config web system thing?
* Any news regarding a Hashicorp config web system thing?
* Processor #Fail
* Processor #Fail (see problem 3 above)
** https://bugzilla.mozilla.org/show_bug.cgi?id=1235436
** https://bugzilla.mozilla.org/show_bug.cgi?id=1235436
** How do we get access to these for local debugging?
** How do we get access to these for local debugging?
No STAGE PUSHES till Lars says


== Project Updates ==
== Project Updates ==
* S3 bucket names with dots. lonnen?
* WebQA PII?
** let's not test privileged access.
** but continue to test NOT privileged access.


=== Deployment Triage ===
=== Deployment Triage ===

Latest revision as of 19:49, 6 January 2016

« previous meetingindexnext week » create?

Meeting Info

Breakpad status meetings occur on Wed at 11:00am Pacific Time.

Conference numbers:

   Vidyo: Stability 
   650-903-0800 x92 conf 98200#
   800-707-2533 (pin 369) conf 98200# 

IRC backchannel: #breakpad
Mountain View: Dancing Baby (3rd floor)

Operations Updates

  • What's up dear Rabbit?
    • http://pad.mocotoolsprod.net/p/postmortem-dec-2015-4
    • problem 1: disks filled up due to no server recycling (jp to file bugs and own)
    • problem 2: processed crashes going down. 80% difference, verified. Rabbit is silently losing 80%. 43 unable to send as many crashes (proxy ssl problems). Abandon our RabbitMQ. We're using cloudamqpapp.com. Lars unable to fully test this on staging. Once tested on staging, lars will push that prod. If continues to lose things, backup plan; Upgrading pika libs. We have preserved logs for failed UUIDs for failed ones from 1st Jan. Enumerating UUIDs on S3 takes very long time, thus a problem if we tried to reprocess all of December. "A lot of data pain the butt".
    • problem 3: 28th Dec all processors died. ujson couldn't handle bad data. Breakpad bug? ujson bug? (we're on usjon 1.33, there is a 1.34 available). Lars has a patch that drops ujson for regular json.
    • (jp) How do we do monitoring of this going forward?
      • reaching out to experts for help.
      • alerting socorro-dev@mozilla.com
  • Any news regarding a Hashicorp config web system thing?
  • Processor #Fail (see problem 3 above)


No STAGE PUSHES till Lars says

Project Updates

  • S3 bucket names with dots. lonnen?
  • WebQA PII?
    • let's not test privileged access.
    • but continue to test NOT privileged access.

Deployment Triage

PR Triage

other business

Travel, etc

  • lars
    • jury duty 2016-01-11 - 2016-01-13
    • other news - accepted offer of keynote at PyCon2016

Links