Breakpad/Status Meetings/2016-08-03: Difference between revisions

Latest revision as of 17:28, 10 August 2016

Meeting Info

Breakpad status meetings occur on Wed at 10:00am Pacific Time.

Conference numbers:

   Vidyo: Stability 
   650-903-0800 x92 conf 98200#
   800-707-2533 (pin 369) conf 98200#

IRC backchannel: #breakpad
Mountain View: Dancing Baby (3rd floor)

Operations Updates

P1 Infra Bugs

focused on "not being on fire"
- seems to be going well

root cause of last weeks issues
- configuration mismatch with the rest of the cluster
- puppet missed putting the .yaml file in there
- they defaulted to 2GB and when they exhausted themselves everything went to hell
- we initially suspected that it was retention related
- debated but didn't land a change that would lower retention temporarily
new pingdom accounts coming if you have one already

monitoring of ES
- Jason has been helping us to figure out our ES config and make it more robust
- JP has new monitoring agent
- we expect to have new, aggressive alerts

super search errors are checked in webapp health check
- should catch individual shard failures
- shard failures break pingdom and sentry now
- jp will own a plan for failure

python upgrade
- on the horizon
- JP wants a stable stage and prod before he does it
- let's do it this week, shortly after our next ship to prod

Project Updates

Socorro::Middleware component to Graveyard
- aiming to reduce to ::infra ::general ::webapp
- https://bugzilla.mozilla.org/show_bug.cgi?id=1291808
Monitoring/healthcheck now checks for ES shards errors. In prod. Every minute.
Home page AJAX code now cached! Yay faster home page.

in the ES fire there was talk of a spike among release drivers
- laura was saying we needed to be stable so they could investigate

api now has cache headers
- webapp is using them now, improved perf

google auth is getting ready to go out
- we see an error on stage that we cannot reproduce
- we suspect its security scanner tools, benign
- one more fix going out and then we're ready for prod

Deployment Triage

https://whatsdeployed.io/s-7M7

PR Triage

http://prs.mozilla.io/mozilla/socorro

Major Projects

Migrating off of persona

Deployed in stage
- emails will be sent as soon as we put this in prod
- also need to execute the migration for aliases https://bugzilla.mozilla.org/show_bug.cgi?id=1287548
  - lonnen will migrate to clonnen, etc, gauth doesn't support aliases

Sending public data to parquet for reading from spark/re:dash

Adrian and Peter has prototype to add another crash storage that sends to S3
- https://github.com/peterbe/socorro/pull/17/files
Mark's awareness of reprocessing (aka. primary keys)
- how useful is it and how often do we do it?
we are going to unify raw and processed crashes into a single crash report json document based on the public schema
- avoids duplicate info, unifies all the info we have into one doc and uses the prettier name where we have it
- starts only at the point where we transmit to telemetry data platform

Symbols service refactoring (snappy, somewhat tangental to us)

No update.

Signature generation across crash reporters

No update
but Peter is asking about it here: https://bugzilla.mozilla.org/show_bug.cgi?id=828452

Splitting out collector

No update.

Collecting client-side JavaScript errors

No update

Handling more PII data in crashes

Sending stacks for all crashes from the client

no update

Replacing FTPscraper

Nick Thomas suggests we put a file called "socorro.json" in the build artifacts on TaskCluster.
- A bad idea?
https://public.etherpad-mozilla.org/p/socorro-releng-index-201607

@@ Line 18: / Line 18: @@
 == Operations Updates ==
+[https://bugzilla.mozilla.org/buglist.cgi?priority=P1&resolution=---&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=Infra&product=Socorro&list_id=13148014  P1 Infra Bugs]
 * focused on "not being on fire"
 ** seems to be going well
@@ Line 43: / Line 45: @@
 ** JP wants a stable stage and prod before he does it
 ** let's do it this week, shortly after our next ship to prod
-[https://bugzilla.mozilla.org/buglist.cgi?priority=P1&resolution=---&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=Infra&product=Socorro&list_id=13148014  P1 infra bugs]
 == Project Updates ==
 * Socorro::Middleware component to Graveyard
+** aiming to reduce to ::infra ::general ::webapp
+** https://bugzilla.mozilla.org/show_bug.cgi?id=1291808
 * Monitoring/healthcheck now checks for ES shards errors. In prod. Every minute.
+* Home page AJAX code now cached! Yay faster home page.
+* in the ES fire there was talk of a spike among release drivers
+** laura was saying we needed to be stable so they could investigate
+* api now has cache headers
+** webapp is using them now, improved perf
+* google auth is getting ready to go out
+** we see an error on stage that we cannot reproduce
+** we suspect its security scanner tools, benign
+** one more fix going out and then we're ready for prod
 === Deployment Triage ===
@@ Line 61: / Line 75: @@
 * Deployed in stage
 ** emails will be sent as soon as we put this in prod
+** also need to execute the migration for aliases https://bugzilla.mozilla.org/show_bug.cgi?id=1287548
+*** lonnen will migrate to clonnen, etc, gauth doesn't support aliases
 === Sending public data to parquet for reading from spark/re:dash ===
@@ Line 66: / Line 82: @@
 ** https://github.com/peterbe/socorro/pull/17/files
 * Mark's awareness of reprocessing (aka. primary keys)
+** how useful is it and how often do we do it?
+* we are going to unify raw and processed crashes into a single crash report json document based on the public schema
+** avoids duplicate info, unifies all the info we have into one doc and uses the prettier name where we have it
+** starts only at the point where we transmit to telemetry data platform
 === Symbols service refactoring (snappy, somewhat tangental to us) ===
@@ Line 80: / Line 100: @@
 === Collecting client-side JavaScript errors ===
+* No update
 === Handling more PII data in crashes ===
 === Sending stacks for all crashes from the client ===
+* no update
 === Replacing FTPscraper ===
+* Nick Thomas suggests we put a file called "socorro.json" in the build artifacts on TaskCluster.
+** A bad idea?
+* https://public.etherpad-mozilla.org/p/socorro-releng-index-201607
 == other business ==

Breakpad/Status Meetings/2016-08-03: Difference between revisions

Latest revision as of 17:28, 10 August 2016

Contents

Meeting Info

Operations Updates

Project Updates

Deployment Triage

PR Triage

Major Projects

Migrating off of persona

Sending public data to parquet for reading from spark/re:dash

Symbols service refactoring (snappy, somewhat tangental to us)

Signature generation across crash reporters

Splitting out collector

Collecting client-side JavaScript errors

Handling more PII data in crashes

Sending stacks for all crashes from the client

Replacing FTPscraper

other business

Travel, etc

Links