Breakpad/Status Meetings/2016-08-03: Difference between revisions
< Breakpad | Status Meetings
(15 intermediate revisions by 2 users not shown) | |||
Line 18: | Line 18: | ||
== Operations Updates == | == Operations Updates == | ||
[https://bugzilla.mozilla.org/buglist.cgi?priority=P1&resolution=---&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=Infra&product=Socorro&list_id=13148014 P1 Infra Bugs] | |||
* focused on "not being on fire" | * focused on "not being on fire" | ||
** seems to be going well | ** seems to be going well | ||
Line 43: | Line 45: | ||
** JP wants a stable stage and prod before he does it | ** JP wants a stable stage and prod before he does it | ||
** let's do it this week, shortly after our next ship to prod | ** let's do it this week, shortly after our next ship to prod | ||
== Project Updates == | == Project Updates == | ||
* Socorro::Middleware component to Graveyard | * Socorro::Middleware component to Graveyard | ||
** aiming to reduce to ::infra ::general ::webapp | |||
** https://bugzilla.mozilla.org/show_bug.cgi?id=1291808 | |||
* Monitoring/healthcheck now checks for ES shards errors. In prod. Every minute. | * Monitoring/healthcheck now checks for ES shards errors. In prod. Every minute. | ||
* Home page AJAX code now cached! Yay faster home page. | |||
* in the ES fire there was talk of a spike among release drivers | |||
** laura was saying we needed to be stable so they could investigate | |||
* api now has cache headers | |||
** webapp is using them now, improved perf | |||
* google auth is getting ready to go out | |||
** we see an error on stage that we cannot reproduce | |||
** we suspect its security scanner tools, benign | |||
** one more fix going out and then we're ready for prod | |||
=== Deployment Triage === | === Deployment Triage === | ||
Line 61: | Line 75: | ||
* Deployed in stage | * Deployed in stage | ||
** emails will be sent as soon as we put this in prod | ** emails will be sent as soon as we put this in prod | ||
** also need to execute the migration for aliases https://bugzilla.mozilla.org/show_bug.cgi?id=1287548 | |||
*** lonnen will migrate to clonnen, etc, gauth doesn't support aliases | |||
=== Sending public data to parquet for reading from spark/re:dash === | === Sending public data to parquet for reading from spark/re:dash === | ||
Line 66: | Line 82: | ||
** https://github.com/peterbe/socorro/pull/17/files | ** https://github.com/peterbe/socorro/pull/17/files | ||
* Mark's awareness of reprocessing (aka. primary keys) | * Mark's awareness of reprocessing (aka. primary keys) | ||
** how useful is it and how often do we do it? | |||
* we are going to unify raw and processed crashes into a single crash report json document based on the public schema | |||
** avoids duplicate info, unifies all the info we have into one doc and uses the prettier name where we have it | |||
** starts only at the point where we transmit to telemetry data platform | |||
=== Symbols service refactoring (snappy, somewhat tangental to us) === | === Symbols service refactoring (snappy, somewhat tangental to us) === | ||
Line 80: | Line 100: | ||
=== Collecting client-side JavaScript errors === | === Collecting client-side JavaScript errors === | ||
* No update | |||
=== Handling more PII data in crashes === | === Handling more PII data in crashes === | ||
=== Sending stacks for all crashes from the client === | === Sending stacks for all crashes from the client === | ||
* no update | |||
=== Replacing FTPscraper === | === Replacing FTPscraper === | ||
* Nick Thomas suggests we put a file called "socorro.json" in the build artifacts on TaskCluster. | |||
** A bad idea? | |||
* https://public.etherpad-mozilla.org/p/socorro-releng-index-201607 | |||
== other business == | == other business == |
Latest revision as of 17:28, 10 August 2016
Meeting Info
Breakpad status meetings occur on Wed at 10:00am Pacific Time.
Conference numbers:
Vidyo: Stability 650-903-0800 x92 conf 98200# 800-707-2533 (pin 369) conf 98200#
IRC backchannel: #breakpad
Mountain View: Dancing Baby (3rd floor)
Operations Updates
- focused on "not being on fire"
- seems to be going well
- root cause of last weeks issues
- configuration mismatch with the rest of the cluster
- puppet missed putting the .yaml file in there
- they defaulted to 2GB and when they exhausted themselves everything went to hell
- we initially suspected that it was retention related
- debated but didn't land a change that would lower retention temporarily
- new pingdom accounts coming if you have one already
- monitoring of ES
- Jason has been helping us to figure out our ES config and make it more robust
- JP has new monitoring agent
- we expect to have new, aggressive alerts
- super search errors are checked in webapp health check
- should catch individual shard failures
- shard failures break pingdom and sentry now
- jp will own a plan for failure
- python upgrade
- on the horizon
- JP wants a stable stage and prod before he does it
- let's do it this week, shortly after our next ship to prod
Project Updates
- Socorro::Middleware component to Graveyard
- aiming to reduce to ::infra ::general ::webapp
- https://bugzilla.mozilla.org/show_bug.cgi?id=1291808
- Monitoring/healthcheck now checks for ES shards errors. In prod. Every minute.
- Home page AJAX code now cached! Yay faster home page.
- in the ES fire there was talk of a spike among release drivers
- laura was saying we needed to be stable so they could investigate
- api now has cache headers
- webapp is using them now, improved perf
- google auth is getting ready to go out
- we see an error on stage that we cannot reproduce
- we suspect its security scanner tools, benign
- one more fix going out and then we're ready for prod
Deployment Triage
PR Triage
Major Projects
Migrating off of persona
- Deployed in stage
- emails will be sent as soon as we put this in prod
- also need to execute the migration for aliases https://bugzilla.mozilla.org/show_bug.cgi?id=1287548
- lonnen will migrate to clonnen, etc, gauth doesn't support aliases
Sending public data to parquet for reading from spark/re:dash
- Adrian and Peter has prototype to add another crash storage that sends to S3
- Mark's awareness of reprocessing (aka. primary keys)
- how useful is it and how often do we do it?
- we are going to unify raw and processed crashes into a single crash report json document based on the public schema
- avoids duplicate info, unifies all the info we have into one doc and uses the prettier name where we have it
- starts only at the point where we transmit to telemetry data platform
Symbols service refactoring (snappy, somewhat tangental to us)
No update.
Signature generation across crash reporters
- No update
- but Peter is asking about it here: https://bugzilla.mozilla.org/show_bug.cgi?id=828452
Splitting out collector
No update.
Collecting client-side JavaScript errors
- No update
Handling more PII data in crashes
Sending stacks for all crashes from the client
- no update
Replacing FTPscraper
- Nick Thomas suggests we put a file called "socorro.json" in the build artifacts on TaskCluster.
- A bad idea?
- https://public.etherpad-mozilla.org/p/socorro-releng-index-201607