Breakpad/Status Meetings/2016-08-03: Difference between revisions

 
(15 intermediate revisions by 2 users not shown)
Line 18: Line 18:


== Operations Updates ==
== Operations Updates ==
[https://bugzilla.mozilla.org/buglist.cgi?priority=P1&resolution=---&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=Infra&product=Socorro&list_id=13148014  P1 Infra Bugs]
* focused on "not being on fire"
* focused on "not being on fire"
** seems to be going well
** seems to be going well
Line 43: Line 45:
** JP wants a stable stage and prod before he does it
** JP wants a stable stage and prod before he does it
** let's do it this week, shortly after our next ship to prod
** let's do it this week, shortly after our next ship to prod
[https://bugzilla.mozilla.org/buglist.cgi?priority=P1&resolution=---&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=Infra&product=Socorro&list_id=13148014  P1 infra bugs]


== Project Updates ==
== Project Updates ==
* Socorro::Middleware component to Graveyard
* Socorro::Middleware component to Graveyard
** aiming to reduce to ::infra ::general ::webapp
** https://bugzilla.mozilla.org/show_bug.cgi?id=1291808
* Monitoring/healthcheck now checks for ES shards errors. In prod. Every minute.  
* Monitoring/healthcheck now checks for ES shards errors. In prod. Every minute.  
* Home page AJAX code now cached! Yay faster home page.
* in the ES fire there was talk of a spike among release drivers
** laura was saying we needed to be stable so they could investigate
* api now has cache headers
** webapp is using them now, improved perf
* google auth is getting ready to go out
** we see an error on stage that we cannot reproduce
** we suspect its security scanner tools, benign
** one more fix going out and then we're ready for prod


=== Deployment Triage ===
=== Deployment Triage ===
Line 61: Line 75:
* Deployed in stage
* Deployed in stage
** emails will be sent as soon as we put this in prod
** emails will be sent as soon as we put this in prod
** also need to execute the migration for aliases https://bugzilla.mozilla.org/show_bug.cgi?id=1287548
*** lonnen will migrate to clonnen, etc, gauth doesn't support aliases


=== Sending public data to parquet for reading from spark/re:dash ===
=== Sending public data to parquet for reading from spark/re:dash ===
Line 66: Line 82:
** https://github.com/peterbe/socorro/pull/17/files
** https://github.com/peterbe/socorro/pull/17/files
* Mark's awareness of reprocessing (aka. primary keys)
* Mark's awareness of reprocessing (aka. primary keys)
** how useful is it and how often do we do it?
* we are going to unify raw and processed crashes into a single crash report json document based on the public schema
** avoids duplicate info, unifies all the info we have into one doc and uses the prettier name where we have it
** starts only at the point where we transmit to telemetry data platform


=== Symbols service refactoring (snappy, somewhat tangental to us) ===
=== Symbols service refactoring (snappy, somewhat tangental to us) ===
Line 80: Line 100:


=== Collecting client-side JavaScript errors ===
=== Collecting client-side JavaScript errors ===
* No update
=== Handling more PII data in crashes ===
=== Handling more PII data in crashes ===
=== Sending stacks for all crashes from the client ===
=== Sending stacks for all crashes from the client ===
* no update
=== Replacing FTPscraper ===
=== Replacing FTPscraper ===
* Nick Thomas suggests we put a file called "socorro.json" in the build artifacts on TaskCluster.
** A bad idea?
* https://public.etherpad-mozilla.org/p/socorro-releng-index-201607


== other business ==
== other business ==

Latest revision as of 17:28, 10 August 2016

« previous meetingindexnext week » create?

Meeting Info

Breakpad status meetings occur on Wed at 10:00am Pacific Time.

Conference numbers:

   Vidyo: Stability 
   650-903-0800 x92 conf 98200#
   800-707-2533 (pin 369) conf 98200# 

IRC backchannel: #breakpad
Mountain View: Dancing Baby (3rd floor)

Operations Updates

P1 Infra Bugs

  • focused on "not being on fire"
    • seems to be going well
  • root cause of last weeks issues
    • configuration mismatch with the rest of the cluster
    • puppet missed putting the .yaml file in there
    • they defaulted to 2GB and when they exhausted themselves everything went to hell
    • we initially suspected that it was retention related
    • debated but didn't land a change that would lower retention temporarily
  • new pingdom accounts coming if you have one already
  • monitoring of ES
    • Jason has been helping us to figure out our ES config and make it more robust
    • JP has new monitoring agent
    • we expect to have new, aggressive alerts
  • super search errors are checked in webapp health check
    • should catch individual shard failures
    • shard failures break pingdom and sentry now
    • jp will own a plan for failure
  • python upgrade
    • on the horizon
    • JP wants a stable stage and prod before he does it
    • let's do it this week, shortly after our next ship to prod

Project Updates

  • Socorro::Middleware component to Graveyard
  • Monitoring/healthcheck now checks for ES shards errors. In prod. Every minute.
  • Home page AJAX code now cached! Yay faster home page.
  • in the ES fire there was talk of a spike among release drivers
    • laura was saying we needed to be stable so they could investigate
  • api now has cache headers
    • webapp is using them now, improved perf
  • google auth is getting ready to go out
    • we see an error on stage that we cannot reproduce
    • we suspect its security scanner tools, benign
    • one more fix going out and then we're ready for prod

Deployment Triage

PR Triage

Major Projects

Migrating off of persona

Sending public data to parquet for reading from spark/re:dash

  • Adrian and Peter has prototype to add another crash storage that sends to S3
  • Mark's awareness of reprocessing (aka. primary keys)
    • how useful is it and how often do we do it?
  • we are going to unify raw and processed crashes into a single crash report json document based on the public schema
    • avoids duplicate info, unifies all the info we have into one doc and uses the prettier name where we have it
    • starts only at the point where we transmit to telemetry data platform

Symbols service refactoring (snappy, somewhat tangental to us)

No update.

Signature generation across crash reporters

Splitting out collector

No update.

Collecting client-side JavaScript errors

  • No update

Handling more PII data in crashes

Sending stacks for all crashes from the client

  • no update

Replacing FTPscraper

other business

Travel, etc

Links