User:Armenzg/Test pool efficiency

< User:Armenzg
Revision as of 18:43, 28 September 2015 by Armenzg (talk | contribs) (runner dashboard)

The main focus of this page is to collect information around running *tests*, how to have a good comprehension, good metrics and determine the efficiency of the system.

Information about jobs

Non-running-tests wall time:

  • machine reboot time (if applicable)
  • runner (if applicable)
  • buildslave connecting to master assigning job
  • buildbot steps besides mozharness call
  • buildbot steps lag (due to master lagginess)
  • mozharness non-running-tests actions
    • clobber
    • download-and-extract
    • checkout

We always reboot on Windows testers since runner isn't managing all the processes there. We also reboot after any android, emulator, mochitests or reftests, since those change the system state in ways we haven't been able to identify...the only way to get back to a known good state is to reboot.

Known bugs

  • We are currently experiencing lags introduced by masters
    • reduce # of active jobs running on a master
    • reduce # of buildbot steps
    • reduce output
      • the reason this impacts step lag is that the log processing is happening over the same channel as the start/stop commands
      • can we make mozharness not output to stdio and make the log_uploader.py upload the Mozharness log and set log_url to it?
    • send logs back to the master on bigger chunks (less interruptions of the masters)
    • http://hg.mozilla.org/build/buildbotcustom/file/03644c855bb4/bin/log_uploader.py#l111
      • the data is somewhat structured already - that function serializes it out to the current format
  • bug 1209112 - Virtualenv cache always gets clobbered
  • bug 1208223 - We lack Mozharness metrics for test jobs (per-action)
  • We lack per Buildbot steps metrics
    • We have some data on pulse but we don't know real elapsedTime
  • We don't have runner for Windows test jobs
    • This would move clean up steps prior to Buildbot start up

Optimizations

Auditing

  • Evaluate which jobs can be combined or re-shuffled

Sources

http://activedata.allizom.org/tools/query.html#query_id=SDcCQmDR

buildbot_status    duration
exception              3473
failure             1353995
retry                107128
success           174430338
warnings            8688192
  • Buildbot master lags: dashboard
    • The master lag is calculated by measuring the reported time of one of the initial steps that should be nearly instantaneous
    • What is the impact on jobs?
  • Tree uptimes, end to end, branch load, time per push dashboard