Notes collected from Stability Week 2013, Aug 19-23, Mozilla MV (in reverse order). Action Items have been extracted to a separate page.

Day 4: 08/22/2013

Rethinking Socorro Storage

Storages:

Postgres
ES
HBase
Raw crashes (which include JSON and dump) go 100% to HBase and 10% to Postgres and ES.
...little help?

SLAs

Keep all crashes forever. Keep derived data as long as possible. Keep everything else for 6 months. (The 6 months we were able to negotiate *only* for privacy reasons.)

Availability

Write crashes: 100%
Processing crashes: a couple hours to a day
UI/API: can be down
Priority processing: 1-2 mins (we do 30 seconds)

CAP

Consistency, availability (for writes, too), partition-tolerance: pick any 2. Postgres: C, A HBase: C, P Mongo, Cassandara, ES: A, P It's an emerging pattern to have a storage with each set of 2. We do.

Considerations when choosing

Which of A, P, C are we missing, and how hard is it to work around that?
Ease of querying
Support at Mozilla

NoSQL left behind! ;-)

HBase problems

Thrift is flaky. Does it mattter? Could we use it better? Should we use something else?
- Why is it flaky? Server problem? JVM misconfig? rhelmer observes that, when we have problems, we'd hear that the servers were having problems at the same time. The CrashMovers today have had over 100 Thrift failures, and it's 10am.
- Alternatives: protocol buffers, Apache Avro, HTTP
- Stage runs into these failure much more often than prod, and it's more of a disaster, because there are fewer processors, and it can't spare one to hang out in a retry loop. On-call gets paged, because now we care about staging.
- Might be nice to have a staging HBase cluster of our own to place with so we can experiment and track down the Thrift issue.
Hard to query
- There are good tools available to this, but we'd bottleneck on IT installing them so we could try them out. Having our own playground cluster would help with that.
No insight into problems
Hard to install, so hard to develop with
We're not taking advantage of its good stuff, like map-reduce
No 24/7 on-call from IT. There's one guy. He doesn't give us much advance notice about upgrades.
Horrifying backup situation (HBS)
Need training
No effective monitoring. It's not clear when it's gone down.

Actions

Push on the ops/training/support troubles. Try and get some dedicated attention (a person, for instance) allocated to us. [laura]
- Backups: costed on Amazon Glacier [lonnen], waiting on bmoss to approve the expense [laura]
Write a failing test against some kind of playground environment to expose Thrift failures. Once that's failing, we can do experiments to figure out what's wrong. [erik, rhelmer]
Research alternative connection methods to Thrift [selena]
Rhelmer will look at CM4 and see what we can get out of that for monitoring, etc [rhelmer]
Go shopping for query tools - Cascading, Hive, Impala, etc - must be open source [selena]
Research alternatives to HBase [selena]

Socorro and FHR: Testing, the master plan or the mutating reality

Q: Should QA worry about dev as much anymore, given that it's a random playground in unknown states? A: Dev isn't monitored by IT or set up the same as anything else. Stage is a better indicator. Automation suite now runs a test 5 times before declaring failure. We weren't doing that right before. Nobody's been ambiently complaining about crash-stats in meetings this year! Yay quality. Resource: https://etherpad.mozilla.org/socorro-testing xFail reporting: http://vaidikkapoor.info/mozwebqa-dashboard/

Figure out how much end-to-end testing to add each time we add a new component: do we test the whole system with, say, the new crash storage installed? Can we embiggen our testing of configuration, since that's so hard to get right? Which configuration? Well, we currently support 6 configs: dev, stage, prod, integrationtest1 (old monitor), integrationtest2 (rabbit), and vagrant. Let's have fewer: a local-development one (which obsoletes or is the same as our Vagrant one), a prod one,
- https://github.com/mozilla/socorro/blob/master/scripts/monitor-integration-test.sh depends on PG but nothing else: not ES or HBase.

Actions

Get Selenium suite polling for new Socorro commits so the tests get run more promptly and we have less trouble tracking down regressions. [mbrandt]
How about some tests for our stored procs? Write some. They'll call the procs through the middleware layer. [selena]
Stop running tests against dev. [mbrandt]
Audit stage's config; make sure it's the same as prod. [selena, rhelmer, lonnen]

Stats

We should knock some windows in this airport.

Actions

File a bug and do something about it. [lonnen] -> bug 908334

Day 3: 08/21/2013

Socorro: future of dev

Who uses what
- Counts: Vagrant: 2, Other VM: 5, Local: everybody else 6+
- Local: everything, everything but hbase, etc
- Using dev resources? (dev, mware, hbase) - most
- Local fakedata / remote fakedata
Dev and stage run same code now
Dev was busted a lot. What should we do with it?
- run try branches? (some resistance, the current box is useless)
- don't put it back in inventory
- turn it into a real environment?
- maintain dev datastores?
Vagrant is too complex: need to trim it down to bare minimum
- rhelmer has this working better on other projects
- strip it down
- base it on CentOS instead of Ubuntu
  - Then we can build binary packages for prod on CI no matter what host OS CI is running.
- dev needs to continue as a resource for datastores: everything from the middleware back (testing crons etc etc)

Actions

redo Vagrant-lite (rhelmer)
developer HBase polished and working (tmary bug 872810, rhelmer to follow up) bug 907964
keep dev for now for use as a datastore etc (no real action here)
build new dev env on the VM dumitru gave us (lonnen)
nag IT for puppet write access for everybody (laura)
follow up on options for try (lonnen)

Improving User Support after Crashing

Instructions to users on how we can help them
Brainstorm ideas for how to get things in front of users:
- After crash, start in safemode (esp if we know it's add-on related)
- Show UI to users on startup on how to help them
- Allowing users to ask for help
- Improving the data quality sent to FHR
- On the support side, we find it hard to help users out just based off-of crash-stats
FHR in future may look into self-healing options (disable plugin if that's causes instability ??)
Part of initial FHR idea was that server would report back if anything interesting is in the packet and any steps can be taken to improve the state
Sending emaills to the user (already there is socorro) and attach resolution to their issue ?
- Cannot send emails in custom langauge etc?
  - Bug of file ?
By adding metadata into the Empty Dump crashes we can classify users and do some outreach for help
- For Eg : We know some of these users already, Eg : Nvidia users using certain graphics card and are outreach away ?
Outreach options
- Emails (bsmedberg)
- FHR ? (but if its a start-up crash ?)
- Support article embedded in crash report ?
  - Outstanding Question : How long do we have these support article
Link to SUMO on FHR?
- It's in the header under Learn More... and in the tips to specific articles
Whats blocking us on Processing all(100%) of the crashes (Why do we need this ?? :bmsedberg to fill in here) (laura)
- Monitor is a bottleneck but going away soon
- Coorelations
- Cron jobs takes long
- Storage is a concern
- Saving the minidump not important
Enhancing about:crashes https://wiki.mozilla.org/File:About_crashes.png wsa an older idea and we are discussing something on the same lines here
What can we do about start-up crashes ??
Right now, we offer Safe Mode option after 3 start-up crashes
- Most of them are third party related
  - Libraries that get loaded into the process
Build things into the installer, reinstaller should check if we had start-up crashes(crash -reporter ? ) and pass information to users on revious similar crashes and help them out
- Limitations : Only possible on some platforms
Or fire-up another browser( ??) or something else here to pass on the related information
[Rob Strong] Would Changing existing Plugin directory structure help the start-up crashes ?
- :bsmedberg, not that much
Review of the current state of Firefox crashes
- What happens when you crash:
  - First crash, restart Firefox
  - Second crash, restart Firefox, disable tabs
  - Third crash, if startup crash, start safe mode
What might cause Firefox crashes that we can ID and be actionable
- External extension
- Antivirus
- Graphics driver (user may update or disable HW accel)
- Plugins (user may update)
- Firefox version (user may update)
Only 5% of crashes have email addresses
Consider the severity (frequency)
Catch users before they leave
Redesign crash reporter to better collect emails addresses.

Actions

[KaiRo] Finish up bug on listing last 3 days of crashes in about:support - bug 765285
[gps] expose "disable addon with ID X" API from chrome to "healthreport" - bug 912815
[Arun] come back for ideas for design changes in the client

https://etherpad.mozilla.org/ux-stability-workweek (my notes -Arun)

[bsmedberg] Support classification
[brandonsavage] API to say "tell me what classification for this crash is" [Magic 8 ball API] - bug 915667
[bsmedberg to follow up] Reduce the junk in the help menu.
[bsmedberg] Redesign email flow
[bsmedberg] Support article embedded in crash report ? Need to investigate this further for implemantion
[laura] Develop plan for processing 100% of crashes only for their support classification
[laura] Develop plan for API for Firefox client to query for crash classification

Socorro Triage

Open Webapp bugs, sorted by last changed
- List size at start of session:206
- List size at end of session: 151
- Resolved as WFM: 10 - WONTFIX: 26 (4 changed resolution in the week after) - INCO: 3 - FIXED: 23

DAY 2: 08/20/2013

New B2G Socorro Features

Check bugs on file for all the requests this afternoon, then we can triage them
Reports per-device
- This quarter, search by device: https://bugzilla.mozilla.org/show_bug.cgi?id=853455
- B2G vs App crashes
- ADI ratio, if we had ADIs
- Reports per-partner - later
- Regular TCBS for B2G
  - Work out what the blockers are: don't know releases, what else?
B2G Crash view enhancement (versions)
- 1.1.* vs 1.1.1.* vs 1.1.1.0
reports by buildID
- facet on per-device report
partner access to minidumps is an open question

Actions

We'll link to per-device crash lists from per-vendor wiki pages
Chasing ADIs, verifying
akeybl/khu - to follow up with TAMs about getting a crash report per partner (soft blocked on crashme app, instructions to get crash ID)
- required: product name, version, platform, build_id, build_type (beta/release/etc)
- optional: beta_number, repository
  - where can we find above infomation? Can we do it via some adb cmds?
    - required info will be in a crash - probably available via adb commands, it's in settings app
Build IDs of final release are highlighted in green on this spreadsheet : https://docs.google.com/a/mozilla.com/spreadsheet/ccc?key=0Av0LdM1CVycIdDBxVFhqNmNkSnBhVnNSc05vMmhRdXc#gid=4
bajaj - checklist item for channel to be release-
KaiRo to poke catlee for proper channel naming internally (pinged on bug 869905 for now)

B2G Stability Process

What do we need to do with partners? -khu
- we need gecko and gaia source code (all open source parts)
- partner needs to provide symbols for each build
- partner needs to turn on crash reporter and provide a crash report for this device/build
- With QC symbols we do not need binaries but since we know that they are not going to be able to provide QC symbols we NEED binaries
do we file bugs? or do we only provide a partner view?
- depends on whether or not we can make our partners care to track their own stability
- we can highlight these issues with partners

Decision Tree for crashes we find in crash-stats:

see top crash in KaiRo's reports (soon to be Soccorro), file bug
- Device Specific
  - Gecko
    - Mozilla will be the first to look
    - Nominate for blocker of upcoming release
  - Proprietary Code (RIL, sensor drivers, etc.)
    - Mark with POVB, or file in Boot2Gecko::Partner Issues
    - Pass the bug off to appropriate TAM [WHO IS THIS?] for partner triage, for investigation by OEM to start
    - Mozilla to follow up as necessary
    - Nominate for blocker of upcoming release
- Not Device Specific
  - Gecko Crash
    - Normal stability process
  - Content Crash
    - Normal stability process
  - Open Source Crashes (Android level)
    - Handled case-by-case

Actions

communicate the device-specific crash report
[akeybl] Propose creation of a confidential partner-specific Boot2Gecko::Partner Issues (just like POVB)
[done][nhirata w/ robert wood] automated testing ( http://woodrobert.wordpress.com/2013/05/31/gaia-ui-endurance-tests/ ) should enable crash metrics for TAMs to use (endurance testing)

B2G Stability Requirements

What do we need to analyze B2G stability and raise stability?
- reports per-device
  - What is needed to integrate B2G crash reports that KaiRo generates into crash-stats
- https://crash-analysis.mozilla.com/rkaiser/2013-08-19/2013-08-19.b2g.topcrashes.weekly.html
- surface the buildID to understand geo-specific stability issues
- team reports (?)
[gkw] What about JS app/Gaia "crashes"
- blocked on dcamp's team's work
  - JS bug https://bugzilla.mozilla.org/show_bug.cgi?id=630464
  - DOM bug https://bugzilla.mozilla.org/show_bug.cgi?id=355430
- consider doing raw crash counts (even without stacks) to justify the importance of these
- Gaia: about:gaia-crashes (or equivalent name) page/app:
  - https://bugzilla.mozilla.org/show_bug.cgi?id=857843
Opt-in submit your crash via cell network
[bsmedberg] Build a process watcher to watch for hanging processes and kill/report
Cheng's common issues:
- 4/24 needed to pull their battery due to hang - see responsiveness action
- apps closing unexpectedly
- Firefox OS reboot (system crash) - should have crashes for this already
What does QA do in stability testing right now and what requirements has QA to improve this work? Taipei QA plans to write scripts to do the testing, and use high speed camera to record down the results.
Dupes: the checksum on the minidump will be the same
- action: checksum minidumps, store the checksum, use for dupe detection
Need about:crashes for b2g
Need crashmenow for b2g
OOM killer on apps playing music or in the foreground
When is the new Symbol Upload UI going to come up (bug # ?? )
- For short term,do we still need to request partners to upload the symbols for 1.1 in the same old fashion ?
Could we direct them to some automated script/tool that does automatic symbol upload of every build instead of the manual process where they new to submit the file each and every time ? (Not sure how Geeksphone do it ?)
Do we need anyting else from partners on the stability side ?
- KaiRo mentioned to have a test-case where they could crash and pass on the details in tn a bug #

Actions

exposing system app URLs publicly - bug 915397
reports per-device - bug 853455
[NEEDS BUG, Socorro] report by OS version 1.1.*, 1.1.2, etc.
[NEEDS BUG, Socorro] surface the buildID to understand geo-specific stability issues
[bsmedberg] replacing debuggerd (bug #?)
[NEEDS BUG, bsmedberg] need to check for process responsiveness from an external place (app pings), and also pull system logs when reporting - https://bugzilla.mozilla.org/show_bug.cgi?id=908000
way to report issues to Mozilla from settings, with crash reports (input? etc?) - bug 915409
KaiRo->Fabrice to discuss B2G version number-less reports (bug 910836)
nhirata/Laura to look at https://bugzilla.mozilla.org/show_bug.cgi?id=895246 (duplicate reports)
- client side bug as well, why are we getting dupes at all?
- switch dupe detection to use minidump checksumming - bug 907499
Get activation time be reported as install time - bug 915405
bug 908896 about:crashes on the phone
fg OOM kills (FHR? crashes?) - bug 915407

Meet with B2G Engineering working on Crashes

Pain points for B2G crashes
- we don't get crash reports for one of our devices (TCL)
- Geeksphone crashes reflected the ZTE Open well
Gonk specific crashes
- Camera crashes were prevalent
Lack of code for devices [AI]
- line number, but not our code
- We need to put pressure on partners to honor MPL for open source
support suggests Firefox OS "crashes and hangs a lot"
Do we have devices on hand?
- Physical access - QA?
stability bugs our partners want fixed, versus us wanting it fixed
external pre-release populations
- If not GP, then Nexus 4
PARTNER VIEW to Socorro+1
- Put this in front of partner QA, get them to investigate and file bugs
- Would this info not be present post release though ?
bsmedberg is making a symbol upload app, will be added to the release checklist
release checklist should maybe be added to the release checklist
we're making an app to force a crash for partners to test, and provide us with a crash submission
foreground applications that go out of memory should be seen, but may not (action item)
does the crash reporter have comments?

no they do not

https://etherpad.mozilla.org/StabilityWeek2013-Notes-B2G

Actions

[akeybl/legal] We need to have access to source(gecko only? or also Gaia?) both. ok, thx.
[akeybl/branding/TAMs] find out about requiring that OEMs send us their device to get stability support
we need to document how to investigate issues (source, device access, partner contacts, etc.)
[laura] to check with Anurag where we are in having Socorro consume ADI data
[KaiRo/jlebar] foreground applications that go out of memory should be found
we need a bug for uploading crash reports once on wifi and idle
[bsmedberg] we need a bug for comments and emails in B2G - https://bugzilla.mozilla.org/show_bug.cgi?id=907998

Metrics

6 analysts
understanding insights -> formulate question -> select techniques -> collect data -> run analysis -> interpret results
not simple to provide data upon request - need to think about the right way to ask/frame the question
telemetry doesn't have user identifier, so no analytics on user level there (session level only)
irregularities in data, need to figure out what to do with them before analysis
bsmedberg: we have some kinds of problems we'd like to have automated
crash-stats data is biased because the total sample is not including non-crashing installations
current FHR data on crashes shouldn't be seen as reliable, we are working on better FHR crash metrics
phones only submit crashes over wifi, and some people only ever have data, that skews data as well
sending on dialup may fail often
graph for crash rates over time (1-2 years)
- existing data
- Annie's team, Benjamin Sullins can do the actual graphing work

Actions

KaiRo to drive project on long-term crash rate graph. bug 915438

High Level CrashKill Review

Review of responsibility and current process
Reviewing what we check in stability daily/weekly/etc. - bug queries, custom crash queries, crash-stats
- https://etherpad.mozilla.org/stability-checks
- per-OS
- crash spikes
- top crash list
- most feedback?
high level crash statistic over time -cheng
- we at least know where we are today www.arewestableyet.com -kairo
what % of our users are crashing more than we think is acceptable -cheng
- FHR?
plugin hangs are worse than plugin crashes
- reprioritize? or at least triage for understanding
- we're already working to unblock this triaging
Topcrash criteria
- https://wiki.mozilla.org/CrashKill/Topcrash
New Top Crash metric based upon volume, # of impacted users, type of crash, etc.
- most feedback, slowest startup, etc.
Key question: are we going to lose users?
- level of frustration (#/type of comments) - also input, twitter, SUMO: what else?
Things to add to the metric:
- # of (unique)impacted users, or percentage of users on channel [TODAY]
- crashes per install [TODAY]
  - impacts more than X installations (150k on a critical crash?)
- # of comments [TODAY]
- # of curse words (multilingually)
- hangs versus crashes [TODAY]
- plugin versus browser [TODAY]
- startup crashes!! also permanent versus non-permanent [TODAY]
- URL of the crashing site (like the doodle...)
- more crash types
  - crashes while editing [NEEDS BUG]
  - crashes soon after update [TODAY]
- Browser uptime ( 2 crashes if you are usin the browser in an hour is too high) [TODAY]
- histogram of recent crashes? [NEEDS BUG]
- crash and don't get your tabs back (session restore) [non-actionable?]
"Soak" time for RCs

Actions

arm/flash top crash bugs -kairo - filed bug 918085 for a solution to include that
[socorro] triage bugs that kairo has on file for custom crash reports
longterm crashes, plugin crashes, hangs, plugin hangs - filed as bug 915438
annotate phase of startup as part of crash, along with actual time -bsmedberg (NEEDS BUG) - bug 907994
bug on new crash severity rating bug 918077

Socorro Brainstorming

Combining signatures associated with the same bug # in Socorro (m:n relationship)
- Goal: often there is one bug with multiple signatures, and it's listed lower than it would be if bucketed as a single crash. Want to be able to measure the volume of a bug
- "Top crashes by bug" - https://bugzilla.mozilla.org/show_bug.cgi?id=717797
- Could be done with an alternate crash classifier - but it's really dynamic
- Developers would like a single source of truth for topcrashers
- show bugzilla tracking flags and whiteboard for QA workflow
Identifying/Marking bugs as external issues based upon channel prevalence
Crash "emblems" both auto and manually set - like Mac symbol for Mac-only crashes, Flash plugin crash, etc.
Tags/whiteboards for crashes/signatures pulled automatically from bugzilla
- https://bugzilla.mozilla.org/show_bug.cgi?id=418698
Be able to mark signatures as generic (e.g. EnterBaseline/EnterMethodJIT/EnterJaegershot)
User customization of columns/icons/emblems (Kairo: we need a bug for this)
- This is interesting to a lot of people
- Customized reports need to be sharable
- Customization of search fields, report/list, etc
Bringing up new operating systems or OS updates
- Support more than Linux/Windows/OSX
- Needed for B2G anyway
White or blacklisting of fields
- New annotations should be passed through without a code change
- Need a single source of truth about what is public/private data for all components to look at
- Shouldn't need a code change for adding a new field: config/UI
- Should be s low touch as possible
Getting rid of HBase? [selena is interested :)]
- or at least changing its role
- Reviewing crash storage mechanisms (short, medium, long term)
- Generate a set of requirements, then come up with the alternatives
- Three things we could be doing with hbase that we don't:
  - Impala -- real-time querying of hbase
  - Mahout -- machine learning for hadoop
  - "train the monkeys, don't be the monkeys"
- Think about processing pipeline vs other analytics
- OLTP vs OLAP
Adding more data to ES so we can search on things like graphic card info?
- Should be able to search on any field in the raw crash
- Not only search by any data but be able to present any data in the results.
- Dimension/facet the results in any way. Drill your way down to tight correlations. The only tricky part here is where how to define the facet buckets is not obvious.
- bsmedberg and everybody else is hugely on board with this.
Correlations
- The current state of correlations
  - https://crash-analysis.mozilla.com/crash_analysis/20130820/20130820_Firefox_25.0a2-interesting-modules.txt
  - Looks at a small subset of Firefox (dbaron) python script to correlate signatures with modules/dlls and addons, Human readable text file
  - Need processed data in Postgres to run correlations in Postgres -- would like to have an expiry period on processed data if we put it in Postgres
  - Add processed_json https://bugzilla.mozilla.org/show_bug.cgi?id=907305
  - need correlation column fixed
- [bajaj]Will it help if we have third party correlation/extension info on the crash-stat report for android?
- Some crash-stat views have broken coorelation in desktop (was that ever working ? )
  - it only runs for a small subset or products/releases e.g. those listed in https://crash-analysis.mozilla.com/crash_analysis/20130820/
- Block bringing correlations in on getting procJSON in pg (and prioritize that)
- Add correlations fot Android at some point (small # users yet, low value)
[bajaj] List of reports by Android API version ?
- Supersearch will solve
when patches/updates from other companies come in (MS updates, Mac OS updates, Flash updates, etc.)
- maybe support pulling in calendar info?
- event timeline/graph annotations as mentioned yesterday would help with this
- long term crashes/ADI graph could show this
Reports on failure to stackwalk
- Depends on JSON mdsw
- mdsw has a verbose.cmd line version - tells you whether something is exact or if it got confused - not in the machine readable version
- right now 50% exact stackwalking, what the other problems are we don't know - gives an idea of degee of confidence
- filed as https://bugzilla.mozilla.org/show_bug.cgi?id=907312
Symbols and Symbols API - bsmedberg, selena, kairo, ted to discuss a better way to upload
Sig summary: number of affected installations per signature per day - do more with it. Search for it, ratio of crashes of installation (TCB installation)
B2G needs (B2G session this afternoon ?)
- custom reports (nhirata to play around with API in parallel)
  - Reports via devices
  - Report by device/build id... [this gives away country now that I think about it]
  - reports via application [sprint team interest]
- Use of different metrics that need to be adjusted for counting ADI
- Use of different metrics to understand the crash data?
- Third Party upload of symbols made easier
- Third Party looking at crash reports/access to dumps (Privacy/Legal issues?)

[note : nhirata uses https://github.com/sotaroikeda/firefox-diagrams/wiki/Firefox-Diagrams to understand/reproduce crashes ]

Work Week, Medium Term, and Long term (year) planning for stability and socorro
If you're involved in Socorro make sure you're on https://lists.mozilla.org/listinfo/tools-socorro

DAY 1 : 08/19/2013

OOM/EMPTY crashes

15%-20% of our crashes are EMPTY, lots of those are OOM
Could we have a flag that tells us about malloc OOM?
Can we inform the user before we run OOM and propose actions to save the situation (close tabs, ..., restart Firefox)?
For people who run out of VM (without leaking) and have free RAM, 64bit would help.
For the GFX driver bug where the same memory is mapped all over again, 64bit doesn't help
We need to use fallible allocators more
Slow script dialog might contribute to being crashy on slow/old machines
It would be awesome if we could get reports from our chrome JS exceptions
Could we get some kind of MTBF instead of only crashes per ADI?
Can we annotate amount of usage of number of pageloads?

Actions

[gps] Annotation on "how many tabs were open?"
[dmajor] Annotation on OS (maybe also build architecture?) so EMPTY dump reports still get that (and use it in Socorro)
- Bug already exists: bug 838061
[ask Cww if to do at all] put "increasing memory swap helps to crash less" on SUMO radar
[tabled] Investigate with UX/UR of warning users of impending doom and proposing ways out
[bsmedberg to file] Annotate that the slow script dialog came up - bug 907993
[JS team, laura to nag] bug 630464 get JS stacks on the toplevel for uncaught exceptions
[not yet] Talk to jjensen about what crash data we can put in FHR - bug 875562 is leading up to this
Can we determine and eliminate duplicate crash reports (ones that are really just re-submitted)
- [crashkill]review heuristics - bug 907499 will do this in a better way.
- [laura/kairo]triage old bugs for aggregates with and without dupes - with bug 907499 we'll be able to eliminate real dupes completely
We have information on gfx chipset and driver, memory, hardware/cpu but don't use it to narrow down the cause of crashes
- [laura/kairo] prioritize/review bugs during triage - bug 853468 gives us summarized info on graphics chipsets
- Can potentially compare with information like add-ons that are installed
[kairo to file] Annotate/show events like Firefox or MS releases on the crash charts? SUMO already has the ability to note "events" on their graphs. They use it for releases as well. Maybe we can steal or learn.
[Kairo to follow up with privacy, contacted afowler] Submit all URLs from all tabs with a crash report

Automated Stability

Jesse, Rob, :gkw, :tracy, :bc
Fuzzing ?
- Bug hunter - automated tool to repro crashes . We do have a variety of Platforms/OSes where these run
  - Does Bug hunter manipulate UI ?? - "NO".Currently tests crashes on load, has plugins installed
- :Tomcat/:bc occassionally work on it and file bugs
- Most common crashes are thread related, OOM
- Not running on ASAN builds right now..(decoder to help with this)
- Adding some more randomization ??(Jesse proposed ...)
- Some crashes a reproducible only in the tool and not with STR :|, then Scoobi filed a bug from Soccorro and it was worked on since then
  - Why ?
    - bclary mentioned about crashy ad code being loaded on a Tuesday (and thus being detected), whereas it is being looked at on a Wednesday, which does not have crashy ad code.
- How many bugs are filed using bug hunter ? Unclear this is looked at once in a while when :bc, :Tomcat have a chance
- Are issues that are repro by bug hunter/reproduced by :bc filed as bugs at all?
- Disconnect between fuzzer bugs and the signature in crash-stat ? Right now Scoobi helps with this.
[Jesse] test-suites passing on different sanitizers
[Jesse] TIme on machines to run fuzzing ? Something that needs releng's input.Fuzzing most of the DOM,WebRTC, WebAudio,JS code
- Or utilize nightly population by filtering fuzzer test cases(as it sometimes finds security issues), if costs is an issue ?
- We found about ~10bugs a week with running fuzzers on ~100 machines .Fuzzer's run's breakpad,parses to text output, matches to see if its a known bug,checks assertation failures,too much recursion, ref count leaks,inconsistent rendering for a given DOM tree

Actions

QA to look at the output from the bug hunter tool given the current scenario
[bsmedberg to help with the action needed to resolve the issues] Disconnect between fuzzer bugs and the signature in crash-stat ?
gkw, to ensure decoder uploads stacks as well with the bug reports
Jesse to release part of fuzzzers so nightly population can run it
[KaiRo] escalate necessity of tests passing on various sanitizers so we can have tests and fuzzing find issues using those - as of the platform meeting on 9/24, it looks like tests are run and passing on mozilla-central on ASan builds.
If we can't pass all TBPL tests using ASan/etc, can we enable ASan/etc for the subset of tests that do pass (using test manifests for ASan opt-in)?
[KaiRo] Ease routine jobs of putting ranks into bugs and updating topcrash keyword - bug 913437
- Get a service on Socorro that will give current ranks of a signature on current releases. - bug 915373
Can we get Nightly ASan builds that can auto-update so adventerous users can use Nightly ASan as their regular browser?

bsmedberg hour

(see #s)

(1) JSON version of minidump stackwalk
- Goal is to add all sorts of additional data to do identify JIT frame, go back to the caller etc (??)
- exploitability would move into the JSON minidump stackwalk
- dump lookup output (https://bugzilla.mozilla.org/show_bug.cgi?id=818069) -bsmedberg
- remember to "branch" the JSON so that we can just chop off the "sensitive" data rather than entirely reparsing when we share crash data
Priorities
- Pretty well covered on Desktop/Mobile
- (2) Low end computers need more focus
- (3 Issues that happen to a lot of people a little, instead of a lot to a few people
Top crash rating is skewed, we priotize crashes like this : 1 million crashes to 1 million ADI over 1 million crasher over 100,000 ADI
- "Skewed" because we'd be getting more crash reports if those users hadn't ditched Firefox
- Focus more on crashes caused by third parties becuase that's the major part that is left and this affects our users a lot
  - Out of date drivers, extensions, plug-ins, anti-virus
- (4) "far crashes" - happening far away from their cause, signatures pretty useless
  - IPC, GC, races, ...
(5) Firefox OS (also a priority)
- User research from Cheng shows that crashes/hangs are awful
- [Hang Detection], Are we collecting hangs at all as of today ?? Need Engineering efforts here..
- OOM's are very frequent, and possibly being identified by users as "crashes"
- defer 5 to tomorrow, when jlebar is around :)x
Engineering activities
- (6) empty crash signatures
- (7) improving stack walking (depends a lot on JSON minidump stackwalking)
(8) Make Upload Symbols very easy either encrypted or not
- Oracle,
- REQUIRE that partners provide symbols (Skype toolbar, top binary extensions, etc.) - we can do this in AMO, bsmedberg has already had initial convos with the AMO guys about this
  - Would we allow "hashed symbols" like we get for Adobe Flash? Yes - :bsmedberg has a tool that does this better than the way Adobe does it!
(4 continued) Collecting more data on far crashes(memory corruption ??)
- Collecting full dumps ?
- Doing it on Nightly/Aurora is very harder than necessary so maybe only Beta and Release??
- Needs to opt-in for privacy and bandwidth reasons (entire used address space -- gigabytes)
- Needs a privacy review
- Do it the MS way ?? (bsmedberg?)
  - talk to a server: Have I seen this before? Do I want a full dump for this signature?
- Adding the UI to breakpad is hard
- Could this be executed like a Test Pilot program ?
- Implement in a way that the existing crash-reporter UI should be able to do it when he restarts firefox
- Store a crash-report on user's computer and upload on demand ?

Actions

See above bolded #s
[lars] JSON minidump_stackwalk getting deployed

JS Engineering

:nbp,Jesse from the JS team are here
What are the pain points for JS team related to Socorro ?
- Reports are useless :|
- distinguish between read/write (difficult with mixed reads & writes ?)
[selena] GC crashes : we end up having the count but not the complete volume across everything not per signature
- Previous bug https://bugzilla.mozilla.org/show_bug.cgi?id=803209
Mutation of crashes when signatures are morphed ?
- How: talk to Scoobidiver about how to do this since apparently he does it by hand
https://bugzilla.mozilla.org/show_bug.cgi?id=803209 GC crashes over time
- Need followup for graph - bbajaj to file

From the bottom /\

Quickly identifying commonalities between multiple signatures, or separating signatures
- Often what's interesting to JIT devs is offsets, i.e. the last bits of the addresses
- Customizable Search with multiple filters and dig into deeper level to get buckets of crashes( similar to group by ? )
- Search by the last byte ?
- flexible, faceted search (?? need help from Socorro folk here)
  - allows filtering by different criteria, but there is a mismatch between the facets that the JS team needs and the ones that we have. Need to get a list of what they do need - is it common with the needs of other teams? or is it just about being more flexible?
  - example, look at the addresses of the reports associated with:

     https://crash-stats.mozilla.com/report/list?signature=js%3A%3ADestroyContext%28JSContext*%2C+js%3A%3ADestroyContextMode%29&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&hang_type=any&date=2013-08-19+22%3A00%3A00&range_value=1
    This show how the memory address changed over time, and tracking this kind of modiciation can give hints on the data structure which has been modified.

- crash address (offset is what they're interested in for large page sizes, but the raw address for small page sizes) -- look at the addesses and see "similar offset patterns"
- FEATURE: search by last byte of the crash address (adrian says possible now)
- alternate crash classifers - bsmedberg/lars: building a way to implement alternate classifiers
  - have to know what you're looking for - not emergent

(ionmonkey doesn't maintain a frame pointer (wow)) - no object on stack, only garbage

  IonMonkey has its own stack layout.  The hacking tips page includes a brief description of IonMonkey stack frames:
  https://developer.mozilla.org/en-US/docs/SpiderMonkey/Hacking_Tips#Finding_the_script_of_Ion_generated_assembly_%28from_gdb%29

isolating Crashes in JIT code
- @ Annotating the code, so on processing an alternate stack walking algorithm can be applied
- Compile all related functions and list them all ? (Not sure about the size limitations here)
  - If it fits in 10K, can be fit in a minidump ?
- Needs instrumentation ^^

Multiple kinds of addresses
- NULL based, random, others which are completely objects
- If we see a NaN, possibly we forgot to unbox something ?
- Random ones
- Ones that show the characteristics of an object, and those could be exploitable (0x7fff… mostly on x64 systems)
- Poisonned memory 0x…dadadada (not rooted value)
Certain memory patterns are suspicious, need a way to classify them
- 0x41 - exploitable ? How ??(:nbp to fill) (apparently a crash instruction on x86 & x64)

   This is a pattern (0x41414141) which is commonly used by hacker as a convenient way to analysize the emmory, as it is easy to inject in text box ("A") and easy to spot, as it cause an execution crash.  Any execution violation with this address should be considered as sec-high and exploitable.

- frame poisoning address

[bsmedberg] Classification of searches - map reduce program ?
- Not sure how to make it secure ? Limit it to mozillians ?
- Adrian says its very easy :)
- JIT crashes -> run a script to classify them further for a top-crash list (NaN crashes, NULL ptr etc..)

Making flexible Elastic search Queries ? [Adrian]

Actions

[bajaj] Create a bug# to have the total GC crashes on Nightly graphed
[KaiRo] to ask Scoobidiver for information on anything that are potentially automatable (Scoobidiver could get hit by a bus, God forbid!) - actually, it looks like he disappeared right before this stability week :(
[Laura] to find list of what JS team wants to filter on, determine whether these are common between teams or we need a flexible solution
Enable search results to be split by custom criteria (e.g. addresses) instead of signature
[bsmedberg working that out with nbp] Better way of classification for JIT crashes
Do something (in Socorro UI) with "interesting" addresses (mark them?) - bug 918101
[bsmedberg/nbp] Store where we are in JIT code in a page, dump it as part of the crash stack
[Socorro] exploitable crashes could be marked differently, see above for heuristics
[bsmedberg] Make stalkwalker know about (Ion) JIT frames

Radeon Update & Driver Investigation

3min overview
- always happens with AMD graphics, but with different CPUs (mainly AMD but also Intel)
- memory/register corruption from minidumps, values that cannot happen if code is executing how it should
- when the bad thing happens inside a hot function, it happens more often
- something in the graphics driver, probably in kernel mode, pretty surely not directly related to a gfx call from Firefox, does a memory check where it sets a few bytes in memory at a fixed offset in xul.dll and immediately re-sets it back to the original, if we execute the function in that time, we crash
- Does happen on Aurora/Nightly but since we have daily builds there, not that noticeable and more prominent in Release Builds
- stuck getting through AMD QA
- unrelated to acceleration in any way (not an easy disable)
- Clean install takes a long time to reproduce
virtual memory exhaustion related to gfx drivers
- dual Intel/NVidia cards
- 60-70% of OOM crashes we have data for have enough memory available but run out of virtual memory space
- https://bugzilla.mozilla.org/show_bug.cgi?id=767343 (was planned to be discussed in this meeting in this morning's crash-kill) https://bugzilla.mozilla.org/show_bug.cgi?id=859955 is the main tracking bug
3rd party outreach
- We need to give their QA to have them engaged? -milan
- Getting to the right person would help

Actions

[bsmedberg] learning kernel debugging for the radeon crash, from our new in-house expert
[KaiRo/akeybl/bizdev?] following up on escalation with AMD, in preparation for blogging - pushed off as dmajor found out more details on what's behind this issue
- having an engineer would make things much much more likely to be resolved, at MS you can even pay $$ for support tickets to be reviewed by egr
[lsblakk/bsmedberg] highly correlated crashes in FF23.0, discussing in post-mortem
[bsmedberg] to find bug # for empty crash (probably bug 837835) -
- [milan] to follow up on above bug
MS support ticket on bug 812695 (text corruption on Win7) [KaiRo sent email to jrmuizel/bas to check if we really think it's an MS issue]

GFX Engineering

Ran into situations that happen on older cards
- Can we purchase gfx cards before they're EOL'd or when they're released? -milan
- We can't even find older cards on ebay? —erik
- In a general theortical sense, probably, but in reality the cases we encountered were not available anywhere - ashughes
- vendors? -akeybl
Are there ways that we can get info from a helpful user? -bsmedberg
- dxdiag, system information dumps, etc.?
- gfx team - hard to talk about issues in general
most of the time issues are timing issues, likely not covered through unittests
gfx team - getting access to HW is the main issue
public/private crash data
- is this discouraging developers? -milan
- we don't want this to be a limiting factor, and it shouldn't be with employees -bsmedberg
- people without access is split into contributors (e.g. Scoobidiver) and partners externally (FirefoxOS, Oracle (for Java), graphics card vendors)
Not requiring devs to be good at SQLis also a pain point
- Peter will be giving a brownbag on Thursday 2pm PDT on Airmo on the API -laura
  - API https://crash-stats.mozilla.com/api/
- jydoop (?) https://github.com/mozilla/jydoop/
- flow chart for how to get to data, or where to go next with an investigation?
getting access to user computers -cheng
- we'll pay you $500 to take your computer off your hands, doesn't work
- it's all about the level of user expertise
- we could fly engineers out if we needed to
- how bad is it for the user?
Milan - getting exact hardware info, using a debugger, etc. - "I'm an expert!" checkbox +1
why aren't users going from crash -> helping? -akeybl
- the bug number is "below the fold" -kbrosnan
bsmedberg - debugger needs to be already attached
we piss off users when we crash them, we need to remember that -cheng
what about issues where millions of users crash once? -bsmedberg
- still impacts overall stability perception, obviously

Actions

[Milan/Anthony/Marc] how to get access to gfx hardware, or access to user computers (remote access?)
- is there a gfx card service like https://appthwack.com/
- Sounds like there is \o/
- Perhaps this can be used just for top crash
- [bajaj] bug # incoming
[bsmedberg] flow chart for developers: how to get the data that you need, next steps in the investigation, etc.
[Kairo for filing bugs] Put things that are now in app notes into proper crash annotations instead - bug 918102 filed as a tracker
affected graphics chips per signature - bug 853468 (planned for Q3)
- [brandon] bug 853468 Graphics vendors and devices - q3 goal to add this to signature summary
[Laura/bsmedberg] help get around the legalities involved with crash data access to contributors and external partners (and once those are resolved, the logistics)
- needs subgoals!

CrashKill/StabilityWeek2013

Contents

Day 4: 08/22/2013

Rethinking Socorro Storage

SLAs

Availability

CAP

Considerations when choosing

HBase problems

Actions

Socorro and FHR: Testing, the master plan or the mutating reality

Actions

Stats

Actions

Day 3: 08/21/2013

Socorro: future of dev

Actions

Improving User Support after Crashing

Actions

Socorro Triage

DAY 2: 08/20/2013

New B2G Socorro Features

Actions

B2G Stability Process

Actions

B2G Stability Requirements

Actions

Meet with B2G Engineering working on Crashes

Actions

Metrics

Actions

High Level CrashKill Review

Actions

Socorro Brainstorming

DAY 1 : 08/19/2013

OOM/EMPTY crashes

Actions

Automated Stability

Actions

bsmedberg hour

Actions

JS Engineering

Actions

Radeon Update & Driver Investigation

Actions

GFX Engineering

Actions

Navigation menu

Search