Platform/GFX/WRperfmeetings
Meeting notes for our wr-perf sync sessions.
Contents
Q3
Jul 2
Attendees: GW, JM, JCB
- Glenn is taking on the following list of work (below) that is somewhat entangled at the moment
- Some of this work will become less tied together and more can be done in parallel but not exactly sure when that will be
- First goals: C, E - likely diversions into pieces of other work to get those done
- Expected outcomes on this work: better performance, rendering quality in some cases, and simplicity
- Outstanding question: how can we measure this work as we go to make sure we are moving things in the right direction?
(A) Text Runs
- Draw text runs to persistent texture cache, blit as image.
(B) Support multiple child tasks for a given picture.
- Limit render task size to picture cache tile size.
(C) Simplify clip API and implementation.
- WR clip API is much more expressive than needed.
(D) Optimize GPU clipping implementation.
- Take advantage of reduced clipping scope.
- Handle large clip masks that are mostly redundant.
- Gradient clip mask primitive.
(E) Visibility / clip-chain building optimizations and caching.
- Most work we do here per-frame is redundant.
(F) Reduce / remove GPU cache and vertex texture usage. (G) Scene Building profiling and optimization. (I) Segment building
- Move to scene building / interning pass.
- Simpler implementation (nine-patch / fixed grid only)
(J) Unify shaders
- Unify text and image shaders, others possible too.
- Reduce batching cost and renderer cost.
(K) Faster instance building
- Build directly into mapped buffers in render backend etc
(L) Simplify raster root calculations and generation.
- Quality improvements
- Reduce cost of picture traversal / bounds checks
(N) Optimize how shared_clips and generated and handled per-frame.
Work that is not entangled, can be done independently:
(H) Consider replacing rayon with a simpler crossbeam-channel approach.
(M) Box shadow rendering optimizations / simplifications
- Remove from being a clip kind and make into a normal primitive.
(O) General optimization work of frame builder.
- Tighter loops for each part of the frame.
- Memory allocations / freelists etc
(P) Glyph cache work - one issue is the way we time lifetime of CPU glyphs to GPU glyphs in texture cache. Also general batching issues.wr
Q2
May 19
- GW is Convinced text run optimizations is worth doing. That could be significant win for CPU time.
- GW will focus on that in a week or two. https://bugzilla.mozilla.org/show_bug.cgi?id=1639346
- Data target init taking up so much time. Would be valuable to know about that. Why GPU times have so much time in the first flush.
- BP is going to investigate that https://bugzilla.mozilla.org/show_bug.cgi?id=1639336
- Why batch breaks, could be a missing piece of the puzzle? Nical to add to list for investigation again
- JM reach out to more reporters of bugs on old hardware and continue work on getting decent measurements
- DM working on avoiding uploading instance data for each draw call, blocked by ANGLE update, will mark his task blocking the frame building meta bug
May 12 2020
GPU View walkthrough meeting
GPU View tips:
- ... TODO magic symbol lookup string to make Firefox symbols work
- Ctrl-Z to zoom in, Z to zoom out; '+' to add threads;
- Select an object ID and use "Lookup" to find details about it, eg. a texture allocation;
- Filter by thread ID in the Event Viewer to focus on a single (scene builder) thread;
GPU frequency ramp-up delay:
- can take up to a second before GPU is at full throttle again after low amounts of work;
- talking to Intel on getting any details on clock change latency;
- the better we optimize fast frames, the worse this actually gets;
- make the cheap frames more expensive and the expensive ones cheaper? eg. render more tiles outside the viewport?
- keep fixing bugs that cause spike frames -- unneeded invalidations;
- is there a "prefetch" API to give the driver/GPU a heads-up? we have 9ms to do it.
- (do mobile-first APIs eg. Apple have something like this? or just really good schedulers?)
Delay between submitting CPU work and GPU doing it:
- manually force more flushes, see what happens (eg. using GetData to avoid Angle ignoring it);
- identify source of 1ms delay in the submit; help from Intel and/or experiment in stand-alone sample;
- start submitting earlier -- overlap frame builder and scene builder threads;
- will be easier once they're a single thread again;
- starting earlier is limited by the need to sample at vsync;
- no obvious "bookkeeping" style GPU work we could start earlier;
- clears are a maybe -- but beware of tiled architectures;
- unlikely to find much work there that can be split off -- it's mostly talking to the (single threaded) GL context;
Submission cost:
- Angle overhead; is Update Resource leading to Maps?
- try switching from uniforms to UBOs;
- try removing uniforms altogether: transform is unneeded, other state can be folded into shader permutations;
Useful links:
- Call of Duty latency talk
- Reduce number of uniforms
- Submit GPU work more often
- GPU frequency blog post
May 5 2020
Agenda:
- Status for Intel gen 7.5 and 7 - any new bugs/issues/things to worry about?
- Intel gen 6 - how is that shaping up? How far off from being shipable?
- Any other Intel targets we should consider looking at more closely right now?
- Discuss any other useful tasks/bugs to fix people have in mind that would be beneficial in general
Notes:
- We should still try and making things better for older Intel generations, how much is unclear
- Plenty of ideas right now of how we could do so, but we don't know which will give us the clearest wins if at all
Glenn's ideas:
- We still spend times updating vertex buffers and gpu ache - try to improve - or reduce our usage
- Port gpu scatter updates to vertex data updates
- changing how we map/update vertex buffer objects
- change batching strategy
- unifying shader cases
- rather than drawing each glyph treat text runs like cached render task
In particular, 2 of the ideas will bring improvements to other platforms/hardware as well:
- using 1:1 texture for solid rects in the alpha pass - will help batching on all platforms
- share instance data between frame builder and renderer so fb can write directly. Should be easy to check if there are wins.
Next Steps:
- We need to spend a bit more time investigating what the issues are
- We have some 7.5 and 7 on in Nightly
- Need to spend more time profiling so we can get a more clear picture of what idea will help the most
- Glenn can profile on 7.5, probably the same issues but different magnitudes happen on the older gens
- Non-4K stuff needs more looking at
- Add edge casey-ness and complexity to ideas listed above
- Likely that frame building bugs will help in general
- Jeff to put up some profiles on gen 6
- Look at screen resolution of 7.5 gpus -> https://sql.telemetry.mozilla.org/queries/70757/source#178018
- Invalidation is pretty high on low resolution. See bug 374980 (fixing this should help a lot)
April 20 2020
Agenda:
- Would like to ensure we can be in shipable state for 7.5, 7 and 6 by end of H1 (EO June)
- Intel Gen 7.5 - to confirm: anything else to do there or are we in shipable state now? I know there are some further tweaks on the radar but my understanding is we are 'good enough' now. Want to prioritize work that will give us improvements for the rest this grouping of Intel
- Intel Gen 7 - what do we need to do there to ship?
- Intel Gen 6
- What will unblock us there? Looks like 7% of pop, so we should agree how much time is reasonable to spend
Notes:
Intel unblocking
- We will enable WR on 7.5 and at least some % of 7
- We need to do more profiling and testing for 6 to have a sense of how far we off there
- jbonisteel and jrmuizel to figure out where to prioritize that in the coming weeks
- We can likely also try enabling WR for Win7 soon, maybe start of 78 TBD based on fallout bugs from above
- Newer lower end Intel - not somewhere we are on. Will need decide what to do there (do we need to get hardware)
DirectComp
- on in AMD in Nightly, no complaints so far. Bug 1632239 extends that
- mstange is currently working on a fix that should help the 'grey line' issue
Other
- Nical asked about whether we have all the counters that we want in the on-screen profiler
- Improve tracy profiler integration
- Nical asked about texture cache eviction work. He’ll take a look at it after the API message stuff
April 7 2020
Agenda:
- Intel Gen 7-7.5 progress blockers
- Review other bugs in wr-perf-p1
- Guardian covid bug? https://bugzilla.mozilla.org/show_bug.cgi?id=1627458
Notes & Next Steps:
- For Intel Gen 7-7.5 blockers
- Dzmitry's scatter mode helps with GPU caching stalls on angle
- Can we use this for all Windows? Would be nice to have the same path everywhere
- We will do a test to see if this can be enabled for all Windows
- Vertex data textures cause CPU
- GW has workaround for this, seems like reasonable enough solution for now
- There are some cases when scrolling down, when youtube creates and deletes DOM elements and that trips up picture caching
- GW fixed the main cases, will get those landed.
- Dzmitry's scatter mode helps with GPU caching stalls on angle
After meeting, GW will kick-off a try build with these fixes included and then we can re-test to see how things are looking for this target.
- Partial Present?
- In Beta
- Do we have case where people having been testing DC off and partial present on in Nightly?
- Presumably people who have the hardware stretching issue have that path
- We will keep eye out for any other bugs
In general, lowering CPU times is valuable.
- Bug 1623669
- GW to write up some notes here - general optimization, doesn't block Intel
- Miko's idea for a bigger refactor
- One approach - get rid of gecko transform display items, bring this WR style spatial tree into gecko, share it between gecko and wr (not an easy change to make but benefits might be nice)
- Big project - would need to work out how to take steps towards that.
- Gecko supplying spatial IDs and making them persistent would be a good starting point to investigate and get initial wins
- Miko to write out a bug with some ideas and we can figure out next steps/how to prototype
- Nical - How hard to have dirty rects per tile?
- Not sure. Ideally not bigger than 512x512
- Nical make a bug about investigating this - rasterizing in bigger chunks.
- Guardian Covid map big
- No clear straightforward wins there. Chrome also not great at that page also (still seems a bit better than FF)
March 19 2020
Improving Texture caching seems like the best bet for improving perf to unblock Intel Gen 7.5
- Nical has submitted: https://phabricator.services.mozilla.com/D67575
- Once that lands, Jrmuizel to re-measure perf and see how things stack up
- GW to see if he can repro the jerky scroll (frame skipping) on YouTube homepage on his older hardware (https://bugzilla.mozilla.org/show_bug.cgi?id=1576637)
- Nical also to spin off individual bugs for low-hanging texture caching fruit, so we can see what can be potentially done in parallel
Other tasks to prioritize this Quarter likely for GW:
- Segment building - @GW is there a bug on file for this? https://bugzilla.mozilla.org/show_bug.cgi?id=1611908
DirectComposition
- How to ship on older hardware?
- Markus, Jeff and Glenn to discuss - see if there is a way to make the artifacting a bit better
- Jrmuizel to see if we can't learn more about why Chrome doesn't ship on AMD
General
- Jrmuizel to take a look more at gen7 Intel hardware to see if there is anything we should do to unblock there
- jbonisteel to take a look at the WR future doc, potentially set up a triage meeting to discuss more