Platform/GFX/WRperfmeetings

Meeting notes for our wr-perf sync sessions.

Q3

Jul 2

Attendees: GW, JM, JCB

Glenn is taking on the following list of work (below) that is somewhat entangled at the moment
Some of this work will become less tied together and more can be done in parallel but not exactly sure when that will be
First goals: C, E - likely diversions into pieces of other work to get those done
Expected outcomes on this work: better performance, rendering quality in some cases, and simplicity
- Outstanding question: how can we measure this work as we go to make sure we are moving things in the right direction?

(A) Text Runs

Draw text runs to persistent texture cache, blit as image.

(B) Support multiple child tasks for a given picture.

Limit render task size to picture cache tile size.

(C) Simplify clip API and implementation.

WR clip API is much more expressive than needed.

(D) Optimize GPU clipping implementation.

Take advantage of reduced clipping scope. 
Handle large clip masks that are mostly redundant. 
Gradient clip mask primitive.

(E) Visibility / clip-chain building optimizations and caching.

Most work we do here per-frame is redundant.

(F) Reduce / remove GPU cache and vertex texture usage. (G) Scene Building profiling and optimization. (I) Segment building

Move to scene building / interning pass. 
Simpler implementation (nine-patch / fixed grid only)

(J) Unify shaders

Unify text and image shaders, others possible too. 
Reduce batching cost and renderer cost.

(K) Faster instance building

Build directly into mapped buffers in render backend etc

(L) Simplify raster root calculations and generation.

Quality improvements 
Reduce cost of picture traversal / bounds checks

(N) Optimize how shared_clips and generated and handled per-frame.

Work that is not entangled, can be done independently:

(H) Consider replacing rayon with a simpler crossbeam-channel approach.

(M) Box shadow rendering optimizations / simplifications

Remove from being a clip kind and make into a normal primitive.

(O) General optimization work of frame builder.

Tighter loops for each part of the frame.
Memory allocations / freelists etc

(P) Glyph cache work - one issue is the way we time lifetime of CPU glyphs to GPU glyphs in texture cache. Also general batching issues.wr

Q2

May 19

GW is Convinced text run optimizations is worth doing. That could be significant win for CPU time.
- GW will focus on that in a week or two. https://bugzilla.mozilla.org/show_bug.cgi?id=1639346
Data target init taking up so much time. Would be valuable to know about that. Why GPU times have so much time in the first flush.
- BP is going to investigate that https://bugzilla.mozilla.org/show_bug.cgi?id=1639336
Why batch breaks, could be a missing piece of the puzzle? Nical to add to list for investigation again
JM reach out to more reporters of bugs on old hardware and continue work on getting decent measurements
DM working on avoiding uploading instance data for each draw call, blocked by ANGLE update, will mark his task blocking the frame building meta bug

May 12 2020

GPU View walkthrough meeting

GPU View tips:

... TODO magic symbol lookup string to make Firefox symbols work
Ctrl-Z to zoom in, Z to zoom out; '+' to add threads;
Select an object ID and use "Lookup" to find details about it, eg. a texture allocation;
Filter by thread ID in the Event Viewer to focus on a single (scene builder) thread;

GPU frequency ramp-up delay:

can take up to a second before GPU is at full throttle again after low amounts of work;
- talking to Intel on getting any details on clock change latency;
the better we optimize fast frames, the worse this actually gets;
make the cheap frames more expensive and the expensive ones cheaper? eg. render more tiles outside the viewport?
keep fixing bugs that cause spike frames -- unneeded invalidations;
is there a "prefetch" API to give the driver/GPU a heads-up? we have 9ms to do it.
- (do mobile-first APIs eg. Apple have something like this? or just really good schedulers?)

Delay between submitting CPU work and GPU doing it:

manually force more flushes, see what happens (eg. using GetData to avoid Angle ignoring it);
identify source of 1ms delay in the submit; help from Intel and/or experiment in stand-alone sample;
start submitting earlier -- overlap frame builder and scene builder threads;
- will be easier once they're a single thread again;
- starting earlier is limited by the need to sample at vsync;
- no obvious "bookkeeping" style GPU work we could start earlier;
  - clears are a maybe -- but beware of tiled architectures;
unlikely to find much work there that can be split off -- it's mostly talking to the (single threaded) GL context;

Submission cost:

Angle overhead; is Update Resource leading to Maps?
- try switching from uniforms to UBOs;
- try removing uniforms altogether: transform is unneeded, other state can be folded into shader permutations;

Useful links:

May 5 2020

Agenda:

Status for Intel gen 7.5 and 7 - any new bugs/issues/things to worry about?
Intel gen 6 - how is that shaping up? How far off from being shipable?
Any other Intel targets we should consider looking at more closely right now?
Discuss any other useful tasks/bugs to fix people have in mind that would be beneficial in general

Notes:

We should still try and making things better for older Intel generations, how much is unclear
Plenty of ideas right now of how we could do so, but we don't know which will give us the clearest wins if at all

Glenn's ideas:

We still spend times updating vertex buffers and gpu ache - try to improve - or reduce our usage
Port gpu scatter updates to vertex data updates
changing how we map/update vertex buffer objects
change batching strategy
unifying shader cases
rather than drawing each glyph treat text runs like cached render task

In particular, 2 of the ideas will bring improvements to other platforms/hardware as well:

using 1:1 texture for solid rects in the alpha pass - will help batching on all platforms
share instance data between frame builder and renderer so fb can write directly. Should be easy to check if there are wins.

Next Steps:

We need to spend a bit more time investigating what the issues are
We have some 7.5 and 7 on in Nightly
- Need to spend more time profiling so we can get a more clear picture of what idea will help the most
- Glenn can profile on 7.5, probably the same issues but different magnitudes happen on the older gens
Non-4K stuff needs more looking at
Add edge casey-ness and complexity to ideas listed above
Likely that frame building bugs will help in general
Jeff to put up some profiles on gen 6
Look at screen resolution of 7.5 gpus -> https://sql.telemetry.mozilla.org/queries/70757/source#178018
Invalidation is pretty high on low resolution. See bug 374980 (fixing this should help a lot)

April 20 2020

Agenda:

Would like to ensure we can be in shipable state for 7.5, 7 and 6 by end of H1 (EO June)
- https://wiki.mozilla.org/Platform/GFX/WebRender_Where
Intel Gen 7.5 - to confirm: anything else to do there or are we in shipable state now? I know there are some further tweaks on the radar but my understanding is we are 'good enough' now. Want to prioritize work that will give us improvements for the rest this grouping of Intel
Intel Gen 7 - what do we need to do there to ship?
Intel Gen 6
- What will unblock us there? Looks like 7% of pop, so we should agree how much time is reasonable to spend

Notes:

Intel unblocking

We will enable WR on 7.5 and at least some % of 7
We need to do more profiling and testing for 6 to have a sense of how far we off there
- jbonisteel and jrmuizel to figure out where to prioritize that in the coming weeks
We can likely also try enabling WR for Win7 soon, maybe start of 78 TBD based on fallout bugs from above
Newer lower end Intel - not somewhere we are on. Will need decide what to do there (do we need to get hardware)

DirectComp

on in AMD in Nightly, no complaints so far. Bug 1632239 extends that
mstange is currently working on a fix that should help the 'grey line' issue

Other

Nical asked about whether we have all the counters that we want in the on-screen profiler
Improve tracy profiler integration
Nical asked about texture cache eviction work. He’ll take a look at it after the API message stuff

April 7 2020

Agenda:

Intel Gen 7-7.5 progress blockers
Review other bugs in wr-perf-p1
- Guardian covid bug? https://bugzilla.mozilla.org/show_bug.cgi?id=1627458

Notes & Next Steps:

For Intel Gen 7-7.5 blockers
- Dzmitry's scatter mode helps with GPU caching stalls on angle
  - Can we use this for all Windows? Would be nice to have the same path everywhere
  - We will do a test to see if this can be enabled for all Windows
- Vertex data textures cause CPU
  - GW has workaround for this, seems like reasonable enough solution for now
- There are some cases when scrolling down, when youtube creates and deletes DOM elements and that trips up picture caching
  - GW fixed the main cases, will get those landed.

After meeting, GW will kick-off a try build with these fixes included and then we can re-test to see how things are looking for this target.

Partial Present?
- In Beta
- Do we have case where people having been testing DC off and partial present on in Nightly?
  - Presumably people who have the hardware stretching issue have that path
- We will keep eye out for any other bugs

In general, lowering CPU times is valuable.

Bug 1623669
- GW to write up some notes here - general optimization, doesn't block Intel

Miko's idea for a bigger refactor
- One approach - get rid of gecko transform display items, bring this WR style spatial tree into gecko, share it between gecko and wr (not an easy change to make but benefits might be nice)
- Big project - would need to work out how to take steps towards that.
- Gecko supplying spatial IDs and making them persistent would be a good starting point to investigate and get initial wins
- Miko to write out a bug with some ideas and we can figure out next steps/how to prototype

Nical - How hard to have dirty rects per tile?
- Not sure. Ideally not bigger than 512x512
- Nical make a bug about investigating this - rasterizing in bigger chunks.

Guardian Covid map big
- No clear straightforward wins there. Chrome also not great at that page also (still seems a bit better than FF)

March 19 2020

Improving Texture caching seems like the best bet for improving perf to unblock Intel Gen 7.5

Nical has submitted: https://phabricator.services.mozilla.com/D67575
Once that lands, Jrmuizel to re-measure perf and see how things stack up
GW to see if he can repro the jerky scroll (frame skipping) on YouTube homepage on his older hardware (https://bugzilla.mozilla.org/show_bug.cgi?id=1576637)
Nical also to spin off individual bugs for low-hanging texture caching fruit, so we can see what can be potentially done in parallel

Other tasks to prioritize this Quarter likely for GW:

Segment building - @GW is there a bug on file for this? https://bugzilla.mozilla.org/show_bug.cgi?id=1611908

DirectComposition

How to ship on older hardware?
Markus, Jeff and Glenn to discuss - see if there is a way to make the artifacting a bit better
Jrmuizel to see if we can't learn more about why Chrome doesn't ship on AMD

General

Jrmuizel to take a look more at gen7 Intel hardware to see if there is anything we should do to unblock there
jbonisteel to take a look at the WR future doc, potentially set up a triage meeting to discuss more

Platform/GFX/WRperfmeetings

Contents

Q3

Jul 2

Q2

May 19

May 12 2020

May 5 2020

April 20 2020

April 7 2020

March 19 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools