본문으로 건너뛰기
CHOI HONGSU
1 min read

PostProcessing Bloom optimization

 

Migrated Compute Shader-based Bloom to Fragment Shader

  • 85% COSPostProcessing GPU time 절감 (8.37ms → 1.26ms)
  • 91% Bloom RT 메모리 절감 (16.6 MB → 1.4 MB)
  • -15.1 MB 전체 RenderTexture 감소 (95.0 MB → 79.9 MB)

Approach

Fragment Shader 전환

Chosen

Pros

  • Async dependency removed, pipeline simplified, RT format freedom

Cons

  • Requires full replacement

Compute Shader 유지 + 버그 수정

Pros

  • Minimal change

Cons

  • Removing frame latency erases the Async advantage; structural debt remains

Why Fragment Shader 전환: Chosen: Fragment Shader migration Removing frame latency to fix the bug erases Compute's performance advantage. Took the opportunity to clean up the History RT, GC allocations, and other accumulated tech debt.

Implementation

  1. 01

    Step 1 — Compute → Fragment migration

    Replaced Bloom Passes 4 (Prefilter), 5 (DownSample), and 6 (UpSample) with Fragment-Shader-based equivalents.

    • RenderGraph: ComputePassRasterRenderPass chain
    • Compatibility mode: DispatchComputeDrawProcedural blit
    • enableRandomWrite = false
  2. 02

    Step 2 — History RT removal + direct hand-off

    Removed the permanent COSBloomHistoryFrameRT RT system. The Bloom result is now handed off directly through the bloomResultTexture field.
    → Frees 4.15 MB always-resident memory and removes the 1-frame latency.

  3. 03

    Step 3 — RT format / resolution lightening

  4. 04

    Step 4 — Sampling + Interpolator lightening

  5. 05

    Step 5 — Remove GC allocations

    Replaced per-frame new arrays with fixed-size field caches:

  6. 06

    Step 6 — Bug fixes (alpha, PreMiscPass)

    PreMiscPass: fully skipped when miscActivated == false.
    (Once Distortion/RadialBlur are confirmed unused, the pass will be removed.)

Validation

Tools: Memory Profiler / AGI · Build: Dev · Scene:  

DeviceGPUAPIRenderTextureCOSPostProcessing GPU time
BeforeAfterBeforeAfter
Galaxy S21 Mali-G78 Vulkan 1.1.0 95.0 MB79.9 MB (-15.1 MB)8.368 ms1.263 ms (−85%)

Unity Memory Profiler

Galaxy S21

BEFORE
Before
AFTER
After

Android GPU Inspector

Galaxy S21

BEFORE
AFTER
After

제목

RT memory

RTBeforeAfterNotes
HistoryTexture × 210.0 MB0 MBSystem removed
TempRTBloom05.0 MB1.0 MBFormat + resolution
TempRTBloom11.3 MB395.5 KB
TempRTBloom2342.9 KB52.9 KB
Bloom RT total16.6 MB1.4 MB (-91%)
Total RenderTexture95.0 MB79.9 MB (-15.1 MB)

GPU performance (Android GPU Inspector)

ItemBeforeAfter
COSPostProcessing GPU time8.368 ms1.263 ms (−85%)
Base resolution1/21/4
DownSample samples5/pixel4/pixel
UpSample samples8/pixel4/pixel
UpSample Interpolator18 floats8 floats

Before / After

Bloom comparison

 

BEFORE
AFTER

Tradeoffs & Future Work

Tradeoffs

  • Quarter base resolution can lose bloom detail on close-up large objects. If needed, restore to 1/2 by changing the ReduceTextureDescSize argument from 2 → 1.
  • Compute Shader removal requires resetting the URP Inspector's Compute Shader slot (one-time).
  • COSBlurIsolatedFeature uses a separate kernel and is unaffected by this change.

Conclusion

The Async buffer-overlap structure of the Compute Shader was the root cause of buffer collisions in dialog / screen-overlay environments.
Removing the frame latency to fix the bug erased Compute's performance advantage. Taking that opportunity, accumulated structural debt — History RT, GC allocations, dead code — was cleaned up by migrating to a Fragment Shader.

 

 

Result: COSPostProcessing GPU time −85% (8.37ms → 1.26ms),
Bloom RT memory −91% (16.6 MB → 1.4 MB), 3 rendering bugs fixed.
Visual quality remained within the acceptable range per QA review.

 

 

A case where bug response triggered refactoring — digging into the root cause often opens far larger improvement room than a localized fix.

Tags

Bloom
PostProcessing Bloom 최적화 · Cookie Run: Oven Smash · Choi Hongsu · Hongsu