SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing

1Hong Kong Polytechnic University

2Stability AI

3Johns Hopkins University


SyncNoise enables high-quality and controllable editing that closely adheres to the textual instructions with minimal changes to irrelevant regions. SyncNoise attains geometrically consistent editing without compromising fine-grained textures.


Abstract

Text-based 2D diffusion models have demonstrated impressive capabilities in image generation and editing. Leveraging these advancements, these models also exhibit substantial potential for 3D editing tasks. However, achieving consistent edits across multiple viewpoints remains a challenge. While the iterative dataset update (IDU) is capable of achieving global consistency, it often suffers from slow convergence and over-smoothed textures. To overcome these limitations, we propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing. Our method synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent, which ensures global consistency in both semantic structure and low-frequency appearance. To further enhance local consistency in high-frequency details, we set a group of anchor views and propagate them to their neighboring frames through cross-view reprojection. To improve the reliability of multi-view correspondences, we introduce depth supervision during training to enhance the reconstruction of precise geometries. Our method achieves high-quality 3D edits respecting the textual instructions, especially in scenes with complex textures, by enhancing geometric consistency at the noise and pixel levels.

Pipeline

Our method simultaneously edits multi-view images while enforcing geometrical consistency at noise and pixel levels:

  1. We construct reliable correspondences based on precise 3D geometries.
  2. we enforce multi-view noise consistency by aligning U-Net decoder features across views.
  3. We use cross-view projection to maintain pixel-level consistency by propagating the anchor view to neighboring views.
  4. To minimize reprojection artifacts, we refine these views with a 2D diffusion model.
  5. Finally, we update the 3D scene based on the edited multi-view images.


Results

Qualitative comparisons. Our SyncNoise offers more consistent (e.g. ``rainbow table''), finer-grained (e.g. ``wood carving'', ``Spiderman''), and instruction-following 3D editing (e.g. ``Iron Man wearing the helmet'', ``robot'', ``Thanos'') with minimal changes to irrelevant regions.








Citation

If you use this work or find it helpful, please consider citing: (bibtex)

@inproceedings{instructnerf2023,
         author = {Ruihuang, Li and Liyi, Chen and Zhengqiang, Zhang and Varun, Jampani and Vishal M., Patel and Lei, Zhang},
         title = {SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing},
         booktitle = {arxiv},
         year = {2024},
        } 



This website is constructed using the source code provided by Instruct-Nerf2Nerf, and we are grateful for their template.