在面部編輯領域中,現有方法多採用基於最佳化的技術來調整編輯結果。雖 然這類方法能達到不錯的編輯效果,但在效率上卻遠不及直接推論。然而,由於 缺乏合適的訓練資料,使得面部編輯任務難以進行監督式學習,也就難以做到直 接推論。我們提出了一種創新的兩階段自監督式深度學習方法,成功克服面部編 輯領域中無法直接推論的困難。 為了有效處理面部網格數據,我們提出了一種隨機聚類方法,將整個面部網 格劃分為數個相同面數的區塊。這種細粒度的劃分使模型能夠捕捉更豐富的面部 特徵信息,同時減少人為偏見,提高模型的表達能力和適應性。我們的模型架構 採用了編碼器-解碼器結構,在預訓練過程中,模型隨機遮罩部分面部區塊,編碼 器學習未遮罩區域的特徵,而解碼器則嘗試重建被遮罩的部分。這種方法使得模 型能夠捕捉面部區塊之間的關聯性,而非僅僅記憶面部區塊內部的特徵。在微調 階段,模型引入了控制點的概念,使用兩組不同的 3D 面部網格進行訓練。通過 凍結預訓練的編碼器,並結合控制點特徵,模型學會了根據給定的控制點生成精 確的面部編輯結果。為了優化模型性能,我們設計了新的損失函數,確保編輯結 果的準確性和結構完整性。實驗結果表明,我們的模型能夠在約 0.01 秒內生成高 品質的編輯後面部網格,大幅提升了面部編輯的效率和互動性。我們提出的自監 督式深度學習方法在編輯效果和操作便利性方面與過去的方法相比取得顯著進 步,為實時面部編輯應用提供了新的可能性。
In the field of facial editing, existing methods often rely on optimization-based techniques to adjust the editing results. While these methods can achieve good editing effects, they are far less efficient than direct inference. However, the lack of suitable training data makes it difficult to perform supervised learning for facial editing tasks, making direct inference challenging. We propose an innovative two-stage self-supervised deep learning method that successfully overcomes the difficulty of direct inference in the field of facial editing. To effectively handle facial mesh data, we propose a random clustering method that divides the entire facial mesh into several blocks with the same number of faces. This fine-grained division allows the model to capture richer facial feature information, while reducing human bias and improving the model’s expressiveness and adaptability. Our model architecture adopts an encoder-decoder structure. During the pre-training process, the model randomly masks parts of the facial blocks. The encoder learns the features of the unmasked areas, while the decoder attempts to reconstruct the masked parts. This ap- proach enables the model to capture the relationships between facial blocks, rather than merely memorizing the features within the facial blocks. In the fine-tuning stage, the model introduces the concept of control points and uses two sets of different 3D facial meshes for training. By freezing the pre-trained encoder and combining the features of the control points, the model learns to generate precise facial editing results based on the given control points. To optimize the model’s performance, we designed a new loss function to ensure the accuracy and structural integrity of the editing results. Experimental results show that our model can generate high-quality edited facial meshes in approximately 0.01 seconds, significantly improving the efficiency and interactivity of facial editing. Our pro- posed self-supervised deep learning method makes significant progress in editing effects and operational convenience compared to previous methods, providing new possibilities for real-time facial editing applications.