0% found this document useful (0 votes)
13 views6 pages

Enhanced 3D Virtual Try-On with Residuals

The document presents a project on improving 3D virtual try-on technology by integrating residual connections into the existing M3D-VTON synthesis model. This enhancement aims to better differentiate between the front and back parts of clothing, preserve logos, and reduce artifacts, resulting in more realistic 2D and 3D outputs. Experimental results demonstrate that the proposed method significantly outperforms previous models on the MPV3D dataset.

Uploaded by

Aditya Ganjoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

Enhanced 3D Virtual Try-On with Residuals

The document presents a project on improving 3D virtual try-on technology by integrating residual connections into the existing M3D-VTON synthesis model. This enhancement aims to better differentiate between the front and back parts of clothing, preserve logos, and reduce artifacts, resulting in more realistic 2D and 3D outputs. Experimental results demonstrate that the proposed method significantly outperforms previous models on the MPV3D dataset.

Uploaded by

Aditya Ganjoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Monocular-to-3D Virtual Try-On using Deep Residual U-Net

Hasib Zunair, ID: 40126681


COMP 6381 Digitial Geometric Modeling Project Paper, Fall 2021
Concordia University
hasibzunair@[Link]

Abstract posed to generate textured 3D try-on meshes only from 2D


images of person and clothing by formulating the 3D try-on
3D virtual try-on aims to synthetically fit a target cloth- problems as 2D try-on and depth estimation.
ing image onto a 3D human shape while preserving real- However, we find that the synthesis model in the M3D-
istic details such as pose, identity of the person. Existing VTON pipeline uses a simple U-Net architecture. We hy-
methods heavily depend on annotated 3D shapes and gar- pothesize that this is insufficient to synthesize body parts
ment templates which limits their practical use. While 2D and differentiate between front and back parts of clothing
virtual try-on is another alternative, it ignores the 3D body only from the 2D image. And this would ultimately lead
information and cannot fully represent the human body. Re- to unrealistic outputs affecting the final 3D try-on result.
cently, M3D-VTON was proposed to generate textured 3D We aim to improve this by implementing residual units in
try-on meshes only from 2D images of person and cloth- the existing synthesis model. Residual learning is known to
ing by formulating the 3D try-on problem as 2D try-on and ease training of these networks by reducing parameters and
depth estimation. However, we find that the synthesis model reducing compute cost. Further, the rich skip connections
in the M3D-VTON pipeline uses a simple U-Net architec- within the network could facilitate information propagation
ture. We hypothesize that this is insufficient to synthesize and effectively learn better representations to output better
body parts and model complex relation between front and 2D try-on images, and finally better 3D try-on meshes.
back parts of clothing only from the 2D clothing image, ul-
timately leading to unrealistic 3D try-on results. We im- 2. Methodology & Experimental Results
prove this by implementing residual units in the existing
synthesis model. Studying it’s effect demonstrates that it 2.1. Methodology
improves 2D try-on outputs, mainly by differentiating be-
tween front and back part of clothing, preserving logo of M3D-VTON. Figure 1 (left) is an overview of the 3D vir-
clothing and reducing artifacts. This ultimately results in tual try-on pipeline that we build on. We can see that there
better textured 3D try-on mesh. Benchmarking our method are many components involved. The major components are
on the MPV3D dataset shows that it performs better than monocular prediction, depth refinement and texture fusion.
previous works significantly. Code is available at https: The monocular prediction module produces warped
//[Link]/resm3dvton/. clothing, person segmentation and double depth maps
which give a base 3D shape. The depth refinement module
produces the refined depth maps which capture the warped
1. Introduction clothing details as well as the high frequency details which
the previous module oversmooths. The texture fusion mod-
3D virtual try-on aims to synthetically fit a target cloth- ule merges the warped clothing with unchanged person part
ing image onto a 3D human shape while preserving realis- to output 2D try-on results. After getting the 2D try-on
tic details such as pose and identity of the person. Existing and depth map, we unproject the front-view and back-view
methods heavily depend on annotated 3D shapes and gar- depth maps to get 3D point clouds and triangulate them with
ment templates which limits their practical use. While 2D screened poisson reconstruction. Since the try-on image
virtual try-on is another alternative, it is highly challenging and depth maps are spatially aligned, the try-on image can
because it involves several tasks such as cloth warping, im- be used to color the front side of the mesh. As for the back
age segmentation, image compositing, and image synthesis. texture, the image is inpainted using fast marching method
It ignores the 3D body information and cannot fully repre- where the face area is filled with surrounding hair color and
sent the human body. M3D-VTON [5] was recently pro- is then mirrored to finally texture the back side of the mesh.

1
Plain connections in Residual connections in
synthesis model synthesis model

Figure 1. Overview of the proposed framework (left) with an illustration of a plain unit and it’s residual counterpart (right). We can see that
there are many components involved. The major components are monocular prediction, depth refinement and texture fusion. Left image
taken from M3D-VTON [5].

This allows us to achieve the monocular-to-3D conversion, also consists of batch normalization (BN), ReLU activation
producing the reconstructed 3D try-on mesh with the cloth- and convolutional layers. This approach uses identity map-
ing changed and person identity retained. ping [2] that facilitates training and addresses the degrada-
tion problem mainly due to vanishing gradients. We refer
Residual connections. The existing synthesis model in readers to [1, 2] for more details.
texture fusion module which combines all previous outputs
comprises of an 5-layer encoder and a 5-layer decoder ar-
chitecture, similar to that of a U-Net [3]. We argue that
the plain connections in this encoder-decoder network in
the texture fusion module is not enough to synthesize body We augment the existing U-Net [3] model in texture fu-
parts and differentiate between front and back parts of cloth- sion module by replacing the plain connections with resid-
ing only from the 2D image. And errors in this step would ual connections. This results in a new synthesis architecture
ultimately lead to unrealistic outputs affecting the final 3D where the encoder and decoder layers consists of residual
try-on result. blocks, similar to that of Deep Residual U-Net [4]. We think
To address the above problem, we propose to use resid- that residual connection in the synthesis model is capable on
ual connections [1]. The main idea is that residual connec- handling to problem of front and back part of clothing, pre-
tions are proven to have better information propagation and serve logo as well as reduce artifacts to output better 2D try-
effectively learn better representations of the input data, es- on results and eventually better looking textured 3D try-on
pecially known for image recognition tasks [1]. Each con- mesh. Since our work builds on M3D-VTON [5] directly,
nection can be mathematically defined as: we follow the same architecture design in the other mod-
where xi and xi+1 are the input and output of the i-th ules as well as follow the same training and testing proto-
residual unit, F (.) is the residual function, f (.) is activation cols. All experiments are performed on a Linux workstation
function and h(xi ) is an identity mapping function, for in- running 4.8Hz and 64GB RAM with and RTX 3080 GPU.
stance h(xi ) = xi . Figure 1 (right) shows an illustration of Experiments are conducted using Python programming lan-
a plain unit and it’s residual counterpart. The residual block guage and PyTorch deep learning framework.

2
Reference Target M3D-VTON Ours
Person Clothes M3D-VTON (2D) Ours (2D) (3D Mesh) (3D Mesh)

Figure 2. Comparison of 2D and 3D try-on mesh outputs with recent state-of-the-art M3D-VTON.

2.2. Experimental Results have back part of clothing in the front, preserves logo of
clothing and reduces artifacts shown in Figure 2.
We find that better 2D try-on results lead to better tex- Figure 4 shows some examples of the final 2D try-on
tured 3D meshes. In particular the 3D try-on meshes do not outputs compared to previous work. In many cases, we see

3
Method FID SSIM Finally, we show some quantitative results in Table 1
VITON (CVPR, 2018) 28.43 0.8807 on two metrics which are currently used to benchmark try-
CP-VTON (ECCV, 2018) 20.05 0.8503 on methods. Our method consistently outperforms baseline
CP-VTON+ (2020) 23.18 0.8782 methods with an improvement of almost 5% over the previ-
ACGPN (CVPR, 2020) 20.19 0.8924 ous best method on FID score.
M3D-VTON (ICCV, 2021) 19.87 0.9725
Ours 15.16 0.9814 3. Conclusions
Table 1. 2D try-on SSIM and FID scores on the MPV3D test set. To summarize, we integrate residual connections into
Bolface numbers indicate better performance. Our method consis- the synthesis model of a recent 3D virtual try-on pipeline.
tently outperforms baseline methods. Studying it’s effect demonstrates that it improves 2D try-on
outputs, mainly by differentiating between front and back
part of clothing, preserving logo of clothing and reducing
artifacts. This ultimately results in better textured 3D try-on
mesh. Benchmarking our method on the MPV3D dataset
shows that it performs better than previous works signifi-
cantly.

References
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceedings
of the IEEE conference on computer vision and pattern recog-
nition, pages 770–778, 2016. 2
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Identity mappings in deep residual networks. In European
conference on computer vision, pages 630–645. Springer,
2016. 2
[3] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
Net: Convolutional networks for biomedical image segmen-
tation. In International Conference on Medical image com-
Figure 3. Results of our method on out-of-distribution images. puting and computer-assisted intervention, pages 234–241.
Given the reference person image (left) and target clothing image Springer, 2015. 2
(middle), our method can reconstruct the 3D try-on mesh (right)
[4] Zhengxin Zhang, Qingjie Liu, and Yunhong Wang. Road ex-
with the clothing changed and person identity retained.
traction by deep residual U-Net. IEEE Geoscience and Re-
mote Sensing Letters, 15(5):749–753, 2018. 2
[5] Fuwei Zhao, Zhenyu Xie, Michael Kampffmeyer, Haoye
that the baseline model is unable to differentiate between
Dong, Songfang Han, Tianxiang Zheng, Tao Zhang, and Xi-
the front and back of clothing. It also tends to change the aodan Liang. M3d-vton: A monocular-to-3d virtual try-on
skin color of persons. The baseline model also fails to pre- network. In Proceedings of the IEEE/CVF International Con-
serve the logo of clothing image. This is due to the limited ference on Computer Vision, pages 13239–13249, 2021. 1,
capability of the U-Net architecture employed in the base- 2
line model. In comparison, the proposed method generates
realistic try-on results which differentiates front and back
part of clothing as well as preserve logo of clothing. It also
reduces artifacts in non-target body parts such as skin. We
show some more examples, where the baseline tends to out-
put blurry logo, synthesize back part of clothing in the front.
In comparison, our method mitigates these problems.
We also show two examples in Figure 3 of outputs
from our model on out-of-distribution images. Out-of-
distribution in the sense that the model is trained on MPV3D
dataset which consists of only women images and women
top clothing, while the images here are of men. We can
see that the model is able to reconstruct the 3D try-on mesh
with the clothing changed and person identity retained.

4
Appendix: Extra Results
Figure 4 shows more examples of the final try-on outputs
compared to previous work. In many cases, we see that the
baseline model is unable to differentiate between the front
and back of clothing. It also tends to change the skin color
of the person. The baseline model also fails to preserve the
logo of clothing image. This is due to the limited capability
of the U-Net architecture employed in the baseline model.
In comparison, the proposed method generates realistic try-
on results which differentiates front and back part of cloth-
ing, preserve logo of clothing. It also reduces artifacts in
non-target body parts such as skin.

5
Reference Target Reference Target Reference Target
Person Clothes M3D-VTON Ours Person Clothes M3D-VTON Ours Person Clothes M3D-VTON Ours

Figure 4. Extensive visual results of 2D try-on outputs with M3D-VTON. Our method generates realistic try-on results which differentiates
front and back part of clothing, preserve logo of clothing. It also reduces artifacts in non-target body parts such as skin

Common questions

Powered by AI

Existing 3D virtual try-on methods heavily depend on annotated 3D shapes and garment templates, limiting their practical use by requiring extensive resources for data preparation . These methods often fail to effectively synthesize complex 3D human shapes from 2D images and cannot fully represent human bodies due to the use of simple architectures like U-Net, which struggles with differentiating between the front and back parts of clothing and maintaining realistic textures . The proposed method enhances the synthesis model by incorporating residual connections, which improve information propagation and representation learning . This modification significantly reduces artifacts, better preserves clothing logos, and accurately differentiates between clothing parts, resulting in more realistic 3D meshes .

The proposed method improves the differentiation between front and back clothing parts by utilizing residual connections within its synthesis model, which enhances information propagation and representation learning . This allows the model to better capture and maintain complex relational details in clothing, overcoming the limitations of simpler architectures like U-Nets that often mistake front and back clothing parts . These improvements reduce errors in texture alignment and preservation of clothing orientation, resulting in more realistic and coherent 3D try-on outputs .

Residual connections improve the synthesis model by enhancing information propagation and effectively learning better representations of input data, which is particularly beneficial for image recognition tasks . These connections help address the degradation problem that arises from vanishing gradients in deep networks, thereby stabilizing and improving the training process . In the context of 3D try-on tasks, residual connections help the model distinguish between the front and back parts of the clothing, preserve clothing logos, and reduce artifacts in non-target areas like skin, ultimately leading to more accurate 2D and 3D try-on results .

Residual connections offer several benefits in image recognition tasks, such as addressing the vanishing gradients problem by allowing gradients to flow more effectively through deep networks . This facilitates the training of deeper networks without degradation of accuracy. In 3D try-on technology, these connections enhance the synthesis model’s capability to differentiate complex patterns like front and back clothing parts and to preserve small details such as logos. By maintaining continuity in neural network layers, residual connections ensure that detailed features are accurately represented in final outputs, leading to improved texture detail and fewer artifacts in try-on results .

The proposed method outperforms baseline models on key metrics such as FID (Fréchet Inception Distance) and SSIM (Structural Similarity Index) scores. Specifically, it achieves an FID score of 15.16 and an SSIM score of 0.9814 on the MPV3D test set, which represents a significant improvement over the previous best method, M3D-VTON, with an FID score of 19.87 and an SSIM score of 0.9725 . The improved performance is attributed to the method's ability to generate realistic 2D try-on results, preserving clothing logos, differentiating clothing parts, and reducing artifacts .

Residual connections significantly enhance the preservation of clothing logos in 3D virtual try-on applications by facilitating better information propagation and learning detailed representations . The traditional U-Net architecture, lacking these connections, often struggles with preserving fine details like clothing logos, leading to blurry or incorrect outputs . By integrating residual connections, the proposed method ensures that logos are accurately maintained, thereby contributing to more realistic and authentic 3D try-on results. This enhancement reduces artifacts and maintains critical features of clothing design .

The proposed framework significantly reduces artifacts in non-target body parts compared to other state-of-the-art methods . Unlike the baseline models, which often fail to accurately maintain non-target areas such as skin and logos, leading to blurry or unrealistic results, the new method achieves cleaner outputs by leveraging residual connections in the synthesis model . These connections facilitate the differentiation between clothing parts and mitigate issues related to misaligned textures or changes in skin color, thus producing more coherent and artifact-free try-on results .

The main components of the 3D virtual try-on pipeline are monocular prediction, depth refinement, and texture fusion. The monocular prediction module generates warped clothing, person segmentation, and depth maps to form a base 3D shape . The depth refinement module refines these depth maps to capture detailed clothing features and high-frequency details . The texture fusion module combines warped clothing with unchanged parts of the person image to produce 2D try-on results, which are then used in conjunction with depth maps to construct 3D point clouds and meshes .

The proposed method effectively reconstructs 3D try-on meshes on out-of-distribution images, showcasing its flexibility and adaptability . Despite being trained on the MPV3D dataset consisting only of women's images and clothing, the method successfully handles men's images during testing, maintaining clothing changes and person identity . This demonstrates its robustness against variations not present during training, addressing challenges associated with the baseline model's limited capability. The method maintains realistic representations without changing non-target features such as skin color, unlike the baseline .

Depth refinement plays a crucial role by enhancing depth maps to capture detailed clothing features and high-frequency details that the initial monocular prediction module may oversmooth . This refinement is essential for producing a detailed and realistic base 3D shape. Texture fusion subsequently merges the refined clothing textures with the unchanged parts of the person's image, resulting in high-quality 2D try-on results. This process ensures that any distortions or artifacts are minimized . Together, these components enhance the accuracy and realism of the final 3D try-on meshes by maintaining detail while ensuring the compatibility of clothing and body features .

You might also like