NCHO : Unsupervised Learning for Neural 3D Composition of Humans and Objects

1Seoul National University, 2Meta Reality Labs
ICCV 2023

From 3D scans of the “source human” in casual clothing and with additional outwear or objects, our method automatically decomposes objects from the source human and builds a compositional generative model that enables 3D avatar creations of novel human identities with variety of outwear and objects in an unsupervised manner.


In this work, we present a novel framework for learning a compositional generative model of humans and objects (backpacks, coats, scarves, and more) from real-world 3D scans. Our compositional model is interaction-aware, meaning the spatial relationship between humans and objects, and the mutual shape change by physical contact is fully incorporated. The key challenge is that, since humans and objects are in contact, their 3D scans are merged into a single piece. To decompose them without manual annotations, we propose to leverage two sets of 3D scans of a single person with and without objects. Our approach learns to decompose objects and naturally compose them back into a generative human model in an unsupervised manner. Despite our simple setup requiring only the capture of a single subject with objects, our experiments demonstrate the strong generalization of our model by enabling the natural composition of objects to diverse identities in various poses and the composition of multiple objects, which is unseen in training data.



Disentangled Generation

Our compositional model provides separate controls over the shapes of human part and object part.

Same scarf on different human identities.

Same outwear on different human identities.

Different backpacks on the same human identity.

Different outwear on the same human identity.


Our compositional model allows smooth interpolation of either human part or object part without deteriorating the other part.

Human part interpolation.

Object part interpolation.

Multiple Objects

Our model allows composition of two or more objects despite the fact that our train data contain no scans of the source human with multiple objects.

Composition of multiple objects.


    author    = {Kim, Taeksoo and Saito, Shunsuke and Joo, Hanbyul},
    title     = {NCHO: Unsupervised Learning for Neural 3D Composition of Humans and Objects},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year      = {2023}