icon CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

1Xidian University
2Chongqing University of Posts and Telecommunications
*Corresponding author
Description of the image

CatVersion concatenates embeddings in the text encoder's feature-dense space within the diffusion model to achieve personalized concept inversion, relying on a few examples. It helps to restore personalized concepts more faithfully and enables more robust editing.

Abstract

We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing methods that emphasize word embedding learning or parameter fine-tuning, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.

Visualization Results of CatVersion

Comparisons with Existing Methods

How does it work?

Description of the image
  • 1. Firstly, we identify the feature-dense layers in the CLIP text encoder.
  • 2. Then, we concatenate the residual embeddings with Keys and Values, as shown in (a).
  • 3. In the optimization process, we use the base class word (e.g. dog) of the personalized concept as text input and optimize these residual embeddings utilizing a handful of images depicting one personalized concept, as shown in (b).
  • 4. During inference, residual embeddings of CatVersion can be deleted and replaced to achieve different personalized needs, as shown in (c).

BibTeX

@misc{zhao2023catversion,
        title={CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization}, 
        author={Ruoyu Zhao and Mingrui Zhu and Shiyin Dong and Nannan Wang and Xinbo Gao},
        year={2023},
        eprint={2311.14631},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }