Image harmonization, aiming to make composite images look more realistic, is an important and challenging task. The composite, synthesized by combining foreground from one image with background from another image,… Click to show full abstract
Image harmonization, aiming to make composite images look more realistic, is an important and challenging task. The composite, synthesized by combining foreground from one image with background from another image, inevitably suffers from the issue of inharmonious appearance caused by distinct imaging conditions, i.e., lights. Current solutions mainly adopt an encoder-decoder architecture with convolutional neural network (CNN) to capture the context of composite images, trying to understand what it should look like in the foreground referring to surrounding background. In this work, we seek to solve image harmonization with Transformer, by leveraging its powerful ability of modeling long-range context dependencies, for adjusting foreground light to make it compatible with background light while keeping structure and semantics unchanged. We present the design of our two vision Transformer frameworks and corresponding methods, as well as comprehensive experiments and empirical study, demonstrating the power of Transformer and investigating the Transformer for vision. Our methods achieve state-of-the-art performance on the image harmonization as well as four additional vision and graphics tasks, i.e., image enhancement, image inpainting, white-balance editing, and portrait relighting, indicating the superiority of our work. Codes, models, more results and details can be found at the project website http://ouc.ai/project/HarmonyTransformer.
               
Click one of the above tabs to view related content.