There are $n$ input images and each one has a corresponding generated image; and there is one generated image. For each generated image, there are two losses.

Disentangling factors of variation in deep representations using adversarial training

2018-11-15

Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis

Pose Transferrable Person Re-Identification

A Semi-supervised Deep Generative Model for Human Body Analysis

Deep Video-Based Performance Cloning

Adaptive Appearance Rendering

A Hybrid Model for Identity Obfuscation by Face Replacement

Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features

Learning Disentangled Representations with Semi-Supervised Deep Generative Models

2018-11-10

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

Semi and Weakly Supervised Semantic Segmentation Using Generative Adversarial Network

A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection

Tag Disentangled Generative Adversarial Networks for Object Image Re-rendering

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

2018-11-08

TextureGAN: Controlling Deep Image Synthesis with Texture Patches

VITAL: VIsual Tracking via Adversarial Learning

Problem setting : tracking-by-detection, i.e., drawing samples around the target object in the first stage and classifying each sample as the target object or as background in the second stage.

creen Shot 2018-11-09 at 10.15.46 A

Stacked Conditional Generative Adversarial Networks for Jointly Learning Shadow Detection and Shadow Removal

Image Denoising via CNNs: An Adversarial Approach

Image denoising is a fundamental image processing problem whose objective is to remove the noise while preserving the original image structure.

creen Shot 2018-11-12 at 9.19.33 A

Discriminative Region Proposal Adversarial Networks for High-Quality Image-to-Image Translation

From source to target and back: symmetric bi-directional adaptive GAN

Conditional Image-to-Image Translation

Deformable Shape Completion with Graph Convolutional Autoencoders

2018-11-05

Deep Visual Analogy-Making tf code

Given a pair of images $(a,b)$ and a query image $c$, we try to generate $d$, the corresponding image of $c$, such that $a$ is to $b$ as $c$ is to $d$.

creen Shot 2018-11-06 at 9.25.11 P

creen Shot 2018-11-06 at 9.38.23 P

Person Transfer GAN to Bridge Domain Gap for Person Re-Identification code

A new Multi-Scene Multi-Time person ReID dataset (MSMT17) is proposed.

[transfer styles but keep identies] A method is proposed to bridge the domain gap by transferring persons in dataset A to another dataset B. The transferred persons from A are desired to keep their identities, meanwhile present similar styles, e.g., backgrounds, lightings, etc., with persons in B.

To keep identity, a identities loss is introduced where the mask region of generated images and gt images should be similar.

Separating Style and Content for Generalized Style Transfer

Connecting Pixels to Privacy and Utility: Automatic Redaction of Private Information in Images code

FoldingNet: Point Cloud Auto-encoder via Deep Grid Deformation

Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment

A New Representation of Skeleton Sequences for 3D Action Recognition

2018-11-04

Synthesizing Images of Humans in Unseen Poses

creen Shot 2018-11-04 at 11.17.55 P

Our model is trained on (example, label) tuples of the form $((I_s, p_s, p_t), I_t)$, where $I_s$, $p_s$ and $p_t$ are the source image, source 2D pose and target 2D pose, and $I_t$ is the target image.

our model first segments the scene into foreground and background layers. It further segments the person’s body into different part layers such as the arms and legs, allowing each part to then be moved independently of the others. creen Shot 2018-11-04 at 11.22.15 P

Generating a Fusion Image: One’s Identity and Another’s Shape

Given two rgb images $x $ and $y$, we try to generate a new image which is the combination of the identity $x$ and the shape or pose of $y$.

A Generative Model of People in Clothing

The authors try to generate different people with different clothes but with the same specified shapes.

creen Shot 2018-11-05 at 10.58.10 A

2018-11-2

Multimodal Deep Learning for Robust RGB-D Object Recognition

Depth-aware CNN for RGB-D Segmentation

Learning Rich Features from RGB-D Images for Object Detection and Segmentation

Deep Bilinear Learning for RGB-D Action Recognition

PAPER

creen Shot 2018-11-02 at 4.30.19 P

In this paper, we present a novel tensor-structured cube feature【The multi-modal sequences with temporal information can be regarded as a tensor， structured with two different dimensions (temporal and modality)】, and propose to learn time-varying information from multi-modal action history sequences for RGB-D action recognition.

In this paper, we address this challenge by proposing a novel deep bilinear framework, where a bilinear block consisting of two linear pooling layers (modality pooling layer and temporal pooling layer) is defined to pool the input tensor along the modality and temporal directions, separately. In this way, the structures along the temporal and modal dimensions are both preserved. By stacking the proposed bilinear blocks and other network layers (e.g., Relu and softmax), we develop our deep bilinear model to jointly learn the action history and modality information in videos. Results have shown that learning modality-temporal mutual information is beneficial for the recognition of RGB-D actions.

2018-10-31

A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-Ranking

coarse pose : front, back, side of a person’s orientation to the camera.

fine pose : joint skeleton

Everybody Dance Now

creen Shot 2018-10-31 at 11.17.32 P

The task is to generate a action video conditioned on the figure and source video.

The problem is you have to train a generate G model for each new person.

creen Shot 2018-11-14 at 10.55.48 P

$L_{VGG}$ loss: instead of using per-pixel loss functions depending only on low-level pixel information, we train our networks using perceptual loss functions that depend on high-level features from a pretrained loss network. During training, perceptual losses measure
image similarities more robustly than per-pixel losses

CR-GAN: Learning Complete Representations for Multi-view Generation

learn complete representation to handle unseen data problem.

creen Shot 2018-11-01 at 10.55.28 A

The authors aim to geneate multi-view images of a figure given one image of that person.

Geometry-Contrastive GAN for Facial Expression Transfer

handle the misalignment across different subjects or facial expressions.

creen Shot 2018-11-01 at 11.22.36 A

Pose Guided Human Video Generation

creen Shot 2018-11-01 at 2.17.07 P

creen Shot 2018-11-01 at 2.20.38 P

creen Shot 2018-11-01 at 2.51.41 P

DDDDDDDIFFICULTLearning to Forecast and Refine Residual Motion for Image-to-Video Generation

we study a form of classic problems in video generation that can be framed as
image-to-video translation tasks, where a system receives one or more images
as the input and translates it into a video containing realistic motions of a
single object.

creen Shot 2018-10-31 at 6.12.16 P

creen Shot 2018-10-31 at 10.45.04 P

2018-10-29

Pose-Normalized Image Generation for Person Re-identification

Critically, once trained, the model can be applied to a new dataset without any model fine-tuning as long as the test image’s pose is also normalized.

creen Shot 2018-10-29 at 9.46.55 P

creen Shot 2018-10-29 at 9.45.51 P

we train two re-id models. One model is trained using the original images in a training set to extract identity-invariant features in the presence of pose variation. The other is trained using the synthesized images with normalized poses using our PN-GAN to compute re-id features free of pose variation. They are then fused as the final feature representat.

VITON: An Image-based Virtual Try-on Network

We present an image-based virtual try-on approach, relying merely on plain RGB images without leveraging any 3D information. we propose a virtual try-on network (VITON), a coarse-to-fine framework that seamlessly transfers a target clothing item in a product image to the corresponding region of a clothed person in a 2D image.

creen Shot 2018-10-29 at 10.06.30 P

The mask is then used as a guidance to warp the target clothing item to account for deformations.

creen Shot 2018-10-29 at 10.33.18 P

Disentangled Person Image Generation

Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time.

creen Shot 2018-10-29 at 10.53.41 P

creen Shot 2018-10-29 at 10.54.11 P

creen Shot 2018-10-29 at 10.57.45 P

In stage one, a real image is used to train 3 independent encoders, i.e., Pose Encoder, Foreground Encoder, and Background Encoder.

In stage two, we can smaple features from 3 encoders respectively to get pose features, foreground features and background features. And combining these three features to generate imagse.

creen Shot 2018-10-29 at 11.08.43 P

In particular, we aim at sampling from a standard distribution, e.g. a Gaussian
distribution, to first generate new embedding features and from them generate new images

Natural and Effective Obfuscation by Head Inpainting

detecting 68 facial keypoints using the python dlib toolbox paper

We focus on the scenario where the user wants to obfuscate some identities in a social media photo by inpainting new heads for them. We use facial landmarks to provide strong guidance for the head inpainter. We factor the head inpainting task into two stages: (1) landmark detection or generation and (2) head inpainting conditioned on body context and landmarks.

creen Shot 2018-10-29 at 11.19.36 P

It takes either the original or blackhead image as input, in order to give flexibility to deal with cases where the original images are not available.

Given original or headobfuscated input, stage-I detects or generates landmarks,
respectively. Stage-II takes the blackhead image and landmarks as input, and outputs the generated image.

creen Shot 2018-10-29 at 11.23.53 P

Deformable GANs for Pose-based Human Image Generation

Specifically, given an image of a person and a target pose, we synthesize a new image of that person in the novel pose. In order to deal with pixel-to-pixel misalignments caused by the pose differences, we introduce deformable skip connections in the generator of our Generative Adversarial Network.

creen Shot 2018-10-30 at 10.00.26 A

2018-10-27

Cross-Modality Person Re-Identification with Generative Adversarial Training

studied the Re-ID between infrared and RGB images, which is essentially a cross-modality problem and widely encountered in real-world scenarios.

creen Shot 2018-10-29 at 9.40.42 A

Predicting Human Interaction via Relative Attention Model

Essentially, a good algorithm should effectively model the mutual influence between the two interacting subjects. Also, only a small region in the scene is discriminative for identifying the on-going interaction.

creen Shot 2018-10-29 at 10.22.19 A

An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

We build our model on top of the Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM), which learns to selectively focus on discriminative joints of skeleton within each frame of the inputs and pays different levels of attention to the outputs of different frames.

For spatial joints of skeleton, we propose a spatial attention module which conducts automatic mining of discriminative joints. A certain type of action is usually only associated with and characterized by the combinations of a subset of kinematic joints.

For a sequence, the amount of valuable information provided by different frames is in general not equal. Only some of the frames (key frames) contain the most discriminative information while the other frames provide context information.

creen Shot 2018-10-28 at 11.13.33 A

Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation

focus on the problem of skeleton-based human action recognition and detection.

By investigating the convolution operation, we may decompose it into two steps, i.e. local feature aggregation across the spatial domain (width and height) and global feature aggregation across channels.

The input is skeleton sequences and skeleton temporal differences.

creen Shot 2018-10-29 at 11.15.10 A

For multiple persons, inputs of multiple persons go through the same subnetwork and their conv6 feature maps are merged with either concatenation along channels or element-wise maximum / mean operation.

creen Shot 2018-10-29 at 11.17.51 A

Action detection

Pose Guided Person Image Generation

creen Shot 2018-10-28 at 10.05.47 P

https://arxiv.org/pdf/1601.01006.pdf

A2g-GAN

2018-10-26

IJCAI 2018

Exploiting Images for Video Recognition with Hierarchical Generative Adversarial Networks

The two-level HiGAN is designed to have a low-level conditional GAN and a high-level conditional GAN. The low-level conditional GAN is built to connect videos and their corresponding video frames by learning a mapping function from frame features to video features in the target domain. The high-level conditional GAN, on the other hand, is modeled to bridge the gap between source images and target videos by formulating a mapping function from video features to image-frame features.

Memory Attention Networks for Skeleton-based Action Recognition

2018-10-25

StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

Pose Guided Person Image Generation

Disentangled Person Image Generation

Deformable GANs for Pose-based Human Image Generation

from random noise. We split the generation process into two stages: ﬁrst, we

generate human skeleton sequences from random noise, and then we t r an sf orm

from the skeleton images to the real pixel-level images.

creen Shot 2018-10-26 at 9.15.48 A

The model is independent of training subjests, where we train the model using some subjects but test it using totally different subjects.

2020

February

January

CVPR2019

GAN

2019-05-16

2019-05-15

2019-05-14

Videos

NLP

Images