1)Image-to-Image Translation with Conditional Adversarial Nets

2017-11-10 1)Image-to-Image Translation with Conditional Adversarial Nets 阅读总结：训练input: 一个图片和随机高斯噪声（dropout) ，训练output:一个逼真的和输入图片相关的图片测试input: 一个图片和随机高斯噪声（dropout) 。测试output:一个逼真的和输入图片相关的图片 G: U-NET是an encoder-decoder with skip connections* (encoder: C64-C128-C256-C512-C512-C512-C512-C512 U-Net decoder: CD512-CD1024-CD1024-C1024-C1024-C512-C256-C128 After the last layer in the decoder, a convolution is ap- plied to map to the number of output channels (3 in general, except in colorization, where it is 2), followed by a Tanh function. As an exception to the above notation, Batch- Norm is not applied to the first C64 layer in the encoder. All ReLUs in the encoder are leaky, with slope 0.2, while ReLUs in the decoder are not leaky.) D：论文采用70x70的patchGAN (C64-C128-C256-C512,After the last layer, a convolution is applied to map to a 1 dimensional output, followed by a Sigmoid function. As an exception to the above notation, BatchNorm is not applied to the first C64 layer. All ReLUs are leaky, with slope 0.2.) 优势：可以用于处理任意大的图片 loss函数：G⇤ = arg min max LcGAN (G, D) + lamdaLL1(G). （Adding both terms together (with lamda = 100) reduces these artifacts.）训练细节： 1）Weights were initialized from a Gaussian distribution with mean 0 and standard deviation 0.02. 2） apply batch normalization，use batch size 1 for certain experiments and 4 for others，noting little difference

2017-11-10 videoGAN 阅读总结：功能：1）generate videos from scratch (not conditioned on the past） 2） generate a sequence of frames (32 frames) 训练输入： a large amount of unlabeled video, 噪声训练输出：video 测试输入：噪声测试输出：video 数据处理： 1）降低相机抖动影响We extract SIFT keypoints [22], use RANSAC to estimate a homography (rotation, translation, scale) between adjacent frames, and warp frames to minimize background motion. 2）The only other pre-processing we do is normalizing the videos to be in the range [−1, 1]. 3）We extract frames at native frame rate (25 fps). We use 32-frame videos of spatial resolution 64 × 64. 结构： G: 两个独立的部分，输入一个100维的高斯噪声作为隐变量，一个生成动态的前景，一个生成静态的背景，将两个通过一个mask进行相加得到video D: We design the architecture to be reverse of the foreground stream in the generator, replacing fractionally strided convolutions with strided convolutions (to down-sample instead of up-sample), and replacing the last layer to output a binary classification (real or not). 参数： 1）We use the Adam optimizer and a fixed learning rate of 0.0002 and momentum term of 0.5 2）The latent code has 100 dimensions, which we sample from a normal distribution 3） a batch size of 64. 4）We initialize all weights with zero mean Gaussian noise with standard deviation 0.01 成果与不足： the model usually learns to put motion on the right objects, one common failure mode is that the objects lack resolution. 模型结果评价方法： 1）We quantitatively evaluate our generation using a psychophysical two-alternative forced choice with workers on Amazon Mechanical Turk. We show a worker two random videos,and ask them “Which video is more realistic?” 2）baseline:训练一个autoencoder进行对比 We train an autoencoder over our data. The encoder is similar to the discriminator network (except producing 100 dimensional code), while the decoder follows the two-stream generator network

拓展应用：使用一张静态图生成video 结构改动： We utilize the same model as our two-stream model, however we must make one change in order to input the static image instead of the latent code. We can do this by attaching a five- layer convolutional network to the front of the generator which encodes the image into the latent space, similar to a conditional generative adversarial network. The rest of the generator and discriminator networks remain the same. loss函数改动： we add an additional loss term that minimizes the L1 distance between the input and the first frame of the generated image. 效果： 1）Although the extrapolations are rarely correct, they often have fairly plausible motions. 2）most common failure is that the generated video has a scene similar but not identical to the input image, such as by changing colors or dropping/hallucinating objects. 改进方向： the former could be solved by a color histogram normalization in post-processing ,the latter will require building more powerful generative models.

2017-11-11 C3D阅读总结： 3维卷积网络能够处理时间和空间维度的信息，而2维卷积网络仅能处理空间维度信息，因此2维卷积网络只能生成一张图片，无法生成视频，而3维卷积网络的卷积层可以保留时间信息，因此可以生成视频。文章贡献： 1）C3D能够同时对物体外表和运动进行建模 2）作者在UCF101上实验认为所有层都采用3x3x3的核学习效果最好输入：Videos are split into non-overlapped 16-frame clips，all video frames are resized into 128 × 171. 作者探索最佳模型架构的实验： 1）所有层都有一样的时间深度，分别是1，3，5，7 2）采取不同时间深度，共两种模型，3-3-5-5-7，或7-5-5-3-3 实验时采用的框架信息具体可以参照论文Common network settings处的具体介绍

采用的架构C3D为： we design our 3D ConvNet to have 8 convolution layers, 5 pooling layers, followed by two fully connected layers, and a softmax output layer; 细节信息： All of 3D convolution filters are3×3×3 with stride1×1×1. All 3Dpooling layers are 2×2×2 with stride 2×2×2 except forpool1 which has kernel size of 1×2×2 and stride 1×2×2 with the intention of preserving the temporal information in the early phase. Each fully connected layer has 4096 output units. 训练集： Sports-1M dataset， the dataset consists of 1.1 million sports videos. Each video belongs to one of 487 sports categories. 训练输入： 1）随机从每个视频中街截取5个2秒的短片，调整成 frame size of 128 × 171， 2）同时还会随机将一些输入视频调整成16 × 112 × 112 ，从而加入一些扰动， 3）并且会有一半的概率水平flip这些短片。超参数： Training is done by SGD with mini- batch size of 30 examples. Initial learning rate is 0.003, and is divided by 2 every 150K iterations.

最后通过和其他方法进行对比，提出C3D在动作识别、场景和目标识别，以及运算速度上具有优势拓展应用： After training, C3D can be used as a feature extractor for other video analysis tasks. 输入： to extract C3D feature, a video is split into 16 frame long clips with a 8-frame overlap between two consecutive clips. 发现：使用deconvolution来看c3d如何学习特征。发现C3D首先在前几针图片中关注物体外表，在随后的图片中关注物体的运动

2017-11-11 3D U-NET阅读总结： 1）把2D U-NET网络里的2d 操作全部替换为3D操作；

架构： 1）分为两个部分：analysis and a synthesis path each with four resolution steps 2)In the analysis path, each layer contains two 3 × 3 × 3 convolutions each followed by a rectified linear unit (ReLu), and then a 2 × 2 × 2 max pooling with strides of two in each dimension. 3)In the synthesis path, each layer consists of an upconvolution of 2×2×2 by strides of two in each dimension, followed by two 3×3×3 convolutions each followed by a ReL 4) In the last layer a 1×1×1 convolution reduces the number of output channels to the number of labels which is 3 in our case 输入：The input to the network is a 132 × 132 × 116 voxel tile of the image with 3 channels. 输出： Our output in the final layer is 44×44×28 voxels in x, y, and z directions respectively. 架构优势： weighted softmax loss function allows us to train on sparse annotations. Setting the weights of unlabeled pixels to zero makes it possible to learn from only the labelled ones and, hence, to generalize to the whole volume. 训练： Besides rotation, scaling and gray value augmentation, we apply a smooth dense deformation field on both data and ground truth label。评价手段： Intersection over Union (IoU) is used as accuracy measure to compare dropped out ground truth slices to the predicted 3D volume.

2017-11-13 3D U-NET网络的生成器框架代码（keras)链接： https://github.com/ellisdg/3DUnetCNN/blob/master/unet3d/model.py

pix2pix的判别器代码(tensorflow)链接： https://github.com/yenchenlin/pix2pix-tensorflow/blob/master/model.py

C3D的代码（keras): https://gist.github.com/albertomontesg/d8b21a179c1e6cca0480ebdf292c34d2

videogan 的生成器和判别器代码（torch): https://github.com/cvondrick/videogan/blob/master/main.lua videogan的生成器代码（tensor flow): https://github.com/Yuliang-Zou/tf_videogan/blob/master/main.py videogan的生成器和判别器代码（tensor flow): https://github.com/wxh1996/VideoGAN-tensorflow/blob/master/model.py

mocogan的生成器和判别器代码pytorch: https://github.com/DLHacks/mocogan/blob/master/models.py

0 10/23 工作内容： +阅读：1）生成模型学习的特征属性如何操作修改

2）生成对抗网络综述
3 ）unsupervised representation learning with deep convolutional generative adversaral networks

10/24 工作内容： 1）DCGAN文献阅读 2）与张总一起学习模拟汽车赛道学习 3）linux常见指令学习（cd,ls,less,file)

10/25工作内容： 1）和张伟一起学习法图像插值实现； 2）keras用法学习（sequential model and API使用）

10/30工作内容： 1）上周工作回顾 2）张总ppt讨论 3）道路图片挑选 4）道路差值图片文件名正则化和排序程序编写测试成功

11/2 工作内容： 1）PGgan模型下载，train.py无法正常启动，提示cudnn.h无法找到,将cudnn.h文件复制到对应的include文件夹中依然找不到；关闭cuda使用后，依然出现lasagne无法导入downsample,怀疑lasagne版本和theano没有很好兼容，尝试多次重新安装，问题依然存在 2）PGgan的pytorch版本有一个inference model 没有train.py文件，只可以调用已经训练好的权重进行图片测试，体验模型训练后的效果并有一个插值模型，可以生成10副连续插值图像进行对比

11/3 工作内容： 1）台式电脑ubuntu系统安装，gtx 1070 cuda 安装 2）linux系统指令学习 3）caffe2架构安装学习 4）pggan的模型lasagne无法导入downsample查找，从github导入最新的模型后，并在terminal运行，该错误消失，可能与IDE集成环境报错有关，建议以后尽量用terminal运行程序

11/4工作内容： 1）cuda在ubuntu14上的安装方法和cuda 8.0 toolkit的下载链接（方法失败） http://blog.csdn.net/yanncywang/article/details/52872473 2）显卡驱动安装指令链接（方法失败） http://blog.csdn.net/iotlpf/article/details/54175064 3）ubuntu14系统循环登陆问题解决方法（显卡驱动不匹配导致，需要卸载掉旧的驱动） http://www.jianshu.com/p/35c7fde85968(强烈建议采用该链接进行安装操作！！！！！！！）

如果安装后重启电脑，还是出现循环登陆，那么请尝试以下方法（方法失败） http://www.binarytides.com/install-nvidia-drivers-ubuntu-14-04/ 官网查找显卡驱动的版本，可能会有几个版本，建议一个个尝试 https://www.geforce.cn/drivers 4）ubuntu安装教程链接 https://jingyan.baidu.com/article/0bc808fc6326ca1bd485b9e6.html 5)ssh远程登陆电脑 http://www.ruanyifeng.com/blog/2011/12/ssh_remote_login.html 6）Linux开启ssh服务 http://www.cnblogs.com/fengbeihong/p/3307575.html 7)ssh链接远程服务器，如果遇到这样的错误：ECDSA host key "ip地址" for has changed and you have requested strict checking，请参考 http://blog.csdn.net/ausboyue/article/details/52775281 8)nvidia cuda linux安装知道官方网站(建议严格按照官网步骤，其他途径可以作为参考，方法失败） http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

11-10工作内容参考以下链接 https://github.com/createamind/busyplan/issues/10

11-11工作内容参考以下链接 https://github.com/createamind/busyplan/issues/10

11-14日工作内容记录： 1）查找了pix2pix／viedeogan／mocogan/3D-U-NET／C3D的框架代码，具体代码链接在issue的predict videio中这些代码可以为后续进行视频生成提供素材和思路 2）详细阅读了mocogan和videogan的论文以及框架思路

3）mocogan总结：视频生成的困难原因： 1）视频中包含信息多，既有物体外表，也有物体结构，某个信息错误都会生成不逼真的视频； 2）时间维度包含大量的变化 3）人类眼睛对运动很敏感，因此极容易发现不逼真的视频缺陷

文章中将视频预测的研究工作分为两类： 1）基于已有图像，在将来的图像生成原始像素 2）将已有图像上的像素根据某种转换进行重组，从而生成新的图片

训练输入：一个随机噪声生成content,一个随机噪声（经过rnn，生成motion vector），一个真实video片段训练输出：一个视频片段

测试输入：一个随机噪声生成content,一个随机噪声（经过rnn，生成motion vector）测试输出：一个视频片段

隐变量：一个隐变量可以分为两个部分：一部分是content vector，用于生成物体外表，一部分是motion vector，用于生成物体的运动其中content vector服从多元正太高斯分布，当固定一个content vector生成一个物体后，需要一系列不同的motion vector来生成一个运动路径，这个运动路径的motion vector是通过一个rnn网络实现的。随机生成一个服从正态高斯分布的噪声，经过rnn网络(文章采用GRU）的处理，会产生一系列的motion vector.

架构介绍： movogan包含4个网络：循环生成网络RNN，一个图片生成器，一个图片判别器，一个视频判别器循环神经网络rnn：用于将一系列随机噪声生成一系列的motion vetctor,用于生成一系列的运动图片生成器：可以将一些列的隐变量映射到一些列图片，生成视频片段图片判别器：用于对生成的一张视频图片质量进行判断，视频判别器：用于对生成的一个视频片段质量进行判断之所以有两个判别器，是因为视频判别器比较难以训练，因此先训练好图片判别器后，保证生成的物体外表质量符合要求后，视频判别器的学习任务就会减轻，只需要对视频中的运动质量进行判断就好，能加速收敛

损失函数的梯度更新： 1）第一步：更新图片生成器和视频生成器进行迭代 2）第二步：更新迭代生成器和循环神经网络

训练细节： We trained MoCoGANs using ADAM. We set the learning rate to 0.0002 and momentums to 0.5 and 0.999, respectively

应用拓展：数据集中如果有运动相关的标注的话，可以和content vector或motion vector一起输入类似于conditional gan。

11-14日工作内容： 1）详细查看了pix2pix和3d-U-NET的结构代码，将3d-u-net的架构结合pix2pix的tendorflow代码，改写成了keras版本的pix2pix的3维生成器； 2）pix2pix的3维生成器改写成keras版本，但似乎好像不大对，明天继续修改

yushenxiang / MachineLearningTips

1)Image-to-Image Translation with Conditional Adversarial Nets #5