Why Does Gatys et al Neural Style Transfer Work Best With Old VGG CNN Features?

 Does it really?

Let's avoid a discussion of what 'works best' even means, let alone 'style'.  For now.

I grabbed this archived discussion from reddit and copy/pasted it here below in case the one on reddit vanishes for some reason.  And it's a very interesting read, and it highlights some things we kept pointing out at HTC in many previous posts.  That there is something about the VGG architecture that seems to work well with a number of different neural net image transformation tasks.

An acquaintance a year or two ago was messing around with neural style transfer (Gatys et al 2016), experimenting with some different approaches, like a tile-based GPU implementation for making large poster-size transfers, or optimizing images to look different using a two-part loss: one to encourage being like the style of the style image, and a negative one to penalize having content like the source image; this is unstable and can diverge, but when it works, looks cool. (Example: "The Great Wave" + Golden Gate Bridge. I tried further Klimt-ising it but at that point too much has been lost.)

VGG worked best for style transfer

One thing they noticed was that using features from a pretrained ImageNet VGG-16/19 CNNfrom 2014 (4 years ago), like the original Gatys paper did, worked much better than anything else; indeed, almost any set of 4-5 layers in VGG would provide great features for the style transfer optimization to target (as long as they were spread out and weren't exclusively bottom or top layers), while using more modern resnets (resnet-50) or GoogLeNet Inception v1 didn't work - it was hard to find sets of layers that would work at all and when they did, the quality of the style transfer was not as good. Interestingly, this appeared to be true of VGG CNNs trained on the MIT Places scene recognition database too, suggesting there's something architectural going on which is not database specific or peculiar to those two trained models. And their attempt at an upscaling CNN modeled on Johnson et al 2016's VGG-16 for CIFAR-100 worked well too.

Everyone uses VGG

Indeed, VGG is used pervasively through style transfer implementations & research beyond what one would expect from cargo-culting or copy-paste, even in applications as exotic as inferring images from human fMRI scans (Shen et al 2017). This surprising because 4 years in DL is a long time, and the newer CNNs outperform VGG at everything else like image classification or object localization (Tapa Ghosh disagrees on object localization) rendering VGG obsolete due to its large model size (much of which comes from the 3 large fully-connected layers at the top) & slowness & poor accuracy, and style transfer itself has made major advances in, among other things, going from days on a desktop to generate a new image to being capable of realtime on smartphones. For example, SqueezeNet outperforms VGG in every way, but its style transfer results are distinctly worse (but extremely fast!). Although this VGG-specificity appears to be folklore among practitioners, this is not something I have seen noticed in neural style transfer papers; indeed, the review Jing et al 2017 explicitly says that other models work fine, but their reference is to Johnson's list of models where almost every single model is (still) VGG-based and the ones which are not come with warnings (NIN-Imagenet: "May need heavy tweaking to achieve reasonable results"; Illustration2vec: "Best used with anime content...Be warned that it can sometimes be difficult to avoid the burn marks that the model sometimes creates"; PASCAL VOC FCN-32s: "Uses more resources than VGG-19, but can produce better results depending on your style and/or content image." etc). Sahil Singla describes his experiences trying to get Inception-v3 to work well for style transfer: he has to change striding/kernels, max pooling to average pooling, search over various layer combos, compared an Imagenet-trained with OpenImages to see if that matters and with Inception-v2 & v4, style transfer hyperparameter tweaks, and after all that, concludes "VGG is way better at the moment."


Some possible explanations:

  1. VGG is so big that it is incidentally capturing a lot of information that the other models discard and accidentally generalizing better despite worse task-specific performance. (Do resnets in general do transfer-learning worse, compared to earlier CNNs, than would be expected based on their superior task-specific performance?)

    but while VGG is giant compared to other ImageNet models, 500M vs <50MB (Keras table), most of this appears to be coming from the FC layers rather than the convolutions being sampled (leaving 58/80MB for the rest), so where is the supposed knowledge being stored? Nor does VGG appear to have tame internal dynamics lacking in other models - the layer average norms differ greatly, and rescaling appears to be unnecessary (neither they nor Johnson needed to do that like the Bethge lab did).

    on the gripping hand, could the FC layers in some way be forcing the lower convolutions to be different in terms of abstractions than equivalent convolutions in later less-FC-heavy models?

  2. resnets are unrolled iteration/shallow ensembles: the features do exist but they are too spread out to be pulled out easily and the levels of abstraction are all mixed up - instead of getting a nice balance of features from the bottom and top, they're spread out wildly between layer #3 and #33 and #333 etc. While VGGs, being relatively shallow and modular and having no residual connections or other special tricks to smuggle raw information up the layers, are forced to create more of a clearcut hierarchical pyramid of abstractions.

    Here there may be some straightforward way to better capture resnet knowledge; Pierre Richemond suggests:

    Probably ResNets feature maps need to be summed depthwise before taking the Gram matrix. By that logic, one'd think DenseNets should work better than Resnets but worse than VGG (due to gradient flows from earlier layers).

    Alternately, perhaps resnet/densenet layer activations need to be de-summed/de-concatenated.

  3. Residual connections themselves somehow mess up the optimization procedure by affecting properties like independence of features, with "blurring" from layers so easily passing around activations, suggests Kyle Kastner (this might be the same thing as "resnets have too many layers & split up features")

  4. VGG's better performance is due to not downsampling aggressively, doing so only after two convolutions and then max pooling

    In this interpretation, GoogLeNet fails because it downsamples in the first layer.

Testing hypotheses

What tests could be done?

  1. train much bigger resnet/DenseNets to see if expanding model capacity helps; alternately, retrain much smaller VGGs to create models which are comparable in parameters to see if the gap goes away. If a small VGG can't do better style transfer than an equal-sized resnet, that suggests there is no special mystery.

    Add/remove FC layers from retrained VGG and resnet models. Does that lead to large gains/losses in quality?

  2. experiment with different ways of picking or summing layers to generate features; possibly brute force, trying out a large number of subsets until one works.

    Another approach would be to try to remove layers entirely: resnets are resistant to deleting random layers, or one could try model distillation to train a shallow but wide resnet from a SOTA deep resnet. With similar parameters, it should perform just as well, but the layer features should be more compressed and easier to find a good set.

  3. Model distillation again but for an equivalent resnet minus all residual connections? I don't know if that's trainable at all.

  4. train competing models but with VGG-style initial layers.

Fixing this limitation to VGG, or showing that current resnets actually do work well and this folklore is false, could speed up style transfer training by replacing VGG with a smaller faster model, or a better one, and might give some interesting insights into what CNNs are learning & how that's affected by their architecture.

EDIT: Google Brain explains it as a kind of checkerboard artifact that VGG accidentally avoids, which would destabilize backprop: https://distill.pub/2018/differentiable-parameterizations/#section-styletransfer

EDITEDIT: work on robust models show they get good style transfer for free: https://reiinakano.com/2019/06/21/robust-neural-style-transfer.html but the details are confusing: https://distill.pub/2019/advex-bugs-discussion/response-4/

EDITEDITEDIT: Steinberg asks why robust ResNets work if the striding artifacts are still there; Olah mentions that they're powerful enough that their internal structure appears to specifically fight and neutralize the striding bug, explaining why they're able to work.


Popular posts from this blog

Pix2Pix: a GAN architecture for image to image transformation

CycleGAN: a GAN architecture for learning unpaired image to image transformations

Smart Fabrics