Skip to content

Latest commit

 

History

History
1262 lines (1250 loc) · 260 KB

README.md

File metadata and controls

1262 lines (1250 loc) · 260 KB

社区模型库

飞桨目前包含170+个社区模型,覆盖CV、NLP、推荐等多个领域,详细内容如下表:

图像分类

序号 论文名称(链接) 摘要 数据集/指标 快速开始
1 Wide Residual Networks
Abstract
Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet. Our code and models are available at https://github.com/szagoruyko/wide-residual-networks
CIFAR-10(WRN-28-20-dropout): 96.55% 快速开始
2 Colorful Image Colorization
Abstract
Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph. This problem is clearly underconstrained, so previous approaches have either relied on significant user interaction or resulted in desaturated colorizations. We propose a fully automatic approach that produces vibrant and realistic colorizations. We embrace the underlying uncertainty of the problem by posing it as a classification task and use class-rebalancing at training time to increase the diversity of colors in the result. The system is implemented as a feed-forward pass in a CNN at test time and is trained on over a million color images. We evaluate our algorithm using a "colorization Turing test," asking human participants to choose between a generated and ground truth color image. Our method successfully fools humans on 32% of the trials, significantly higher than previous methods. Moreover, we show that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder. This approach results in state-of-the-art performance on several feature learning benchmarks.
AuC: non-rebal=89.5%rebal=67.3%VGG Top-1 Class Acc=56%AMT Labeled Real=32.3% 快速开始
3 Prototypical Networks for Few-shot Learning
Abstract
Dropout is a powerful and widely used technique to regularize the training ofdeep neural networks. In this paper, we introduce a simple regularizationstrategy upon dropout in model training, namely R-Drop, which forces the outputdistributions of different sub models generated by dropout to be consistentwith each other. Specifically, for each training sample, R-Drop minimizes thebidirectional KL-divergence between the output distributions of two sub modelssampled by dropout. Theoretical analysis reveals that R-Drop reduces thefreedom of the model parameters and complements dropout. Experiments on$\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total),including neural machine translation, abstractive summarization, languageunderstanding, language modeling, and image classification, show that R-Drop isuniversally effective. In particular, it yields substantial improvements whenapplied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large,and BART, and achieves state-of-the-art (SOTA) performances with the vanillaTransformer model on WMT14 English$\to$German translation ($\bf{30.91}$ BLEU)and WMT14 English$\to$French translation ($\bf{43.95}$ BLEU), even surpassingmodels trained with extra large-scale data and expert-designed advancedvariants of Transformer models. Our code is available atGitHub{\url{this https URL}}.
1-shot=49.42%, 5-shot=68.2% 快速开始
4 R-Drop: Regularized Dropout for Neural Networks
Abstract
Inspired by recent work in machine translation and object detection, weintroduce an attention based model that automatically learns to describe thecontent of images. We describe how we can train this model in a deterministicmanner using standard backpropagation techniques and stochastically bymaximizing a variational lower bound. We also show through visualization howthe model is able to automatically learn to fix its gaze on salient objectswhile generating the corresponding words in the output sequence. We validatethe use of attention with state-of-the-art performance on three benchmarkdatasets: Flickr8k, Flickr30k and MS COCO.
ViT-B/16+RD=93.29 快速开始
5 Weight Uncertainty in Neural Networks
Abstract
We introduce a new, efficient, principled and backpropagation-compatible algorithm for learning a probability distribution on the weights of a neural network, called Bayes by Backprop. It regularises the weights by minimising a compression cost, known as the variational free energy or the expected lower bound on the marginal likelihood. We show that this principled kind of regularisation yields comparable performance to dropout on MNIST classification. We then demonstrate how the learnt uncertainty in the weights can be used to improve generalisation in non-linear regression problems, and how this weight uncertainty can be used to drive the exploration-exploitation trade-off in reinforcement learning.
MNIST: Test Error=1.32% 快速开始
6 Matching Networks for One Shot Learning
Abstract
Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.
omniglot k-way=5, n-shot=1, acc = 98.1% 快速开始
7 Modeling Relational Data with Graph Convolutional Networks
Abstract
Recognizing arbitrary multi-character text in unconstrained naturalphotographs is a hard problem. In this paper, we address an equally hardsub-problem in this domain viz. recognizing arbitrary multi-digit numbers fromStreet View imagery. Traditional approaches to solve this problem typicallyseparate out the localization, segmentation, and recognition steps. In thispaper we propose a unified approach that integrates these three steps via theuse of a deep convolutional neural network that operates directly on the imagepixels. We employ the DistBelief implementation of deep neural networks inorder to train large, distributed neural networks on high quality images. Wefind that the performance of this approach increases with the depth of theconvolutional network, with the best performance occurring in the deepestarchitecture we trained, with eleven hidden layers. We evaluate this approachon the publicly available SVHN dataset and achieve over $96\%$ accuracy inrecognizing complete street numbers. We show that on a per-digit recognitiontask, we improve upon the state-of-the-art, achieving $97.84\%$ accuracy. Wealso evaluate this approach on an even more challenging dataset generated fromStreet View imagery containing several tens of millions of street numberannotations and achieve over $90\%$ accuracy. To further explore theapplicability of the proposed system to broader text recognition tasks, weapply it to synthetic distorted text from reCAPTCHA. reCAPTCHA is one of themost secure reverse turing tests that uses distorted text to distinguish humansfrom bots. We report a $99.8\%$ accuracy on the hardest category of reCAPTCHA.Our evaluations on both tasks indicate that at specific operating thresholds,the performance of the proposed system is comparable to, and in some casesexceeds, that of human operators.
Accuracy = 95.83% 快速开始
8 Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks
Abstract
Deep neural networks are typically trained by optimizing a loss function withan SGD variant, in conjunction with a decaying learning rate, untilconvergence. We show that simple averaging of multiple points along thetrajectory of SGD, with a cyclical or constant learning rate, leads to bettergeneralization than conventional training. We also show that this StochasticWeight Averaging (SWA) procedure finds much flatter solutions than SGD, andapproximates the recent Fast Geometric Ensembling (FGE) approach with a singlemodel. Using SWA we achieve notable improvement in test accuracy overconventional SGD training on a range of state-of-the-art residual networks,PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, andImageNet. In short, SWA is extremely easy to implement, improvesgeneralization, and has almost no computational overhead.
Accuracy=95.65% 快速开始
9 Averaging Weights Leads to Wider Optima and Better Generalization
Abstract
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.
VGG16+SWA 1budget, CIFAR10 top1=93.59 快速开始
10 Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation
Abstract
Deep learning (DL) based semantic segmentation methods have been providing state-of-the-art performance in the last few years. More specifically, these techniques have been successfully applied to medical image classification, segmentation, and detection tasks. One deep learning technique, U-Net, has become one of the most popular for these applications. In this paper, we propose a Recurrent Convolutional Neural Network (RCNN) based on U-Net as well as a Recurrent Residual Convolutional Neural Network (RRCNN) based on U-Net models, which are named RU-Net and R2U-Net respectively. The proposed models utilize the power of U-Net, Residual Network, as well as RCNN. There are several advantages of these proposed architectures for segmentation tasks. First, a residual unit helps when training deep architecture. Second, feature accumulation with recurrent residual convolutional layers ensures better feature representation for segmentation tasks. Third, it allows us to design better U-Net architecture with same number of network parameters with better performance for medical image segmentation. The proposed models are tested on three benchmark datasets such as blood vessel segmentation in retina images, skin cancer segmentation, and lung lesion segmentation. The experimental results show superior performance on segmentation tasks compared to equivalent models including U-Net and residual U-Net (ResU-Net).
R2U-Net  F1-score=0.8171 快速开始
11 Unsupervised Representation Learning by Predicting Image Rotations
Abstract
Over the last years, deep convolutional neural networks (ConvNets) have transformed the field of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale. Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In our work we propose to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input. We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance. Specifically, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus significantly close the gap with supervised feature learning. For instance, in PASCAL VOC 2007 detection task our unsupervised pre-trained AlexNet model achieves the state-of-the-art (among unsupervised methods) mAP of 54.4% that is only 2.4 points lower from the supervised case. We get similarly striking results when we transfer our unsupervised learned features on various other tasks, such as ImageNet classification, PASCAL classification, PASCAL segmentation, and CIFAR-10 classification. The code and models of our paper will be published on: https://github.com/gidariss/FeatureLearningRotNet .
RotNet+conv, CIFAR-10上top1=91.16 快速开始
12 FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
Abstract
Over the last decade, Convolutional Neural Network (CNN) models have beenhighly successful in solving complex vision problems. However, these deepmodels are perceived as "black box" methods considering the lack ofunderstanding of their internal functioning. There has been a significantrecent interest in developing explainable deep learning models, and this paperis an effort in this direction. Building on a recently proposed method calledGrad-CAM, we propose a generalized method called Grad-CAM++ that can providebetter visual explanations of CNN model predictions, in terms of better objectlocalization as well as explaining occurrences of multiple object instances ina single image, when compared to state-of-the-art. We provide a mathematicalderivation for the proposed method, which uses a weighted combination of thepositive partial derivatives of the last convolutional layer feature maps withrespect to a specific class score as weights to generate a visual explanationfor the corresponding class label. Our extensive experiments and evaluations,both subjective and objective, on standard datasets showed that Grad-CAM++provides promising human-interpretable visual explanations for a given CNNarchitecture across multiple tasks including classification, image captiongeneration and 3D action recognition; as well as in new settings such asknowledge distillation.
cifar 10, 40label: 93.6%, 250 label 95.31%, 4000 labels 95.77% 快速开始
13 MLP-Mixer: An all-MLP Architecture for Vision
Abstract
Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.
Mixer-B/16CIFAR-10upstream: ImageNet96.72%upstream: ImageNet-21k96.82%(官方JAX repo提供) 快速开始
14 Deep Networks with Stochastic Depth
Abstract
Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91% on CIFAR-10).
CIFAR-10 test error=5.25 快速开始
15 Recurrent Models of Visual Attention
Abstract
Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.
28*28 Mnist, RAM, 6 glimpses, 8 × 8, 1 scale达到论文指标 快速开始
16 Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet 
Abstract
Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0\% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3\% top1 accuracy in image resolution 384×384 on ImageNet. (Code: this https URL)
ImageNet1k: T2T-ViT-7, 71.7% 快速开始
17  Rethinking Spatial Dimensions of Vision Transformers 
Abstract
Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection, and robustness evaluation. Source codes and ImageNet models are available at https://github.com/naver-ai/pit.
ImageNet1k: pit_ti 73.0% 快速开始
18 Masked Autoencoders Are Scalable Vision Learners
Abstract
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
ImageNet1k: val 83.6% 快速开始
19 XCiT: Cross-Covariance Image Transformers
Abstract
Following tremendous success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlyingtransformers yields global interactions between all tokens, i.e. words or image patches, andenables flexible modelling of image data beyond the local interactions of convolutions. Thisflexibility, however, comes with a quadratic complexity in time and memory, hinderingapplication to long sequences and high-resolution images. We propose a “transposed”version of self-attention that operates across feature channels rather than tokens, wherethe interactions are based on the cross-covariance matrix between keys and queries. Theresulting cross-covariance attention (XCA) has linear complexity in the number of tokens,and allows efficient processing of high-resolution images. Our cross-covariance imagetransformer (XCiT) – built upon XCA – combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness andgenerality of XCiT by reporting excellent results on multiple vision benchmarks, including (self-supervised) image classification on ImageNet-1k, object detection and instancesegmentation on COCO, and semantic segmentation on ADE20k.
ImageNet; xcit_nano_12_p8224: top1=73.8224: top1=76.3384: top1=77.8 快速开始
20 Matching Networks for One Shot Learning
Abstract
Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.
omniglot k-way=5, n-shot=1, acc = 98.1 快速开始
21 CycleMLP: A MLP-like Architecture for Dense Prediction
Abstract
This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation, CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have O(N2) computations due to fully spatial connections. We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e.g., Swin Transformer, while using fewer parameters and FLOPs. We expand the MLP-like models' applicability, making them a versatile backbone for dense prediction tasks. CycleMLP achieves competitive results on object detection, instance segmentation, and semantic segmentation. In particular, CycleMLP-Tiny outperforms Swin-Tiny by 1.3% mIoU on ADE20K dataset with fewer FLOPs. Moreover, CycleMLP also shows excellent zero-shot robustness on ImageNet-C dataset. Code is available at https://github.com/ShoufaChen/CycleMLP.
ImageNet: CycleMLP-B1 78.9 快速开始
22 Greedy Hash: Towards Fast Optimization for Accurate Hash Coding in CNN
Abstract
To convert the input into binary code, hashing algorithm has been widely used for approximate nearest neighbor search on large-scale image sets due to its computation and storage efficiency. Deep hashing further improves the retrieval quality by combining the hash coding with deep neural network. However, a major difficulty in deep hashing lies in the discrete constraints imposed on the network output, which generally makes the optimization NP hard. In this work, we adopt the greedy principle to tackle this NP hard problem by iteratively updating the network toward the probable optimal discrete solution in each iteration. A hash coding layer is designed to implement our approach which strictly uses the sign function in forward propagation to maintain the discrete constraints, while in back propagation the gradients are transmitted intactly to the front layer to avoid the vanishing gradients. In addition to the theoretical derivation, we provide a new perspective to visualize and understand the effectiveness and efficiency of our algorithm. Experiments on benchmark datasets show that our scheme outperforms state-of-the-art hashing methods in both supervised and unsupervised tasks.
cifar10(1) 12bits 0.766, 24bits 0.794, 32bit 0.803, 48bits 0.817 快速开始

目标检测

序号 论文名称(链接) 摘要 数据集 快速开始
1 EfficientDet: Scalable and Efficient Object Detection
Abstract
Accurate depth estimation from images is a fundamental task in manyapplications including scene understanding and reconstruction. Existingsolutions for depth estimation often produce blurry approximations of lowresolution. This paper presents a convolutional neural network for computing ahigh-resolution depth map given a single RGB image with the help of transferlearning. Following a standard encoder-decoder architecture, we leveragefeatures extracted using high performing pre-trained networks when initializingour encoder along with augmentation and training strategies that lead to moreaccurate results. We show how, even for a very simple decoder, our method isable to achieve detailed high-resolution depth maps. Our network, with fewerparameters and training iterations, outperforms state-of-the-art on twodatasets and also produces qualitatively better results that capture objectboundaries more faithfully. Code and corresponding pre-trained weights are madepublicly available.
efficientdet_d0 mAP: 33.6 快速开始
2 High Quality Monocular Depth Estimation via Transfer Learning
Abstract
We show that the YOLOv4 object detection neural network based on the CSPapproach, scales both up and down and is applicable to small and large networkswhile maintaining optimal speed and accuracy. We propose a network scalingapproach that modifies not only the depth, width, resolution, but alsostructure of the network. YOLOv4-large model achieves state-of-the-art results:55.5% AP (73.4% AP50) for the MS COCO dataset at a speed of ~16 FPS on TeslaV100, while with the test time augmentation, YOLOv4-large achieves 56.0% AP(73.3 AP50). To the best of our knowledge, this is currently the highestaccuracy on the COCO dataset among any published work. The YOLOv4-tiny modelachieves 22.0% AP (42.0% AP50) at a speed of 443 FPS on RTX 2080Ti, while byusing TensorRT, batch size = 4 and FP16-precision the YOLOv4-tiny achieves 1774FPS.
NYU Depth v2  δ1: 0.895 参考原论文table1 快速开始
3 Scaled-YOLOv4: Scaling Cross Stage Partial Network
Abstract
The highest accuracy object detectors to date are based on a two-stageapproach popularized by R-CNN, where a classifier is applied to a sparse set ofcandidate object locations. In contrast, one-stage detectors that are appliedover a regular, dense sampling of possible object locations have the potentialto be faster and simpler, but have trailed the accuracy of two-stage detectorsthus far. In this paper, we investigate why this is the case. We discover thatthe extreme foreground-background class imbalance encountered during trainingof dense detectors is the central cause. We propose to address this classimbalance by reshaping the standard cross entropy loss such that itdown-weights the loss assigned to well-classified examples. Our novel FocalLoss focuses training on a sparse set of hard examples and prevents the vastnumber of easy negatives from overwhelming the detector during training. Toevaluate the effectiveness of our loss, we design and train a simple densedetector we call RetinaNet. Our results show that when trained with the focalloss, RetinaNet is able to match the speed of previous one-stage detectorswhile surpassing the accuracy of all existing state-of-the-art two-stagedetectors. Code is at: this https URL.
YOLOv4-P5 mAP: 51.2 快速开始
4 Focal Loss for Dense Object Detection
Abstract
There are a huge number of features which are said to improve ConvolutionalNeural Network (CNN) accuracy. Practical testing of combinations of suchfeatures on large datasets, and theoretical justification of the result, isrequired. Some features operate on certain models exclusively and for certainproblems exclusively, or only for small-scale datasets; while some features,such as batch-normalization and residual-connections, are applicable to themajority of models, tasks, and datasets. We assume that such universal featuresinclude Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections(CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT)and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation,Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, andcombine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50)for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Sourcecode is at this https URL
RetinaNet R-50-FPN 1x mAP: 35.7 参考github 快速开始
5 YOLOv4: Optimal Speed and Accuracy of Object Detection
Abstract
Cascade is a classic yet powerful architecture that has boosted performanceon various tasks. However, how to introduce cascade to instance segmentationremains an open question. A simple combination of Cascade R-CNN and Mask R-CNNonly brings limited gain. In exploring a more effective approach, we find thatthe key to a successful instance segmentation cascade is to fully leverage thereciprocal relationship between detection and segmentation. In this work, wepropose a new framework, Hybrid Task Cascade (HTC), which differs in twoimportant aspects: (1) instead of performing cascaded refinement on these twotasks separately, it interweaves them for a joint multi-stage processing; (2)it adopts a fully convolutional branch to provide spatial context, which canhelp distinguishing hard foreground from cluttered background. Overall, thisframework can learn more discriminative features progressively whileintegrating complementary features together in each stage. Without bells andwhistles, a single HTC obtains 38.4 and 1.5 improvement over a strong CascadeMask R-CNN baseline on MSCOCO dataset. Moreover, our overall system achieves48.6 mask AP on the test-challenge split, ranking 1st in the COCO 2018Challenge Object Detection Task. Code is available at:this https URL.
input size: 416x416, MS COCO上mAP=41.2 快速开始
6 Hybrid Task Cascade for Instance Segmentation
Abstract
We trained a convolutional neural network (CNN) to map raw pixels from asingle front-facing camera directly to steering commands. This end-to-endapproach proved surprisingly powerful. With minimum training data from humansthe system learns to drive in traffic on local roads with or without lanemarkings and on highways. It also operates in areas with unclear visualguidance such as in parking lots and on unpaved roads.The system automatically learns internal representations of the necessaryprocessing steps such as detecting useful road features with only the humansteering angle as the training signal. We never explicitly trained it todetect, for example, the outline of roads.Compared to explicit decomposition of the problem, such as lane markingdetection, path planning, and control, our end-to-end system optimizes allprocessing steps simultaneously. We argue that this will eventually lead tobetter performance and smaller systems. Better performance will result becausethe internal components self-optimize to maximize overall system performance,instead of optimizing human-selected intermediate criteria, e.g., lanedetection. Such criteria understandably are selected for ease of humaninterpretation which doesn't automatically guarantee maximum systemperformance. Smaller networks are possible because the system learns to solvethe problem with the minimal number of processing steps.We used an NVIDIA DevBox and Torch 7 for training and an NVIDIA DRIVE(TM) PXself-driving car computer also running Torch 7 for determining where to drive.The system operates at 30 frames per second (FPS).
HTC R-50-FPN 1x box AP: 42.3 mask AP: 37.4 快速开始
7 Holistically-Nested Edge Detection
Abstract
Humans recognize the visual world at multiple levels: we effortlesslycategorize scenes and detect objects inside, while also identifying thetextures and surfaces of the objects along with their different compositionalparts. In this paper, we study a new task called Unified Perceptual Parsing,which requires the machine vision systems to recognize as many visual conceptsas possible from a given image. A multi-task framework called UPerNet and atraining strategy are developed to learn from heterogeneous image annotations.We benchmark our framework on Unified Perceptual Parsing and show that it isable to effectively segment a wide range of concepts from images. The trainednetworks are further applied to discover visual knowledge in natural scenes.Models are available at \url{this https URL}.
BSD500 dataset (ODS F-score of .782) 快速开始
8 Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Abstract
We present an approach to efficiently detect the 2D pose of multiple peoplein an image. The approach uses a nonparametric representation, which we referto as Part Affinity Fields (PAFs), to learn to associate body parts withindividuals in the image. The architecture encodes global context, allowing agreedy bottom-up parsing step that maintains high accuracy while achievingrealtime performance, irrespective of the number of people in the image. Thearchitecture is designed to jointly learn part locations and their associationvia two branches of the same sequential prediction process. Our method placedfirst in the inaugural COCO 2016 keypoints challenge, and significantly exceedsthe previous state-of-the-art result on the MPII Multi-Person benchmark, bothin performance and efficiency.
bleu-1: 67%, bleu-2: 45.7%,
bleu-3: 31.4%, bleu-4: 21.3%
快速开始
9 Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Abstract
Recent work has demonstrated that deep neural networks are vulnerable toadversarial examples---inputs that are almost indistinguishable from naturaldata and yet classified incorrectly by the network. In fact, some of the latestfindings suggest that the existence of adversarial attacks may be an inherentweakness of deep learning models. To address this problem, we study theadversarial robustness of neural networks through the lens of robustoptimization. This approach provides us with a broad and unifying view on muchof the prior work on this topic. Its principled nature also enables us toidentify methods for both training and attacking neural networks that arereliable and, in a certain sense, universal. In particular, they specify aconcrete security guarantee that would protect against any adversary. Thesemethods let us train networks with significantly improved resistance to a widerange of adversarial attacks. They also suggest the notion of security againsta first-order adversary as a natural and broad security guarantee. We believethat robustness against such well-defined classes of adversaries is animportant stepping stone towards fully resistant deep learning models. Code andpre-trained models are available at this https URLand this https URL.
coco 2014 bleu-1=79.8% 快速开始
10 Pixel Recurrent Neural Networks
Abstract
Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image model that is at once expressive, tractable and scalable. We present a deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. Our method models the discrete probability of the raw pixel values and encodes the complete set of dependencies in the image. Architectural novelties include fast two-dimensional recurrent layers and an effective use of residual connections in deep recurrent networks. We achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Our main results also provide benchmarks on the diverse ImageNet dataset. Samples generated from the model appear crisp, varied and globally coherent.
NLL test 81.3 快速开始
11 Residual Attention Network for Image Classification
Abstract
In this work, we propose "Residual Attention Network", a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. Our Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. Inside each Attention Module, bottom-up top-down feedforward structure is used to unfold the feedforward and feedback attention process into a single feedforward process. Importantly, we propose attention residual learning to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers. Extensive analyses are conducted on CIFAR-10 and CIFAR-100 datasets to verify the effectiveness of every module mentioned above. Our Residual Attention Network achieves state-of-the-art object recognition performance on three benchmark datasets including CIFAR-10 (3.90% error), CIFAR-100 (20.45% error) and ImageNet (4.8% single model and single crop, top-5 error). Note that, our method achieves 0.6% top-1 accuracy improvement with 46% trunk depth and 69% forward FLOPs comparing to ResNet-200. The experiment also demonstrates that our network is robust against noisy labels.
Attention-92 top1 error 4.99% 快速开始
12 Fast R-CNN
Abstract
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
COCO R50 mAP=37.8 快速开始
13 Simple Baselines for Human Pose Estimation and Tracking
Abstract
There has been significant progress on pose estimation and increasing interests on pose tracking in recent years. At the same time, the overall algorithm and system complexity increases as well, making the algorithm analysis and comparison more difficult. This work provides simple and effective baseline methods. They are helpful for inspiring and evaluating new ideas for the field. State-of-the-art results are achieved on challenging benchmarks. The code will be available at https://github.com/leoxiaobin/pose.pytorch.
MPII; 256x256_pose_resnet_50 mean=88.53 快速开始
14 VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
Abstract
Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.
KITTI; VoxelNet 3D detection kitti validation: car easy: 81, 97 moderate: 65.46 hard: 62.85(参考原论文table2 github repo实现精度 easy: 53.43 moderate: 48.78 hard: 48.06) 快速开始
15 MnasNet: Platform-Aware Neural Architecture Search for Mobile
Abstract
Designing convolutional neural networks (CNN) for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant efforts have been dedicated to design and improve mobile CNNs on all dimensions, it is very difficult to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated mobile neural architecture search (MNAS) approach, which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike previous work, where latency is considered via another, often inaccurate proxy (e.g., FLOPS), our approach directly measures real-world inference latency by executing the model on mobile phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that encourages layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our MnasNet achieves 75.2% top-1 accuracy with 78ms latency on a Pixel phone, which is 1.8x faster than MobileNetV2 [29] with 0.5% higher accuracy and 2.3x faster than NASNet [36] with 1.2% higher accuracy. Our MnasNet also achieves better mAP quality than MobileNets for COCO object detection. Code is at this https URL
ImageNet: MnasNet-A top-1 73.5 快速开始
16 Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
Abstract
While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of "predictive coding" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.
KITTI: MSE 0.007 快速开始
17 Gradient Harmonized Single-stage Detector
Abstract
This paper revisits feature pyramids networks (FPN) for one-stage detectors and points out that the success of FPN is due to its divide-and-conquer solution to the optimization problem in object detection rather than multi-scale feature fusion. From the perspective of optimization, we introduce an alternative way to address the problem instead of adopting the complex feature pyramids - {\em utilizing only one-level feature for detection}. Based on the simple and efficient solution, we present You Only Look One-level Feature (YOLOF). In our method, two key components, Dilated Encoder and Uniform Matching, are proposed and bring considerable improvements. Extensive experiments on the COCO benchmark prove the effectiveness of the proposed model. Our YOLOF achieves comparable results with its feature pyramids counterpart RetinaNet while being 2.5× faster. Without transformer layers, YOLOF can match the performance of DETR in a single-level feature manner with 7× less training epochs. With an image size of 608×608, YOLOF achieves 44.3 mAP running at 60 fps on 2080Ti, which is 13% faster than YOLOv4. Code is available at \url{https://github.com/megvii-model/YOLOF}.
coco: R50 37.0 快速开始
18 You Only Look One-level Feature
Abstract
Many modern object detectors demonstrate outstanding performances by using the mechanism of looking and thinking twice. In this paper, we explore this mechanism in the backbone design for object detection. At the macro level, we propose Recursive Feature Pyramid, which incorporates extra feedback connections from Feature Pyramid Networks into the bottom-up backbone layers. At the micro level, we propose Switchable Atrous Convolution, which convolves the features with different atrous rates and gathers the results using switch functions. Combining them results in DetectoRS, which significantly improves the performances of object detection. On COCO test-dev, DetectoRS achieves state-of-the-art 55.7% box AP for object detection, 48.5% mask AP for instance segmentation, and 50.0% PQ for panoptic segmentation. The code is made publicly available.
coco: R50 37.5 快速开始
19 YOLOX: Exceeding YOLO Series in 2021 
Abstract
In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector — YOLOX. We switch the YOLO detector to an anchor-free manner and conduct other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve state-of-the-art results across a large scale range of models: For YOLONano with only 0.91M parameters and 1.08G FLOPs, we get 25.3% AP on COCO, surpassing NanoDet by 1.8% AP; for YOLOv3, one of the most widely used detectors in industry, we boost it to 47.3% AP on COCO, outperforming the current best practice by 3.0% AP; for YOLOX-L with roughly the same amount of parameters as YOLOv4- CSP, YOLOv5-L, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, exceeding YOLOv5-L by 1.8% AP. Further, we won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model. We hope this report can provide useful experience for developers and * Equal contribution. † Corresponding author. researchers in practical scenes, and we also provide deploy versions with ONNX, TensorRT, NCNN, and Openvino supported. Source code is at https://github.com/ Megvii-BaseDetection/YOLOX.
COCO 2017 test-dev, YOLOX-X (640*640) map: 51.2 快速开始

图像分割

序号 论文名称(链接) 摘要 数据集 快速开始
1 PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
Abstract
Few prior works study deep learning on point sets. PointNet by Qi et al. is apioneer in this direction. However, by design PointNet does not capture localstructures induced by the metric space points live in, limiting its ability torecognize fine-grained patterns and generalizability to complex scenes. In thiswork, we introduce a hierarchical neural network that applies PointNetrecursively on a nested partitioning of the input point set. By exploitingmetric space distances, our network is able to learn local features withincreasing contextual scales. With further observation that point sets areusually sampled with varying densities, which results in greatly decreasedperformance for networks trained on uniform densities, we propose novel setlearning layers to adaptively combine features from multiple scales.Experiments show that our network called PointNet++ is able to learn deep pointset features efficiently and robustly. In particular, results significantlybetter than state-of-the-art have been obtained on challenging benchmarks of 3Dpoint clouds.
ModelNet40: 89.2% 快速开始
2 PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
Abstract
The ability to perform pixel-wise semantic segmentation in real-time is ofparamount importance in mobile applications. Recent deep neural networks aimedat this task have the disadvantage of requiring a large number of floatingpoint operations and have long run-times that hinder their usability. In thispaper, we propose a novel deep neural network architecture named ENet(efficient neural network), created specifically for tasks requiring lowlatency operation. ENet is up to 18$\times$ faster, requires 75$\times$ lessFLOPs, has 79$\times$ less parameters, and provides similar or better accuracyto existing models. We have tested it on CamVid, Cityscapes and SUN datasetsand report on comparisons with existing state-of-the-art methods, and thetrade-offs between accuracy and processing time of a network. We presentperformance measurements of the proposed architecture on embedded systems andsuggest possible software improvements that could make ENet even faster.
ModelNet40: 90.7% 快速开始
3 ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
Abstract
We develop a new edge detection algorithm that tackles two important issuesin this long-standing vision problem: (1) holistic image training andprediction; and (2) multi-scale and multi-level feature learning. Our proposedmethod, holistically-nested edge detection (HED), performs image-to-imageprediction by means of a deep learning model that leverages fully convolutionalneural networks and deeply-supervised nets. HED automatically learns richhierarchical representations (guided by deep supervision on side responses)that are important in order to approach the human ability resolve thechallenging ambiguity in edge and object boundary detection. We significantlyadvance the state-of-the-art on the BSD500 dataset (ODS F-score of .782) andthe NYU Depth dataset (ODS F-score of .746), and do so with an improved speed(0.4 second per image) that is orders of magnitude faster than some recentCNN-based edge detection algorithms.
Cityscapes mIoU 58.3% 快速开始
4 Unified Perceptual Parsing for Scene Understanding
Abstract
In this work we address the task of semantic image segmentation with DeepLearning and make three main contributions that are experimentally shown tohave substantial practical merit. First, we highlight convolution withupsampled filters, or 'atrous convolution', as a powerful tool in denseprediction tasks. Atrous convolution allows us to explicitly control theresolution at which feature responses are computed within Deep ConvolutionalNeural Networks. It also allows us to effectively enlarge the field of view offilters to incorporate larger context without increasing the number ofparameters or the amount of computation. Second, we propose atrous spatialpyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPPprobes an incoming convolutional feature layer with filters at multiplesampling rates and effective fields-of-views, thus capturing objects as well asimage context at multiple scales. Third, we improve the localization of objectboundaries by combining methods from DCNNs and probabilistic graphical models.The commonly deployed combination of max-pooling and downsampling in DCNNsachieves invariance but has a toll on localization accuracy. We overcome thisby combining the responses at the final DCNN layer with a fully connectedConditional Random Field (CRF), which is shown both qualitatively andquantitatively to improve localization performance. Our proposed "DeepLab"system sets the new state-of-art at the PASCAL VOC-2012 semantic imagesegmentation task, reaching 79.7% mIOU in the test set, and advances theresults on three other datasets: PASCAL-Context, PASCAL-Person-Part, andCityscapes. All of our code is made publicly available online.
Cityscapes mIoU 80.1% 快速开始
5 DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
Abstract
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
Cityscapes mIoU 71.4% 快速开始
6 Real-Time High-Resolution Background Matting
Abstract
We introduce a real-time, high-resolution background replacement technique which operates at 30fps in 4K resolution, and 60fps for HD on a modern GPU. Our technique is based on background matting, where an additional frame of the background is captured and used in recovering the alpha matte and the foreground layer. The main challenge is to compute a high-quality alpha matte, preserving strand-level hair details, while processing high-resolution images in real-time. To achieve this goal, we employ two neural networks; a base network computes a low-resolution result which is refined by a second network operating at high-resolution on selective patches. We introduce two largescale video and image matting datasets: VideoMatte240K and PhotoMatte13K/85. Our approach yields higher quality results compared to the previous state-of-the-art in background matting, while simultaneously yielding a dramatic boost in both speed and resolution.
PhotoMatte85 SAD8.65、MSE9.57 快速开始
7 Panoptic Feature Pyramid Networks
Abstract
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-of-the-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, top-performing method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
Cityscapes mIoU 75.8% 快速开始
8 SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Abstract
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. We show that SegNet provides good performance with competitive inference time and more efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
image size: 360× 480; Dataset: Camvid; mIOU: 60.1 快速开始
9 YOLACT: Real-time Instance Segmentation
Abstract
We develop an algorithm that can detect pneumonia from chest X-rays at alevel exceeding practicing radiologists. Our algorithm, CheXNet, is a 121-layerconvolutional neural network trained on ChestX-ray14, currently the largestpublicly available chest X-ray dataset, containing over 100,000 frontal-viewX-ray images with 14 diseases. Four practicing academic radiologists annotate atest set, on which we compare the performance of CheXNet to that ofradiologists. We find that CheXNet exceeds average radiologist performance onthe F1 metric. We extend CheXNet to detect all 14 diseases in ChestX-ray14 andachieve state of the art results on all 14 diseases.
Image size: 550 Resnet101-FPN FPS=33.5 mAP=29.8 快速开始
10 YOLACT++: Better Real-time Instance Segmentation
Abstract
We present a new method for efficient high-quality image segmentation ofobjects and scenes. By analogizing classical computer graphics methods forefficient rendering with over- and undersampling challenges faced in pixellabeling tasks, we develop a unique perspective of image segmentation as arendering problem. From this vantage, we present the PointRend (Point-basedRendering) neural network module: a module that performs point-basedsegmentation predictions at adaptively selected locations based on an iterativesubdivision algorithm. PointRend can be flexibly applied to both instance andsemantic segmentation tasks by building on top of existing state-of-the-artmodels. While many concrete implementations of the general idea are possible,we show that a simple design already achieves excellent results. Qualitatively,PointRend outputs crisp object boundaries in regions that are over-smoothed byprevious methods. Quantitatively, PointRend yields significant gains on COCOand Cityscapes, for both instance and semantic segmentation. PointRend'sefficiency enables output resolutions that are otherwise impractical in termsof memory or computation compared to existing approaches. Code has been madeavailable atthis https URL.
Image size: 550 Resnet50-FPN FPS=33.5 mAP=34.1 快速开始
11 PointRend: Image Segmentation as Rendering
Abstract
Recent works have widely explored the contextual dependencies to achieve moreaccurate segmentation results. However, most approaches rarely distinguishdifferent types of contextual dependencies, which may pollute the sceneunderstanding. In this work, we directly supervise the feature aggregation todistinguish the intra-class and inter-class context clearly. Specifically, wedevelop a Context Prior with the supervision of the Affinity Loss. Given aninput image and corresponding ground truth, Affinity Loss constructs an idealaffinity map to supervise the learning of Context Prior. The learned ContextPrior extracts the pixels belonging to the same category, while the reversedprior focuses on the pixels of different classes. Embedded into a conventionaldeep CNN, the proposed Context Prior Layer can selectively capture theintra-class and inter-class contextual dependencies, leading to robust featurerepresentation. To validate the effectiveness, we design an effective ContextPrior Network (CPNet). Extensive quantitative and qualitative evaluationsdemonstrate that the proposed model performs favorably against state-of-the-artsemantic segmentation approaches. More specifically, our algorithm achieves46.3% mIoU on ADE20K, 53.9% mIoU on PASCAL-Context, and 81.3% mIoU onCityscapes. Code is available at this https URL.
cityscapes resnet50+FPN mIoU 78.3% 参考P5 architecture 快速开始
12 Context Prior for Scene Segmentation
Abstract
BiSeNet has been proved to be a popular two-stream network for real-timesegmentation. However, its principle of adding an extra path to encode spatialinformation is time-consuming, and the backbones borrowed from pretrainedtasks, e.g., image classification, may be inefficient for image segmentationdue to the deficiency of task-specific design. To handle these problems, wepropose a novel and efficient structure named Short-Term Dense Concatenatenetwork (STDC network) by removing structure redundancy. Specifically, wegradually reduce the dimension of feature maps and use the aggregation of themfor image representation, which forms the basic module of STDC network. In thedecoder, we propose a Detail Aggregation module by integrating the learning ofspatial information into low-level layers in single-stream manner. Finally, thelow-level features and deep features are fused to predict the finalsegmentation results. Extensive experiments on Cityscapes and CamVid datasetdemonstrate the effectiveness of our method by achieving promising trade-offbetween segmentation accuracy and inference speed. On Cityscapes, we achieve71.9% mIoU on the test set with a speed of 250.4 FPS on NVIDIA GTX 1080Ti,which is 45.2% faster than the latest methods, and achieve 76.8% mIoU with 97.0FPS while inferring on higher resolution images.
cityscapes resnet101 mIoU 81.3% 参考论文table6 快速开始
13 Rethinking BiSeNet For Real-time Semantic Segmentation
Abstract
BiSeNet has been proved to be a popular two-stream network for real-time segmentation. However, its principle of adding an extra path to encode spatial information is time-consuming, and the backbones borrowed from pretrained tasks, e.g., image classification, may be inefficient for image segmentation due to the deficiency of task-specific design. To handle these problems, we propose a novel and efficient structure named Short-Term Dense Concatenate network (STDC network) by removing structure redundancy. Specifically, we gradually reduce the dimension of feature maps and use the aggregation of them for image representation, which forms the basic module of STDC network. In the decoder, we propose a Detail Aggregation module by integrating the learning of spatial information into low-level layers in single-stream manner. Finally, the low-level features and deep features are fused to predict the final segmentation results. Extensive experiments on Cityscapes and CamVid dataset demonstrate the effectiveness of our method by achieving promising trade-off between segmentation accuracy and inference speed. On Cityscapes, we achieve 71.9% mIoU on the test set with a speed of 250.4 FPS on NVIDIA GTX 1080Ti, which is 45.2% faster than the latest methods, and achieve 76.8% mIoU with 97.0 FPS while inferring on higher resolution images.
imgsize 512 × 1024 cityscapes STDC2-Seg50  mIoU 74.2% 参见论文table6 快速开始
14 ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network
Abstract
We introduce a light-weight, power efficient, and general purpose convolutional neural network, ESPNetv2, for modeling visual and sequential data. Our network uses group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on four different tasks: (1) object classification, (2) semantic segmentation, (3) object detection, and (4) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the state-of-the-art methods. Our network outperforms ESPNet by 4-5% and has 2-4x fewer FLOPs on the PASCAL VOC and the Cityscapes dataset. Compared to YOLOv2 on the MS-COCO object detection, ESPNetv2 delivers 4.4% higher accuracy with 6x fewer FLOPs. Our experiments show that ESPNetv2 is much more power efficient than existing state-of-the-art efficient methods including ShuffleNets and MobileNets. Our code is open-source and available at https://github.com/sacmehta/ESPNetv2
cityscapes ESPNetv2-val mIoU 66.4% 参见论文 fig7 快速开始
15 Exploring Cross-Image Pixel Contrast for Semantic Segmentation
Abstract
Current semantic segmentation methods focus only on mining "local" context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization criteria (e.g., IoU-like loss). However, they ignore "global" context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely explored before. Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3, HRNet, OCR) and backbones (i.e., ResNet, HR-Net), our method brings consistent performance improvements across diverse datasets (i.e., Cityscapes, PASCAL-Context, COCO-Stuff, CamVid). We expect this work will encourage our community to rethink the current de facto training paradigm in fully supervised semantic segmentation.
HRNet-W48 Cityscaoes mIOU=80.18 快速开始
16 Category-Level Adversarial Adaptation for Semantic Segmentation using Purified Features
Abstract
We target the problem named unsupervised domain adaptive semantic segmentation. A key in this campaign consists inreducing the domain shift, so that a classifier based on labeled data from one domain can generalize well to other domains. With theadvancement of adversarial learning blackmethod, recent works prefer the strategy of aligning the marginal distribution in the featurespaces for minimizing the domain discrepancy. However, based on the observance in experiments, only focusing on aligning globalmarginal distribution but ignoring the local joint distribution alignment fails to be the optimal choice. Other than that, the noisy factorsexisting in the feature spaces, which are not relevant to the target task, entangle with the domain invariant factors improperly and makethe domain distribution alignment more difficult. To address those problems, we introduce two new modules, Significance-awareInformation Bottleneck (SIB) and Category-level alignment (CLA), to construct a purified embedding-based category-level adversarialnetwork. As the name suggests, our designed network, CLAN, can not only disentangle the noisy factors and suppress their influencesfor target tasks but also utilize those purified features to conduct a more delicate level domain calibration, i.e., global marginal distributionand local joint distribution alignment simultaneously. In three domain adaptation tasks, i.e., GTA5 → Cityscapes, SYNTHIA → Cityscapesand Cross Season, we validate that our proposed method matches the state of the art in segmentation accuracy.
Resnet101 Cityscapes mIoU 45.5% 快速开始
17 Brain Tumor Segmentation with Deep Neural Networks
Abstract
In this paper, we present a fully automatic brain tumor segmentation method based on Deep Neural Networks (DNNs). The proposed networks are tailored to glioblastomas (both low and high grade) pictured in MR images. By their very nature, these tumors can appear anywhere in the brain and have almost any kind of shape, size, and contrast. These reasons motivate our exploration of a machine learning solution that exploits a flexible, high capacity DNN while being extremely efficient. Here, we give a description of different model choices that we've found to be necessary for obtaining competitive performance. We explore in particular different architectures based on Convolutional Neural Networks (CNN), i.e. DNNs specifically adapted to image data. We present a novel CNN architecture which differs from those traditionally used in computer vision. Our CNN exploits both local features as well as more global contextual features simultaneously. Also, different from most traditional uses of CNNs, our networks use a final layer that is a convolutional implementation of a fully connected layer which allows a 40 fold speed up. We also describe a 2-phase training procedure that allows us to tackle difficulties related to the imbalance of tumor labels. Finally, we explore a cascade architecture in which the output of a basic CNN is treated as an additional source of information for a subsequent CNN. Results reported on the 2013 BRATS test dataset reveal that our architecture improves over the currently published state-of-the-art while being over 30 times faster.
BRATS 2013 test: Dice Complete 0.84, Core 0.72, Enhancing 0.57 快速开始
18 Dynamic Graph CNN for Learning on Point Clouds
Abstract
Point clouds provide a flexible geometric representation suitable for countless applications in computer graphics; they also comprise the raw output of most 3D data acquisition devices. While hand-designed features on point clouds have long been proposed in graphics and vision, however, the recent overwhelming success of convolutional neural networks (CNNs) for image analysis suggests the value of adapting insight from CNN to the point cloud world. Point clouds inherently lack topological information so designing a model to recover topology can enrich the representation power of point clouds. To this end, we propose a new neural network module dubbed EdgeConv suitable for CNN-based high-level tasks on point clouds including classification and segmentation. EdgeConv acts on graphs dynamically computed in each layer of the network. It is differentiable and can be plugged into existing architectures. Compared to existing modules operating in extrinsic space or treating each point independently, EdgeConv has several appealing properties: It incorporates local neighborhood information; it can be stacked applied to learn global shape properties; and in multi-layer systems affinity in feature space captures semantic characteristics over potentially long distances in the original embedding. We show the performance of our model on standard benchmarks including ModelNet40, ShapeNetPart, and S3DIS.
mIOU=85.2% 参考论文 Table.6 快速开始
19 Adaptive Pyramid Context Network for Semantic Segmentation
Abstract
Semi-supervised learning (SSL) provides an effective means of leveragingunlabeled data to improve a model's performance. In this paper, we demonstratethe power of a simple combination of two common SSL methods: consistencyregularization and pseudo-labeling. Our algorithm, FixMatch, first generatespseudo-labels using the model's predictions on weakly-augmented unlabeledimages. For a given image, the pseudo-label is only retained if the modelproduces a high-confidence prediction. The model is then trained to predict thepseudo-label when fed a strongly-augmented version of the same image. Despiteits simplicity, we show that FixMatch achieves state-of-the-art performanceacross a variety of standard semi-supervised learning benchmarks, including94.93% accuracy on CIFAR-10 with 250 labels and 88.61% accuracy with 40 -- just4 labels per class. Since FixMatch bears many similarities to existing SSLmethods that achieve worse performance, we carry out an extensive ablationstudy to tease apart the experimental factors that are most important toFixMatch's success. We make our code available atthis https URL.
Cityscapes: mIOU=79.28% 快速开始
20 CGNet: A Light-weight Context Guided Network for Semantic Segmentation
Abstract
We focus on the challenging task of real-time semantic segmentation in thispaper. It finds many practical applications and yet is with fundamentaldifficulty of reducing a large portion of computation for pixel-wise labelinference. We propose an image cascade network (ICNet) that incorporatesmulti-resolution branches under proper label guidance to address thischallenge. We provide in-depth analysis of our framework and introduce thecascade feature fusion unit to quickly achieve high-quality segmentation. Oursystem yields real-time inference on a single GPU card with decent qualityresults evaluated on challenging datasets like Cityscapes, CamVid andCOCO-Stuff.
Cityscapes valset: M3N21, mIOU=68.27% 快速开始
21 ICNet for Real-Time Semantic Segmentation on High-Resolution Images
Abstract
Recent deep learning based approaches have shown promising results for thechallenging task of inpainting large missing regions in an image. These methodscan generate visually plausible image structures and textures, but often createdistorted structures or blurry textures inconsistent with surrounding areas.This is mainly due to ineffectiveness of convolutional neural networks inexplicitly borrowing or copying information from distant spatial locations. Onthe other hand, traditional texture and patch synthesis approaches areparticularly suitable when it needs to borrow textures from the surroundingregions. Motivated by these observations, we propose a new deep generativemodel-based approach which can not only synthesize novel image structures butalso explicitly utilize surrounding image features as references during networktraining to make better predictions. The model is a feed-forward, fullyconvolutional neural network which can process images with multiple holes atarbitrary locations and with variable sizes during the test time. Experimentson multiple datasets including faces (CelebA, CelebA-HQ), textures (DTD) andnatural images (ImageNet, Places2) demonstrate that our proposed approachgenerates higher-quality inpainting results than existing ones. Code, demo andmodels are available at: this https URL.
Cityscapes mIOU 69.6% 快速开始
22 Context Encoding for Semantic Segmentation
Abstract
Recent work has made significant progress in improving spatial resolution for pixelwise labeling with Fully Convolutional Network (FCN) framework by employing Dilated/Atrous convolution, utilizing multi-scale features and refining boundaries. In this paper, we explore the impact of global contextual information in semantic segmentation by introducing the Context Encoding Module, which captures the semantic context of scenes and selectively highlights class-dependent featuremaps. The proposed Context Encoding Module significantly improves semantic segmentation results with only marginal extra computation cost over FCN. Our approach has achieved new state-of-the-art results 51.7% mIoU on PASCAL-Context, 85.9% mIoU on PASCAL VOC 2012. Our single model achieves a final score of 0.5567 on ADE20K test set, which surpass the winning entry of COCO-Place Challenge in 2017. In addition, we also explore how the Context Encoding Module can improve the feature representation of relatively shallow networks for the image classification on CIFAR-10 dataset. Our 14 layer network has achieved an error rate of 3.45%, which is comparable with state-of-the-art approaches with over 10 times more layers. The source code for the complete system are publicly available.
Cityscapes; Cityscapes mIOU = 78.55% 快速开始
23 BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation
Abstract
Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate high-resolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048x1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.
Cityscapes: resnet18 mIOU=74.8 对应论文 Table.6 中实现 快速开始
24 FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation
Abstract
Modern approaches for semantic segmentation usually employ dilated convolutions in the backbone to extract high-resolution feature maps, which brings heavy computation complexity and memory footprint. To replace the time and memory consuming dilated convolutions, we propose a novel joint upsampling module named Joint Pyramid Upsampling (JPU) by formulating the task of extracting high-resolution feature maps into a joint upsampling problem. With the proposed JPU, our method reduces the computation complexity by more than three times without performance loss. Experiments show that JPU is superior to other upsampling modules, which can be plugged into many existing approaches to reduce computation complexity and improve performance. By replacing dilated convolutions with the proposed JPU module, our method achieves the state-of-the-art performance in Pascal Context dataset (mIoU of 53.13%) and ADE20K dataset (final score of 0.5584) while running 3 times faster.
ADE20K: EncNet+JPU (resnet50)mIOU=42.5 快速开始
25 Dynamic Multi-Scale Filters for Semantic Segmentation             
Abstract
Multi-scale representation provides an effective way to address scale variation of objects and stuff in semantic segmentation. Previous works construct multi-scale representation by utilizing different filter sizes, expanding filter sizes with dilated filters or pooling grids, and the parameters of these filters are fixed after training. These methods often suffer from heavy computational cost or have more parameters, and are not adaptive to the input image during inference. To address these problems, this paper proposes a Dynamic Multi-scale Network (DMNet) to adaptively capture multi-scale contents for predicting pixel-level semantic labels. DMNet is composed of multiple Dynamic Convolutional Modules (DCMs) arranged in parallel, each of which exploits context-aware filters to estimate semantic representation for a specific scale. The outputs of multiple DCMs are further integrated for final segmentation. We conduct extensive experiments to evaluate our DMNet on three challenging semantic segmentation and scene parsing datasets, PASCAL VOC 2012, Pascal-Context, and ADE20K. DMNet achieves a new record 84.4% mIoU on PASCAL VOC 2012 test set without MS COCO pre-trained and post-processing, and also obtains state-of-the-art performance on PascalContext and ADE20K.
Cityscapes: mIOU = 79.64% 快速开始
26 ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation
Abstract
We introduce a fast and efficient convolutional neural network, ESPNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less. We evaluated ESPNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively.
Cityscapes; 1. mIOU=60.3 对应论文 Table.1(a) 中实现; 2. 训练日志中包含周期性的在 valset 上的评估结果。 快速开始
27 Adversarial Learning for Semi-Supervised Semantic Segmentation
Abstract
We propose a method for semi-supervised semantic segmentation using an adversarial network. While most existing discriminators are trained to classify input images as real or fake on the image level, we design a discriminator in a fully convolutional manner to differentiate the predicted probability maps from the ground truth segmentation distribution with the consideration of the spatial resolution. We show that the proposed discriminator can be used to improve semantic segmentation accuracy by coupling the adversarial loss with the standard cross entropy loss of the proposed model. In addition, the fully convolutional discriminator enables semi-supervised learning through discovering the trustworthy regions in predicted results of unlabeled images, thereby providing additional supervisory signals. In contrast to existing methods that utilize weakly-labeled images, our method leverages unlabeled images to enhance the segmentation model. Experimental results on the PASCAL VOC 2012 and Cityscapes datasets demonstrate the effectiveness of the proposed algorithm.
Pascal VOC 2012: data amount= 1/8; DeepLab-v2 with ResNet-101; mIOU=69.5 对应论文 Table.1 快速开始
28 V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation
Abstract
Convolutional Neural Networks (CNNs) have been recently employed to solve problems from both the computer vision and medical image analysis fields. Despite their popularity, most approaches are only able to process 2D images while most medical data used in clinical practice consists of 3D volumes. In this work we propose an approach to 3D image segmentation based on a volumetric, fully convolutional, neural network. Our CNN is trained end-to-end on MRI volumes depicting prostate, and learns to predict segmentation for the whole volume at once. We introduce a novel objective function, that we optimise during training, based on Dice coefficient. In this way we can deal with situations where there is a strong imbalance between the number of foreground and background voxels. To cope with the limited number of annotated volumes available for training, we augment the data applying random non-linear transformations and histogram matching. We show in our experimental evaluation that our approach achieves good performances on challenging test data while requiring only a fraction of the processing time needed by other previous methods.
Prostate dataset Dice coefficient: 0.869参考论文指标 快速开始

图像生成

序号 论文名称(链接) 摘要 数据集 快速开始
1 Deep Image Prior
Abstract
Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting. Furthermore, the same prior can be used to invert deep neural representations to diagnose them, and to restore images based on flash-no flash input pairs. Apart from its diverse applications, our approach highlights the inductive bias captured by standard generator network architectures. It also bridges the gap between two very popular families of image restoration methods: learning-based methods using deep convolutional networks and learning-free methods based on handcrafted image priors such as self-similarity. Code and supplementary material are available at https://dmitryulyanov.github.io/deep_image_prior .
8× super-resolution, avg psnr=24.15% 快速开始
2 Progressive Growing of GANs for Improved Quality, Stability, and Variation
Abstract
We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024^2. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.
CelebA  MS-SSIM=0.2838, SWD=2.64(64) 快速开始
3 Image Inpainting for Irregular Holes Using Partial Convolutions
Abstract
Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using convolutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, but are expensive and may fail. We propose the use of partial convolutions, where the convolution is masked and renormalized to be conditioned on only valid pixels. We further include a mechanism to automatically generate an updated mask for the next layer as part of the forward pass. Our model outperforms other methods for irregular masks. We show qualitative and quantitative comparisons with other methods to validate our approach.
CelebA  人眼评估生成的图像(参考论文Figure8) 快速开始
4 Generative Adversarial Text to Image Synthesis
Abstract
Synthesizing high resolution photorealistic images has been a long-standingchallenge in machine learning. In this paper we introduce new methods for theimproved training of generative adversarial networks (GANs) for imagesynthesis. We construct a variant of GANs employing label conditioning thatresults in 128x128 resolution image samples exhibiting global coherence. Weexpand on previous work for image quality assessment to provide two newanalyses for assessing the discriminability and diversity of samples fromclass-conditional image synthesis models. These analyses demonstrate that highresolution samples provide class information not present in low resolutionsamples. Across 1000 ImageNet classes, 128x128 samples are more than twice asdiscriminable as artificially resized 32x32 samples. In addition, 84.7% of theclasses have samples exhibiting diversity comparable to real ImageNet data.
Oxford-102  人眼评估生成的图像(参考论文中展示的生成图片) 快速开始
5 Conditional Image Synthesis With Auxiliary Classifier GANs
Abstract
We introduce SinGAN, an unconditional generative model that can be learnedfrom a single natural image. Our model is trained to capture the internaldistribution of patches within the image, and is then able to generate highquality, diverse samples that carry the same visual content as the image.SinGAN contains a pyramid of fully convolutional GANs, each responsible forlearning the patch distribution at a different scale of the image. This allowsgenerating new samples of arbitrary size and aspect ratio, that havesignificant variability, yet maintain both the global structure and the finetextures of the training image. In contrast to previous single image GANschemes, our approach is not limited to texture images, and is not conditional(i.e. it generates samples from noise). User studies confirm that the generatedsamples are commonly confused to be real images. We illustrate the utility ofSinGAN in a wide range of image manipulation tasks.
ImageNet  人眼评估生成的图像(参考论文中展示的生成图片) 快速开始
6 SinGAN: Learning a Generative Model from a Single Natural Image
Abstract
We propose spatially-adaptive normalization, a simple but effective layer forsynthesizing photorealistic images given an input semantic layout. Previousmethods directly feed the semantic layout as input to the deep network, whichis then processed through stacks of convolution, normalization, andnonlinearity layers. We show that this is suboptimal as the normalizationlayers tend to ``wash away'' semantic information. To address the issue, wepropose using the input layout for modulating the activations in normalizationlayers through a spatially-adaptive, learned transformation. Experiments onseveral challenging datasets demonstrate the advantage of the proposed methodover existing approaches, regarding both visual fidelity and alignment withinput layouts. Finally, our model allows user control over both semantic andstyle. Code is available at this https URL .
人眼评估生成的图像(可参考论文中展示的生成图片Figure6) 快速开始
7 Semantic Image Synthesis with Spatially-Adaptive Normalization
Abstract
We present a generic image-to-image translation framework, pixel2style2pixel(pSp). Our pSp framework is based on a novel encoder network that directlygenerates a series of style vectors which are fed into a pretrained StyleGANgenerator, forming the extended W+ latent space. We first show that our encodercan directly embed real images into W+, with no additional optimization. Next,we propose utilizing our encoder to directly solve image-to-image translationtasks, defining them as encoding problems from some input domain into thelatent domain. By deviating from the standard invert first, edit latermethodology used with previous StyleGAN encoders, our approach can handle avariety of tasks even when the input image is not represented in the StyleGANdomain. We show that solving translation tasks through StyleGAN significantlysimplifies the training process, as no adversary is required, has bettersupport for solving tasks without pixel-to-pixel correspondence, and inherentlysupports multi-modal synthesis via the resampling of styles. Finally, wedemonstrate the potential of our framework on a variety of facialimage-to-image translation tasks, even when compared to state-of-the-artsolutions designed specifically for a single task, and further show that it canbe extended beyond the human facial domain.
cityscapes mIoU=62.3, accu=81.9, FID=71.8, 及人眼观察可视化效果 快速开始
8 Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation
Abstract
The task of age transformation illustrates the change of an individual'sappearance over time. Accurately modeling this complex transformation over aninput facial image is extremely challenging as it requires making convincing,possibly large changes to facial features and head shape, while stillpreserving the input identity. In this work, we present an image-to-imagetranslation method that learns to directly encode real facial images into thelatent space of a pre-trained unconditional GAN (e.g., StyleGAN) subject to agiven aging shift. We employ a pre-trained age regression network to explicitlyguide the encoder in generating the latent codes corresponding to the desiredage. In this formulation, our method approaches the continuous aging process asa regression task between the input age and desired target age, providingfine-grained control over the generated image. Moreover, unlike approaches thatoperate solely in the latent space using a prior on the path controlling age,our method learns a more disentangled, non-linear path. Finally, we demonstratethat the end-to-end nature of our approach, coupled with the rich semanticlatent space of StyleGAN, allows for further editing of the generated images.Qualitative and quantitative evaluations show the advantages of our methodcompared to state-of-the-art approaches.
CelebA LPIPS=0.17, similarity=0.56, MSE=0.03 (task of StyleGAN Inversion) 快速开始
9 Only a Matter of Style: Age Transformation Using a Style-Based Regression Model
Abstract
The task of age transformation illustrates the change of an individual's appearance over time. Accurately modeling this complex transformation over an input facial image is extremely challenging as it requires making convincing, possibly large changes to facial features and head shape, while still preserving the input identity. In this work, we present an image-to-image translation method that learns to directly encode real facial images into the latent space of a pre-trained unconditional GAN (e.g., StyleGAN) subject to a given aging shift. We employ a pre-trained age regression network to explicitly guide the encoder in generating the latent codes corresponding to the desired age. In this formulation, our method approaches the continuous aging process as a regression task between the input age and desired target age, providing fine-grained control over the generated image. Moreover, unlike approaches that operate solely in the latent space using a prior on the path controlling age, our method learns a more disentangled, non-linear path. Finally, we demonstrate that the end-to-end nature of our approach, coupled with the rich semantic latent space of StyleGAN, allows for further editing of the generated images. Qualitative and quantitative evaluations show the advantages of our method compared to state-of-the-art approaches.
CelebA  人眼评估生成的图像(参考论文中展示的生成图片Figure4, 6, 8) 快速开始
10 ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic
Abstract
We aim at accelerating super-resolution (SR) networks on large images (2K-8K). The large images are usually decomposed into small sub-images in practical usages. Based on this processing, we found that different image regions have different restoration difficulties and can be processed by networks with different capacities. Intuitively, smooth areas are easier to super-solve than complex textures. To utilize this property, we can adopt appropriate SR networks to process different sub-images after the decomposition. On this basis, we propose a new solution pipeline -- ClassSR that combines classification and SR in a unified framework. In particular, it first uses a Class-Module to classify the sub-images into different classes according to restoration difficulties, then applies an SR-Module to perform SR for different classes. The Class-Module is a conventional classification network, while the SR-Module is a network container that consists of the to-be-accelerated SR network and its simplified versions. We further introduce a new classification method with two losses -- Class-Loss and Average-Loss to produce the classification results. After joint training, a majority of sub-images will pass through smaller networks, thus the computational cost can be significantly reduced. Experiments show that our ClassSR can help most existing methods (e.g., FSRCNN, CARN, SRResNet, RCAN) save up to 50% FLOPs on DIV8K datasets. This general framework can also be applied in other low-level vision tasks.
DIV2K PSNR=26.39, FLOPs=21.22G(65%)(Test2K, ClassSR-RCAN) 快速开始
11 Self-Attention Generative Adversarial Networks
Abstract
In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.
ImageNet FID=18.28  Inception score=52.52 快速开始
12 Generative Image Inpainting with Contextual Attention
Abstract
We apply basic statistical reasoning to signal reconstruction by machinelearning -- learning to map corrupted observations to clean signals -- with asimple and powerful conclusion: it is possible to learn to restore images byonly looking at corrupted examples, at performance at and sometimes exceedingtraining using clean data, without explicit image priors or likelihood modelsof the corruption. In practice, we show that a single model learns photographicnoise removal, denoising synthetic Monte Carlo images, and reconstruction ofundersampled MRI scans -- all corrupted by different processes -- based onnoisy data only.
L1Loss=8.6%, L2Loss=2.1%, PSNR=18.91, TVLoss=25.3% 快速开始
13 Noise2Noise: Learning Image Restoration without Clean Data
Abstract
While humans easily recognize relations between data from different domainswithout any supervision, learning to automatically discover them is in generalvery challenging and needs many ground-truth pairs that illustrate therelations. To avoid costly pairing, we address the task of discoveringcross-domain relations given unpaired data. We propose a method based ongenerative adversarial networks that learns to discover relations betweendifferent domains (DiscoGAN). Using the discovered relations, our proposednetwork successfully transfers style from one domain to another whilepreserving key attributes such as orientation and face identity. Source codefor official implementation is publicly availablethis https URL
Denoised 与clear image PSNR持平 (Gaussian noise (σ = 25)) 快速开始
14 Learning to Discover Cross-Domain Relations with Generative Adversarial Networks
Abstract
We propose to restore old photos that suffer from severe degradation througha deep learning approach. Unlike conventional restoration tasks that can besolved through supervised learning, the degradation in real photos is complexand the domain gap between synthetic images and real old photos makes thenetwork fail to generalize. Therefore, we propose a novel triplet domaintranslation network by leveraging real photos along with massive syntheticimage pairs. Specifically, we train two variational autoencoders (VAEs) torespectively transform old photos and clean photos into two latent spaces. Andthe translation between these two latent spaces is learned with syntheticpaired data. This translation generalizes well to real photos because thedomain gap is closed in the compact latent space. Besides, to address multipledegradations mixed in one old photo, we design a global branch with apartialnonlocal block targeting to the structured defects, such as scratches and dustspots, and a local branch targeting to the unstructured defects, such as noisesand blurriness. Two branches are fused in the latent space, leading to improvedcapability to restore old photos from multiple defects. Furthermore, we applyanother face refinement network to recover fine details of faces in the oldphotos, thus ultimately generating photos with enhanced perceptual quality.With comprehensive experiments, the proposed pipeline demonstrates superiorperformance over state-of-the-art methods as well as existing commercial toolsin terms of visual quality for old photos restoration.
可视化, 参考论文图7, 8, 9 快速开始
15 Old Photo Restoration via Deep Latent Space Translation
Abstract
We present a novel method for constructing Variational Autoencoder (VAE).Instead of using pixel-by-pixel loss, we enforce deep feature consistencybetween the input and the output of a VAE, which ensures the VAE's output topreserve the spatial correlation characteristics of the input, thus leading theoutput to have a more natural visual appearance and better perceptual quality.Based on recent deep learning works such as style transfer, we employ apre-trained deep convolutional neural network (CNN) and use its hidden featuresto define a feature perceptual loss for VAE training. Evaluated on the CelebAface dataset, we show that our model produces better results than other methodsin the literature. We also show that our method can produce latent vectors thatcan capture the semantic information of face expressions and can be used toachieve state-of-the-art performance in facial attribute prediction.
PSNR=23.33, SSIM= 0.69, LPIPS=0.25, FID=134.35(table2) 快速开始
16 Deep Feature Consistent Variational Autoencoder
Abstract
Scene text detection, an important step of scene text reading systems, haswitnessed rapid development with convolutional neural networks. Nonetheless,two main challenges still exist and hamper its deployment to real-worldapplications. The first problem is the trade-off between speed and accuracy.The second one is to model the arbitrary-shaped text instance. Recently, somemethods have been proposed to tackle arbitrary-shaped text detection, but theyrarely take the speed of the entire pipeline into consideration, which may fallshort in practical this http URL this paper, we propose an efficient andaccurate arbitrary-shaped text detector, termed Pixel Aggregation Network(PAN), which is equipped with a low computational-cost segmentation head and alearnable post-processing. More specifically, the segmentation head is made upof Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM).FPEM is a cascadable U-shaped module, which can introduce multi-levelinformation to guide the better segmentation. FFM can gather the features givenby the FPEMs of different depths into a final feature for segmentation. Thelearnable post-processing is implemented by Pixel Aggregation (PA), which canprecisely aggregate text pixels by predicted similarity vectors. Experiments onseveral standard benchmarks validate the superiority of the proposed PAN. It isworth noting that our method can achieve a competitive F-measure of 79.9% at84.2 FPS on CTW1500.
Average accuracies=88.73 (Table1.  VAE-345) 快速开始
17 Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data
Abstract
Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images. In this work, we extend the powerful ESRGAN to a practical restoration application (namely, Real-ESRGAN), which is trained with pure synthetic data. Specifically, a high-order degradation modeling process is introduced to better simulate complex real-world degradations. We also consider the common ringing and overshoot artifacts in the synthesis process. In addition, we employ a U-Net discriminator with spectral normalization to increase discriminator capability and stabilize the training dynamics. Extensive comparisons have shown its superior visual performance than prior works on various real datasets. We also provide efficient implementations to synthesize training pairs on the fly.
DIV2K and Flickr2K and OST; 可视化效果与论文一致 快速开始

人脸识别

序号 论文名称(链接) 摘要 数据集 快速开始
1 RetinaFace: Single-stage Dense Face Localisation in the Wild
Abstract
We propose a novel Connectionist Text Proposal Network (CTPN) that accuratelylocalizes text lines in natural image. The CTPN detects a text line in asequence of fine-scale text proposals directly in convolutional feature maps.We develop a vertical anchor mechanism that jointly predicts location andtext/non-text score of each fixed-width proposal, considerably improvinglocalization accuracy. The sequential proposals are naturally connected by arecurrent neural network, which is seamlessly incorporated into theconvolutional network, resulting in an end-to-end trainable model. This allowsthe CTPN to explore rich context information of image, making it powerful todetect extremely ambiguous text. The CTPN works reliably on multi-scale andmulti- language text without further post-processing, departing from previousbottom-up methods requiring multi-step post-processing. It achieves 0.88 and0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpass- ing recentresults [8, 35] by a large margin. The CTPN is computationally efficient with0:14s/image, by using the very deep VGG16 model [27]. Online demo is availableat: this http URL.
MAP: 52.318 快速开始
2 Real-time Convolutional Neural Networks for Emotion and Gender Classification
Abstract
In this paper we propose an implement a general convolutional neural network (CNN) building framework for designing real-time CNNs. We validate our models by creating a real-time vision system which accomplishes the tasks of face detection, gender classification and emotion classification simultaneously in one blended step using our proposed CNN architecture. After presenting the details of the training procedure setup we proceed to evaluate on standard benchmark sets. We report accuracies of 96% in the IMDB gender dataset and 66% in the FER-2013 emotion dataset. Along with this we also introduced the very recent real-time enabled guided back-propagation visualization technique. Guided back-propagation uncovers the dynamics of the weight changes and evaluates the learned features. We argue that the careful implementation of modern CNN architectures, the use of the current regularization methods and the visualization of previously hidden features are necessary in order to reduce the gap between slow performances and real-time architectures. Our system has been validated by its deployment on a Care-O-bot 3 robot used during RoboCup@Home competitions. All our code, demos and pre-trained architectures have been released under an open-source license in our public repository.
IMDB: 96% 快速开始
3 PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models
Abstract
The primary aim of single-image super-resolution is to construct high-resolution (HR) images from corresponding low-resolution (LR) inputs. In previous approaches, which have generally been supervised, the training objective typically measures a pixel-wise average distance between the super-resolved (SR) and HR images. Optimizing such metrics often leads to blurring, especially in high variance (detailed) regions. We propose an alternative formulation of the super-resolution problem based on creating realistic SR images that downscale correctly. We present an algorithm addressing this problem, PULSE (Photo Upsampling via Latent Space Exploration), which generates high-resolution, realistic images at resolutions previously unseen in the literature. It accomplishes this in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training, unlike previous methods (which require supervised training on databases of LR-HR image pairs). Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the "downscaling loss," which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, we restrict the search space to guarantee realistic outputs. PULSE thereby generates super-resolved images that both are realistic and downscale correctly. We show proof of concept of our approach in the domain of face super-resolution (i.e., face hallucination). We also present a discussion of the limitations and biases of the method as currently implemented with an accompanying model card with relevant metrics. Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible.
CelebA HQ 3.6 快速开始
4 Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks 
Abstract
Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, we propose a deep cascaded multi-task framework which exploits the inherent correlation between them to boost up their performance. In particular, our framework adopts a cascaded structure with three stages of carefully designed deep convolutional networks that predict face and landmark location in a coarse-to-fine manner. In addition, in the learning process, we propose a new online hard sample mining strategy that can improve the performance automatically without manual sample selection. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging FDDB and WIDER FACE benchmark for face detection, and AFLW benchmark for face alignment, while keeps real time performance.
FDDB: 0.82 快速开始
5 Finding Tiny Faces
Abstract
Though tremendous strides have been made in object recognition, one of the remaining open challenges is detecting small objects. We explore three aspects of the problem in the context of finding small faces: the role of scale invariance, image resolution, and contextual reasoning. While most recognition approaches aim to be scale-invariant, the cues for recognizing a 3px tall face are fundamentally different than those for recognizing a 300px tall face. We take a different approach and train separate detectors for different scales. To maintain efficiency, detectors are trained in a multi-task fashion: they make use of features extracted from multiple layers of single (deep) feature hierarchy. While training detectors for large objects is straightforward, the crucial challenge remains training detectors for small objects. We show that context is crucial, and define templates that make use of massively-large receptive fields (where 99% of the template extends beyond the object of interest). Finally, we explore the role of scale in pre-trained deep networks, providing ways to extrapolate networks tuned for limited scales to rather extreme ranges. We demonstrate state-of-the-art results on massively-benchmarked face datasets (FDDB and WIDER FACE). In particular, when compared to prior art on WIDER FACE, our results reduce error by a factor of 2 (our models produce an AP of 82% while prior art ranges from 29-64%).
WIDER_FACE: resnet50 500*500 easy: 0.902 medium: 0.892 medium: 0.892 快速开始

行为识别

序号 论文名称(链接) 摘要 数据集 快速开始
1 Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks
Abstract
This paper addresses the visualisation of image classification models, learntusing deep Convolutional Networks (ConvNets). We consider two visualisationtechniques, based on computing the gradient of the class score with respect tothe input image. The first one generates an image, which maximises the classscore [Erhan et al., 2009], thus visualising the notion of the class, capturedby a ConvNet. The second technique computes a class saliency map, specific to agiven image and class. We show that such maps can be employed for weaklysupervised object segmentation using classification ConvNets. Finally, weestablish the connection between the gradient-based ConvNet visualisationmethods and deconvolutional networks [Zeiler et al., 2013].
可视化方法 快速开始
2 Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Abstract
The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available. https://github.com/kenshohara/3D-ResNets-PyTorch
ucf-101; resnet18 112*112 Kinetics400 : 66.1 ucf-101: 42.4(不加载预训练) 快速开始
3 Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution
Abstract
In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies, and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially "slower" at a lower spatial resolution reducing both memory and computation cost. Unlike existing multi-scale methods, OctConv is formulated as a single, generic, plug-and-play convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing convolutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memory and computational cost. An OctConv-equipped ResNet-152 can achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2 GFLOPs.
imagenet: 1.125 MobileNet (v2) 3224224 imagenet top1 73 快速开始
4 Learning Spatiotemporal Features with 3D Convolutional Networks
Abstract
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.
UCF101: 128x171 TOP1=83.27% 快速开始
5 MVFNet: Multi-View Fusion Network for Efficient Video Recognition
Abstract
Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.
UCF101: 4x16, Top1=96.6% 快速开始

自然语言处理

序号 论文名称(链接) 摘要 数据集 快速开始
1 A Structured Self-attentive Sentence Embedding
Abstract
This paper proposes a new model for extracting an interpretable sentence embedding by introducing self-attention. Instead of using a vector, we use a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence. We also propose a self-attention mechanism and a special regularization term for the model. As a side effect, the embedding comes with an easy way of visualizing what specific parts of the sentence are encoded into the embedding. We evaluate our model on 3 different tasks: author profiling, sentiment classification, and textual entailment. Results show that our model yields a significant performance gain compared to other sentence embedding methods in all of the 3 tasks.
SNLI: accuracy=84.4%(见论文Table 2) 快速开始
2 End-To-End Memory Networks
Abstract
This article offers an empirical exploration on the use of character-levelconvolutional networks (ConvNets) for text classification. We constructedseveral large-scale datasets to show that character-level convolutionalnetworks could achieve state-of-the-art or competitive results. Comparisons areoffered against traditional models such as bag of words, n-grams and theirTFIDF variants, and deep learning models such as word-based ConvNets andrecurrent neural networks.
Penn Treebank: ppl=111; Text8: ppl=147 快速开始
3 Character-level Convolutional Networks for Text Classification
Abstract
Building open-domain chatbots is a challenging area for machine learningresearch. While prior work has shown that scaling neural models in the numberof parameters and the size of the data they are trained on gives improvedresults, we show that other ingredients are important for a high-performingchatbot. Good conversation requires a number of skills that an expertconversationalist blends in a seamless way: providing engaging talking pointsand listening to their partners, and displaying knowledge, empathy andpersonality appropriately, while maintaining a consistent persona. We show thatlarge scale models can learn these skills when given appropriate training dataand choice of generation strategy. We build variants of these recipes with 90M,2.7B and 9.4B parameter models, and make our models and code publiclyavailable. Human evaluations show our best models are superior to existingapproaches in multi-turn dialogue in terms of engagingness and humannessmeasurements. We then discuss the limitations of this work by analyzing failurecases of our models.
Amazon Review Full: error rate=40.45%; Yahoo! Answers: error rate=28.80% 快速开始
4 Recipes for building an open-domain chatbot
Abstract
Pre-trained language models like BERT and its variants have recently achievedimpressive performance in various natural language understanding tasks.However, BERT heavily relies on the global self-attention block and thussuffers large memory footprint and computation cost. Although all its attentionheads query on the whole input sequence for generating the attention map from aglobal perspective, we observe some heads only need to learn localdependencies, which means the existence of computation redundancy. We thereforepropose a novel span-based dynamic convolution to replace these self-attentionheads to directly model local dependencies. The novel convolution heads,together with the rest self-attention heads, form a new mixed attention blockthat is more efficient at both global and local context learning. We equip BERTwith this mixed attention design and build a ConvBERT model. Experiments haveshown that ConvBERT significantly outperforms BERT and its variants in variousdownstream tasks, with lower training cost and fewer model parameters.Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher thanELECTRAbase, while using less than 1/4 training cost. Code and pre-trainedmodels will be released.
BlenderbotForConditionalGeneration模型和BlenderbotSmallForConditionalGeneration模型前向推理输出与论文对齐(90M和2.7B distilled to 360M两个权重) 快速开始
5 ConvBERT: Improving BERT with Span-based Dynamic Convolution
Abstract
Natural Language Processing (NLP) has recently achieved great success byusing huge pre-trained models with hundreds of millions of parameters. However,these models suffer from heavy model sizes and high latency such that theycannot be deployed to resource-limited mobile devices. In this paper, wepropose MobileBERT for compressing and accelerating the popular BERT model.Like the original BERT, MobileBERT is task-agnostic, that is, it can begenerically applied to various downstream NLP tasks via simple fine-tuning.Basically, MobileBERT is a thin version of BERT_LARGE, while equipped withbottleneck structures and a carefully designed balance between self-attentionsand feed-forward networks. To train MobileBERT, we first train a speciallydesigned teacher model, an inverted-bottleneck incorporated BERT_LARGE model.Then, we conduct knowledge transfer from this teacher to MobileBERT. Empiricalstudies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASEwhile achieving competitive results on well-known benchmarks. On the naturallanguage inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuADv1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of90.0/79.2 (1.5/2.1 higher than BERT_BASE).
QNLI测试集accuracy=93.2%(见论文Table 3), SQuAD v1.1验证集上Exact Match=84.7%, F1=90.9%, SQuAD v2.0验证集Exact Match=80.6%, F1=83.1%(见论文Table 4) 快速开始
6 MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
Abstract
BERT adopts masked language modeling (MLM) for pre-training and is one of themost successful pre-training models. Since BERT neglects dependency amongpredicted tokens, XLNet introduces permuted language modeling (PLM) forpre-training to address this problem. However, XLNet does not leverage the fullposition information of a sentence and thus suffers from position discrepancybetween pre-training and fine-tuning. In this paper, we propose MPNet, a novelpre-training method that inherits the advantages of BERT and XLNet and avoidstheir limitations. MPNet leverages the dependency among predicted tokensthrough permuted language modeling (vs. MLM in BERT), and takes auxiliaryposition information as input to make the model see a full sentence and thusreducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on alarge-scale dataset (over 160GB text corpora) and fine-tune on a variety ofdown-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNetoutperforms MLM and PLM by a large margin, and achieves better results on thesetasks compared with previous state-of-the-art pre-trained methods (e.g., BERT,XLNet, RoBERTa) under the same model setting. The code and the pre-trainedmodels are available at: this https URL.
MNLI验证集-m/mm accuracy=83.3/82.6(见论文table 4), SQuAD 1.1验证集F1/EM=90.0/82.9, SQuAD 2.0验证集F1/EM=79.2/76.2(见论文table 5) 快速开始
7 MPNet: Masked and Permuted Pre-training for Language Understanding
Abstract
BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting. The code and the pre-trained models are available at: https://github.com/microsoft/MPNet.
QQP验证集accuracy=91.9(见论文Table 3), SQuAD 1.1 F1/EM (dev set)=92.7/86.9, SQuAD 2.0 F1/EM (dev set)=85.7/82.7(见论文Table 4) 快速开始
8 Reformer: The Efficient Transformer
Abstract
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(LlogL), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
ReformerModel, ReformerForSequenceClassification和ReformerForQuestionAnswering网络前向推理输出与论文对齐 快速开始
9 SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
Abstract
Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets, large computing systems, and better neural network models, natural language processing (NLP) technology has made significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test set. The SqueezeBERT code will be released.
QQP验证集accuracy=89.4(见论文Table 2); SqueezeBERT模型加速比对比BERT-Base达到4.3x(见论文Table 2) 快速开始
10 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
GLUE dev set上达到平均指标85.97, CNNDM dev set上达到ROUGE-2=20.90(见论文Table 15) 快速开始
11 VisualBERT: A Simple and Performant Baseline for Vision and Language
Abstract
We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
VQA: Test-Dev=70.80, Test-Std=71.00; NLVR: accuracy=67.4(见论文Table1, Table3) 快速开始
12 ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information
Abstract
Recent pretraining models in Chinese neglect two important aspects specific to the Chinese language: glyph and pinyin, which carry significant syntax and semantic information for language understanding. In this work, we propose ChineseBERT, which incorporates both the {\it glyph} and {\it pinyin} information of Chinese characters into language model pretraining. The glyph embedding is obtained based on different fonts of a Chinese character, being able to capture character semantics from the visual features, and the pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). Pretrained on large-scale unlabeled Chinese corpus, the proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps. The porpsoed model achieves new SOTA performances on a wide range of Chinese NLP tasks, including machine reading comprehension, natural language inference, text classification, sentence pair matching, and competitive performances in named entity recognition. Code and pretrained models are publicly available at https://github.com/ShannonAI/ChineseBert.
CMRC dev/test=70.70/78.05(见论文Table 2), XNLI dev/test=82.7/81.6(见论文Table 4), ChnSentiCorp dev/test=95.8/95.9 快速开始
13 CTRL: A Conditional Transformer Language Model for Controllable Generation
Abstract
Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at https://github.com/salesforce/ctrl.
CTRLLMHeadModel模型和CTRLForSequenceClassification模型前向推理输出与论文对齐 快速开始
14 Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
Abstract
With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension. The code and pretrained checkpoints are available at https://github.com/laiguokun/Funnel-Transformer.
QNLI验证集上accuracy=95.1%(见论文table 3), SQuAD v1.1验证集上F1/EM=94.7/89.0, SQuAD v2.0验证集F1/EM=90.4/87.6(见论文table 5) 快速开始
15 Splinter: Few-Shot Question Answering by Pretraining Span Selection
Abstract
In several question answering benchmarks, pretrained models have reached human parity through fine-tuning on an order of 100,000 annotated questions and answers. We explore the more realistic few-shot setting, where only a few hundred training examples are available, and observe that standard models perform poorly, highlighting the discrepancy between current pretraining objectives and question answering. We propose a new pretraining scheme tailored for question answering: recurring span selection. Given a passage with multiple sets of recurring spans, we mask in each set all recurring spans but one, and ask the model to select the correct span in the passage for each masked span. Masked spans are replaced with a special token, viewed as a question representation, that is later used during fine-tuning to select the answer span. The resulting model obtains surprisingly good results on multiple benchmarks (e.g., 72.7 F1 on SQuAD with only 128 training examples), while maintaining competitive performance in the high-resource setting.
SQuAD 1.1验证集, 16 examples F1=54.6, 128 examples F1=72.7, 1024 Examples F1=82.8(见论文Table1) 快速开始
16 UNILMv1: Unified Language Model Pre-training for Natural Language Understanding and Generation
Abstract
This paper presents a new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UniLM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UniLM achieves new state-of-the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm.
QNLI测试集达到92.7(见论文table11), CoQA验证集F1=82.5(见论文 table 7) 快速开始
17 BERT for Joint Intent Classification and Slot Filling
Abstract
Intent classification and slot filling are two essential tasks for natural language understanding. They often suffer from small-scale human-labeled training data, resulting in poor generalization capability, especially for rare words. Recently a new language representation model, BERT (Bidirectional Encoder Representations from Transformers), facilitates pre-training deep bidirectional representations on large-scale unlabeled corpora, and has created state-of-the-art models for a wide variety of natural language processing tasks after simple fine-tuning. However, there has not been much effort on exploring BERT for natural language understanding. In this work, we propose a joint intent classification and slot filling model based on BERT. Experimental results demonstrate that our proposed model achieves significant improvement on intent classification accuracy, slot filling F1, and sentence-level semantic frame accuracy on several public benchmark datasets, compared to the attention-based recurrent neural network models and slot-gated models.
在Snips测试集指标达到98.6, 97.0, 92.8; 在ATIS测试集上指标达到97.9, 96.0, 88.6 (见table2) 快速开始
18 Fastformer: Additive Attention Can Be All You Need
Abstract
Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.
Amazon F1=43.23(见论文table4), Pubmed测试集R-L=34.81(见论文table6),   快速开始
19 Convolutional Sequence to Sequence Learning
Abstract
The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks.1 Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT’14 English-German and WMT’14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.
在WMT’14 English-German测试集上BLEU=25.16 或者在WMT’16 English-Romanian测试集上BLEU=30.02 快速开始
20 FNet: Mixing Tokens with Fourier Transforms
Abstract
We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
QQP验证集Acc=85%, SST-2验证集Acc=95%(见论文Table1) 快速开始
21 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Abstract
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).
megatron-bert-cased-345m模型在MNLI验证集上Acc=89.7/90.0; 2. megatron-bert-cased-345m模型在SQuAD-1.1验证集上F1/EM=94.2/88.0, 在SQuAD-2.0验证集上F1/EM=88.1/84.8 快速开始
22 LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
Abstract
Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at https://github.com/studio-ousia/luke.
Open Entity: F1=78.2, SQuAD1.1: F1=95.4, EM=90.2(见论文Table1 & Table5) 快速开始
23 RemBERT: Rethinking embedding coupling in pre-trained language models 
Abstract
We re-evaluate the standard practice of sharing weights between input and output embeddings in state-of-the-art pre-trained language models. We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation in the input embedding of multilingual models. By reallocating the input embedding parameters in the Transformer layers, we achieve dramatically better performance on standard natural language understanding tasks with the same number of parameters during fine-tuning. We also show that allocating additional capacity to the output embedding provides benefits to the model that persist through the fine-tuning stage even though the output embedding is discarded after pre-training. Our analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages. Harnessing these findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.
XTREME: Sentence-pair Classification: Acc=84.2(见论文Table7) 快速开始
24 Nyströmformer: A Nystöm-based Algorithm for Approximating Self-Attention
Abstract
Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences -- a topic being actively studied in the community. To address this limitation, we propose Nystr\"{o}mformer -- a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nystr\"{o}m method to approximate standard self-attention with O(n) complexity. The scalability of Nystr\"{o}mformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nystr\"{o}mformer performs comparably, or in a few cases, even slightly better, than standard self-attention. On longer sequence tasks in the Long Range Arena (LRA) benchmark, Nystr\"{o}mformer performs favorably relative to other efficient self-attention methods. Our code is available at https://github.com/mlpen/Nystromformer.
IMDB: F1=93.2, LRA benchmark下text任务 acc=65.52; 快速开始
25 ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training
Abstract
This paper presents a new sequence-tosequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of optimizing one-stepahead prediction in the traditional sequenceto-sequence model, the ProphetNet is optimized by n-step ahead prediction that predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large-scale dataset (160GB), respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.
prophetnet-large-uncased模型在CNN/DailyMail测试集上R-1=44.20, R-2=21.17, R-L=41.30 (见论文Table 4); prophetnet-large-uncased模型在Gigaword测试集上R-1=39.51, R-2=20.42, R-L=36.69 (见论文Table 4); 快速开始
26 Universal Language Model Fine-tuning for Text Classification  
Abstract
Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18- 24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100× more data. We opensource our pretrained models and code1 .
AG’s News: Err=5.01% (见论文Table 3) 快速开始
27 ByT5: Towards a token-free future with pre-trained byte-to-byte models
Abstract
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.
在GEM-Xsum验证集上, small model BLEU score=9.1; 在TweetQA验证集上, small model BLEU-1/ROUGE-L=65.7/69.7 (见论文table3) 快速开始
28 Few-Shot Question Answering by Pretraining Span Selection
Abstract
In several question answering benchmarks, pretrained models have reached human parity through fine-tuning on an order of 100,000 annotated questions and answers. We explore the more realistic few-shot setting, where only a few hundred training examples are available, and observe that standard models perform poorly, highlighting the discrepancy between current pretraining objectives and question answering. We propose a new pretraining scheme tailored for question answering: recurring span selection. Given a passage with multiple sets of recurring spans, we mask in each set all recurring spans but one, and ask the model to select the correct span in the passage for each masked span. Masked spans are replaced with a special token, viewed as a question representation, that is later used during fine-tuning to select the answer span. The resulting model obtains surprisingly good results on multiple benchmarks (e.g., 72.7 F1 on SQuAD with only 128 training examples), while maintaining competitive performance in the high-resource setting.
SQuAD 1.1验证集, 16 examples F1=54.6, 128 examples F1=72.7, 1024 Examples F1=82.8(见论文Table1) 快速开始

多模态

序号 论文名称(链接) 摘要 数据集 快速开始
1 Comprehensive Semi-Supervised Multi-Modal Learning
Abstract
Multi-modal learning refers to the process of learning a precise model to represent the joint representations of different modalities. Despite its promise for multi-modal learning, the co-regularization method is based on the consistency principle with a sufficient assumption, which usually does not hold for real-world multi-modal data. Indeed, due to the modal insufficiency in real-world applications, there are divergences among heterogeneous modalities. This imposes a critical challenge for multi-modal learning. To this end, in this paper, we propose a novel Comprehensive Multi-Modal Learning (CMML) framework, which can strike a balance between the consistency and divergency modalities by considering the insufficiency in one unified framework. Specifically, we utilize an instance level attention mechanism to weight the sufficiency for each instance on different modalities. Moreover, novel diversity regularization and robust consistency metrics are designed for discovering insufficient modalities. Our empirical studies show the superior performances of CMML on real-world data in terms of various criteria.
Coverage: 2.669 Average Precision: 0.914 Ranking Loss: 0.058 Example AUC: 0.942 Micro AUC: 0.94 Macro AUC: 0.932 快速开始
2 Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
Abstract
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.
RefCOCO+-val=72.34 快速开始
3 Attention on Attention for Image Captioning
Abstract
Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the caption decoding process. However, the decoder has little idea of whether or how well the attended vector and the given attention query are related, which could make the decoder give misled results. In this paper, we propose an “Attention on Attention” (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries. AoA first generates an “information vector” and an “attention gate” using the attention result and the current context, then adds another attention by applying element-wise multiplication to them and finally obtains the “attended information”, the expected useful knowledge. We apply AoA to both the encoder and the decoder of our image captioning model, which we name as AoA Network (AoANet). Experiments show that AoANet outperforms all previously published methods and achieves a new state-ofthe-art performance of 129.8 CIDEr-D score on MS COCO “Karpathy” offline test split and 129.6 CIDEr-D (C40) score on the official online testing server. Code is available at https://github.com/husthuaan/AoANet.
COCO; {‘Bleu_1’: 0.8054903453672397, ‘Bleu_2’: 0.6523038976984842, ‘Bleu_3’: 0.5096621263772566, ‘Bleu_4’: 0.39140307771618477, ‘METEOR’: 0.29011216375635934, ‘ROUGE_L’: 0.5890369750273199, ‘CIDEr’: 1.2892294296245852, ‘SPICE’: 0.22680092759866174} 快速开始
4 VinVL: Revisiting Visual Representations in Vision-Language Models
Abstract
This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR [21], and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. Code, models and pre-extracted features are released at https://github.com/pzzhang/VinVL.
COCO 2014; Oscar-Large: COCO-Text Retireval: Recall@1=89.8 Recall@5=98.8 Recall@10=99.7 COCO-Image Retireval: Recall@1=78.2 Recall@5=95.8 Recall@10=98.3 快速开始
5  From Recognition to Cognition: Visual Commonsense Reasoning
Abstract
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
VQA val, Q->A 63.8%, QA->R: 67.2%, Q-AR: 43.1% 快速开始

推荐系统

序号 论文名称(链接) 摘要 数据集 快速开始
1 Field-Embedded Factorization Machines for Click-through rate prediction
Abstract
Click-through rate (CTR) prediction models are common in many online applications such as digital advertising and recommender systems. Field-Aware Factorization Machine (FFM) and Field-weighted Factorization Machine (FwFM) are state-of-the-art among the shallow models for CTR prediction. Recently, many deep learning-based models have also been proposed. Among deeper models, DeepFM, xDeepFM, AutoInt+, and FiBiNet are state-of-the-art models. The deeper models combine a core architectural component, which learns explicit feature interactions, with a deep neural network (DNN) component. We propose a novel shallow Field-Embedded Factorization Machine (FEFM) and its deep counterpart Deep Field-Embedded Factorization Machine (DeepFEFM). FEFM learns symmetric matrix embeddings for each field pair along with the usual single vector embeddings for each feature. FEFM has significantly lower model complexity than FFM and roughly the same complexity as FwFM. FEFM also has insightful mathematical properties about important fields and field interactions. DeepFEFM combines the FEFM interaction vectors learned by the FEFM component with a DNN and is thus able to learn higher order interactions. We conducted comprehensive experiments over a wide range of hyperparameters on two large publicly available real-world datasets. When comparing test AUC and log loss, the results show that FEFM and DeepFEFM outperform the existing state-of-the-art shallow and deep models for CTR prediction tasks. We have made the code of FEFM and DeepFEFM available in the DeepCTR library (https://github.com/shenweichen/DeepCTR).
criteo auc >0.8 快速开始
2 Deep Learning Recommendation Model for Personalization and Recommendation Systems
Abstract
With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design.
criteo auc > 0.79 快速开始
3 A Dual Input-aware Factorization Machine for CTR Prediction
Abstract
Factorization Machines (FMs) refer to a class of general predictors working with real valued feature vectors, which are well-known for their ability to estimate model parameters under significant sparsity and have found successful applications in many areas such as the click-through rate (CTR) prediction. However, standard FMs only produce a single fixed representation for each feature across different input instances, which may limit the CTR model’s expressive and predictive power. Inspired by the success of Input-aware Factorization Machines (IFMs), which aim to learn more flexible and informative representations of a given feature according to different input instances, we propose a novel model named Dual Input-aware Factorization Machines (DIFMs) that can adaptively reweight the original feature representations at the bit-wise and vector-wise levels simultaneously. Furthermore, DIFMs strategically integrate various components including Multi-Head Self-Attention, Residual Networks and DNNs into a unified end-to-end model. Comprehensive experiments on two real-world CTR prediction datasets show that the DIFM model can outperform several state-of-the-art models consistently.
crito auc >0.799 快速开始
4 FAT-DeepFFM: Field Attentive Deep Field-aware Factorization Machine
Abstract
Click through rate (CTR) estimation is a fundamental task in personalized advertising and recommender systems. Recent years have witnessed the success of both the deep learning based model and attention mechanism in various tasks in computer vision (CV) and natural language processing (NLP). How to combine the attention mechanism with deep CTR model is a promising direction because it may ensemble the advantages of both sides. Although some CTR model such as Attentional Factorization Machine (AFM) has been proposed to model the weight of second order interaction features, we posit the evaluation of feature importance before explicit feature interaction procedure is also important for CTR prediction tasks because the model can learn to selectively highlight the informative features and suppress less useful ones if the task has many input features. In this paper, we propose a new neural CTR model named Field Attentive Deep Field-aware Factorization Machine (FAT-DeepFFM) by combining the Deep Field-aware Factorization Machine (DeepFFM) with Compose-Excitation network (CENet) field attention mechanism which is proposed by us as an enhanced version of Squeeze-Excitation network (SENet) to highlight the feature importance. We conduct extensive experiments on two real-world datasets and the experiment results show that FAT-DeepFFM achieves the best performance and obtains different improvements over the state-of-the-art methods. We also compare two kinds of attention mechanisms (attention before explicit feature interaction vs. attention after explicit feature interaction) and demonstrate that the former one outperforms the latter one significantly.
crito AUC>=0.8099 快速开始
5 BERT4Rec:Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Abstract
Top-$N$ sequential recommendation models each user as a sequence of itemsinteracted in the past and aims to predict top-$N$ ranked items that a userwill likely interact in a `near future'. The order of interaction implies thatsequential patterns play an important role where more recent items in asequence have a larger impact on the next item. In this paper, we propose aConvolutional Sequence Embedding Recommendation Model (\emph{Caser}) as asolution to address this requirement. The idea is to embed a sequence of recentitems into an `image' in the time and latent spaces and learn sequentialpatterns as local features of the image using convolutional filters. Thisapproach provides a unified and flexible network structure for capturing bothgeneral preferences and sequential patterns. The experiments on public datasetsdemonstrated that Caser consistently outperforms state-of-the-art sequentialrecommendation methods on a variety of common evaluation metrics.
1、Beauty HR@10=0.30252、Steam HR@10=0.40133、ML-1m HR@10=0.69704、ML-20m HR@10=0.7473 快速开始
6 Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding
Abstract
Top-N sequential recommendation models each user as a sequence of items interacted in the past and aims to predict top-N ranked items that a user will likely interact in a `near future'. The order of interaction implies that sequential patterns play an important role where more recent items in a sequence have a larger impact on the next item. In this paper, we propose a Convolutional Sequence Embedding Recommendation Model (\emph{Caser}) as a solution to address this requirement. The idea is to embed a sequence of recent items into an `image' in the time and latent spaces and learn sequential patterns as local features of the image using convolutional filters. This approach provides a unified and flexible network structure for capturing both general preferences and sequential patterns. The experiments on public datasets demonstrated that Caser consistently outperforms state-of-the-art sequential recommendation methods on a variety of common evaluation metrics.
1、MovieLens MAP=0.15072、Gowalla MAP=0.09283、Foursquare MAP=0.09094、Tmall MAP=0.0310 快速开始
7 SASRec:Self-Attentive Sequential Recommendation
Abstract
Sequential dynamics are a key feature of many modern recommender systems, which seek to capture the `context' of users' activities on the basis of actions they have performed recently. To capture such patterns, two approaches have proliferated: Markov Chains (MCs) and Recurrent Neural Networks (RNNs). Markov Chains assume that a user's next action can be predicted on the basis of just their last (or last few) actions, while RNNs in principle allow for longer-term semantics to be uncovered. Generally speaking, MC-based methods perform best in extremely sparse datasets, where model parsimony is critical, while RNNs perform better in denser datasets where higher model complexity is affordable. The goal of our work is to balance these two goals, by proposing a self-attention based sequential model (SASRec) that allows us to capture long-term semantics (like an RNN), but, using an attention mechanism, makes its predictions based on relatively few actions (like an MC). At each time step, SASRec seeks to identify which items are `relevant' from a user's action history, and use them to predict the next item. Extensive empirical studies show that our method outperforms various state-of-the-art sequential models (including MC/CNN/RNN-based approaches) on both sparse and dense datasets. Moreover, the model is an order of magnitude more efficient than comparable CNN/RNN-based models. Visualizations on attention weights also show how our model adaptively handles datasets with various density, and uncovers meaningful patterns in activity sequences.
Hit Rate@10(Recall@10; Precision@10) andNDCG@10, 快速开始
8 Neural Collaborative Reasoning
Abstract
Existing Collaborative Filtering (CF) methods are mostly designed based on the idea of matching, i.e., by learning user and item embeddings from data using shallow or deep models, they try to capture the associative relevance patterns in data, so that a user embedding can be matched with relevant item embeddings using designed or learned similarity functions. However, as a cognition rather than a perception intelligent task, recommendation requires not only the ability of pattern recognition and matching from data, but also the ability of cognitive reasoning in data. In this paper, we propose to advance Collaborative Filtering (CF) to Collaborative Reasoning (CR), which means that each user knows part of the reasoning space, and they collaborate for reasoning in the space to estimate preferences for each other. Technically, we propose a Neural Collaborative Reasoning (NCR) framework to bridge learning and reasoning. Specifically, we integrate the power of representation learning and logical reasoning, where representations capture similarity patterns in data from perceptual perspectives, and logic facilitates cognitive reasoning for informed decision making. An important challenge, however, is to bridge differentiable neural networks and symbolic reasoning in a shared architecture for optimization and inference. To solve the problem, we propose a modularized reasoning architecture, which learns logical operations such as AND (∧), OR (∨) and NOT (¬) as neural modules for implication reasoning (→). In this way, logical expressions can be equivalently organized as neural networks, so that logical reasoning and prediction can be conducted in a continuous space. Experiments on real-world datasets verified the advantages of our framework compared with both shallow, deep and reasoning models.
ML100K : HR@K >0.68 快速开始
9 TiSASRec: Time Interval Aware Self-Attention for Sequential Recommendation
Abstract
Sequential recommender systems seek to exploit the order of users’ interactions, in order to predict their next action based on the context of what they have done recently. Traditionally, Markov Chains (MCs), and more recently Recurrent Neural Networks (RNNs) and Self Attention (SA) have proliferated due to their ability to capture the dynamics of sequential patterns. However a simplifying assumption made by most of these models is to regard interaction histories as ordered sequences, without regard for the time intervals between each interaction (i.e., they model the time-order but not the actual timestamp). In this paper, we seek to explicitly model the timestamps of interactions within a sequential modeling framework to explore the influence of different time intervals on next item prediction. We propose TiSASRec (Time Interval aware Self-attention based sequential recommendation), which models both the absolute positions of items as well as the time intervals between them in a sequence. Extensive empirical studies show the features of TiSASRec under different settings and compare the performance of self-attention with different positional encodings. Furthermore, experimental results show that our method outperforms various state-of-the-art sequential models on both sparse and dense datasets and different evaluation metrics.
ml-1m: NDCG@10: 0.5706, Hit@10: 0.8038 快速开始
10 Efficient Non-Sampling Factorization Machines for Optimal Context-Aware Recommendation
Abstract
To provide more accurate recommendation, it is a trending topic to go beyond modeling user-item interactions and take context features into account. Factorization Machines (FM) with negative sampling is a popular solution for context-aware recommendation. However, it is not robust as sampling may lost important information and usually leads to non-optimal performances in practical. Several recent e_x001D_orts have enhanced FM with deep learning architectures for modelling high-order feature interactions. While they either focus on rating prediction task only, or typically adopt the negative sampling strategy for optimizing the ranking performance. Due to the dramatic _x001E_uctuation of sampling, it is reasonable to argue that these sampling-based FM methods are still suboptimal for context-aware recommendation. In this paper, we propose to learn FM without sampling for ranking tasks that helps context-aware recommendation particularly. Despite e_x001D_ectiveness, such a non-sampling strategy presents strong challenge in learning e_x001C_ciency of the model. Accordingly, we further design a new ideal framework named E_x001C_cient Non-Sampling Factorization Machines (ENSFM). ENSFM not only seamlessly connects the relationship between FM and Matrix Factorization (MF), but also resolves the challenging e_x001C_ciency issue via novel memorization strategies. Through extensive experiments on three realworld public datasets, we show that 1) the proposed ENSFM consistently and signi_x001B_cantly outperforms the state-of-the-art methods on context-aware Top-K recommendation, and 2) ENSFM achieves signi_x001B_cant advantages in training e_x001C_ciency, which makes it more applicable to real-world large-scale systems. Moreover, the empirical results indicate that a proper learning method is even more important than advanced neural network structures for Top-K recommendation task. Our implementation has been released 1 to facilitate further developments on e_x001C_cient non-sampling methods.
Movielens: HR@5: 0.0601, HR@10: 0.1024, HR@20: 0.1690 (论文table3) 快速开始
11 AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction 
Abstract
Learning feature interactions is crucial for click-through rate (CTR) prediction in recommender systems. In most existing deep learning models, feature interactions are either manually designed or simply enumerated. However, enumerating all feature interactions brings large memory and computation cost. Even worse, useless interactions may introduce noise and complicate the training process. In this work, we propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS). AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence. In the search stage, instead of searching over a discrete set of candidate feature interactions, we relax the choices to be continuous by introducing the architecture parameters. By implementing a regularized optimizer over the architecture parameters, the model can automatically identify and remove the redundant feature interactions during the training process of the model. In the re-train stage, we keep the architecture parameters serving as an attention unit to further boost the performance. Offline experiments on three large-scale datasets (two public benchmarks, one private) demonstrate that AutoFIS can significantly improve various FM based models. AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service, where a 10-day online A/B test demonstrated that AutoFIS improved the DeepFM model by 20.3% and 20.1% in terms of CTR and CVR respectively.
Criteo; (DeepFM)auc: 0.8009, logloss: 0.5404 (table1) 快速开始
12 Training Deep AutoEncoders for Collaborative Filtering
Abstract
�is paper proposes a model for the rating prediction task in recommender systems which signi�cantly outperforms previous stateof-the art models on a time-split Net�ix data set. Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training. We empirically demonstrate that: a) deep autoencoder models generalize much be�er than the shallow ones, b) non-linear activation functions with negative parts are crucial for training deep models, and c) heavy use of regularization techniques such as dropout is necessary to prevent over��ing. We also propose a new training algorithm based on iterative output re-feeding to overcome natural sparseness of collaborate �ltering. �e new algorithm signi�cantly speeds up training and improves model performance. Our code is available at h�ps://github.com/NVIDIA/DeepRecommender.
RMSE: Netflix 3 months: 0.9373, Netflix 6 months : 0.9207, Netflix 1 year : 0.9225, Netflix full: 0.9099; 快速开始
13 DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
Abstract
The Mixture-of-experts (MoE) architecture is showing promising results in multitask learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable “sparse gate” to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. Our gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k in the context of MTL, on both synthetic and real datasets with up to 128 tasks. Our experiments indicate that MoE models based on DSelect-k can achieve statistically significant improvements in predictive and expert selection performance. Notably, on a real-world large-scale recommender system, DSelect-k achieves over 22% average improvement in predictive performance compared to the Top-k gate. We provide an open-source TensorFlow implementation of our gate1 .
MNIST: Accuracy1 92.56% Accuracy2: 90.98 % 快速开始
14 Self-Supervised Multi-Channel Hypergraph Convolutional Network for Social Recommendation 
Abstract
Social relations are often used to improve recommendation quality when user-item interaction data is sparse in recommender systems. Most existing social recommendation models exploit pairwise relations to mine potential user preferences. However, real-life interactions among users are very complicated and user relations can be high-order. Hypergraph provides a natural way to model complex high-order relations, while its potentials for improving social recommendation are under-explored. In this paper, we fill this gap and propose a multi-channel hypergraph convolutional network to enhance social recommendation by leveraging high-order user relations. Technically, each channel in the network encodes a hypergraph that depicts a common high-order user relation pattern via hypergraph convolution. By aggregating the embeddings learned through multiple channels, we obtain comprehensive user representations to generate recommendation results. However, the aggregation operation might also obscure the inherent characteristics of different types of high-order connectivity information. To compensate for the aggregating loss, we innovatively integrate self-supervised learning into the training of the hypergraph convolutional network to regain the connectivity information with hierarchical mutual information maximization. The experimental results on multiple real-world datasets show that the proposed model outperforms the SOTA methods, and the ablation study verifies the effectiveness of the multi-channel setting and the selfsupervised task. The implementation of our model is available via https://github.com/Coder-Yu/RecQ.
LastFM: Precision@10: 20.052, Recall@10: 20.375, NDCG@10: 24.395 快速开始
15 DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems  
Abstract
Learning effective feature crosses is the key behind building recommender systems. However, the sparse and large feature space requires exhaustive search to identify effective crosses. Deep & Cross Network (DCN) was proposed to automatically and efficiently learn bounded-degree predictive feature interactions. Unfortunately, in models that serve web-scale traffic with billions of training examples, DCN showed limited expressiveness in its cross network at learning more predictive feature interactions. Despite significant research progress made, many deep learning models in production still rely on traditional feed-forward neural networks to learn feature crosses inefficiently. In light of the pros/cons of DCN and existing feature interaction learning approaches, we propose an improved framework DCN-V2 to make DCN more practical in large-scale industrial settings. In a comprehensive experimental study with extensive hyper-parameter search and model tuning, we observed that DCN-V2 approaches outperform all the state-of-the-art algorithms on popular benchmark datasets. The improved DCN-V2 is more expressive yet remains cost efficient at feature interaction learning, especially when coupled with a mixture of low-rank architecture. DCN-V2 is simple, can be easily adopted as building blocks, and has delivered significant offline accuracy and online business metrics gains across many web-scale learning to rank systems at Google.
Logloss: 0.4406, AUC: 0.8115 快速开始
16 FLEN: Leveraging Field for Scalable CTR Prediction
Abstract
Click-Through Rate (CTR) prediction systems are usually based on multi-field categorical features, i.e., every feature is categorical and belongs to one and only one field. Modeling feature conjunctions is crucial for CTR prediction accuracy. However, it usually requires a massive number of parameters to explicitly model all feature conjunctions, which is not scalable for real-world production systems. In this paper, we describe a novel Field-Leveraged Embedding Network (FLEN) which has been deployed in the commercial recommender systems in Meitu and serves the main traffic. FLEN devises a field-wise bi-interaction pooling technique. By suitably exploiting field information, the field-wise bi-interaction pooling layer captures both inter-field and intra-field feature conjunctions with a small number of model parameters and an acceptable time complexity for industrial applications. We show that some classic shallow CTR models can be regarded as special cases of this technique, i.e., MF, FM and FwFM. We identify a unique challenge in this technique, i.e., the FM module in our model may suffer from the coupled gradient issue, which will damage the performance of the model. To solve this challenge, we develop Dicefactor: a novel dropout method to prevent independent latent features from co-adapting. Extensive experiments, including offline evaluations and online A/B testing on real production systems, demonstrate the effectiveness and efficiency of FLEN against the state-of-the-art models. In particular, compared to the previous version deployed on the system (i.e. NFM), FLEN has obtained 5.19% improvement on CTR with 1/6 of memory usage and computation time.
AUC: 0.7519, Logloss: 0.3944; 快速开始

其他

序号 论文名称(链接) 摘要 数据集 快速开始
1 End to End Learning for Self-Driving Cars
Abstract
Point cloud is an important type of geometric data structure. Due to itsirregular format, most researchers transform such data to regular 3D voxelgrids or collections of images. This, however, renders data unnecessarilyvoluminous and causes issues. In this paper, we design a novel type of neuralnetwork that directly consumes point clouds and well respects the permutationinvariance of points in the input. Our network, named PointNet, provides aunified architecture for applications ranging from object classification, partsegmentation, to scene semantic parsing. Though simple, PointNet is highlyefficient and effective. Empirically, it shows strong performance on par oreven better than state of the art. Theoretically, we provide analysis towardsunderstanding of what the network has learnt and why the network is robust withrespect to input perturbation and corruption.
能在模拟器上运行不偏离路面, 模拟器地址https: //github.com/udacity/self-driving-car-sim 快速开始
2 Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Abstract
Top-down visual attention mechanisms have been used extensively in imagecaptioning and visual question answering (VQA) to enable deeper imageunderstanding through fine-grained analysis and even multiple steps ofreasoning. In this work, we propose a combined bottom-up and top-down attentionmechanism that enables attention to be calculated at the level of objects andother salient image regions. This is the natural basis for attention to beconsidered. Within our approach, the bottom-up mechanism (based on FasterR-CNN) proposes image regions, each with an associated feature vector, whilethe top-down mechanism determines feature weightings. Applying this approach toimage captioning, our results on the MSCOCO test server establish a newstate-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability ofthe method, applying the same approach to VQA we obtain first place in the 2017VQA Challenge.
full test set 75.6% 快速开始
3 Towards Deep Learning Models Resistant to Adversarial Attacks
Abstract
Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.
PGD-steps100-restarts20-sourceA: 89.3%PGD-steps100-restarts20-sourceA: 95.7%PGD-steps40-restarts1-sourceB: 96.4% 快速开始
4 Stacked Hourglass Networks for Human Pose Estimation
Abstract
This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a "stacked hourglass" network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
MPII Human Pose Dataset, hourglass52, size: 384x384, mean@0.1: 0.366 size: 256x256, mean@0.1: 0.317 快速开始
5 Learning to See in the Dark
Abstract
Imaging in low light is challenging due to low photon count and low SNR. Short-exposure images suffer from noise, while long exposure can induce blur and is often impractical. A variety of denoising, deblurring, and enhancement techniques have been proposed, but their effectiveness is limited in extreme conditions, such as video-rate imaging at night. To support the development of learning-based pipelines for low-light image processing, we introduce a dataset of raw short-exposure low-light images, with corresponding long-exposure reference images. Using the presented dataset, we develop a pipeline for processing low-light images, based on end-to-end training of a fully-convolutional network. The network operates directly on raw sensor data and replaces much of the traditional image processing pipeline, which tends to perform poorly on such data. We report promising results on the new dataset, analyze factors that affect performance, and highlight opportunities for future work. The results are shown in the supplementary video at https://youtu.be/qWKUFK7MWvg
PSNR/SSIM: 28.88/0.787 快速开始
6 Adversarial Autoencoders
Abstract
In this paper, we propose the "adversarial autoencoder" (AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) to perform variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. Matching the aggregated posterior to the prior ensures that generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization. We performed experiments on MNIST, Street View House Numbers and Toronto Face datasets and show that adversarial autoencoders achieve competitive results in generative modeling and semi-supervised classification tasks.
MNIST, Log-likelihood(10K): 340±2 快速开始
7 Supervised Contrastive Learning
Abstract
Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models. Modern batch contrastive approaches subsume or significantly outperform traditional contrastive losses such as triplet, max-margin and the N-pairs loss. In this work, we extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. We analyze two possible versions of the supervised contrastive (SupCon) loss, identifying the best-performing formulation of the loss. On ResNet-200, we achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above the best number reported for this architecture. We show consistent outperformance over cross-entropy on other datasets and two ResNet variants. The loss shows benefits for robustness to natural corruptions and is more stable to hyperparameter settings such as optimizers and data augmentations. Our loss function is simple to implement, and reference TensorFlow code is released at https://t.ly/supcon.
基于Contrastive loss, CIFAR10数据集在ResNet-50上, Top-1 Acc=96% 快速开始
8 CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning
Abstract
We present a simple, fully-convolutional model for real-time (>30 fps)instance segmentation that achieves competitive results on MS COCO evaluated ona single Titan Xp, which is significantly faster than any previousstate-of-the-art approach. Moreover, we obtain this result after training ononly one GPU. We accomplish this by breaking instance segmentation into twoparallel subtasks: (1) generating a set of prototype masks and (2) predictingper-instance mask coefficients. Then we produce instance masks by linearlycombining the prototypes with the mask coefficients. We find that because thisprocess doesn't depend on repooling, this approach produces very high-qualitymasks and exhibits temporal stability for free. Furthermore, we analyze theemergent behavior of our prototypes and show they learn to localize instanceson their own in a translation variant manner, despite beingfully-convolutional. We also propose Fast NMS, a drop-in 12 ms fasterreplacement for standard NMS that only has a marginal performance penalty.Finally, by incorporating deformable convolutions into the backbone network,optimizing the prediction head with better anchor scales and aspect ratios, andadding a novel fast mask re-scoring branch, our YOLACT++ model can achieve 34.1mAP on MS COCO at 33.5 fps, which is fairly close to the state-of-the-artapproaches while still running at real-time.
Resnet50 AUC: Atelectasis=0.707 Cardiomegaly=0.81 Effusion=0.73 Infiltration =0.61 Mass=0.56Nodule=0.71 Pneumonia=0.63 Pneumothorax=0.78 快速开始
9 TabNet: Attentive Interpretable Tabular Learning
Abstract
Though tremendous strides have been made in uncontrolled face detection,accurate and efficient face localisation in the wild remains an open challenge.This paper presents a robust single-stage face detector, named RetinaFace,which performs pixel-wise face localisation on various scales of faces bytaking advantages of joint extra-supervised and self-supervised multi-tasklearning. Specifically, We make contributions in the following five aspects:(1) We manually annotate five facial landmarks on the WIDER FACE dataset andobserve significant improvement in hard face detection with the assistance ofthis extra supervision signal. (2) We further add a self-supervised meshdecoder branch for predicting a pixel-wise 3D shape face information inparallel with the existing supervised branches. (3) On the WIDER FACE hard testset, RetinaFace outperforms the state of the art average precision (AP) by 1.1%(achieving AP equal to 91.4%). (4) On the IJB-C test set, RetinaFace enablesstate of the art methods (ArcFace) to improve their results in faceverification (TAR=89.59% for FAR=1e-6). (5) By employing light-weight backbonenetworks, RetinaFace can run real-time on a single CPU core for aVGA-resolution image. Extra annotations and code have been made available at:this https URL.
Forest Cover Type: acc=96.99% 快速开始
10 Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Abstract
Relational reasoning is a central component of generally intelligentbehavior, but has proven difficult for neural networks to learn. In this paperwe describe how to use Relation Networks (RNs) as a simple plug-and-play moduleto solve problems that fundamentally hinge on relational reasoning. We testedRN-augmented networks on three tasks: visual question answering using achallenging dataset called CLEVR, on which we achieve state-of-the-art,super-human performance; text-based question answering using the bAbI suite oftasks; and complex reasoning about dynamic physical systems. Then, using acurated dataset called Sort-of-CLEVR we show that powerful convolutionalnetworks do not have a general capacity to solve relational questions, but cangain this capacity when augmented with RNs. Our work shows how a deep learningarchitecture equipped with an RN module can implicitly discover and learn toreason about entities and their relations.
可视化方法 快速开始
11 A simple neural network module for relational reasoning
Abstract
We introduce the variational graph auto-encoder (VGAE), a framework forunsupervised learning on graph-structured data based on the variationalauto-encoder (VAE). This model makes use of latent variables and is capable oflearning interpretable latent representations for undirected graphs. Wedemonstrate this model using a graph convolutional network (GCN) encoder and asimple inner product decoder. Our model achieves competitive results on a linkprediction task in citation networks. In contrast to most existing models forunsupervised learning on graph-structured data and link prediction, our modelcan naturally incorporate node features, which significantly improvespredictive performance on a number of benchmark datasets.
CLEVR: Acc = 95.5% 快速开始
12 Variational Graph Auto-Encoders
Abstract
This paper presents a self-supervised framework for training interest pointdetectors and descriptors suitable for a large number of multiple-view geometryproblems in computer vision. As opposed to patch-based neural networks, ourfully-convolutional model operates on full-sized images and jointly computespixel-level interest point locations and associated descriptors in one forwardpass. We introduce Homographic Adaptation, a multi-scale, multi-homographyapproach for boosting interest point detection repeatability and performingcross-domain adaptation (e.g., synthetic-to-real). Our model, when trained onthe MS-COCO generic image dataset using Homographic Adaptation, is able torepeatedly detect a much richer set of interest points than the initialpre-adapted deep model and any other traditional corner detector. The finalsystem gives rise to state-of-the-art homography estimation results on HPatcheswhen compared to LIFT, SIFT and ORB.
CiteSeer AUC: 90.8, AP: 92 快速开始
13 SuperPoint: Self-Supervised Interest Point Detection and Description
Abstract
The demand of applying semantic segmentation model on mobile devices has beenincreasing rapidly. Current state-of-the-art networks have enormous amount ofparameters hence unsuitable for mobile devices, while other small memoryfootprint models follow the spirit of classification network and ignore theinherent characteristic of semantic segmentation. To tackle this problem, wepropose a novel Context Guided Network (CGNet), which is a light-weight andefficient network for semantic segmentation. We first propose the ContextGuided (CG) block, which learns the joint feature of both local feature andsurrounding context, and further improves the joint feature with the globalcontext. Based on the CG block, we develop CGNet which captures contextualinformation in all stages of the network and is specially tailored forincreasing segmentation accuracy. CGNet is also elaborately designed to reducethe number of parameters and save memory footprint. Under an equivalent numberof parameters, the proposed CGNet significantly outperforms existingsegmentation networks. Extensive experiments on Cityscapes and CamVid datasetsverify the effectiveness of the proposed approach. Specifically, without anypost-processing and multi-scale testing, the proposed CGNet achieves 64.8% meanIoU on Cityscapes with less than 0.5 M parameters. The source code for thecomplete system can be found at this https URL.
MS-COCO 2014 HPatches Homography Estimation, e=1 0.460 快速开始
14 Relaxed Transformer Decoders for Direct Action Proposal Generation
Abstract
Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing action instances of interest. The existing proposal generation approaches are generally based on pre-defined anchor windows or heuristic bottom-up boundary matching strategies. This paper presents a simple and efficient framework (RTD-Net) for direct action proposal generation, by re-purposing a Transformer-alike architecture. To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR). First, to deal with slowness prior in videos, we replace the original Transformer encoder with a boundary attentive module to better capture long-range temporal information. Second, due to the ambiguous temporal boundary and relatively sparse annotations, we present a relaxed matching scheme to relieve the strict criteria of single assignment to each groundtruth. Finally, we devise a three-branch head to further improve the proposal confidence estimation by explicitly predicting its completeness. Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection. Moreover, due to its simplicity in design, our framework is more efficient than previous proposal generation methods, without non-maximum suppression post-processing. The code and models are made available at https://github.com/MCG-NJU/RTD-Action.
THUMOS14, AR@50=41.52 快速开始
15 Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting
Abstract
Multi-horizon forecasting problems often contain a complex mix of inputs -- including static (i.e. time-invariant) covariates, known future inputs, and other exogenous time series that are only observed historically -- without any prior information on how they interact with the target. While several deep learning models have been proposed for multi-step prediction, they typically comprise black-box models which do not account for the full range of inputs present in common scenarios. In this paper, we introduce the Temporal Fusion Transformer (TFT) -- a novel attention-based architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. To learn temporal relationships at different scales, the TFT utilizes recurrent layers for local processing and interpretable self-attention layers for learning long-term dependencies. The TFT also uses specialized components for the judicious selection of relevant features and a series of gating layers to suppress unnecessary components, enabling high performance in a wide range of regimes. On a variety of real-world datasets, we demonstrate significant performance improvements over existing benchmarks, and showcase three practical interpretability use-cases of TFT.
Dataset: Electricity  P90 loss: 0.027 快速开始
16 Learning to Adapt Structured Output Space for Semantic Segmentation
Abstract
Convolutional neural network-based approaches for semantic segmentation rely on supervision with pixel-level ground truth, but may not generalize well to unseen image domains. As the labeling process is tedious and labor intensive, developing algorithms that can adapt source ground truth labels to the target domain is of great interest. In this paper, we propose an adversarial learning method for domain adaptation in the context of semantic segmentation. Considering semantic segmentations as structured outputs that contain spatial similarities between the source and target domains, we adopt adversarial learning in the output space. To further enhance the adapted model, we construct a multi-level adversarial network to effectively perform output space domain adaptation at different feature levels. Extensive experiments and ablation study are conducted under various domain adaptation settings, including synthetic-to-real and cross-city scenarios. We show that the proposed method performs favorably against the state-of-the-art methods in terms of accuracy and visual quality.
Cityscapes: resnet101, mIOU=42.4 快速开始
17 Unsupervised Monocular Depth Estimation with Left-Right Consistency
Abstract
Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage. We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.
KiTTI: Abs Rel 0.130 快速开始
18 Digging Into Self-Supervised Monocular Depth Estimation
Abstract
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
KiTTI: Abs Rel 0.106 快速开始
19  Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation 
Abstract
Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the-art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach.
IMDb测试集error rates=4.6%, TREC-6测试集error rates=3.6% , AG’s News测试集 error rates=5.01%(见论文Table 2 & Table 3) 快速开始