This repository is an up-to-date list of significant AI papers organized by publication date. It covers five fields : computer vision, natural language processing, audio processing, multimodal learning and reinforcement learning. Feel free to give this repository a star if you enjoy the work.
Maintainer: Aimerou Ndiaye
To select the most relevant papers, we chose subjective limits in terms of number of citations. Each icon here designates a paper type that meets one of these criteria.
π Historical Paper : more than 10k citations and a decisive impact in the evolution of AI.
β Important Paper : more than 50 citations and state of the art results.
β« Trend : 1 to 50 citations, recent and innovative paper with growing adoption.
π° Important Article : decisive work that was not accompanied by a research paper.
- β 01/2023: Muse: Text-To-Image Generation via Masked Generative Transformers (Muse)
- β 02/2023: Structure and Content-Guided Video Synthesis with Diffusion Models (Gen-1)
- β 02/2023: Scaling Vision Transformers to 22 Billion Parameters (ViT 22B)
- β 02/2023: Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)
- β 03/2023: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (Visual ChatGPT)
- β 03/2023: Scaling up GANs for Text-to-Image Synthesis (GigaGAN)
- β 04/2023: Segment Anything (SAM)
- β 04/2023: DINOv2: Learning Robust Visual Features without Supervision (DINOv2)
- β 04/2023: Visual Instruction Tuning
- β 04/2023: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models (VideoLDM)
- β 04/2023: Synthetic Data from Diffusion Models Improves ImageNet Classification
- β 04/2023: Segment Anything in Medical Images (MedSAM)
- β 05/2023: Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold (DragGAN)
- β 06/2023: Neuralangelo: High-Fidelity Neural Surface Reconstruction (Neuralangelo)
- β 07/2023: SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (SDXL)
- β 08/2023: 3D Gaussian Splatting for Real-Time Radiance Field Rendering
- β 08/2023: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization... (Qwen-VL)
- β« 08/2023: MVDream: Multi-view Diffusion for 3D Generation (MVDream)
- β« 11/2023: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (Florence-2)
- β« 12/2023: VideoPoet: A Large Language Model for Zero-Shot Video Generation (VideoPoet)
- β 01/2023: DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature (DetectGPT)
- β 02/2023: Toolformer: Language Models Can Teach Themselves to Use Tools (Toolformer)
- β 02/2023: LLaMA: Open and Efficient Foundation Language Models (LLaMA)
- π° 03/2023: GPT-4
- β 03/2023: Sparks of Artificial General Intelligence: Early experiments with GPT-4 (GPT-4 Eval)
- β 03/2023: HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace (HuggingGPT)
- β 03/2023: BloombergGPT: A Large Language Model for Finance (BloombergGPT)
- β 04/2023: Instruction Tuning with GPT-4
- β 04/2023: Generative Agents: Interactive Simulacra of Human (Gen Agents)
- β 05/2023: PaLM 2 Technical Report (PaLM-2)
- β 05/2023: Tree of Thoughts: Deliberate Problem Solving with Large Language Models (ToT)
- β 05/2023: LIMA: Less Is More for Alignment (LIMA)
- β 05/2023: QLoRA: Efficient Finetuning of Quantized LLMs (QLoRA)
- β 05/2023: Voyager: An Open-Ended Embodied Agent with Large Language Models (Voyager)
- β 07/2023: ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs (ToolLLM)
- β 08/2023: MetaGPT: Meta Programming for Multi-Agent Collaborative Framework (MetaGPT)
- β 08/2023: Code Llama: Open Foundation Models for Code (Code Llama)
- β« 09/2023: RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (RLAIF)
- β 09/2023: Large Language Models as Optimizers (OPRO)
- β« 10/2023: Eureka: Human-Level Reward Design via Coding Large Language Models (Eureka)
- β« 12/2023: Mathematical discoveries from program search with large language models (FunSearch)
- β 01/2023: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)
- β 01/2023: MusicLM: Generating Music From Text (MusicLM)
- β 01/2023: AudioLDM: Text-to-Audio Generation with Latent Diffusion Models (AudioLDM)
- β 03/2023: Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages (USM)
- β 05/2023: Scaling Speech Technology to 1,000+ Languages (MMS)
- β« 06/2023: Simple and Controllable Music Generation (MusicGen)
- β« 06/2023: AudioPaLM: A Large Language Model That Can Speak and Listen (AudioPaLM)
- β« 06/2023: Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (Voicebox)
- β 02/2023: Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)
- β 03/2023: PaLM-E: An Embodied Multimodal Language Model (PaLM-E)
- β 04/2023: AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head (AudioGPT)
- β 05/2023: ImageBind: One Embedding Space To Bind Them All (ImageBind)
- β« 07/2023: Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon)
- β« 07/2023: Meta-Transformer: A Unified Framework for Multimodal Learning (Meta-Transformer)
- β« 08/2023: SeamlessM4T: Massively Multilingual & Multimodal Machine Translation (SeamlessM4T)
- β 01/2023: Mastering Diverse Domains through World Models (DreamerV3)
- β« 02/2023: Grounding Large Language Models in Interactive Environments with Online RL (GLAM)
- β« 02/2023: Efficient Online Reinforcement Learning with Offline Data (RLPD)
- β« 03/2023: Reward Design with Language Models
- β 05/2023: Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO)
- β« 06/2023: Faster sorting algorithms discovered using deep reinforcement learning (AlphaDev)
- β« 08/2023: Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization (Retroformer)
- β 02/2023: Symbolic Discovery of Optimization Algorithms (Lion)
- β 07/2023: RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (RT-2)
- β« 11/2023: Scaling deep learning for materials discovery (GNoME)
- β« 12/2023: Discovery of a structural class of antibiotics with explainable deep learning
- β 01/2022: A ConvNet for the 2020s (ConvNeXt)
- β 01/2022: Patches Are All You Need (ConvMixer)
- β 02/2022: Block-NeRF: Scalable Large Scene Neural View Synthesis (Block-NeRF)
- β 03/2022: DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection (DINO)
- β 03/2022: Scaling Up Your Kernels to 31Γ31: Revisiting Large Kernel Design in CNNs (Large Kernel CNN)
- β 03/2022: TensoRF: Tensorial Radiance Fields (TensoRF)
- β 04/2022: MaxViT: Multi-Axis Vision Transformer (MaxViT)
- β 04/2022: Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)
- β 05/2022: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)
- β 05/2022: GIT: A Generative Image-to-text Transformer for Vision and Language (GIT)
- β 06/2022: CMT: Convolutional Neural Network Meet Vision Transformers (CMT)
- β 07/2022: Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors... (Swin UNETR)
- β 07/2022: Classifier-Free Diffusion Guidance
- β 08/2022: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation (DreamBooth)
- β 09/2022: DreamFusion: Text-to-3D using 2D Diffusion (DreamFusion)
- β 09/2022: Make-A-Video: Text-to-Video Generation without Text-Video Data (Make-A-Video)
- β 10/2022: On Distillation of Guided Diffusion Models
- β 10/2022: LAION-5B: An open large-scale dataset for training next generation image-text models (LAION-5B)
- β 10/2022: Imagic: Text-Based Real Image Editing with Diffusion Models (Imagic)
- β 11/2022: Visual Prompt Tuning
- β 11/2022: Magic3D: High-Resolution Text-to-3D Content Creation (Magic3D)
- β 11/2022: DiffusionDet: Diffusion Model for Object Detection (DiffusionDet)
- β 11/2022: InstructPix2Pix: Learning to Follow Image Editing Instructions (InstructPix2Pix)
- β 12/2022: Multi-Concept Customization of Text-to-Image Diffusion (Custom Diffusion)
- β 12/2022: Scalable Diffusion Models with Transformers (DiT)
- β 01/2022: LaMBDA: Language Models for Dialog Applications (LaMBDA)
- β 01/2022: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (CoT)
- β 02/2022: Competition-Level Code Generation with AlphaCode (AlphaCode)
- β 02/2022: Finetuned Language Models Are Zero-Shot Learners (FLAN)
- β 03/2022: Training language models to follow human instructions with human feedback (InstructGPT)
- β 03/2022: Multitask Prompted Training Enables Zero-Shot Task Generalization (T0)
- β 03/2022: Training Compute-Optimal Large Language Models (Chinchilla)
- β 04/2022: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan)
- β 04/2022: GPT-NeoX-20B: An Open-Source Autoregressive Language Model (GPT-NeoX)
- β 04/2022: PaLM: Scaling Language Modeling with Pathways (PaLM)
- β 06/2022: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of lang... (BIG-bench)
- β 06/2022: Solving Quantitative Reasoning Problems with Language Models (Minerva)
- β 10/2022: ReAct: Synergizing Reasoning and Acting in Language Models (ReAct)
- β 11/2022: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (BLOOM)
- π° 11/2022: Optimizing Language Models for Dialogue (ChatGPT)
- β 12/2022: Large Language Models Encode Clinical Knowledge (Med-PaLM)
- β 02/2022: mSLAM: Massively multilingual joint pre-training for speech and text (mSLAM)
- β 02/2022: ADD 2022: the First Audio Deep Synthesis Detection Challenge (ADD)
- β 03/2022: Efficient Training of Audio Transformers with Patchout (PaSST)
- β 04/2022: MAESTRO: Matched Speech Text Representations through Modality Matching (Maestro)
- β 05/2022: SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language... (SpeechT5)
- β 06/2022: WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing (WavLM)
- β 07/2022: BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for ASR (BigSSL)
- β 08/2022: MuLan: A Joint Embedding of Music Audio and Natural Language (MuLan)
- β 09/2022: AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)
- β 09/2022: AudioGen: Textually Guided Audio Generation (AudioGen)
- β 10/2022: High Fidelity Neural Audio Compression (EnCodec)
- β 12/2022: Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)
- β 01/2022: BLIP: Boostrapping Language-Image Pre-training for Unified Vision-Language... (BLIP)
- β 02/2022: data2vec: A General Framework for Self-supervised Learning in Speech, Vision and... (Data2vec)
- β 03/2022: VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks (VL-Adapter)
- β 04/2022: Winoground: Probing Vision and Language Models for Visio-Linguistic... (Winoground)
- β 04/2022: Flamingo: a Visual Language Model for Few-Shot Learning (Flamingo)
- β 05/2022: A Generalist Agent (Gato)
- β 05/2022: CoCa: Contrastive Captioners are Image-Text Foundation Models (CoCa)
- β 05/2022: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts (VLMo)
- β 08/2022: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (BEiT)
- β 09/2022: PaLI: A Jointly-Scaled Multilingual Language-Image Model (PaLI)
- β 01/2022: Learning robust perceptive locomotion for quadrupedal robots in the wild
- β 02/2022: BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning
- β 02/2022: Outracing champion Gran Turismo drivers with deep reinforcement learning (Sophy)
- β 02/2022: Magnetic control of tokamak plasmas through deep reinforcement learning
- β 08/2022: Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning (ANYmal)
- β 10/2022: Discovering faster matrix multiplication algorithms with reinforcement learning (AlphaTensor)
- β 02/2022: FourCastNet: A Global Data-driven High-resolution Weather Model... (FourCastNet)
- β 05/2022: ColabFold: making protein folding accessible to all (ColabFold)
- β 06/2022: Measuring and Improving the Use of Graph Information in GNN
- β 10/2022: TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis (TimesNet)
- β 12/2022: RT-1: Robotics Transformer for Real-World Control at Scale (RT-1)
- π 1958: Perceptron: A probabilistic model for information storage and organization in the brain (Perceptron)
- π 1986: Learning representations by back-propagating errors (Backpropagation)
- π 1986: Induction of decision trees (CART)
- π 1989: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition (HMM)
- π 1989: Multilayer feedforward networks are universal approximators
- π 1992: A training algorithm for optimal margin classifiers (SVM)
- π 1996: Bagging predictors
- π 1998: Gradient-based learning applied to document recognition (CNN/GTN)
- π 2001: Random Forests
- π 2001: A fast and elitist multiobjective genetic algorithm (NSGA-II)
- π 2003: Latent Dirichlet Allocation (LDA)
- π 2006: Reducing the Dimensionality of Data with Neural Networks (Autoencoder)
- π 2008: Visualizing Data using t-SNE (t-SNE)
- π 2009: ImageNet: A large-scale hierarchical image database (ImageNet)
- π 2012: ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)
- π 2013: Efficient Estimation of Word Representations in Vector Space (Word2vec)
- π 2013: Auto-Encoding Variational Bayes (VAE)
- π 2014: Generative Adversarial Networks (GAN)
- π 2014: Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Dropout)
- π 2014: Sequence to Sequence Learning with Neural Networks
- π 2014: Neural Machine Translation by Jointly Learning to Align and Translate (RNNSearch-50)
- π 2014: Adam: A Method for Stochastic Optimization (Adam)
- π 2015: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Cov... (BatchNorm)
- π 2015: Going Deeper With Convolutions (Inception)
- π 2015: Human-level control through deep reinforcement learning (Deep Q Network)
- π 2015: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (Faster R-CNN)
- π 2015: U-Net: Convolutional Networks for Biomedical Image Segmentation (U-Net)
- π 2015: Deep Residual Learning for Image Recognition (ResNet)
- π 2016: You Only Look Once: Unified, Real-Time Object Detection (YOLO)
- π 2017: Attention is All you Need (Transformer)
- π 2018: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (BERT)
- π 2020: Language Models are Few-Shot Learners (GPT-3)
- π 2020: Denoising Diffusion Probabilistic Models (DDPM)
- π 2020: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)
- π 2021: Highly accurate protein structure prediction with AlphaFold (Alphafold)
- π° 2022: ChatGPT: Optimizing Language Models For Dialogue (ChatGPT)