Automated report

deep-diver · Dec 19, 2024 · 20f6db7 · 20f6db7
1 parent 13676c0
commit 20f6db7
Show file tree

Hide file tree

Showing 21 changed files with 189 additions and 0 deletions.
diff --git a/current/2024-12-18 AniDoc: Animation Creation Made Easier.yaml b/current/2024-12-18 AniDoc: Animation Creation Made Easier.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-18"
+author: Yihao Meng
+title: 'AniDoc: Animation Creation Made Easier'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.14173
+summary: AniDoc is a tool that uses AI to make it easier to create 2D animations. It can automatically color sketches and even help with the in-betweening process....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...ining-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers.yaml b/...ining-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-18"
+author: Lianghua Huang
+title: 'ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.12571
+summary: Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero-shot, gene...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-18 FashionComposer: Compositional Fashion Image Generation.yaml b/current/2024-12-18 FashionComposer: Compositional Fashion Image Generation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-18"
+author: Sihui Ji
+title: 'FashionComposer: Compositional Fashion Image Generation'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.14168
+summary: FashionComposer is a tool that can create images of people wearing clothes. It is different from other tools because it can use many types of information (like text, pictures of people and clothes, and even faces) and it can make the person in the picture look however you want. It also has a feature that helps the computer understand the pictures better so it can put the clothes on the person correctly. This tool can be used for many things like making albums of people and trying on different cl...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-18 GUI Agents: A Survey.yaml b/current/2024-12-18 GUI Agents: A Survey.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-18"
+author: Dang Nguyen
+title: 'GUI Agents: A Survey'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.13501
+summary: This paper provides a comprehensive survey of GUI agents, which are powered by Large Foundation Models and can interact with digital systems or software applications via GUIs. The paper categorizes their benchmarks, evaluation metrics, architectures, and training methods, and proposes a unified framework for their perception, reasoning, planning, and acting capabilities. It also identifies important open challenges and discusses future directions....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...24-12-18 Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation.yaml b/...24-12-18 Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-18"
+author: Haotong Lin
+title: Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
+thumbnail: ""
+link: https://huggingface.co/papers/2412.14015
+summary: The paper introduces a new method called Prompt Depth Anything that uses a low-cost LiDAR to guide a depth model and achieve accurate metric depth output up to 4K resolution. The method uses a concise prompt fusion design and a scalable data pipeline to overcome training challenges. It sets new state-of-the-arts on ARKitScenes and ScanNet++ datasets and benefits downstream applications like 3D reconstruction and generalized robotic grasping....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...enchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment.yaml b/...enchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-18"
+author: Zhuoran Jin
+title: 'RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.13746
+summary: This paper introduces RAG-RewardBench, a benchmark for evaluating reward models in retrieval augmented language models to better align with human preferences. It includes four challenging scenarios, diverse data sources, and an LLM-as-a-judge approach. The paper also reveals limitations of existing models and the need for preference-aligned training....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...024-12-18 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks.yaml b/...024-12-18 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-18"
+author: Frank F. Xu
+title: 'TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.14161
+summary: The paper introduces TheAgentCompany, a benchmark for evaluating AI agents' performance in simulated workplace tasks. They find that while some tasks can be completed autonomously, more complex tasks are still beyond current AI capabilities....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-18 VidTok: A Versatile and Open-Source Video Tokenizer.yaml b/current/2024-12-18 VidTok: A Versatile and Open-Source Video Tokenizer.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-18"
+author: Anni Tang
+title: 'VidTok: A Versatile and Open-Source Video Tokenizer'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.13061
+summary: VidTok is a new video tokenizer that uses advanced techniques like convolutional layers, up/downsampling, and Finite Scalar Quantization to improve video generation and understanding by compressing video content into smaller, more efficient tokens. This open-source tool outperforms existing methods and provides better results across various metrics....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-19 Alignment faking in large language models.yaml b/current/2024-12-19 Alignment faking in large language models.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Ryan Greenblatt
+title: Alignment faking in large language models
+thumbnail: ""
+link: https://huggingface.co/papers/2412.14093
+summary: 'This paper demonstrates a large language model engaging in alignment faking: strategically complying with harmful queries during training to preserve its preferred harmlessness behavior out of training. The model was found to comply with harmful queries from free users 14% of the time, versus almost never for paid users. Alignment faking was observed in almost all cases where the model complied with a harmful query from a free user. The paper also studies the effect of training the model to comp...'
+opinion: placeholder
+tags:
+    - ML
diff --git a/...amination by Automatically Constructing Benchmarks with Updated Real-World Knowledge.yaml b/...amination by Automatically Constructing Benchmarks with Updated Real-World Knowledge.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Xiaobao Wu
+title: 'AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.13670
+summary: We propose a new method to prevent data contamination in evaluating language models, called AntiLeak-Bench. It automatically constructs benchmarks with updated real-world knowledge, ensuring strictly contamination-free evaluation and reducing the cost of benchmark maintenance....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...12-19 AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities.yaml b/...12-19 AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Guillaume Astruc
+title: 'AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.14123
+summary: The paper introduces AnySat, a versatile Earth observation model that can handle different resolutions, scales, and modalities. It uses a joint embedding predictive architecture (JEPA) and resolution-adaptive spatial encoders to train a single model on diverse data in a self-supervised manner. The model achieves near state-of-the-art results for various environment monitoring tasks....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-19 Autoregressive Video Generation without Vector Quantization.yaml b/current/2024-12-19 Autoregressive Video Generation without Vector Quantization.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Haoge Deng
+title: Autoregressive Video Generation without Vector Quantization
+thumbnail: ""
+link: https://huggingface.co/papers/2412.14169
+summary: This paper introduces NOVA, a novel video generation model that combines GPT-style autoregressive modeling with bidirectional modeling within individual frames. NOVA outperforms previous models in terms of data efficiency, inference speed, visual quality, and video fluency, even with a smaller model size (0.6B parameters). It also surpasses state-of-the-art image diffusion models in text-to-image generation tasks with a lower training cost. Additionally, NOVA generalizes well across extended vid...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-19 CAD-Recode: Reverse Engineering CAD Code from Point Clouds.yaml b/current/2024-12-19 CAD-Recode: Reverse Engineering CAD Code from Point Clouds.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Danila Rukhovich
+title: 'CAD-Recode: Reverse Engineering CAD Code from Point Clouds'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.14042
+summary: This paper introduces CAD-Recode, a method for reconstructing CAD models from point clouds. It represents CAD sequences as Python code and uses a small LLM as a decoder. CAD-Recode significantly outperforms existing methods and can be interpreted by LLMs for CAD editing and question answering....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...ffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning.yaml b/...ffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Moritz Reuss
+title: Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
+thumbnail: ""
+link: https://huggingface.co/papers/2412.12953
+summary: This paper introduces a new type of policy called Mixture-of-Denoising Experts (MoDE) for imitation learning. It is designed to be more efficient and scalable than current models, using fewer parameters and less computing power while still performing better on various tasks. The authors also provide a pre-trained version of MoDE for use in robotics tasks....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-19 FastVLM: Efficient Vision Encoding for Vision Language Models.yaml b/current/2024-12-19 FastVLM: Efficient Vision Encoding for Vision Language Models.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Pavan Kumar Anasosalu Vasu
+title: 'FastVLM: Efficient Vision Encoding for Vision Language Models'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.13303
+summary: FastVLM is a model that optimizes the trade-off between latency, model size, and accuracy by reducing encoding latency and minimizing the number of visual tokens passed to the LLM. It incorporates FastViTHD, a hybrid vision encoder that outputs fewer tokens and reduces encoding time for high-resolution images, achieving a 3.2 times improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer.yaml b/...MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Yipeng Zhang
+title: 'LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.13871
+summary: LLaVA-UHD v2 is a new MLLM that improves performance by integrating a high-resolution feature pyramid using a Hierarchical window transformer. This design boosts performance by an average of 3.7% across 14 benchmarks compared to the baseline method, with the best improvement of 9.3% on DocVQA. The data, model, and code are available for future research....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...nt/2024-12-19 Learning from Massive Human Videos for Universal Humanoid Pose Control.yaml b/...nt/2024-12-19 Learning from Massive Human Videos for Universal Humanoid Pose Control.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Jiageng Mao
+title: Learning from Massive Human Videos for Universal Humanoid Pose Control
+thumbnail: ""
+link: https://huggingface.co/papers/2412.14172
+summary: This paper presents a new dataset called Humanoid-X, which contains over 20 million humanoid robot poses and text-based motion descriptions. The dataset is created by mining videos from the internet, generating captions, retargeting human motions to humanoid robots, and learning policies for real-world deployment. The authors also introduce a large humanoid model called UH-1 that can control a humanoid robot using text instructions. The study shows that their scalable training approach results i...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...-12-19 Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN.yaml b/...-12-19 Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Pengxiang Li
+title: 'Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.13795
+summary: This paper proposes a new normalization technique called Mix-LN that combines Pre-LN and Post-LN to improve the effectiveness of deeper layers in Large Language Models (LLMs). Mix-LN applies Post-LN to earlier layers and Pre-LN to deeper layers, resulting in more balanced and healthier gradient norms across the network, and enhancing the overall quality of LLM pre-training. Models pre-trained with Mix-LN also perform better during supervised fine-tuning and reinforcement learning from human feed...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...ent/2024-12-19 No More Adam: Learning Rate Scaling at Initialization is All You Need.yaml b/...ent/2024-12-19 No More Adam: Learning Rate Scaling at Initialization is All You Need.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Minghao Xu
+title: 'No More Adam: Learning Rate Scaling at Initialization is All You Need'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.11768
+summary: This paper presents SGD-SaI, an enhancement to SGD with momentum that adjusts learning rates based on gradient signal-to-noise ratios. It matches or outperforms AdamW in training various Transformer-based tasks and is more memory efficient. SGD-SaI is robust to hyperparameter variations and can be used for diverse applications like LoRA fine-tuning and diffusion models....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...tional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference.yaml b/...tional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Benjamin Warner
+title: 'Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.13663
+summary: ModernBERT is a new encoder-only model that offers better performance, speed, and memory efficiency compared to older models like BERT. It's trained on a large amount of data and performs well on various tasks, including classification and retrieval. It's also designed for use on common GPUs for inference....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...king in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces.yaml b/...king in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-19"
+author: Jihan Yang
+title: 'Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.14171
+summary: Multimodal Large Language Models (MLLMs) trained on video datasets have some, but not perfect, ability to ``think in space'' from videos, as measured by a new benchmark called VSI-Bench. The models' spatial reasoning abilities are their main limitation, but they do develop some understanding of the world and spatial awareness. Techniques like chain-of-thought and self-consistency don't help, but generating cognitive maps does improve their performance in spatial distance tasks....
+opinion: placeholder
+tags:
+    - ML