Apple은 Apple Intelligence Foundation 모델에 대한 상세 보고서를 공개했습니다. Google은 Gemma 2 2B 모델을 소개하였으며 GPT-3.5 모델보다 뛰어난 성능을 보였습니다. Meta는 Segment Anything Model 2 (SAM 2)를 발표하여 이미지 및 비디오 객체 분할에서의 성능 향상을 강조했습니다. PyTorch는 torchchat을 소개하며 로컬 LLM 추론을 가속화할 수 있는 방법을 공유했습니다. Hugging Face는 TRL 라이브러리를 통해 비전 언어 모델(VLM)을 위한 선호 최적화(Preference Optimization)를 지원하기 시작했습니다.

Apple, Apple Intelligence Foundation Language Models

링크, 2024년 7월

Dense - decoder only transformer architecture를 사용한 Dense 구조.
RMSNorm 및 Query/Key normalization 사용.
8개의 KV heads를 가진 GQA.
SwiGLU 활성화 및 RoPE(base_freq=500K) 사용.
Applebot을 통한 웹 크롤링 데이터, 공개 코드 및 수학 데이터셋 사용.
BPE 토크나이저: 서버 모델용 100K 단어 사전, 온디바이스 모델용 49K 단어 사전.
3단계 사전 학습:
- Core: 대부분의 컴퓨팅 자원 사용, AFM-server는 6.3T 토큰, 4096 시퀀스 길이.
- Continued: 저품질 데이터의 가중치를 낮추고 코드, 수학, 라이선스 데이터의 가중치를 높임. 1T 토큰, 8192 시퀀스 길이.
- Context-lengthening: 긴 시퀀스와 합성 데이터를 사용한 학습. 100B 토큰, 32768 시퀀스 길이.
사후 학습: 합성 데이터와 인간 주석 데이터 사용.
- 수학 문제 재구성 및 변형, 도구 사용 및 코딩.
- RLHF: 반복적인 인간 선호 데이터 수집 및 위원회 기반 온라인 새로 고침.
배포:
- 각 작업에 대한 어댑터 사용, 어댑터 값은 16비트로 표현.
- 4비트 양자화, 정확도 회복 어댑터로 성능 손실 회복.
- 일부 레이어는 2비트로 축소.
평가:
- 온디바이스: IFEval에서 최고 수준, AlpacaEval 2.0에서 Gemma 7B와 경쟁.
- 서버: IFEval에서 최고 수준, Arena Hard에서 Mixtral 8x22B와 비교.
- 도구/함수 호출, 작성(요약, 구성) 벤치마크에서 GPT-4/Gemini 1.5와 경쟁.

Google, Smaller, Safer, More Transparent: Advancing Responsible AI with Gemma

링크, 2024년 7월 31일

Gemma 2 모델 2B 파라미터 크기로 출시.
Gemma 2 2B 모델은 Chatbot Arena에서 GPT-3.5 모델보다 뛰어난 성능을 보임.
ShieldGemma: 사용자를 보호하는 최신 안전 분류기 모델.
- 주요 해로운 콘텐츠 유형(혐오 발언, 괴롭힘, 성적으로 노골적인 콘텐츠, 위험한 콘텐츠)을 감지하고 완화.
- 다양한 모델 크기로 제공되어 온라인 및 오프라인 분류 작업에 적합.
Gemma Scope: 모델 내부 작동 방식을 이해할 수 있는 도구.
- Sparse autoencoders(SAEs)를 사용하여 모델의 내부 작동 방식을 해석.
- 연구자들이 모델의 패턴 인식, 정보 처리 및 예측 과정을 이해할 수 있도록 지원.

Meta, Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images

링크, 2024년 7월 29일

SAM 2 모델은 이미지와 비디오 객체 분할을 위한 통합 모델.
- 실시간 프롬프트 가능한 객체 분할 기능.
- 이미지 분할 정확도 향상 및 비디오 분할 성능 개선.
- 세 배 적은 상호작용 시간으로 높은 성능 제공.
SA-V 데이터셋 공개:
- 51,000개 비디오와 600,000개 이상의 마스크렛 포함.
- 다양한 실제 시나리오와 객체 파트를 포함한 대규모 데이터셋.
실시간 상호작용 세그멘테이션 데모 제공.

PyTorch, Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile

링크, 2024년 7월 30일

torchchat 라이브러리 출시: 로컬 LLM 추론 가속화.
Llama 3, 3.1 등 대형 언어 모델을 다양한 장치에서 원활하게 실행 가능.
다양한 성능 지표와 하드웨어 구성에 대한 테스트 결과 제공.
- Apple MacBook Pro M1 Max에서 Llama 3 8B 모델 테스트 결과: float16에서 초당 12.63 토큰 처리.
- Intel Xeon CPU와 A100 GPU에서 Llama 3 8B 모델 테스트 결과: CUDA 컴파일 모드에서 초당 135.16 토큰 처리.
- Samsung Galaxy S23와 iPhone에서 4비트 GPTQ를 사용하여 초당 8T 이상 처리.

Hugging Face, Preference Optimization for Vision Language Models with TRL

링크, 2024년 7월 10일

TRL 라이브러리에서 비전 언어 모델(VLM)용 선호 최적화(DPO) 지원 시작.
- 인간의 판단을 더 효과적으로 반영하여 모델을 미세 조정하는 방법.
- DPO는 고정된 레이블 대신 후보 답변을 비교하고 순위를 매겨 더 정교한 인간 판단을 반영.
PEFT 및 bitsandbytes를 통한 QLoRA와 LoRA 미세 조정 지원.
- Idefics2, Llava 1.5, PaliGemma 모델에 대한 지원 포함.
- 실험 및 개발을 쉽게 하기 위한 다양한 스크립트와 예제 제공.

Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)
…

###
content type Paper | published July 2024
Apple Intelligence Foundation Language Models

We present foundation language models developed to power Apple Intelligence features, including a ∼3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.

This paper provides technical details for Apple’s On-Device and Server Foundation Models, introduced on June 10, 2024, in this post.
Apple spilled the beans on Apple Intelligence Foundation Models (notes below):
Architecture:
> Dense - decoder only transformer architecture
> RMSNorm & Query/ Key normalization
> GQA (w/ 8 KV heads)
> SwiGLU activation & RoPE (base_freq=500K for long context)
Pre-training & Tokenisation:
> Webpages crawled through the Applebot (web crawl)
> Code & Math datasets (publicaly licensed)
> BPE tokenizer w/ 100K vocab for server & 49K for on-device
Three step pre-training:
>Core (consumes most of the compute budget)
AFM-server - 6.3T tokens + 4096 seq length
AFM-on-device - initialised from a pruned 6.4B server model, trained for full 6.3T tokens along with distillation loss
- Continued (down-weight lower quality data and increase code, math, licensed data weight)
1T tokens, w/ 8192 seq length
no distillation loss for AFM-on-device in this phase
- Context-lengthening with long sequence + synthetic data
100B tokens, w/ 32768 seq length
Training Infrastructure:
> Pre-trained v4 & v5p TPU clusters
> Using AXLearn (JAX) with a combination of tensor, fsdp, and seq parallelism
> AFM Server trained on 8192 TPUv4 chips
> AFM On-device trained on 2048 TPUv5p chips
Post Training:
> Hybrid data -  synthetic + human annotated
> Synthetic data for Mathematics (problem rephrase & reversion + evolution), Tool use and coding
> RLHF: Iterative Teaching Committee - Refresh online human preference data collection using a diverse set of best performing model
> For above, collect pairwise human preference on responses sampled from the comittee
Deployment:
> Adapters for each task, adapter values represented using 16-bits, loaded on-the-fly based on the task
> Quantised under 4-bit-per-weight (3.7 bpw), use accuracy recovering adapters for regaining the lost performance
> Accuracy recovery adapter trains on 10B tokens across different ranks, 8, 16, 32
> Some layers (unimportant) pushed to 2-bit
Evaluation:
> On-device: SoTA in IFEval and competitive with Gemma 7B on AlpacaEval 2.0
> Server: SoTA in IFEval, comparable to Mixtral 8x22B in Arena Hard
> Competitve with GPT 4/ Gemini 1.5 on Tools/ function calling, writing (summarisation, composition) benchmarks
> On-device beats L3 8B on Math
The report is quite feature packed, quite enjoyed skimming through it. Thanks, Apple, for being so open about your practices and spilling the beans on what would power the next gen of on-device ML.
How is Apple training LLMs for Apple Intelligence? A new technical report shares details on Architecture, Training, Distillation, and Benchmarks for the 2.7B on-device (iPhone) and a large server-based model designed for Private Cloud computing. 👀
𝗔𝗽𝗽𝗹𝗲 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹 (𝗔𝗙𝗠) 𝗱𝗲𝘁𝗮𝗶𝗹𝘀:
🏛 Dense Decoder Architecture using GQA, SwiGLU, and RoPE → very similar to Llama
📊 Pretraining Data includes licensed data, open datasets (code), and crawled data by their Applebot
📑 Used Safari’s reader to extract text from HTML, with model-based classifier, filtering, fuzzy-deduplication, and decontamination.
🔡 AFM-on-device model has 49k Vocab, and AFM-Server Model a 100k Vocab
🧩 3 Stage Pretraining: core (web), continued (high quality), context-lengthening (long context)
🚀 AFM-Server trained on 7.4T (6.3 core; 1 continued; 0.1 lengthening) tokens on TPUv4
📉 AFM-on-device distilled from a pruned 6.4B (trained from scratch) LLM on 6.3T tokens in stage 1 (core)
🔢 Max Sequence length after Pretraining is 32k
💡 Generated Synthetic data, especially for Math, Tool Use, and Coding
🎓 Post-training used SFT + RLHF → Followed by Adapter training
🎲 RLHF used iTeC (new Rejection Sampling method) and MDLOO (similar to RLOO)
🧠 Trained different models with RS, DPO, IPO in Post Training to then generate “best” synthetic data for SFT
🏅 AFM-on-device model is trained on more than 1M high-quality responses generated (”model committee”)
🔌 Uses LoRA Adapter with all-linear for Apple Intelligence Features
💾 Combines 4-bit quantization with adapter training for quality recovery (trained on 10B tokens of pretraining and post-training) followed by product-specific adapter training
📱 AFM-on-device model runs on Apple Neural Engine (ANE) on iPhones
🧪 Used common benchmarks, MMLU, IFEval, Gorilla Function Calling, GSM8k for evaluation
📋 Used 1393 samples to evaluate the general model capabilities with human experts
⚖️ Used LLM-as-a-Judge for task-specific evaluations, e.g. summarization

###
https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma/
Google
Smaller, Safer, More Transparent: Advancing Responsible AI with Gemma
JUL 31, 2024
Neel Nanda
Research Engineer
Tom Lieberum
Research Engineer
Ludovic Peran
Product Manager
Kathleen Kenealy
Research Engineer

Share
20240731-Gemma-Blog-ResponsibleAI
In June, we released Gemma 2, our new best-in-class open models, in 27 billion (27B) and 9 billion (9B) parameter sizes. Since its debut, the 27B model quickly became one of the highest-ranking open models on the LMSYS Chatbot Arena leaderboard, even outperforming popular models more than twice its size in real conversations.

But Gemma is about more than just performance. It's built on a foundation of responsible AI, prioritizing safety and accessibility. To support this commitment, we are excited to announce three new additions to the Gemma 2 family:

Gemma 2 2B – a brand-new version of our popular 2 billion (2B) parameter model, featuring built-in safety advancements and a powerful balance of performance and efficiency.
2. ShieldGemma – a suite of safety content classifier models, built upon Gemma 2, to filter the input and outputs of AI models and keep the user safe.

3. Gemma Scope – a new model interpretability tool that offers unparalleled insight into our models' inner workings.

With these additions, researchers and developers can now create safer customer experiences, gain unprecedented insights into our models, and confidently deploy powerful AI responsibly, right on device, unlocking new possibilities for innovation.


Gemma 2 2B: Experience Next-Gen Performance, Now On-Device
We're excited to introduce the Gemma 2 2B model, a highly anticipated addition to the Gemma 2 family. This lightweight model produces outsized results by learning from larger models through distillation. In fact, Gemma 2 2B surpasses all GPT-3.5 models on the Chatbot Arena, demonstrating its exceptional conversational AI abilities.

Graph - LYMSYS Chatbot Arena leaderboard scores
LMSYS Chatbot Arena leaderboard scores captured on July 30th, 2024. Gemma 2 2B score +/- 10.
Gemma 2 2B offers:

Exceptional performance: Delivers best-in-class performance for its size, outperforming other open models in its category.
Flexible and cost-effective deployment: Run Gemma 2 2B efficiently on a wide range of hardware—from edge devices and laptops to robust cloud deployments with Vertex AI and Google Kubernetes Engine (GKE). To further enhance its speed, it is optimized with the NVIDIA TensorRT-LLM library and is available as an NVIDIA NIM. This optimization targets various deployments, including data centers, cloud, local workstations, PCs, and edge devices — using NVIDIA RTX, NVIDIA GeForce RTX GPUs, or NVIDIA Jetson modules for edge AI. Additionally, Gemma 2 2B seamlessly integrates with Keras, JAX, Hugging Face, NVIDIA NeMo, Ollama, Gemma.cpp, and soon MediaPipe for streamlined development.
Open and accessible: Available under the commercially-friendly Gemma terms for research and commercial applications. It's even small enough to run on the free tier of T4 GPUs in Google Colab, making experimentation and development easier than ever.
Starting today, you can download Gemma 2’s model weights from Kaggle, Hugging Face, Vertex AI Model Garden. You can also try its capabilities in Google AI Studio.


ShieldGemma: Protecting Users with State-of-the-Art Safety Classifiers
Deploying open models responsibly to ensure engaging, safe, and inclusive AI outputs requires significant effort from developers and researchers. To help developers in this process, we're introducing ShieldGemma, a series of state-of-the-art safety classifiers designed to detect and mitigate harmful content in AI models inputs and outputs. ShieldGemma specifically targets four key areas of harm:

Hate speech
Harassment
Sexually explicit content
Dangerous content
Generative AI application model architecture
These open classifiers complement our existing suite of safety classifiers in the Responsible AI Toolkit, which includes a methodology to build classifiers tailored to a specific policy with limited number of datapoints, as well as existing Google Cloud off-the-shelf classifiers served via API.


Here's how ShieldGemma can help you create safer, better AI applications:

SOTA performance: Built on top of Gemma 2, ShieldGemma are the industry-leading safety classifiers.
Flexible sizes: ShieldGemma offers various model sizes to meet diverse needs. The 2B model is ideal for online classification tasks, while the 9B and 27B versions provide higher performance for offline applications where latency is less of a concern. All sizes leverage NVIDIA speed optimizations for efficient performance across hardware.
Open and collaborative: The open nature of ShieldGemma encourages transparency and collaboration within the AI community, contributing to the future of ML industry safety standards.


"As AI continues to mature, the entire industry will need to invest in developing high performance safety evaluators. We're glad to see Google making this investment, and look forward to their continued involvement in our AI Safety Working Group.” ~ Rebecca Weiss, Executive Director, ML Commons
Evaluation results based on Optimal F1(left)/AU-PRC(right), higher is better.
Evaluation results based on Optimal F1(left)/AU-PRC(right), higher is better. We use 𝛼=0 And T = 1 for calculating the probabilities. ShieldGemma (SG) Prompt and SG Response are our test datasets and OpenAI Mod/ToxicChat are external benchmarks. The performance of baseline models on external datasets is sourced from Ghosh et al. (2024); Inan et al. (2023).
Learn more about ShieldGemma, see full results in the technical report, and start building safer AI applications with our comprehensive Responsible Generative AI Toolkit.


Gemma Scope: Illuminating AI Decision-Making with Open Sparse Autoencoders
Gemma Scope offers researchers and developers unprecedented transparency into the decision-making processes of our Gemma 2 models. Acting like a powerful microscope, Gemma Scope uses sparse autoencoders (SAEs) to zoom in on specific points within the model and make its inner workings more interpretable.

These SAEs are specialized neural networks that help us unpack the dense, complex information processed by Gemma 2, expanding it into a form that's easier to analyze and understand. By studying these expanded views, researchers can gain valuable insights into how Gemma 2 identifies patterns, processes information, and ultimately makes predictions. With Gemma Scope, we aim to help the AI research community discover how to build more understandable, accountable, and reliable AI systems.

Here's what makes Gemma Scope groundbreaking:

Open SAEs: Over 400 freely available SAEs covering all layers of Gemma 2 2B and 9B.
Interactive demos: Explore SAE features and analyze model behavior without writing code on Neuronpedia.
Easy-to-use repository: Code and examples for interfacing with SAEs and Gemma 2.
Learn more about Gemma Scope on the Google DeepMind blog, technical report, and developer documentation.


A Future Built on Responsible AI
These releases represent our ongoing commitment to providing the AI community with the tools and resources needed to build a future where AI benefits everyone. We believe that open access, transparency, and collaboration are essential for developing safe and beneficial AI.


Get Started Today:
Experience the power and efficiency of Gemma 2 2B by downloading it or trying it with NVIDIA NIM or Google AI Studio.
Explore ShieldGemma and build safer AI applications.
Try Gemma Scope on Neuronpedia and uncover the inner workings of Gemma 2.
Join us on this exciting journey towards a more responsible and beneficial AI future!
We’re welcoming a new 𝟮 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿 𝗺𝗼𝗱𝗲𝗹 to the Gemma 2 family. 🛠️
It offers best-in-class performance for its size and can run efficiently on a wide range of hardware.
We’re also introducing 𝗦𝗵𝗶𝗲𝗹𝗱𝗚𝗲𝗺𝗺𝗮: a series of state-of-the-art safety classifiers designed to filter harmful content. 🛡️
These target hate speech, harassment, sexually explicit material and more, both in the input and output stages.
Finally, we’re announcing 𝗚𝗲𝗺𝗺𝗮 𝗦𝗰𝗼𝗽𝗲, a set of tools to help researchers examine how Gemma 2 makes decisions. 🔍

Absolutely wild! 🤯 Google DeepMind Gemma 2B  outperforms OpenAI GPT-3.5 on LMSYS Chatbot arena with a score of 1130! 20 months ago, "ChatGPT is a revolution, the most powerful model ever made," and today, you can run a model more preferred than this literally on a toaster!🍞 🚀
Gemma 2B It also ranks higher than:
> Microsoft Phi-3 Medium (14B version)
> Mistral AI 8x7B Instruct
> Mistral AI 7B fine-tunes
> Meta Llama 2 70B

###
https://pytorch.org/blog/torchchat-local-llm-inference/?utm_content=302141290&utm_medium=social&utm_source=linkedin&hss_channel=lcp-78618366
July 30, 2024

Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile

by Team PyTorch

Today, we’re releasing torchchat, a library showcasing how to seamlessly and performantly run Llama 3, 3.1, and other large language models across laptop, desktop, and mobile.

In our previous blog posts, we showed how to use native PyTorch 2 to run LLMs with great performance using CUDA. Torchchat expands on this with more target environments, models and execution modes. Additionally it provides important functions such as export, quantization and eval in a way that’s easy to understand providing an E2E story for those who want to build a local inference solution.

You will find the project organized into three areas:

Python: Torchchat provides a REST API that is called via a Python CLI or can be accessed via the browser
C++: Torchchat produces a desktop-friendly binary using PyTorch’s AOTInductor backend
Mobile devices: Torchchat uses ExecuTorch to export a .pte binary file for on-device inference
torchchat schema

PERFORMANCE
The following table tracks the performance of torchchat for Llama 3 for a variety of configurations.
Numbers for Llama 3.1 are coming soon.

Llama 3 8B Instruct on Apple MacBook Pro M1 Max 64GB Laptop

Mode	DType	Llama 3 8B Tokens/Sec
Arm Compile	float16	5.84
int8	1.63
int4	3.99
Arm AOTI	float16	4.05
int8	1.05
int4	3.28
MPS Eager	float16	12.63
int8	16.9
int4	17.15
Llama 3 8B Instruct on Linux x86 and CUDA
Intel(R) Xeon(R) Platinum 8339HC CPU @ 1.80GHz with 180GB Ram + A100 (80GB)

Mode	DType	Llama 3 8B Tokens/Sec
x86 Compile	bfloat16	2.76
int8	3.15
int4	5.33
CUDA Compile	bfloat16	83.23
int8	118.17
int4	135.16
Llama3 8B Instruct on Mobile
Torchchat achieves > 8T/s on the Samsung Galaxy S23 and iPhone using 4-bit GPTQ via ExecuTorch.

CONCLUSION
We encourage you to clone the torchchat repo and give it a spin, explore its capabilities, and share your feedback as we continue to empower the PyTorch community to run LLMs locally and on constrained devices. Together, let’s unlock the full potential of generative AI and LLMs on any device. Please submit issues as you see them, since we are still iterating quickly. We’re also inviting community contributions across a broad range of areas, from additional models, target hardware support, new quantization schemes, or performance improvements. Happy experimenting!

###
https://chat.lmsys.org/
7/30/24
𝗟𝗹𝗮𝗺𝗮-𝟯.𝟭 𝗺𝗼𝗱𝗲𝗹𝘀 𝗳𝗶𝗻𝗮𝗹𝗹𝘆 𝗴𝗲𝘁 𝘁𝗵𝗲𝗶𝗿 𝗖𝗵𝗮𝘁𝗯𝗼𝘁 𝗔𝗿𝗲𝗻𝗮 𝗿𝗮𝗻𝗸𝗶𝗻𝗴 🎖️
Given the impressive benchmarks published my Meta for their Llama-3.1 models, I was curious to see how these models would compare to top proprietary models on Chatbot Arena.
Now we've got the results! LMSys released the ELO derived from thousands of user votes for the new models, and here are the rankings:
💥 405B Model ranks 5th overall, in front of GPT-4-turbo! But behind GPT-4o, Claude-3.5 Sonnet and Gemini-advanced.
👏 70B Model climbs up to 9th rank ! From 1206 ➡️ 1244.
👍 8B Model improves from 1152 ➡️ 1170.
✅ This confirms that Llama-3.1 is a good contender for any task: any of its 3 model size is much cheaper to run than equivalent proprietary models!
For instance, here are the inference prices for the top models;
➤ GPT-4-Turbo inference price from OpenAI: $5/M input tokens, $15/M output tokens
➤ Llama-3.1-405B from HF API (for testing only): 3$/M for input or output tokens (Source linked in the first comment)
➤ Llama-3.1-405B from HF API (for testing only): free ✨


###
https://ai.meta.com/sam2/
META
Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images
July 29, 2024•
15 minute read
Takeaways:


Following up on the success of the Meta Segment Anything Model (SAM) for images, we’re releasing SAM 2, a unified model for real-time promptable object segmentation in images and videos that achieves state-of-the-art performance.
In keeping with our approach to open science, we’re sharing the code and model weights with a permissive Apache 2.0 license.
We’re also sharing the SA-V dataset, which includes approximately 51,000 real-world videos and more than 600,000 masklets (spatio-temporal masks).
SAM 2 can segment any object in any video or image—even for objects and visual domains it has not seen previously, enabling a diverse range of use cases without custom adaptation.
SAM 2 has many potential real-world applications. For example, the outputs of SAM 2 can be used with a generative video model to create new video effects and unlock new creative applications. SAM 2 could also aid in faster annotation tools for visual data to build better computer vision systems.

A preview of the SAM 2 web-based demo, which allows segmenting and tracking objects in video and applying effects.

Today, we’re announcing the Meta Segment Anything Model 2 (SAM 2), the next generation of the Meta Segment Anything Model, now supporting object segmentation in videos and images. We’re releasing SAM 2 under an Apache 2.0 license, so anyone can use it to build their own experiences. We’re also sharing SA-V, the dataset we used to build SAM 2 under a CC BY 4.0 license and releasing a web-based demo experience where everyone can try a version of our model in action.

Object segmentation—identifying the pixels in an image that correspond to an object of interest—is a fundamental task in the field of computer vision. The Meta Segment Anything Model (SAM) released last year introduced a foundation model for this task on images.

Our latest model, SAM 2, is the first unified model for real-time, promptable object segmentation in images and videos, enabling a step-change in the video segmentation experience and seamless use across image and video applications. SAM 2 exceeds previous capabilities in image segmentation accuracy and achieves better video segmentation performance than existing work, while requiring three times less interaction time. SAM 2 can also segment any object in any video or image (commonly described as zero-shot generalization), which means that it can be applied to previously unseen visual content without custom adaptation.

Before SAM was released, creating an accurate object segmentation model for specific image tasks required highly specialized work by technical experts with access to AI training infrastructure and large volumes of carefully annotated in-domain data. SAM revolutionized this space, enabling application to a wide variety of real-world image segmentation and out-of-the-box use cases via prompting techniques—similar to how large language models can perform a range of tasks without requiring custom data or expensive adaptations.

In the year since we launched SAM, the model has made a tremendous impact across disciplines. It has inspired new AI-enabled experiences in Meta’s family of apps, such as Backdrop and Cutouts on Instagram, and catalyzed diverse applications in science, medicine, and numerous other industries. Many of the largest data annotation platforms have integrated SAM as the default tool for object segmentation annotation in images, saving millions of hours of human annotation time. SAM has also been used in marine science to segment Sonar images and analyze coral reefs, in satellite imagery analysis for disaster relief, and in the medical field, segmenting cellular images and aiding in detecting skin cancer.

As Mark Zuckerberg noted in an open letter last week, open source AI “has more potential than any other modern technology to increase human productivity, creativity, and quality of life,” all while accelerating economic growth and advancing groundbreaking medical and scientific research. We’ve been tremendously impressed by the progress the AI community has made using SAM, and we envisage SAM 2 unlocking even more exciting possibilities.


SAM 2 can be applied out of the box to a diverse range of real-world use cases—for example, tracking objects to create video effects (left) or segmenting moving cells in videos captured from a microscope to aid in scientific research (right).

In keeping with our open science approach, we’re sharing our research on SAM 2 with the community so they can explore new capabilities and use cases. The artifacts we’re sharing today include:

The SAM 2 code and weights, which are being open sourced under a permissive Apache 2.0 license. We’re sharing our SAM 2 evaluation code under a BSD-3 license.
The SA-V dataset, which has 4.5 times more videos and 53 times more annotations than the existing largest video segmentation dataset. This release includes ~51k real-world videos with more than 600k masklets. We’re sharing SA-V under a CC BY 4.0 license.
A web demo, which enables real-time interactive segmentation of short videos and applies video effects on the model predictions.
As a unified model, SAM 2 can power use cases seamlessly across image and video data and be extended to previously unseen visual domains. For the AI research community and others, SAM 2 could be a component as part of a larger AI system for a more general multimodal understanding of the world. In industry, it could enable faster annotation tools for visual data to train the next generation of computer vision systems, such as those used in autonomous vehicles. SAM 2’s fast inference capabilities could inspire new ways of selecting and interacting with objects in real time or live video. For content creators, SAM 2 could enable creative applications in video editing and add controllability to generative video models. SAM 2 could also be used to aid research in science and medicine—for example, tracking endangered animals in drone footage or localizing regions in a laparoscopic camera feed during a medical procedure. We believe the possibilities are broad, and we’re excited to share this technology with the AI community to see what they build and learn.


How we built SAM 2


SAM was able to learn a general notion of what objects are in images. However, images are only a static snapshot of the dynamic real world in which visual segments can exhibit complex motion. Many important real-world use cases require accurate object segmentation in video data, for example in mixed reality, robotics, autonomous vehicles, and video editing. We believe that a universal segmentation model should be applicable to both images and video.

An image can be considered a very short video with a single frame. We adopt this perspective to develop a unified model that supports both image and video input seamlessly. The only difference in handling video is that the model needs to rely on memory to recall previously processed information for that video in order to accurately segment an object at the current timestep.

Successful segmentation of objects in video requires an understanding of where entities are across space and time. Compared to segmentation in images, videos present significant new challenges. Object motion, deformation, occlusion, lighting changes, and other factors can drastically change from frame to frame. Videos are often lower quality than images due to camera motion, blur, and lower resolution, adding to the difficulty. As a result, existing video segmentation models and datasets have fallen short in providing a comparable “segment anything” capability for video. We solved many of these challenges in our work to build SAM 2 and the new SA-V dataset.

Similar to the methodology we used for SAM, our research on enabling video segmentation capabilities involves designing a new task, a model, and a dataset. We first develop the promptable visual segmentation task and design a model (SAM 2) capable of performing this task. We use SAM 2 to aid in creating a video object segmentation dataset (SA-V), which is an order of magnitude larger than anything that exists currently, and use this to train SAM 2 to achieve state-of-the-art performance.

Promptable visual segmentation

SAM 2 supports selecting and refining objects in any video frame.

We design a promptable visual segmentation task that generalizes the image segmentation task to the video domain. SAM was trained to take as input points, boxes, or masks in an image to define the target object and predict a segmentation mask. With SAM 2, we train it to take input prompts in any frame of a video to define the spatio-temporal mask (i.e. a “masklet”) to be predicted. SAM 2 makes an immediate prediction of the mask on the current frame based on the input prompt and temporally propagates it to generate the masklet of the target object across all video frames. Once an initial masklet has been predicted, it can be iteratively refined by providing additional prompts to SAM 2 in any frame. This can be repeated as many times as required until the desired masklet is obtained.


Image and video segmentation in a unified architecture

The evolution of the architecture from SAM to SAM 2.

The SAM 2 architecture can be seen as a generalization of SAM from the image to the video domain. SAM 2 can be prompted by clicks (positive or negative), bounding boxes, or masks to define the extent of the object in a given frame. A lightweight mask decoder takes an image embedding for the current frame and encoded prompts to output a segmentation mask for the frame. In the video setting, SAM 2 propagates this mask prediction to all video frames to generate a masklet. Prompts can then be iteratively added on any subsequent frame to refine the masklet prediction.

To predict masks accurately across all video frames, we introduce a memory mechanism consisting of a memory encoder, a memory bank, and a memory attention module. When applied to images, the memory components are empty and the model behaves like SAM. For video, the memory components enable storing information about the object and previous user interactions in that session, allowing SAM 2 to generate masklet predictions throughout the video. If there are additional prompts provided on other frames, SAM 2 can effectively correct its predictions based on the stored memory context of the object.

Memories of frames are created by the memory encoder based on the current mask prediction and placed in the memory bank for use in segmenting subsequent frames. The memory bank consists of both memories from previous frames and prompted frames. The memory attention operation takes the per-frame embedding from the image encoder and conditions it on the memory bank to produce an embedding that is then passed to the mask decoder to generate the mask prediction for that frame. This is repeated for all subsequent frames.

We adopt a streaming architecture, which is a natural generalization of SAM to the video domain, processing video frames one at a time and storing information about the segmented objects in the memory. On each newly processed frame, SAM 2 uses the memory attention module to attend to the previous memories of the target object. This design allows for real-time processing of arbitrarily long videos, which is important not only for annotation efficiency in collecting the SA-V dataset but also for real-world applications—for example, in robotics.

SAM introduced the ability to output multiple valid masks when faced with ambiguity about the object being segmented in an image. For example, when a person clicks on the tire of a bike, the model can interpret this click as referring to only the tire or the entire bike and output multiple predictions. In videos, this ambiguity can extend across video frames. For example, if in one frame only the tire is visible, a click on the tire might relate to just the tire, or as more of the bike becomes visible in subsequent frames, this click could have been intended for the entire bike. To handle this ambiguity, SAM 2 creates multiple masks at each step of the video. If further prompts don’t resolve the ambiguity, the model selects the mask with the highest confidence for further propagation in the video.


The occlusion head in the SAM 2 architecture is used to predict if an object is visible or not, helping segment objects even when they become temporarily occluded.

In the image segmentation task, there is always a valid object to segment in a frame given a positive prompt. In video, it’s possible for no valid object to exist on a particular frame, for example due to the object becoming occluded or disappearing from view. To account for this new output mode, we add an additional model output (“occlusion head”) that predicts whether the object of interest is present on the current frame. This enables SAM 2 to effectively handle occlusions.

SA-V: Building the largest video segmentation dataset

Videos and masklet annotations from the SA-V dataset.

One of the challenges of extending the “segment anything” capability to video is the limited availability of annotated data for training the model. Current video segmentation datasets are small and lack sufficient coverage of diverse objects. Existing dataset annotations typically cover entire objects (e.g., person), but lack object parts (e.g., person’s jacket, hat, shoes), and datasets are often centered around specific object classes, such as people, vehicles, and animals.

To collect a large and diverse video segmentation dataset, we built a data engine, leveraging an interactive model-in-the-loop setup with human annotators. Annotators used SAM 2 to interactively annotate masklets in videos, and then the newly annotated data was used to update SAM 2 in turn. We repeated this cycle many times to iteratively improve both the model and dataset. Similar to SAM, we do not impose semantic constraints on the annotated masklets and focus on both whole objects (e.g., a person) and object parts (e.g., a person’s hat).

With SAM 2, collecting new video object segmentation masks is faster than ever before. Annotation with our tool and SAM 2 in the loop is approximately 8.4 times faster than using SAM per frame and also significantly faster than combining SAM with an off-the-shelf tracker.

Our released SA-V dataset contains over an order of magnitude more annotations and approximately 4.5 times more videos than existing video object segmentation datasets.

Highlights of the SA-V dataset include:

More than 600,000 masklet annotations on approximately 51,000 videos.
Videos featuring geographically diverse, real-world scenarios, collected across 47 countries.
Annotations that cover whole objects, object parts, and challenging instances where objects become occluded, disappear, and reappear.

Results

Both models are initialized with the mask of the t-shirt in the first frame. For the baseline, we use the mask from SAM. SAM 2 is able to track object parts accurately throughout a video, compared to the baseline which over-segments and includes the person’s head instead of only tracking the t-shirt.

To create a unified model for image and video segmentation, we jointly train SAM 2 on image and video data by treating images as videos with a single frame. We leverage the SA-1B image dataset released last year as part of the Segment Anything project, the SA-V dataset, and an additional internal licensed video dataset.



SAM 2 (right) improves on SAM’s (left) object segmentation accuracy in images.

Key highlights that we detail in our research paper include:


SAM 2 significantly outperforms previous approaches on interactive video segmentation across 17 zero-shot video datasets and requires approximately three times fewer human-in-the-loop interactions.
SAM 2 outperforms SAM on its 23 dataset zero-shot benchmark suite, while being six times faster.
SAM 2 excels at existing video object segmentation benchmarks (DAVIS, MOSE, LVOS, YouTube-VOS) compared to prior state-of-the-art models.
Inference with SAM 2 feels real-time at approximately 44 frames per second.
SAM 2 in the loop for video segmentation annotation is 8.4 times faster than manual per-frame annotation with SAM.
It’s important that we work to build AI experiences that work well for everyone. In order to measure the fairness of SAM 2, we conducted an evaluation on model performance across certain demographic groups. Our results show that the model has minimal performance discrepancy in video segmentation on perceived gender and little variance among the three perceived age groups we evaluated: ages 18 – 25, 26 – 50, and 50+.


Limitations


While SAM 2 demonstrates strong performance for segmenting objects in images and short videos, the model performance can be further improved—especially in challenging scenarios.

SAM 2 may lose track of objects across drastic camera viewpoint changes, after long occlusions, in crowded scenes, or in extended videos. We alleviate this issue in practice by designing the model to be interactive and enabling manual intervention with correction clicks in any frame so the target object can be recovered.


SAM 2 can sometimes confuse multiple similar looking objects in crowded scenes.

When the target object is only specified in one frame, SAM 2 can sometimes confuse objects and fail to segment the target correctly, as shown with the horses in the above video. In many cases, with additional refinement prompts in future frames, this issue can be entirely resolved and the correct masklet can be obtained throughout the video.

While SAM 2 supports the ability to segment multiple individual objects simultaneously, the efficiency of the model decreases considerably. Under the hood, SAM 2 processes each object separately, utilizing only shared per-frame embeddings, without inter-object communication. While this simplifies the model, incorporating shared object-level contextual information could aid in improving efficiency.


SAM 2 predictions can miss fine details in fast moving objects.

For complex fast moving objects, SAM 2 can sometimes miss fine details and the predictions can be unstable across frames (as shown in the video of the cyclist above). Adding further prompts to refine the prediction in the same frame or additional frames can only partially alleviate this problem During training we do not enforce any penalty on the model predictions if they jitter between frames, so temporal smoothness is not guaranteed. Improving this capability could facilitate real-world applications that require detailed localization of fine structures.

While our data engine uses SAM 2 in the loop and we’ve made significant strides in automatic masklet generation, we still rely on human annotators for some steps such as verifying masklet quality and selecting frames that require correction. Future developments could include further automating the data annotation process to enhance efficiency.

There’s still plenty more work to be done to propel this research even further. We hope the AI community will join us by building with SAM 2 and the resources we’ve released. Together, we can accelerate open science to build powerful new experiences and use cases that benefit people and society.

Putting SAM 2 to work

While many of Meta FAIR’s models used in public demos are hosted on Amazon SageMaker, the session-based requirements of the SAM 2 model pushed up against the boundaries of what our team believed was previously possible on AWS AI Infra. Thanks to the advanced model deployment and managed inference capabilities offered by Amazon SageMaker, we’ve been able to make the SAM 2 release possible—focusing on building state of the art AI models and unique AI demo experiences.



In the future, SAM 2 could be used as part of a larger AI system to identify everyday items via AR glasses that could prompt users with reminders and instructions.

We encourage the AI community to download the model, use the dataset, and try our demo. By sharing this research, we hope to contribute to accelerating progress in universal video and image segmentation and related perception tasks. We look forward to seeing the new insights and useful experiences that will be created by releasing this research to the community.



###
https://github.com/OpenGVLab/Diffree
7/30/24
🤩 Introducing Diffree - Inpainting with Diffusion models. 🔥Maintains consistency with original image (light, hue, texture, colors etc.)
- No need of manually drawing boxes or masks
- Adds objects to images based only on text descriptions
- Automatically determines where to place the new object
You can launch Diffree locally in just 6 lines of codes/steps 🔥🔥 -
> git clone <repo>
> cd Diffree
> pip install -r requirements.txt
> pip install huggingface_hub
> huggingface-cli download LiruiZhao/Diffree --local-dir ./checkpoints
> python app.py
This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve cumbersome human intervention in specifying bounding boxes or user-scribbled masks. To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. To this end, we curate OABench, an exquisite synthetic dataset by removing objects with advanced image inpainting techniques. OABench comprises 74K real-world tuples of an original image, an inpainted image with the object removed, an object mask, and object descriptions. Trained on OABench using the Stable Diffusion model with an additional mask prediction module, Diffree uniquely predicts the position of the new object and achieves object addition with guidance from only text. Extensive experiments demonstrate that Diffree excels in adding new objects with a high success rate while maintaining background consistency, spatial appropriateness, and object relevance and quality.

###
https://github.com/maxin-cn/Cinemo
[Submitted on 22 Jul 2024 (v1), last revised 23 Jul 2024 (this version, v2)]
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Yuan-Fang Li, Cunjian Chen, Yu Qiao
Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better motion controllability, as well as stronger temporal consistency and smoothness. In general, we propose three effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. Additionally, a structural similarity index-based strategy is proposed to enable Cinemo to have better controllability of motion intensity. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such three strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.

📣 Introducing the Cinemo😍 Image to video! 💪 Performs motion-controllable image animation with strong consistency and smoothness. More details and links 👇
Cinemo with Gradio offers simpler & more precise user control & generations!
- Motion smoothness: Uses distribution of motion residuals, don't directly generate next frames
- Motion intensity: A structural similarity index-based method is used
- Temporal consistency

###
https://huggingface.co/blog/dpo_vlm
Preference Optimization for Vision Language Models with TRL
Published July 10, 2024
Huggingface
Training models to understand and predict human preferences can be incredibly complex. Traditional methods, like supervised fine-tuning, often require assigning specific labels to data, which is not cost-efficient, especially for nuanced tasks. Preference optimization is an alternative approach that can simplify this process and yield more accurate results. By focusing on comparing and ranking candidate answers rather than assigning fixed labels, preference optimization allows models to capture the subtleties of human judgment more effectively.
Preference optimization is widely used for fine-tuning language models, but it can also be applied to vision language models (VLM). We are excited to announce that the TRL library now supports direct preference optimization (DPO) for VLMs. This article will guide you through the process of training VLMs using TRL and DPO.

The bleeding-edge alignment technique DPO for vision language models is now available in Hugging Face TRL along with LoRA/QLoRA
DPO is a popular cutting-edge alignment technique for language models.
TLDR; a (preference) model is trained using a dataset of inputs and chosen and rejected outputs, and this model generates scores for each input. the main model is fine-tuned using the scores.
Essentially DPO in vision language models is pretty similar, since vision language models are models that take in images projected to text embedding space, it's just input tokens output tokens.
Quentin Gallouédec implemented support for Idefics2, Llava 1.5, and PaliGemma in TRL. 👏
as of now, VLM processors are quite non-standard, only difference is due to processor and chat templates themselves, you can implement it very easily (see his PR in links)
Thanks to TRL's support for PEFT and bitsandbytes you can also try QLoRA and LoRA fine-tuning (which comes in blog post) 😏
Please try the scripts, share your models and let us know how it goes!

TECH BLOG by Dongyoung Kim Ph.D.

2024년 8월 1일 AI 소식

Apple, Apple Intelligence Foundation Language Models

Google, Smaller, Safer, More Transparent: Advancing Responsible AI with Gemma

Meta, Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images

PyTorch, Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile

Hugging Face, Preference Optimization for Vision Language Models with TRL

(today’s date in 년 월 일) AI 소식,

Summary

company name, Title

company name, Title