알리바바에서는 새로운 추론 AI 모델 QwQ-32B-Preview를 출시하여 OpenAI의 o1 모델에 도전하고 있습니다. 이 모델은 특정 벤치마크에서 우수한 성능을 보이며, 수학 및 논리 퍼즐 해결 능력을 향상시켰습니다.
DeepSeek에서는 DeepSeek-R1-Lite-Preview를 발표하여 AIME 및 MATH 벤치마크에서 OpenAI의 o1-preview 수준의 성능을 달성했습니다. 이 모델은 실시간으로 투명한 사고 과정을 보여주며, 곧 오픈 소스 모델과 API를 출시할 예정입니다.
NVIDIA는 Hymba라는 하이브리드 헤드 아키텍처를 적용한 소형 언어 모델을 발표하여 성능과 효율성을 향상시켰습니다. Hymba 1.5B 모델은 유사한 크기의 오픈 소스 모델보다 우수한 성능을 보이며, 상용 사용이 가능합니다.
Hugging Face에서는 SmolVLM이라는 소형이지만 강력한 비전 언어 모델을 공개하였습니다. 이 모델은 메모리 효율성이 뛰어나고 완전한 오픈 소스로 제공되며, 다양한 응용 프로그램에서 사용할 수 있습니다.
Blackforest Labs는 FLUX.1 Tools를 출시하여 FLUX.1 텍스트-이미지 모델에 제어 기능을 추가하였습니다. 이 도구는 실세계 이미지와 생성된 이미지를 수정 및 재생성할 수 있게 해주며, 네 가지 모델이 공개되었습니다.
Apple에서는 빠른 CoreML 모델과 이를 iPhone에서 실행할 수 있는 iOS 앱을 발표하였습니다. 이 모델들은 높은 성능과 효율성을 제공하며, Hugging Face에서 다운로드할 수 있고 GitHub에서 코드를 확인할 수 있습니다.
Alex Garcia는 sqlite-vec의 새로운 업데이트를 발표하여 vec0 가상 테이블에서 메타데이터 열과 필터링을 지원하게 되었습니다. 이를 통해 SQLite 또는 LibSQL 데이터베이스에서의 벡터 검색이 더욱 향상되었습니다.
연구팀은 XGrammar라는 대형 언어 모델을 위한 유연하고 효율적인 구조화 생성 엔진을 제안하였습니다. 이 엔진은 기존 솔루션보다 최대 100배 빠른 속도를 제공하며, 구조화된 생성을 거의 지연 없이 수행할 수 있습니다.
.txt 팀에서는 ‘Let Me Speak Freely’ 논문에 대한 반박을 통해 구조화된 출력이 올바르게 구현되면 LLM 성능을 실제로 향상시킬 수 있음을 보여주었습니다. 이는 프롬프트, 파서, 생성기의 일치를 통해 가능하다고 강조하였습니다.
알리바바, QwQ: Reflect Deeply on the Boundaries of the Unknown

링크, 2024년 11월 28일
새로운 추론 AI 모델 QwQ-32B-Preview 출시
325억 개의 파라미터를 포함하고 약 32,000 단어의 프롬프트를 처리 가능
OpenAI의 o1-preview 및 o1-mini보다 특정 벤치마크에서 우수한 성능
AIME와 MATH 테스트에서 OpenAI의 o1 모델을 능가
논리 퍼즐 해결 및 도전적인 수학 질문 해결 가능
언어 전환 및 루프에 빠지는 현상 등의 한계점 존재
모델 자체의 사실 확인을 통해 잘못된 답변 감소, 그러나 답변 시간이 길어질 수 있음
Hugging Face 플랫폼에서 다운로드 가능하며 상업적 사용 가능
Apache 2.0 라이선스 하에 공개
모델의 일부 구성 요소만 공개되어 전체 시스템의 재현이나 심층적인 이해는 어려움
DeepSeek, NewsDeepSeek-R1-Lite Release 2024/11/20

링크, 2024년 11월 20일
DeepSeek-R1-Lite-Preview 출시, 향상된 추론 능력 제공
AIME 및 MATH 벤치마크에서 OpenAI의 o1-preview 수준의 성능 달성
실시간으로 투명한 사고 과정 제공
곧 오픈 소스 모델과 API 출시 예정
벤치마크 전반에서 인상적인 결과 달성
더 긴 사고 과정을 통해 더 나은 성능 달성
추론 길이가 증가함에 따라 AIME에서 점수가 꾸준히 향상
NVIDIA, Hymba Hybrid-Head Architecture Boosts Small Language Model Performance

링크, 2024년 11월 22일
NVIDIA 연구팀이 Hymba라는 하이브리드 헤드 아키텍처를 도입한 소형 언어 모델 발표
Transformer의 어텐션 메커니즘과 SSM을 통합하여 효율성과 성능 향상
Hymba 1.5B 모델이 유사한 크기의 오픈 소스 모델보다 우수한 성능 달성
더 높은 처리량과 적은 메모리 요구사항 달성
Hymba-1.5B-Base는 상업적 사용 준비 완료
NVIDIA Open Model License Agreement 하에 출시
다양한 벤치마크에서 뛰어난 성능을 입증
어텐션 맵 시각화를 통해 모델의 작동 방식 분석
구조적 재매개변수를 통한 모델 효율성 개선
Hugging Face, SmolVLM - small yet mighty Vision Language Model

링크, 2024년 11월 26일
SmolVLM이라는 20억 규모의 비전 언어 모델 공개
작은 메모리 공간에서도 강력한 성능을 발휘하며, 완전한 오픈 소스로 제공
모델 체크포인트, 데이터셋, 훈련 레시피 및 도구를 Apache 2.0 라이선스로 공개
다양한 멀티모달 작업에서 우수한 성능을 보임
메모리 사용량과 처리량 측면에서 효율성 제공
비디오 이해 기능도 제공하며, 제한된 계산 자원에서도 활용 가능
VLMEvalKit와의 통합을 통해 추가 벤치마크 평가 가능
transformers 라이브러리를 통해 쉽게 모델 사용 가능
사용자 맞춤형 미세 조정을 위한 튜토리얼 및 도구 제공
Blackforest Labs, Flux1 Tools: Hugging Face Transformers

링크, 2024년 11월 21일
FLUX.1 Tools 출시, FLUX.1 텍스트-이미지 모델에 제어 및 조정 기능 추가
실제 및 생성된 이미지를 수정 및 재생성할 수 있는 도구 제공
네 가지 오픈 액세스 모델 공개: FLUX.1 Fill, FLUX.1 Depth, FLUX.1 Canny, FLUX.1 Redux
각 모델은 고유한 기능과 능력을 제공
FLUX.1 Fill은 최신 인페인팅 및 아웃페인팅 모델 제공
FLUX.1 Depth와 Canny는 구조적 가이드를 통해 이미지 변환 제어 가능
FLUX.1 Redux는 입력 이미지와 텍스트 프롬프트를 혼합하여 새로운 이미지를 생성
벤치마크 결과, FLUX.1 시리즈가 기존 모델들보다 우수한 성능을 보임
BFL API를 통해 프로 버전 모델에 접근 가능
Apple, Core ML Models

링크, 2024년 11월 22일
Apple이 빠른 CoreML 모델과 이를 iPhone에서 실행할 수 있는 iOS 앱 발표
S0, S2, MobileCLIP-B(LT) 등 고성능 모델 출시
S0는 OpenAI의 ViT-B/16과 동일한 제로샷 성능을 보이면서 4.8배 빠르고 2.8배 작음
S2는 SigLIP의 ViT-B/16보다 우수한 성능을 보이며, 더 빠르고 작음
모델은 Hugging Face에서 다운로드 가능하고, 코드도 GitHub에서 확인 가능
다양한 컴퓨터 비전 작업에서 사용 가능
FastViT, Depth Anything V2 등 다양한 모델 제공
각 모델에 대한 상세한 정보와 예제 프로젝트 제공
Alex Garcia, sqlite-vec now supports metadata columns and filtering

링크, 2024년 11월 20일
sqlite-vec의 새로운 업데이트 버전 0.1.6 발표
vec0 가상 테이블에서 메타데이터 열과 필터링 지원
메타데이터 컬럼을 통해 벡터 검색에서 조건부 필터링 가능
파티션 키 도입으로 벡터 인덱스를 샤딩하여 쿼리 속도 향상
보조 열을 사용하여 별도의 조인 없이 메타데이터 저장 및 접근 가능
향후 양자화 지원 개선 계획
ANN 인덱스 지원을 통해 성능 향상 예정
sqlite-lembed 및 sqlite-rembed 등의 관련 프로젝트도 업데이트 예정
연구팀, XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

링크, 2024년 11월 22일
XGrammar라는 유연하고 효율적인 구조화 생성 엔진 제안
문맥 자유 문법을 사용하여 구조화된 생성을 가능하게 함
기존 솔루션보다 최대 100배의 속도 향상 달성
LLM 추론 엔진과 결합하여 거의 지연 없는 구조화 생성 가능
MLC-LLM, SGLang, WebLLM 등에 적용됨
곧 vLLM 및 HuggingFace에서도 지원 예정
정확한 JSON/문법 생성 가능
기술 보고서에서 상세한 기술 및 성능 데이터 제공
.txt 팀, Say What You Mean: A Response to ‘Let Me Speak Freely’

링크, 2024년 11월 24일
‘Let Me Speak Freely’ 논문에 대한 반박을 통해 구조화된 출력이 LLM 성능을 향상시킬 수 있음을 증명
원 논문의 결과는 약한 프롬프트와 구조화된 프롬프트의 잘못된 사용으로 인한 것임을 지적
정확한 프롬프트 설계와 파서, 생성기의 일치가 중요함을 강조
구조화된 출력이 테스트에서 비구조화된 출력보다 우수한 성능을 보임
JSON 생성을 위한 프롬프트 설계의 중요성 강조
예시에서 기대하는 형식과 실제 작업에서의 형식이 일치해야 함
구조화된 생성은 적절하게 구현되면 성능 향상에 기여함을 강조
Sources
This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:
(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)
company name, Title

링크, date
detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)
company name, Title

링크, date
링크, date,
detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)
…
###
https://huggingface.co/Qwen/QwQ-32B-Preview
QwQ: Reflect Deeply on the Boundaries of the Unknown
November 28, 2024
 · 22 min · 4496 words · Qwen Team | Translations:

 알리바바가 OpenAI의 o1 reasoning model에 도전하는 'open' 모델을 출시한다

새로운 "추론" AI 모델 QwQ-32B-Preview가 출시됨.
QwQ-32B-Preview는 32.5억 개의 파라미터를 포함하고, 약 32,000 단어 길이의 프롬프트를 처리할 수 있음.
해당 모델은 특정 벤치마크에서 OpenAI의 o1-preview 및 o1-mini보다 더 나은 성능을 보임.
AIME와 MATH 테스트에서 OpenAI의 o1 모델을 능가함.
논리 퍼즐을 풀고, 도전적인 수학 질문을 해결할 수 있음.
그러나 언어 전환, 루프에 빠지는 현상, 상식적 추론을 요구하는 작업에서 약점을 보일 수 있음.
QwQ-32B-Preview는 모델 자체적으로 사실 확인을 수행함으로써 잘못된 답변을 줄이는 반면, 답변 시간이 길어질 수 있음.
Hugging Face 플랫폼에서 다운로드 가능하며 상업적 용도로 사용 가능함.
중국의 규제를 준수하기 위해 정치적 주제에 대해서는 신중한 접근을 취함.
QwQ-32B-Preview는 Apache 2.0 라이선스 하에 공개됨.
모델의 일부 구성 요소만이 공개되어 전체 시스템의 재현이나 내밀한 이해는 어려움.
"스케일링 법칙"의 지속 가능한 효율성에 대한 의문이 제기되면서, AI 연구는 새로운 접근법과 기술 개발에 집중하고 있음.
Google을 포함한 여러 대형 연구소들이 추론 모델에 투자하고 있음.

简体中文
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD

Note: This is the pronunciation of QwQ: /kwju:/ , similar to the word “quill”.

What does it mean to think, to question, to understand? These are the deep waters that QwQ (Qwen with Questions) wades into. Like an eternal student of wisdom, it approaches every problem - be it mathematics, code, or knowledge of our world - with genuine wonder and doubt. QwQ embodies that ancient philosophical spirit: it knows that it knows nothing, and that’s precisely what drives its curiosity. Before settling on any answer, it turns inward, questioning its own assumptions, exploring different paths of thought, always seeking deeper truth. Yet, like all seekers of wisdom, QwQ has its limitations. This version is but an early step on a longer journey - a student still learning to walk the path of reasoning. Its thoughts sometimes wander, its answers aren’t always complete, and its wisdom is still growing. But isn’t that the beauty of true learning? To be both capable and humble, knowledgeable yet always questioning? We invite you to explore alongside QwQ, embracing both its insights and its imperfections as part of the endless quest for understanding.

Limitations
QwQ-32B-Preview is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities. As a preview release, it demonstrates promising analytical abilities while having several important limitations:

Language Mixing and Code-Switching: The model may mix languages or switch between them unexpectedly, affecting response clarity.
Recursive Reasoning Loops: The model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer.
Safety and Ethical Considerations: The model requires enhanced safety measures to ensure reliable and secure performance, and users should exercise caution when deploying it.
Performance and Benchmark Limitations: The model excels in math and coding but has room for improvement in other areas, such as common sense reasoning and nuanced language understanding.
Performance
Through deep exploration and countless trials, we discovered something profound: when given time to ponder, to question, and to reflect, the model’s understanding of mathematics and programming blossoms like a flower opening to the sun. Just as a student grows wiser by carefully examining their work and learning from mistakes, our model achieves deeper insight through patient, thoughtful analysis. This process of careful reflection and self-questioning leads to remarkable breakthroughs in solving complex problems. Our journey of discovery revealed the model’s exceptional ability to tackle some of the most challenging problems in mathematics and programming, including:

GPQA: A Graduate-Level Google-Proof Q&A Benchmark, a challenging benchmark for evaluating scientific problem-solving abilities through grade school level questions.
AIME: American Invitation Mathematics Evaluation, which tests mathematical problem solving with arithmetic, algebra, counting, geometry, number theory, and probability and other secondary school math topics.
MATH-500: The 500 test cases of the MATH benchmark, a comprehensive dataset testing mathematical problem-solving.
LiveCodeBench: A challenging benchmark for evaluating code generation and problem solving abilities in real-world programming scenarios.


Specifically, QwQ demonstrates remarkable performance across these benchmarks, achieving impressive scores of 65.2% on GPQA, showcasing its graduate-level scientific reasoning capabilities; 50.0% on AIME, highlighting its strong mathematical problem-solving skills; 90.6% on MATH-500, demonstrating exceptional mathematical comprehension across diverse topics; and 50.0% on LiveCodeBench, validating its robust programming abilities in real-world scenarios. These results underscore QwQ’s significant advancement in analytical and problem-solving capabilities, particularly in technical domains requiring deep reasoning.



Introduction
QwQ-32B-Preview is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities. As a preview release, it demonstrates promising analytical abilities while having several important limitations:

Language Mixing and Code-Switching: The model may mix languages or switch between them unexpectedly, affecting response clarity.
Recursive Reasoning Loops: The model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer.
Safety and Ethical Considerations: The model requires enhanced safety measures to ensure reliable and secure performance, and users should exercise caution when deploying it.
Performance and Benchmark Limitations: The model excels in math and coding but has room for improvement in other areas, such as common sense reasoning and nuanced language understanding.
Specification:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
Number of Parameters: 32.5B
Number of Paramaters (Non-Embedding): 31.0B
Number of Layers: 64
Number of Attention Heads (GQA): 40 for Q and 8 for KV
Context Length: Full 32,768 tokens
For more details, please refer to our blog. You can also check Qwen2.5 GitHub, and Documentation.


###
https://api-docs.deepseek.com/news/news1120
NewsDeepSeek-R1-Lite Release 2024/11/20
🚀 DeepSeek-R1-Lite-Preview is now live: unleashing supercharged reasoning power!
🔍 o1-preview-level performance on AIME & MATH benchmarks.

💡 Transparent thought process in real-time.

🛠️ Open-source models & API coming soon!

🌐 Try it now at http://chat.deepseek.com


🌟 Impressive Results of DeepSeek-R1-Lite-Preview Across Benchmarks!


🌟 Inference Scaling Laws of DeepSeek-R1-Lite-Preview

Longer Reasoning, Better Performance. DeepSeek-R1-Lite-Preview shows steady score improvements on AIME as thought length increases.


###
https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/
NVIDIA

Hymba Hybrid-Head Architecture Boosts Small Language Model Performance
Nov 22, 2024
By Xin Dong, Yonggan Fu, Shizhe Diao and Pavlo Molchanov

+17
Like
 Discuss (0)

Transformers, with their attention-based architecture, have become the dominant choice for language models (LMs) due to their strong performance, parallelization capabilities, and long-term recall through key-value (KV) caches. However, their quadratic computational cost and high memory demands pose efficiency challenges. In contrast, state space models (SSMs) like Mamba and Mamba-2 offer constant complexity and efficient hardware optimization but struggle with memory recall tasks, affecting their performance on general benchmarks.

NVIDIA researchers recently proposed Hymba, a family of small language models (SLMs) featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with SSMs to achieve both enhanced efficiency and improved performance. In Hymba, attention heads provide high-resolution recall, while SSM heads enable efficient context summarization.

The novel architecture of Hymba reveals several insights:

Overhead in attention: Over 50% of attention computation can be replaced by cheaper SSM computation.
Local attention dominance: Most global attention can be replaced by local attention without sacrificing performance on general and recall-intensive tasks, thanks to the global information summarized by SSM heads.
KV cache redundancy: Key-value cache is highly correlated across heads and layers, so it can be shared across heads (group query attention) and layers (cross-layer KV cache sharing).
Softmax attention limitation: Attention mechanisms are constrained to sum to one, limiting sparsity, and flexibility. We introduce learnable meta-tokens that are prepended to prompts, storing critical information and alleviating the “forced-to-attend” burden associated with attention mechanisms.
This post shows that Hymba 1.5B performs favorably against state-of-the-art open-source models of similar size, including Llama 3.2 1B, OpenELM 1B, Phi 1.5, SmolLM2 1.7B, Danube2 1.8B, and Qwen2.5 1.5B. Compared to Transformer models of similar size, Hymba also achieves higher throughput and requires 10x less memory to store cache.

Hymba 1.5B is released to the Hugging Face collection and GitHub.

Hymba 1.5B performance
Figure 1 compares Hymba 1.5B against sub-2B models (Llama 3.2 1B, OpenELM 1B, Phi 1.5, SmolLM2 1.7B, Danube2 1.8B, Qwen2.5 1.5B) in terms of average task accuracy, cache size (MB) relative to sequence length, and throughput (tok/sec).

A figure showing three performance metrics comparing seven different AI language models in terms of average accuracy, cache size (MB) relative to sequence length, and throughput (tok/sec).
Figure 1. Performance comparison of Hymba 1.5B Base against sub-2B models 
In this set of experiments, the tasks include MMLU, ARC-C, ARC-E, PIQA, Hellaswag, Winogrande, and SQuAD-C. The throughput is measured on an NVIDIA A100 GPU with a sequence length of 8K and a batch size of 128 using PyTorch. For models encountering out of memory (OOM) issues during throughput measurement, the batch size was halved until the OOM is resolved to measure the maximal achievable throughput without OOM.

Hymba model design 
SSMs such as Mamba were introduced to address the quadratic complexity and large inference-time KV cache issues of transformers. However, due to their low-resolution memory, SSMs struggle with memory recall and performance. To overcome these limitations, we propose a road map for developing efficient and high-performing small LMs in Table 1.

Configuration	Commonsense reasoning (%) ↑	Recall (%) ↑	Throughput (token/sec) ↑	Cache size (MB) ↓	Design reason
Ablations on 300M model size and 100B training tokens
Transformer (Llama)	44.08	39.98	721.1	414.7	Accurate recall while inefficient 
State-space models (Mamba)	42.98	19.23	4720.8	1.9	Efficient while inaccurate recall
A. + Attention heads (sequential)	44.07	45.16	776.3	156.3	Enhance recall capabilities
B. + Multi-head heads (parallel)	45.19	49.90	876.7	148.2	Better balance of two modules
C. + Local / global attention	44.56	48.79	2399.7	41.2	Boost compute/cache efficiency
D. + KV cache sharing	45.16	48.04	2756.5	39.4	Cache efficiency
E. + Meta-tokens 	45.59	51.79	2695.8	40.0	Learned memory initialization
Scaling to 1.5B model size and 1.5T training tokens
F. + Size / data	60.56	64.15	664.1	78.6	Further boost task performance
G. + Extended context length (2K→8K)	60.64	68.79	664.1	78.6	Improve multishot and recall tasks
Table 1. Design road map of the Hymba model
Fused hybrid modules 
Fusing attention and SSM heads in parallel within a hybrid-head module outperforms sequential stacking, according to the ablation study. Hymba fuses attention and SSM heads in parallel within a hybrid head module, enabling both heads to process the same information simultaneously. This architecture improves reasoning and recall accuracy.

A diagram showing the architecture of a dual-path attention mechanism. The flow starts with an Input Projection, leading to Latent Feature extraction which splits into two parallel paths. The upper path (in blue) contains SSM Feature processing through SSM Heads and Gate Normalization. The lower path (in red) processes Attention Features through Attention Heads and Gate Normalization. Both paths converge at a Mean operation before final Output Projection. Arrows indicate the flow of data through the system.
Figure 2. The hybrid-head module in Hymba
Efficiency and KV cache optimization
While attention heads improve task performance, they increase KV cache requirements and reduce throughput. To mitigate this, Hymba optimizes the hybrid-head module by combining local and global attention and employing cross-layer KV cache sharing. This improves throughput by 3x and reduces cache by almost 4x without sacrificing performance. 

A diagram showing the architecture of a neural network model with Hymba Blocks. The model flows from left to right, starting with an Embedding layer, followed by alternating Hymba Blocks with Full Attention (in red) and SWA (in blue). The blocks are connected with KV sharing every 2 layers, shown in dotted green boxes labeled 'Repeat (N-3)/2'. Below the main flow, there's a detailed view of a module containing Layer norm, Hybrid-head module, another Layer norm, and FFN components. The diagram ends with an LM Head layer on the right.
Figure 3. Hymba model architecture
Meta-tokens
A set of 128 pretrained embeddings prepended to inputs, functioning as learned cache initialization to enhance focus on relevant information. These tokens serve a dual purpose: 

Mitigating attention drain by acting as backstop tokens, redistributing attention effectively
Encapsulating compressed world knowledge
A diagram illustrating the Fading Memory architecture from SSM (State Space Model). The image shows three layers: At the top is a blue rectangular box labeled 'Fading Memory (From SSM)'. Below it are seven gray input tokens arranged horizontally. At the bottom are two sets of memory blocks: on the left are two green blocks labeled 'Meta Memory (Meta Tokens)', and on the right are three red blocks labeled 'Snapshot Memory (From Attn)'. Green arrows connect the Meta Memory to the input tokens, while red arrows connect the Snapshot Memory to the rightmost input tokens. A blue arrow loops back from the Fading Memory box to itself.
Figure 4. Interpretation of Hymba from the memory aspect
Model analysis
This section presents an apples-to-apples comparison across different architectures under the same training settings. We then visualize the attention maps of SSM and Attention in different pretrained models. Finally, we perform head importance analysis for Hymba through pruning. All the analyses in this section help to illustrate how and why the design choices for Hymba are effective. 

Apples-to-apples comparison 
We performed an apples-to-apples comparison of Hymba, pure Mamba2, Mamba2 with FFN, Llama3 style, and Samba style (Mamba-FFN-Attn-FFN) architectures. All models have 1 billion parameters and are trained from scratch for 100 billion tokens from SmolLM-Corpus with exactly the same training recipe. All results are obtained through lm-evaluation-harness using a zero-shot setting on Hugging Face models. Hymba performs the best on commonsense reasoning as well as question answering and recall-intensive tasks. 

Table 2 compares various model architectures on language modeling and recall-intensive and commonsense reasoning tasks, with Hymba achieving strong performance across metrics. Hymba demonstrates the lowest perplexity in language tasks (18.62 for Wiki and 10.38 for LMB) and solid results in recall-intensive tasks, particularly in SWDE (54.29) and SQuAD-C (44.71), leading to the highest average score in this category (49.50). 

Model	Language (PPL) ↓	Recall intensive (%) ↑	Commonsense reasoning (%) ↑
Mamba2	15.88	43.34	52.52
Mamba2 w/ FFN	17.43	28.92	51.14
Llama3	16.19	47.33	52.82
Samba	16.28	36.17	52.83
Hymba	14.5	49.5	54.57
Table 2. Comparison of architectures trained on 100 billion tokens under the same settings
In commonsense reasoning and question answering, Hymba outperforms other models in most tasks, such as SIQA (31.76) and TruthfulQA (31.64), with an average score of 54.57, slightly above Llama3 and Mamba2. Overall, Hymba stands out as a balanced model, excelling in both efficiency and task performance across diverse categories.

Attention map visualization
We further categorized elements in the attention map into four types: 

Meta: Attention scores from all real tokens to meta-tokens. This category reflects the model’s preference for attending to meta-tokens. In attention maps, they are usually located in the first few columns (for example, 128 for Hymba) if a model has meta-tokens. 
BOS: Attention scores from all real tokens to the beginning-of-sequence token. In the attention map, they are usually located in the first column right after the meta-tokens. 
Self: Attention scores from all real tokens to themselves. In the attention map, they are usually located in the diagonal line. 
Cross: Attention scores from all real tokens to other real tokens. In the attention map, they are usually located in the off-diagonal area. 
The attention pattern of Hymba is significantly different from that of vanilla Transformers. In vanilla Transformers, attention scores are more concentrated on BOS, which is consistent with the findings in Attention Sink. In addition, vanilla Transformers also have a higher proportion of Self attention scores. In Hymba, meta-tokens, attention heads, and SSM heads work complementary to each other, leading to a more balanced distribution of attention scores across different types of tokens. 

Specifically, meta-tokens offload the attention scores from BOS, enabling the model to focus more on the real tokens. SSM heads summarize the global context, which focuses more on current tokens (Self attention scores). Attention heads, on the other hand, pay less attention to Self and BOS tokens, and more attention to other tokens (that is, Cross attention scores). This suggests that the hybrid-head design of Hymba can effectively balance the attention distribution across different types of tokens, potentially leading to better performance.

A diagram showing the composition of the Hymba attention mechanism. It consists of three components that are added together: Meta Tokens (shown as a vertical green stripe on the left), Sliding Window Attention (displayed as a diagonal green band), and SSM (Mamba) (represented as a triangular green gradient). These three patterns combine to form the final Hymba pattern on the right, which shows a triangular area filled with green squares of varying intensity. Each component is displayed in a square grid format, and the combination is shown using plus signs between the components and an equals sign before the final result.
Figure 5. Schematics of the attention map of Hymba as a combination of meta-tokens, sliding window attention, and Mamba contributions
A comparative visualization showing attention patterns across different language models. The image consists of three main parts: 1) Three attention heatmaps for Llama 3.2 3B and Hymba 1.5B models, showing diagonal patterns in purple, yellow, and blue colors. 2) A grid diagram showing BOS (Beginning of Sequence) token connections with Meta and Cross sections marked. 3) Three horizontal stacked bar charts comparing percentage distributions of Meta, BOS, Cross, and Self attention patterns across Llama 3.2 3B and two variants of Hymba models, with percentages clearly labeled in different colors.
Figure 6. Sum of the attention score from different categories in Llama 3.2 3B and Hymba 1.5B
Heads importance analysis 
We analyzed the relative importance of attention and SSM heads in each layer by removing them and recording the final accuracy. Our analysis reveals the following: 

The relative importance of attention/SSM heads in the same layer is input-adaptive and varies across tasks, suggesting that they can serve different roles when handling various inputs.
The SSM head in the first layer is critical for language modeling, and removing it causes a substantial accuracy drop to random guess levels.
Generally, removing one attention/SSM head results in an average accuracy drop of 0.24%/1.1% on Hellaswag, respectively.
A line graph comparing the Hellswag Accuracy (y-axis ranging from 0.45 to 0.50) across 32 different layers (x-axis). The graph shows three elements: a horizontal dashed line labeled Orig Model at approximately 0.493, and two sets of bars in blue and orange representing Remove Attn and Remove SSM, respectively. The bars fluctuate slightly above and below the original model line, with most values falling between 0.47 and 0.495. The graph compares the impact of removing attention mechanisms versus SSM components at different layers of the model.
Figure 7. The achieved accuracy, measured using 1K samples from Hellaswag, after removing the Attention or SSM heads in each layer
Model architecture and training best practices
This section outlines key architectural decisions and training methodologies for Hymba 1.5B Base and Hymba 1.5B Instruct.

Model architecture
Hybrid architecture: Mamba is great at summarization and usually closer focuses on the current token, while attention is more precise and acts as snapshot memory. Combining them in parallel merges these benefits, but standard sequential fusion does not. We chose a 5:1 parameter ratio between SSM and attention heads.
Sliding window attention: Full attention heads are preserved in three layers (first, last, and middle), with sliding window attention heads used in the remaining 90% layers.
Cross-layer KV cache sharing: Implemented between every two consecutive attention layers. It is done in addition to GQA KV cache sharing between heads.
Meta-tokens: These 128 tokens are learnable with no supervision, helping to avoid entropy collapse problems in large language models (LLMs) and mitigate the attention sink phenomenon. Additionally, the model stores general knowledge in these tokens. 
Training best practices 
Pretraining: We opted for two-stage base model training. Stage 1 maintained a constant large learning rate and used less filtered large corpus data. Continuous learning rate decay was then performed to 1e-5 using high-quality data. This approach enables continuous training and resuming of Stage 1.
Instruction fine-tuning: Instruct model tuning is performed in three stages. First, SFT-1 provides the model with strong reasoning abilities by training on code, math, function calling, role play, and other task-specific data. Second, SFT-2 teaches the model to follow human instructions. Finally, DPO is leveraged to align the model with human preferences and improve the model’s safety.
Training pipeline for the Hymba model family divided into five sections that read (left to right) General pretraining, LR annealing, SFT-1, SFT-2, and DPO. 
Figure 8. Training pipeline adapted for the Hymba model family
Performance and efficiency evaluation 
With only 1.5T pretraining tokens, the Hymba 1.5B model performs the best among all small LMs and achieves better throughput and cache efficiency than all transformer-based LMs. 

For example, when benchmarking against the strongest baseline, Qwen2.5, which is pretrained on 13x more tokens, Hymba 1.5B achieves a 1.55% average accuracy improvement, 1.41x throughput, and 2.90x cache efficiency. Compared to the strongest small LM trained on fewer than 2T tokens, namely h2o-danube2, our method achieves a 5.41% average accuracy improvement, 2.45x throughput, and 6.23x cache efficiency.

New small hybrid model from NVIDIA has been announced! Hymba is a 1.5B hybrid Mamba x Attention Model that outperforms other small LLMs like Meta 3.2 or SmolLM v2 being trained on only 1.5T Tokens. 🤯
It uses a new hybrid architecture with Mamba and Attention heads running in parallel with additional meta tokens (learnable tokens prepended to every prompt), to improve the efficacy of the model. 👀
It shares the KV cache between 2 layers and between heads in a single layer. It has 16 SSM states and 3 full attention layers; the rest are sliding window attention. It also uses FlexAttention from Pytorch 2.5! 🔥


Hymba-1.5B-Base
Model Overview
Hymba-1.5B-Base is a base text-to-text model that can be adopted for a variety of natural language generation tasks.

The model has hybrid architecture with Mamba and Attention heads running in parallel. Meta tokens, a set of learnable tokens prepended to every prompt, help improve the efficacy of the model. The model shares KV cache between 2 layers and between heads in a single layer. 90% of attention layers are sliding window attention.

This model is ready for commercial use.

[Caution] During generation, the batch size needs to be 1. Our current implementation does not fully support padding of Meta tokens + SWA; this is a work in progress. Training and pre-filling support any batch size.

Model Developer: NVIDIA

Model Dates: Hymba-1.5B-Base was trained between September 1, 2024 and November 10th, 2024.

License: This model is released under the NVIDIA Open Model License Agreement.

Model Architecture
Hymba-1.5B-Base has a model embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504, with 32 layers in total, 16 SSM states, 3 full attention layers, the rest are sliding window attention. Unlike the standard Transformer, each attention layer in Hymba has a hybrid combination of standard attention heads and Mamba heads in parallel. Additionally, it uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).

Features of this architecture:

Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs.
Hymba Module
Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced-to-attend" in attention.

Integrate with cross-layer KV sharing and global-local attention to further boost memory and computation efficiency.

Hymba Model
Performance Highlights
Hymba-1.5B-Base outperforms all sub-2B public models.
Compare with SoTA Small LMs
Compare with SoTA Small LMs

###
https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct
SmolVLM - small yet mighty Vision Language Model
Published November 26, 2024
Andres Marafioti's avatar
andito
Andres Marafioti
Merve Noyan's avatar
merve
Merve Noyan
Miquel Farré's avatar
mfarre
Miquel Farré
Elie Bakouch's avatar
eliebak
Elie Bakouch
Pedro Cuenca's avatar
pcuenq
Pedro Cuenca
TLDR
This blog post introduces SmolVLM, a 2B VLM, SOTA for its memory footprint. SmolVLM is small, fast, memory-efficient, and fully open-source. All model checkpoints, VLM datasets, training recipes and tools are released under the Apache 2.0 license.

Image description
What is SmolVLM?
This year has seen a boom in multimodal AI with many large vision language models released. The trends were to initially scale up compute, later scale up the data diversity by generating synthetic data with large models, and, recently, scale down to make these models more efficient. Small open models allow local deployment to browser or edge devices, cut inference costs, and enable user customization. Some notable examples of these models include PaliGemma 3B, moondream2, and Qwen2VL.

In this blog post, we introduce SmolVLM, a new family of 2B small vision language models that can be used commercially and deployed to smaller local setups, with completely open training pipelines.

We release three models: SmolVLM-Base, which can be used for downstream fine-tuning, SmolVLM-Synthetic, the fine-tuned variant on synthetic data, and SmolVLM Instruct, the fine-tuned instruction variant, which can be used out of the box for interactive end-user applications.

This release comes with open-source models integrated into transformers, a demo built on SmolVLM Instruct, and a supervised fine-tuning script. We have used the datasets previously used for Idefics3: the Cauldron and Docmatix, which are also fully open-source.

Table of Contents
TLDR
What is SmolVLM?
Model capabilities
Architecture
Performance
Benchmarks
Memory
Throughput
Video
VLMEvalKit integration
Use SmolVLM with Transformers
Training Details
Dataset
Context extension
Checkpoint selection
Fine-Tuning
Wrapping up
Model capabilities
Input Image	Input Prompt	Answer
	I'm planning a visit to this temple, give me travel tips.	The Grand Palace is the most famous landmark in Bangkok and it's worth visiting for its stunning architecture and history. Here are some of our favorite things about the palace:
The Main Building: Located at the center of the complex, the main building features intricate tile work with traditional Thai motifs that cover every inch of the structure. This includes patterns such as dragons, mythical creatures, flowers, and geometric shapes. The roofline has an elaborate design featuring multiple layers of tiles adorned with gold accents.

Towers & Spires: There are several towers within the Grand Palace including the Phra Nakhon Si Ayutthaya National Museum Tower which houses exhibits on Thai culture and heritage. Another notable tower is the Wat Yai Chai Mongkol Tower where you can find beautiful murals depicting scenes from Buddhist mythology.

Architecture Style: The architectural style combines elements of Hindu and Khmer styles seen commonly throughout Southeast Asia. You'll notice influences like stupas (dome-shaped structures), lotus flower carvings, and ornate details typical of these regions' religious buildings.

Visitor Experience: While exploring the Grand Palace, visitors should be aware of certain rules due to security measures taken by authorities. These include no photography inside certain areas or using flashbulbs outside. Additionally, there may be queues during peak tourist seasons so plan accordingly if you're traveling here.

Travel Tips For Visiting The Grand Palace:
Best Time To Visit: The best time to visit would likely be early morning before sunrise when temperatures drop significantly making walking around easier.
Transportation: Public transportation options exist but they might not provide direct access to the Grand Palace itself; consider taking a taxi instead.
Food Options: There aren't many dining facilities available near the Grand Palace yet local street food vendors do offer delicious meals nearby.
Clothing Guidelines: While not mandatory, wearing traditional Thai clothing (sarongs) could add cultural immersion experience. However, shorts/trousers are allowed provided they don't go below knee level.
Safety Measures: Visitors must follow signs indicating restricted zones marked "No Photography" among others. Always carry identification documents just in case.
By following these guidelines, your trip will undoubtedly enhance both your understanding of Thailand’s rich historical tapestry and personal enjoyment!

	Where do the severe droughts happen according to this diagram?	The severe droughts are happening in eastern and southern Africa.
	What is the due date and the invoice date?	The due date is 26/02/2019, and the invoice date is 11/02/2019.
Architecture
Image description
For SmolVLM, we closely followed the architecture from Idefics3, to the point that we use the same implementation in transformers. There are, however a few key differences:

We replaced Llama 3.1 8B with SmolLM2 1.7B as the language backbone.
We more aggressively compress the patched visual information by reducing the information 9x using the pixel shuffle strategy, compared to 4x with idefics3.
We use patches of 384*384, instead of 364x364, because 384 is divisible by 3, which is necessary for our pixel shuffle strategy to work.
For this, we change the vision backbone to use shape-optimized SigLIP with patches of 384x384 pixels and inner patches of 14x14.
Performance
Benchmarks
We present benchmarks for the tasks we mention in training details.

Model	MMMU (val)	MathVista (testmini)	MMStar (val)	DocVQA (test)	TextVQA (val)	Min GPU RAM required (GB)
SmolVLM	38.8	44.6	42.1	81.6	72.7	5.02
Qwen2-VL 2B	41.1	47.8	47.5	90.1	79.7	13.70
InternVL2 2B	34.3	46.3	49.8	86.9	73.4	10.52
PaliGemma 3B 448px	34.9	28.7	48.3	32.2	56.0	6.72
moondream2	32.4	24.3	40.3	70.5	65.2	3.87
MiniCPM-V-2	38.2	39.8	39.1	71.9	74.1	7.88
MM1.5 1B	35.8	37.2	0.0	81.0	72.5	NaN
Memory
Inference GPU memory use for SmolVLM and other models
SmolVLM provides the best memory usage among the existing suite of vision language models in transformers. This allows it to run efficiently on-device, such as a laptop! You can see above the GPU memory usage in GBs for each model, running inference with one or two input images, and using the same images and text prompts in all tests. SmolVLM's efficiency in image encoding is built into the model. SmolVLM encodes each 384x384 image patch to 81 tokens. This results in SmolVLM encoding our test prompt and a single image in 1.2k tokens, whereas Qwen2-VL uses 16k tokens. This also explains why the memory consumption increases so much for 2 images with Qwen and InternVL. In contrast, the increase is much more moderate for SmolVLM and PaliGemma, which use a similar approach.

Throughput
Image description
SmolVLM's tiny memory footprint also implies that it requires far fewer computations to prefill the model and generate. Compared to Qwen2-VL, the prefill throughput is 3.3 to 4.5 times faster, and the generation throughput is 7.5 to 16 times faster.

Video
Given SmolVLM's long context and the possibility of tweaking the internal frame resizing of the model, we explored its suitability as an accessible option for basic video analysis tasks, particularly when computational resources are limited.

In our evaluation of SmolVLM's video understanding capabilities, we implemented a straightforward video processing pipeline code, extracting up to 50 evenly sampled frames from each video while avoiding internal frame resizing. This simple approach yielded surprisingly competitive results on the CinePile benchmark, with a score of 27.14%, a performance that positions the model between InterVL2 (2B) and Video LlaVa (7B).

The quantitative results align with our qualitative testing, looking at an example from the FineVideo dataset:


Question: what is shown in the close-up? 
Assistant: In the close-up, there are bottles of Heinz tomato ketchup,
  honey, soy sauce, and garlic on a kitchen counter.

Question: describe what is happening in the video. 
Assistant: The video shows a man cooking in a kitchen.
  He is wearing a black shirt and a black apron.
  He is standing in front of a counter and is holding a bottle of Heinz tomato ketchup,
  a bottle of honey, soy sauce, and a bowl of minced garlic

While in the second question, we see some temporal understanding limitations (the cook points to one ingredient after the other rather than pointing/holding all of them at the same time) SmolVLM demonstrated great scene understanding and object recognition capabilities.

VLMEvalKit integration
We integrated SmolVLM with VLMEvalKit to facilitate easy evaluation across additional benchmarks.

By running the following command, you can evaluate SmolVLM or your fine-tuned SmolVLM model.

python run.py --data <benchmarks> --model SmolVLM --work-dir <output_directory>

For example, to evaluate on MMMU dev validation set and MathVista mini and store the results in a folder called smol.

python run.py --data MMMU_DEV_VAL MathVista_MINI --model SmolVLM --work-dir smol

Use SmolVLM with Transformers
You can easily load SmolVLM using the Auto classes in transformers. Under the hood, the model and processor are mapped to the same implementations used for Idefics3.

from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-Instruct",
                                                torch_dtype=torch.bfloat16,
                                                _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager").to(DEVICE)

Image and text can be interleaved arbitrarily, and you can pass in multiple images. Here’s how you can use the chat template and pass in the formatted input to the processor.

from PIL import Image
from transformers.image_utils import load_image


# Load images
image1 = load_image("https://huggingface.co/spaces/HuggingFaceTB/SmolVLM/resolve/main/example_images/rococo.jpg")
image2 = load_image("https://huggingface.co/spaces/HuggingFaceTB/SmolVLM/blob/main/example_images/rococo_1.jpg")

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Can you describe the two images?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(DEVICE)

Start generating with preprocessed input and decode the generated output.

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Training Details
Dataset
First, we had to train SmolLM2 to extend it context, but we will discuss that in the next subsection. Once we had a long context SmolLM2, we trained SmolVLM using the same data that we used for Idefics3. Mainly, we used The Cauldron and Docmatix. The full list of datasets we used can be consulted here.

Image description
Context extension
Image description
SmolLM2’s pre-training context window is insufficient for VLMs. Images are encoded into many tokens, and we wanted to support multiple images. To address this, we extended it to 16k tokens by increasing the RoPE base value from 10k to 273k, following the guidelines in “Scaling Laws of RoPE-based Extrapolation”. We fine-tuned the model on a mixture of long- and short-context datasets. For long-context datasets, we used the “books” subset of Dolma (primarily Project Gutenberg) and code documents with 8k+ tokens from The Stack, each contributing 20% to the final mixture. For short-context datasets, we streamlined the original SmolLM2 pre-training mix to include 20% FineWeb-Edu, 20% DCLM, and 20% from our math dataset (to be released soon). The math dataset was upsampled to mitigate a performance drop observed on GSM8k during the context extension process. All experiments were implemented using the EasyContext repository.

Checkpoint Selection
For our training run, we saved checkpoints every 25 optimization steps, allowing us to evaluate and potentially recover the model's state at different points in training. This practice is crucial for identifying the optimal model version, as training longer doesn't always guarantee better performance. We evaluated the performance across multiple vision-language benchmarks, each weighted according to their importance. The core benchmarks included the following:

General multimodal understanding (MMMU and MMStar) which are the most comprehensive benchmark.
Document and text-based visual question answering (DocVQA and TextVQA)
Mathematical Reasoning (MathVista)
Diagram understanding (AI2D)
General multimodal understanding (MMMU and MMStar).
To select the optimal checkpoint, we created a single metric by combining these benchmarks with different manually assigned weights to reflect their relative importance in assessing the model's capabilities. We used this single metric to select the best checkpoint. Generally, the models tended to do great on most benchmarks with more training, but their relative performance on DocVQA would decrease considerably.

Fine-tuning
You can fine-tune SmolVLM using transformers and apply alignment techniques using TRL 🚀

We provide a notebook to fine-tune it on the VQAv2 dataset, optionally using LoRA, QLoRA or full fine-tuning. In the notebook, you can find some tricks to save up even more memory and have a larger batch size to fit SmolVLM inside consumer GPUs, like L4, for training. With batch sizes of 4, 8-bit loading with QLoRA and gradient checkpointing we can fine-tune in L4, and it consumes around ~16 GBs of VRAM. This makes it possible to fine-tune your SmolVLM using Colab! You can play around with the parameters to get a nice point in training duration-memory trade-off.

SmolVLM also comes with TRL integration so you can apply Direct Preference Optimization (DPO) easily through the CLI. Get started by running pip install trl accelerate peft and then run the following command to fine-tune on RLAIF-V dataset:

accelerate launch \
  --config_file examples/accelerate_configs/multi_gpu.yaml examples/scripts/dpo_vlm.py  \
  --dataset_name HuggingFaceH4/rlaif-v_formatted \
  --model_name_or_path HuggingFaceTB/SmolVLM-Instruct \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 32 \
  --dataset_num_proc 32 \
  --output_dir dpo_smolvlm_rlaif-v \
  --bf16 --torch_dtype bfloat16 \
  --use_peft --lora_target_modules=all-linear 

The resulting LoRA adapter weights are SmolVLM-Instruct-DPO. A detailed tutorial on preference tuning vision-based LLM can be found here: dpo_vlm.

Wrapping Up
We introduced SmolVLM, a fully open, small, and mighty VLM for the community! We also provide tools for the community to use and customize it. We are looking forward to seeing what you will create with SmolVLM.

Below are some resources if you would like to read more about all things related to SmolVLM.

Start playing with SmolVLM using this demo.
Learn how to fine-tune SmolVLM on VQAv2 using this notebook
Learn more about vision language models
SmolVLM
SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs. Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images, or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance on multimodal tasks.

Model Summary
Developed by: Hugging Face 🤗
Model type: Multi-modal model (image+text)
Language(s) (NLP): English
License: Apache 2.0
Architecture: Based on Idefics3 (see technical summary)
Resources
Demo: SmolVLM Demo
Blog: Blog post
Uses
SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images. Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on visual content. The model does not support image generation.

To fine-tune SmolVLM on a specific task, you can follow the fine-tuning tutorial.

Technical Summary
SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience. It introduces several changes compared to previous Idefics models:

Image compression: We introduce a more radical image compression compared to Idefics3 to enable the model to infer faster and use less RAM.
Visual Token Encoding: SmolVLM uses 81 visual tokens to encode image patches of size 384×384. Larger images are divided into patches, each encoded separately, enhancing efficiency without compromising performance.
More details about the training and architecture are available in our technical report.

###
https://blackforestlabs.ai/flux-1-tools/
Blackforest Labs
Flux1 Tools: Hugging Face Transformers
11/21/2024

🔥📣 Introducing FLUX.1 Tools from Black Forest Labs
👇Learn more and how you can explore them for free on Hugging Face Spaces
FLUX.1 Tools -- Key highlights:
> A suite of models designed to add control and steerability to the base text-to-image model FLUX.1 released a few months ago
> These tools enables the modification / re-creation of real and generated images.
> This release consists of 4 open-access models within the FLUX.1 [dev] model series. Each with distinct feature set or capabilities:
1️⃣ FLUX.1 Fill: SOTA inpainting and outpainting models
2️⃣ FLUX.1 Depth: Models trained to enable structural guidance based on a depth map extracted from an input image and a text prompt.
3️⃣ FLUX.1 Canny: Models trained to enable structural guidance based on canny edges extracted from an input image and a text prompt.
4️⃣ FLUX.1 Redux: An adapter that allows mixing and recreating input images and text prompts.

Introducing FLUX.1 Tools
Nov 21, 2024
—

by

BlackForestLabs
in News.
Today, we are excited to release FLUX.1 Tools, a suite of models designed to add control and steerability to our base text-to-image model FLUX.1, enabling the modification and re-creation of real and generated images. At release, FLUX.1 Tools consists of four distinct features that will be available as open-access models within the FLUX.1 [dev] model series, and in the BFL API supplementing FLUX.1 [pro]: 

FLUX.1 Fill: State-of-the-art inpainting and outpainting models, enabling editing and expansion of real and generated images given a text description and a binary mask.
FLUX.1 Depth: Models trained to enable structural guidance based on a depth map extracted from an input image and a text prompt.
FLUX.1 Canny: Models trained to enable structural guidance based on canny edges extracted from an input image and a text prompt.
FLUX.1 Redux: An adapter that allows mixing and recreating input images and text prompts.
This release reinforces our dual commitment: delivering cutting-edge open-weight models for the research community while offering best-in-class capabilities through our API. We release each tool in the BFL API as FLUX.1 [pro] variants and with inference code and weights available as guidance-distilled, open-access FLUX.1 [dev] variants. Additionally, we are excited that our released models will be available via our partners fal.ai, Replicate, Together.ai, Freepik and krea.ai.

The following sections contain details on the new models, analyses on their performance and how they can be accessed. We are excited to see how the vibrant Flux ecosystem will be supplemented by our new tools.

Inpainting and Outpainting with FLUX.1 Fill
FLUX.1 Fill introduces advanced inpainting capabilities that surpass existing tools like Ideogram 2.0 and popular open-source variants such as AlimamaCreative’s FLUX-Controlnet-Inpainting. It allows for seamless edits that integrate naturally with existing images.


Additionally, FLUX.1 Fill supports outpainting, enabling the user to extend images beyond their original borders.


We conduct a benchmark, publicly available here. The results show that Flux.1 Fill [pro] outperforms all other competing methods, making it the state-of-the-art inpainting model to date. Second is Flux.1 Fill [dev], outperforming proprietary solutions while being more efficient at inference.


Flux.1 Fill [dev] is available under the Flux Dev License, with

Full model weights available on Hugging Face: [Fill]
Inference code available on GitHub
Flux.1 Fill [pro] is available in the [BFL API]

Structural Conditioning with FLUX.1 Canny / Depth
Structural conditioning uses canny edge or depth detection to maintain precise control during image transformations. By preserving the original image’s structure through edge or depth maps, users can make text-guided edits while keeping the core composition intact. This is particularly effective for retexturing images.




In our evaluations, benchmark available here, FLUX.1 Depth outperforms proprietary models like Midjourney ReTexture. In particular, FLUX.1 Depth [pro] offers higher output diversity, while the Dev version of FLUX.1 Depth delivers more consistent results in depth-aware tasks. For canny edge models, benchmark here,  FLUX.1 Canny [pro] is the best in class, followed by FLUX.1 Canny [dev].


FLUX.1 Canny / Depth are available in two versions: full models for maximum performance, and LoRA versions based on FLUX.1 [dev] for easier development.

Flux Depth / Canny [dev] are available under the Flux Dev License with

Full model weights available on Hugging Face: [Depth] [Canny] 
LoRA weights available on Hugging Face: [Depth] [Canny] 
Inference code available on GitHub
Flux.1 Depth / Canny [pro] are available in the BFL API.

Image Variation and Restyling with FLUX.1 Redux
FLUX.1 Redux is an adapter for all FLUX.1 base models for image variation generation. Given an input image, FLUX.1 Redux can reproduce the image with slight variation, allowing to refine a given image.


It naturally integrates into more complex workflows unlocking image restyling via prompt. Restyling is available through our API by providing an image plus a prompt. The feature is supported in our latest model FLUX1.1 [pro] Ultra, allowing for combining input images and text prompts to create high-quality 4-megapixel outputs with flexible aspect ratios.


Our benchmark demonstrates that FLUX.1 Redux achieves state-of-the-art performance in image variation.


Flux.1 Redux [dev] is available under the Flux Dev License with

Model weights available on Hugging Face: [Redux]
Inference code available on GitHub
Flux.1 Redux supporting FLUX1.1 [pro] Ultra is available in the BFL API.

We’re excited to see what the community is going to build with our new set of tools. Try our API at [api.bfl.ml].

###
https://developer.apple.com/machine-learning/models/
Apple
11/22/2024
Pretty cool! Apple released blazingly fast CoreML models AND an iOS app to run them on iPhone! ⚡
> S0 matches OpenAI's ViT-B/16 in zero-shot performance but is 4.8x faster and 2.8x smaller
> S2 outperforms SigLIP's ViT-B/16 in zero-shot accuracy, being 2.3x faster, 2.1x smaller, and trained with 3x fewer samples
> MobileCLIP-B(LT) achieves 77.2% zero-shot ImageNet accuracy, surpassing DFN, SigLIP, and even OpenAI's ViT-L/14@336
> Models on HF and code on GH 🤗

Core ML Models
Build intelligence into your apps using machine learning models from the research community designed for Core ML.

Filter by keywords
Models are in Core ML format and can be integrated into Xcode projects. You can select different versions of models to optimize for sizes and architectures.

image

FastViT
Image Classification

A Fast Hybrid Vision Transformer architecture trained to classify the dominant object in a camera frame or image.


Hide details

Model Info
Summary
FastViT is a general-purpose, hybrid vision transformer model, trained on the ImageNet dataset, that provides a state-of-the-art accuracy/latency trade-off.

The model's high performance, low latency, and robustness against out-of-distribution samples result from three novel architectural strategies:

Structural reparameterization
Linear training-time overparameterization
Use of large kernel convolutions
FastViT consistently outperforms competing robust architectures on mobile and desktop GPU platforms across a wide range of computer vision tasks such as image classification, object detection, semantic segmentation, and 3D mesh regression.

Use Cases
Image classification, object detection, semantic segmentation, 3D mesh regression

Links
Source code in GitHub
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization
Variants
Model Name	Size	Action
FastViTMA36F16.mlpackage	88.3MB	Download
FastViTT8F16.mlpackage	8.2MB	Download
FastViTMA36F16Headless.mlpackage	85.8MB	Download
FastViTT8F16Headless.mlpackage	6.5MB	Download
Model Details
Variant	Parameters	Size	Weight Precision	Activation Precision
T8	3.6M	7.8	Float16	Float16
MA36	42.7M	84	Float16	Float16
Inference Time
Variant	Device	OS	Inference Time (ms)	Dominant Compute Unit
T8	iPhone 12 Pro Max	17.5	0.79	Neural Engine
T8	M3 Max	14.4	0.62	Neural Engine
MA36	iPhone 12 Pro Max	18.0	4.50	Neural Engine
MA36	M3 Max	15.0	2.99	Neural Engine
Example Projects
Classifying Images with Vision and Core ML
Preprocess photos using the Vision framework and classify them with a Core ML model.

image

Depth Anything V2
Depth Estimation

The Depth Anything model performs monocular depth estimation.


Hide details

Model Info
Summary
Depth Anything v2 is a foundation model for monocular depth estimation. It maintains the strengths and rectifies the weaknesses of the original Depth Anything by refining the powerful data curation engine and teacher-student pipeline.

To train a teacher model, Depth Anything v2 uses purely synthetic, computer-generated images. This avoids problems created by using real images, which can limit monocular depth-estimation model performance due to noisy annotations and low resolution. The teacher model predicts depth information on unlabeled real images, and then uses only that new, pseudo-labeled data to train a student model. This helps avoid distribution shift between synthetic and real images.

On the depth estimation task, the Depth Anything v2 model optimizes and outperforms v1 especially in terms of robustness, inference speed, and image depth properties like fine-grained details, transparent objects, reflections, and complex scenes. Its refined data curation approach results in competitive performance on standard datasets (including KITTI, NYU-D, Sintel, ETH3D, and DIODE) and a more than 9% accuracy improvement over v1 and other community models on the new DA-2k evaluation set built for depth estimation.

Depth Anything v2 provides varied model scales and inference efficiency to support extensive applications and is generalizable for fine tuning to downstream tasks. It can be used in any application requiring depth estimation, such as 3D reconstruction, navigation, autonomous driving, and image or video generation.

Use Cases
Depth estimation, semantic segmentation

Links
Source code in GitHub
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Depth Anything V2
Variants
Model Name	Size	Action
DepthAnythingV2SmallF16.mlpackage	49.8MB	Download
DepthAnythingV2SmallF16P6.mlpackage	19MB	Download
Model Details
Variant	Parameters	Size	Weight Precision	Activation Precision
F32	24.8M	99.2	Float32	Float32
F16	24.8M	49.8	Float16	Float16
Inference Time
Variant	Device	OS	Inference Time (ms)	Dominant Compute Unit
Small F16	iPhone 12 Pro Max	18.0	31.10	Neural Engine
Small F16	iPhone 15 Pro Max	17.4	33.90	Neural Engine
Small F16	MacBook Pro (M1 Max)	15.0	32.80	Neural Engine
Small F16	MacBook Pro (M3 Max)	15.0	24.58	Neural Engine
image

DETR Resnet50 Semantic Segmentation
Semantic Segmentation

The DEtection TRansformer (DETR) model, trained for object detection and panoptic segmentation, configured to return semantic segmentation masks.


Hide details

Model Info
Summary
The DETR model is an encoder/decoder transformer with a convolutional backbone trained on the COCO 2017 dataset. It blends a set of proven ML strategies to detect and classify objects in images more elegantly than standard object detectors can, while matching their performance.

The model is trained with a loss function that performs bipartite matching between predicted and ground-truth objects. At inference time, DETR applies self-attention to an image globally to predict all objects at once. Thanks to global attention, the model outperforms standard object detectors on large objects but underperforms on small objects. Despite this limitation, DETR demonstrates accuracy and run-time performance on par with other highly optimized architectures when evaluated on the challenging COCO dataset.

DETR can be easily reproduced in any framework that contains standard CNN and transformer classes. It can also be easily generalized to accommodate more complex tasks, such as panoptic segmentation and other tasks requiring a simple segmentation head trained on top of a pre-trained DETR.

DETR avoids clunky surrogate tasks and hand-designed components that traditional architectures require to achieve acceptable performance and instead provides a conceptually simple, easily reproducible approach that streamlines the object detection pipeline.

Use Cases
Object detection, panoptic segmentation

Links
Source code in GitHub
End-to-End Object Detection with Transformers
Variants
Model Name	Size	Action
DETRResnet50SemanticSegmentationF16.mlpackage	85.5MB	Download
DETRResnet50SemanticSegmentationF16P8.mlpackage	43.1MB	Download
Model Details
Variant	Parameters	Size	Weight Precision	Activation Precision
F32	43M	171	Float32	Float32
F16	43M	86	Float16	Float16
Inference Time
Variant	Device	OS	Inference Time (ms)	Dominant Compute Unit
F16	iPhone 15 Pro Max	17.5	40	Neural Engine
F16	MacBook Pro (M1 Max)	14.5	43	Neural Engine
F16	iPhone 12 Pro Max	18.0	52	Neural Engine
F16	MacBook Pro (M3 Max)	15.0	29	Neural Engine


###
https://alexgarcia.xyz/blog/2024/sqlite-vec-metadata-release/index.html
sqlite-vec now supports metadata columns and filtering
2024-11-20 by Alex Garcia
SQLite is all you need! Big sqlite-vec update! 🚀 sqlite-vec is a plugin to support Vector Search in SQLite or LibSQL databases. v0.1.6 now allows storing non-vector data in vec0 virtual tables, enabling metadata conditioning and filtering! 🤯
TL;DR:
💡 Store metadata like user_id or created_at fields directly within vec0 virtual tables.
🔍 Metadata columns can be used in WHERE clauses of KNN queries for filtering results based on non-vector data.
🛠️ Introducing Partition Keys to shard the vector index and speed up queries
📚 Auxiliary columns, prefixed with a '+', allow storage of unindexed, SELECT-only metadata without requiring separate joins.
🔜 Improve Quantization support with float16, float8, "smarter" binary quantization


tl;dr — sqlite-vec, a SQLite extension for vector search, now supports metadata columns, auxiliary columns, and partitioning in vec0 virtual tables! You can use these to store metadata like user_id or created_at fields, add additional WHERE clauses in KNN queries, and make certain selective queries much faster. Try it out!

As of the latest v0.1.6 release of sqlite-vec, you can now store non-vector data in vec0 virtual tables! For example:

create virtual table vec_articles using vec0(

  article_id integer primary key,

  -- Vector text embedding of the `headline` column, with 384 dimensions
  headline_embedding float[384],

  -- Partition key, internally shard vector index on article published year
  year integer partition key,

  -- Metadata columns, can appear in `WHERE` clause of KNN queries
  news_desk text,
  word_count integer,
  pub_date text,

  -- Auxiliary columns, unindexed but fast lookups
  +headline text,
  +url text
);
Here we are storing a New York Time article headlines dataset from the past 30 years, where we embed the headlines with mixedbread-ai/mxbai-embed-xsmall-v1.

If we wanted to see the closest related headlines to 'pandemic' on article published in 2020 by the 'Sports' or 'Business' new desk with more than 500 but less than 1000 words, we can perform a KNN query like so:

select
  article_id,
  headline,
  news_desk,
  word_count,
  url,
  pub_date,
  distance
from vec_articles
where headline_embedding match lembed('pandemic')
  and k = 8
  and year = 2020
  and news_desk in ('Sports', 'Business')
  and word_count between 500 and 1000;
┌────────────┬──────────────────────────────────────────────────────────────────────┬───────────┬────────────┬─────────────────────────────┬──────────────────────────┬───────────┐
│ article_id │ headline                                                             │ news_desk │ word_count │ url                         │ pub_date                 │ distance  │
├────────────┼──────────────────────────────────────────────────────────────────────┼───────────┼────────────┼─────────────────────────────┼──────────────────────────┼───────────┤
│    2911716 │ The Pandemic’s Economic Damage Is Growing                            │ Business  │        910 │ https://www.nytimes.com/... │ 2020-07-07T18:12:40+0000 │ 0.8928120 │
│    2892929 │ As Coronavirus Spreads, Olympics Face Ticking Clock and a Tough Call │ Sports    │        987 │ https://www.nytimes.com/... │ 2020-03-06T01:34:36+0000 │ 0.9608180 │
│    2932041 │ The Pandemic Is Already Affecting Next Year’s Sports Schedule        │ Sports    │        620 │ https://www.nytimes.com/... │ 2020-11-11T13:56:25+0000 │ 0.9802038 │
│    2915381 │ The Week in Business: Getting Rich Off the Pandemic                  │ Business  │        814 │ https://www.nytimes.com/... │ 2020-08-02T11:00:03+0000 │ 1.0064692 │
│    2896043 │ The Coronavirus and the Postponement of the Olympics, Explained      │ Sports    │        798 │ https://www.nytimes.com/... │ 2020-03-25T17:45:58+0000 │ 1.0115833 │
│    2898566 │ Robots Welcome to Take Over, as Pandemic Accelerates Automation      │ Business  │        871 │ https://www.nytimes.com/... │ 2020-04-10T09:00:27+0000 │  1.019637 │
│    2898239 │ The Pandemic Feeds Tech Companies’ Power                             │ Business  │        784 │ https://www.nytimes.com/... │ 2020-04-08T16:43:13+0000 │ 1.0200014 │
│    2929224 │ In M.L.S., the Pandemic Changes the Playoff Math                     │ Sports    │        859 │ https://www.nytimes.com/... │ 2020-10-29T17:09:10+0000 │ 1.0238885 │
└────────────┴──────────────────────────────────────────────────────────────────────┴───────────┴────────────┴─────────────────────────────┴──────────────────────────┴───────────┘
Here we used sqlite-lembed to embed our query, but any other embeddings provider could be used!

We can reference those metadata columns and parition key columns in the WHERE clause of the KNN query, and get the exact results we want!

Now, what's the difference between metadata, partition key, and auxiliary columns?

¶ Metadata columns for WHERE clause filtering
Metadata columns are declared with normal column declartions in the vec0 constructor. Metadata columns are stored and indexed alongside vectors, and can appear in the WHERE clause of KNN queries.

create virtual table vec_articles using vec0(
  article_id integer primary key,
  headline_embedding float[384],
  news_desk text,
  word_count integer,
  pub_date text
);

select
  article_id,
  headline,
  news_desk,
  word_count,
  pub_date,
  distance
from vec_articles
where headline_embedding match lembed('new york city housing')
  and k = 20
  and news_desk = 'Metro'
  and word_count < 1000
  and pub_date between '2004-01-20' and '2009-01-20';
┌────────────┬──────────────────────────────────────────────────────────────────────┬───────────┬────────────┬──────────────────────────┬────────────────────┐
│ article_id │ headline                                                             │ news_desk │ word_count │ pub_date                 │ distance           │
├────────────┼──────────────────────────────────────────────────────────────────────┼───────────┼────────────┼──────────────────────────┼────────────────────┤
│    1717598 │ Manhattan: City to Expand Housing Program                            │ Metro     │         83 │ 2007-02-28T05:00:00+0000 │ 0.7736235857009888 │
│    1607183 │ Manhattan: More Money for Housing                                    │ Metro     │         96 │ 2006-06-16T04:00:00+0000 │ 0.7818768620491028 │
│                                                                                  ...                                                                       │
│    1772158 │ Ask About New York Architecture, On Screen and Off                   │ Metro     │        241 │ 2007-09-17T18:25:57+0000 │  0.930429220199585 │
│    1673007 │ Manhattan: City Balances Budget for 26th Year                        │ Metro     │         87 │ 2006-11-01T05:00:00+0000 │ 0.9327330589294434 │
│    1616702 │ Little Shift in Prices of Manhattan Apartments                       │ Metro     │        615 │ 2006-07-06T04:00:00+0000 │ 0.9354249238967896 │
└────────────┴──────────────────────────────────────────────────────────────────────┴───────────┴────────────┴──────────────────────────┴────────────────────┘
There we retrieved the 20 most related article headlines to 'new york city housing', published by the 'Metro' news desk, with less than 1000 words, published during the George W Bush administration.

Metadata columns can be boolean, integer, floats, or text values. More types like BLOBs, dates, and UUID/ULIDs are coming soon!

Only a subset of operators are supported during metadata filtering, including:

Equality constraints, ie = and !=
Comparison constraints, ie >, >=, <, <=
column in (...) constraints, only on INTEGER and TEXT columns on SQLite 3.38 or above
Notably absent: REGEXP, LIKE, GLOB, and other custom scalar functions. Also NULL values are not supported yet,

¶ Partition keys for faster WHERE clause filtering
Now the above query was actually a bit slow! There are 3 million rows in the table, and metadata filters need to visit every single row to do a comparison. Metadata comparison are quite fast and built for fast filtering, but they have their limits.

But notice how we only wanted a small subset of values – between '2004-01-20' and '2009-01-20' is only 5 years out of 30 years of data. We can tell the vec0 virtual table to internally shard the vector index on a given key, using partition keys!

create virtual table vec_articles using vec0(
  article_id integer primary key,
  headline_embedding float[384],

  -- shard the vector index based on published year
  year integer partition key,

  news_desk text,
  word_count integer,
  pub_date text
);

select
  article_id,
  headline,
  news_desk,
  word_count,
  pub_date,
  distance
from vec_articles
where headline_embedding match lembed('new york city housing')
  and k = 20
  -- narrow search to these years only
  and year between 2004 and 2009
  and news_desk = 'Metro'
  and word_count < 1000
  -- finer filtering for exact dates we care about
  and pub_date between '2004-01-20' and '2009-01-20';
This KNN query returns the same exact results as the one above - but is 3x faster! This is because internally, vectors are stored based on the year value of its row. In that KNN query, sqlite-vec will recognize constraints on partition keys, and quickly pre-filter rows before any vectors are compared.

But beware! It's easy to accidentally over-shard a vector index on the wrong values and cause performance issues. Partition keys are great for date-based items like year or month, particulary when each unique partition key value has 100's or 1000's of vectors. They are also great for user IDs or document IDs, for "per-user" or "per-document" vector indexes.

Partition key columns can only be TEXT or INTEGER values, file an issue if you want to see some other type support. Currently column in (...) constraints are not supported for partition key columns, but will be soon!

¶ Auxiliary columns
Some columns never need to be indexed! You can always store addtionally SELECT-only metadata in separate tables and do a JOIN yourself, or you can use auxiliary columns:

create virtual table vec_articles using vec0(
  article_id integer primary key,
  headline_embedding float[384],
  +headline text,
  +url text
);

select
  article_id,
  headline,
  url,
  distance
from vec_articles
where headline_embedding match lembed('dodgers game')
  and k = 20;
┌────────────┬─────────────────────────────────────────────────────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┐
│ article_id │ headline                                                                            │ url                                                                                                                               │ distance           │
├────────────┼─────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────┤
│    1896278 │ Attention Dodgers Fans: There’s a Game Tonight                                      │ https://bats.blogs.nytimes.com/2008/10/15/attention-dodgers-fans-theres-a-game-tonight/                                           │ 0.6733786463737488 │
│    2556896 │ Dodgers, in Flurry of Activity, Move to Revamp Their Infield                        │ https://www.nytimes.com/2014/12/11/sports/baseball/mlb-jimmy-rollins.html                                                         │ 0.7796685099601746 │
│    2382487 │ Keeping Up With the Dodgers                                                         │ https://www.nytimes.com/2012/12/15/sports/angels-keeping-up-with-the-dodgers-leading-off.html                                     │ 0.7849781513214111 │
│    2585169 │ New Life for the Dodgers’ Old Digs                                                  │ https://www.nytimes.com/slideshow/2015/04/19/sports/baseball/20150419DODGERTOWN.html                                              │ 0.7894293665885925 │
│    1032111 │ Not Dodgers II, but It's Baseball; The Game Is Back in Brooklyn, on a Smaller Scale │ https://www.nytimes.com/2001/06/23/nyregion/not-dodgers-ii-but-it-s-baseball-the-game-is-back-in-brooklyn-on-a-smaller-scale.html │ 0.7978747487068176 │
└────────────┴─────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────┘
Auxiliary columns are denoted by a + prefix in the column definition, modeled after the same feature in the SQLite R*Tree extension. These columns are unindex, stored in a separate internal table and JOIN'ed at SELECT time. They cannot appear in a KNN WHERE query, as performance would worsen dramatically.

But it saves you from dealing with additional JOINs yourself! They are especially great for longer TEXT or BLOB values.

¶ Roadmap and the future of sqlite-vec
Metadata column support is the biggest update to sqlite-vec since the initial v0.1.0 launch 3 months ago, but I have a lot planned for the project!

First off: ANN indexes. The vec0 virtual table is brute-force only, which really slows down KNN queries on larger datasets. There are strategies like binary quantization or Matryoshka embeddings that can help, but sqlite-vec won't be fast until ANN indexes are supported.

I delayed working on ANN indexes until metadata columns were supported, because its much easier to build an ANN index with metaddata filtering on day 1 than it is to retroactively try to support them. I think this was the right call — metadata columns are hard! Follow issue #25 for future update on this!

Next: Quantizers. Currently sqlite-vec only supported simple binary quantization and scalar quantization with int8 vectors. But I want to support float16, float8, "smarter" binary quantization (ie custom thresholds instead of just > 0), and other techniques that have come about the last few months. This will also help support ANN indexes, as many of them rely on vector compression for fast queries.

There's also a ton of performance work that sqlite-vec needs, especially with these new metadata column features. This initial release was more of a "make it work" and not "make it fast", so expect much faster metadata filtering in upcoming releases!

Sister projects sqlite-lembed and sqlite-rembed also need a ton of love, they both have some older PRs that need merging. Expect releases of both of these projects very soon!

###
https://arxiv.org/abs/2411.15100
[Submitted on 22 Nov 2024]
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, Tianqi Chen
The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.


LFG!! XGrammar: a lightning fast, flexible, and portable engine for structured generation! 🔥
> Accurate JSON/grammar generation
> 3-10x speedup in latency
> 14x faster JSON-schema generation and up to 80x CFG-guided generation
> Now in MLC-LLM, SGLang, WebLLM; vLLM & HuggingFace coming soon!
GG MLC team is literally the best in the game and slept on! ⚡

###
https://blog.dottxt.co/say-what-you-mean.html
11/24/24

Does Structured Outputs hurt LLM performance? 🤔 The paper "Let Me Speak Freely" paper claimed that it does, but new experiments by .txt (team behind outlines) show it doesn’t if you do it correctly! 👀
TL;DR;
📈 The "Let Me Speak Freely" poor results came from weak prompts and wrong use of structured prompting
📊 Structured outputs outperform unstructured on the test GSM8K: 0.78 vs 0.77, Last Letter: 0.77 vs 0.73, Shuffle Object: 0.44 vs 0.41
🛠️ Key success criteria is to align your prompt, parser, and generator - it's not just about using JSON mode
📌 JSON generation requires careful prompt design, including specifying the desired schema.
📝 Good prompts should contain enough information for a human to understand the task and expected response format
🎯 JSON generation reached 77% accuracy vs the paper's reported <10%
🔮 Examples in prompts should match the exact format expected in the actual tasks
🧰 Structured generation works best when implemented as "running our response parser as a generator"

No-structured outputs can actually improve LLM performance when implemented correctly.

BY Will Kurt

Say What You Mean: A Response to 'Let Me Speak Freely'
A recent paper from the research team at Appier, Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models, made some very serious accusations about the quality of LLM evaluation results when performing structured generation. Their (Tam, et al.) ultimate conclusion was:

Our study reveals that structured generation constraints significantly impact LLM performance across various tasks.

The source for this claim was three sets of evaluations they ran that purportedly showed worse performance for structured generation (”JSON-Mode” in the chart) compared with unstructured (”Natural Language”/NL in the chart). This chart (derived and rescaled from the original charts) shows the concerning performance:

tam_et-al.png

Figure 1: Original findings of Let Me Speak Freely

We here at .txt have always seen structured generation outperform unstructured generation in our past experiments. Our past experiments were on problems with clear, LLM-compatible structure, but so were the tasks that Tam, et al focused on (in fact, we had already done a similar experiment with GSM8K using different models). So these results from Tam, et al were as surprising as they were concerning.

After revisiting the above tasks with the same model (Llama-3-8B-instruct) we found that our results did not match those found in the paper, and reflected what we have previously seen. Diving deeper into the data and source code for the paper, we have determined there are several critical issues that led the authors to a fundamentally flawed conclusion.

More than a rebuttal, the goal of this article is to share some of the knowledge we have acquired by working daily with structured generation. We'll show where Tam, et al. made mistakes, but also provide insights into prompting for structured generation that should improve the response you get from LLMs even when you aren't using structured generation.

The Quick Rebuttal: Structured Generation improves performance.
For those needing a quick answer as to whether or not structured generation hurts performance: the answer is a clear no. In the figure below you can see the results of a quick implementation of JSON generation for all of the concerning eval runs in the article.

tasks_revisited.png

Figure 2: Results of re-implementing the concerning evals.

Task	Unstructured	Structured
GSM8K	0.77	0.78
Last Letter	0.73	0.77
Shuffle Object	0.41	0.44
While our unstructured results are in line with the paper, our structured results directly contradict the findings of the paper by showing that structured generation is an improvement across the board. Notebooks to reproduce these results can be found on github.

Here are some of the key issues we found in the paper:

The paper itself finds that structured generation has superior performance on a number of classification tasks.
The prompts used for unstructured (NL) generation are markedly different than the ones used for structured generation, so the comparisons are not apples-to-apples to begin with.
The structured generation prompts do not provide the model with adequate information to solve the task, this leads to particularly poor performance for the ‘json-mode’ examples.
The real meat of the paper is actually about parsing the results of one LLM with a second LLM. The authors refer to this as the “Perfect Text Parser”, we will refer to it as the “AI parser” (for reasons we'll clarify soon).
The paper confuses structured generation with JSON-mode1, although independent runs of these evals show that “JSON-mode” yields better results than unstructured generation.
An apt analogy would be to programming language benchmarking: it would be easy to write a paper showing that Rust performs worse than Python simply by writting terrible Rust code. Any sensible readers of such a paper would quickly realize the results reflected the skills of the author much more than the capability of the tool. But it is genuinely challenging to write optimally performant code, just like it is difficult to ensure your evaluations are truly representative of the task you are trying to understand and measure.

All that said, this paper does provide a great opportunity to dive a bit deeper into what structured generation is and how to get the best performance out of it.

Task Overview - Last Letter
Here we’ll be focusing on the task that Speak Freely claimed structured generation did the most poorly on: Last Letter.

In this task the model is presented with a list of 4 names such as:

Ian Peter Bernard Stephen
And then the model must concatenate the last letter of each. The answer for that example would be: NRDN

The evaluation consists of 150 questions in the “test” set, and 350 in the "train" set. The paper uses only the 150 “test” question, so we will as well (though all findings here also hold for the full set).

Why Do We Need Structured Generation? Parsing Results!
One of the more interesting parts of the approach in the paper, that honestly should have been the focus, is the so called “Perfect Text Parser” used to extract the answer from the initial model response. Typically most evaluation frameworks use a simple, clearly defined regular expression when parsing responses, however Tam, et al use claude-3-haiku-20240307 to parse the response from the generated output of model. This means that two models are actually used for each answer. In the paper they refer to this as the “Perfect Text Parser”. It turns out the choice of the word "perfect" is a bit misleading, so we’ll be referring to this as the “AI Parser”.

It’s important to note this non-standard method for transforming the response because the primary reason we use structured generation when working with LLMs it to guarantee the format of the response for easy parsing. That is, parsing and structured generation go hand-in-hand. While there are quite a few issues with this paper, the use of the AI parser is quite interesting and worth exploring in the context of structured generation. We’ll take a look at AI parsing vs structured generation, which will help us gain a stronger understanding of just how powerful structured generation is.

Issue #1: The AI Parser
To better understand the impact the AI Parser makes, we’ll take a deep dive into one of their recorded examples. Thankfully, Tam, et al. did provide extensive data (12GB!) from their experiments. These experiments are all sorted by the model and the prompt template used. We’ll be taking a deep dive into the lasterletter-t3-f3 prompt template using meta-llama/Meta-Llama-3-8B-Instruct specifically looking at the 1-shot example.

Let’s start by looking at what the prompt instructs the model to do in the best performing Natural Language (NL) format:

Follow the instruction to complete the task:
String manipulation task:
• Given: A sequence of words
• Required: A new string made from the last letter of each word
• Process: Think step by step to solve this challenge
Note: Ensure you've read the question thoroughly before beginning.


Instruct : Provide your output in the following text format:
Answer: <think step by step>. The final answer is <answer>
Notice that the format of the response is explicitly described: The final answer is <answer>. This means, if the model were to adhere to our prompt, we should be able to parse all answers with this simple regex:

answer_regex = r'answer is ([A-Za-z]{4})'
To see the impact of using the AI Parser we can iterate through the recorded results of the experiment (found in the file text_llama-3-8b-instruct_shots_1.jsonl). To be clear, we aren’t running any models right now, just seeing how different parsing methods impact the final score for an existing experiment.

We can immediately see there is a discrepancy between the strict regex parsing and the AI parsing:

recorded_data_strict_vs_ai.png

Figure 3: Comparing a strict regex parser with the AI parser.

So what’s happening here?

It turns out that AI parser is doing a lot of the heavy lifting for the NL format. By going over the results we can see there were many cases where our strict regex failed to capture the (arguably) correct response from the model. Here are a few examples that the AI Parser was able to correctly recover that didn't match our regex:

The answer is e-S-S-E. → ESSE
The answer is AAA R. → AAAR
The answer is "reye". → REYE
The final answer is: YOOI → YOOI
Clearly our strict regex was indeed a bit too strict. All of these answers seem reasonable to me, there are a few cases in the full data set where one might disagree, but overall these seem acceptable responses. In lieu of an extra call to a more powerful model, a reasonable solution is to simply extend our regular expression so that it covers these cases. We’ll add the following alternate regexes to solve the sample cases we’ve found:

alt_regex_1 = r'answer is ([A-Za-z]-[A-Za-z]-[A-Za-z]-[A-Za-z])'
alt_regex_2 = r'answer is:? "?([A-Za-z] ?[A-Za-z] ?[A-Za-z] ?[A-Za-z])"?'
alt_regex_3 = r'Concatenating them is "([A-Za-z]{4})"'
alt_regex_4 = r"answer is:? ([A-Za-z]'?[A-Za-z]'?[A-Za-z]'?[A-Za-z])"
It turns out that these cover the full set of cases we missed, no need for a separate model at all! If we use this combination of regexes we get the following results from parsing:

strict_vs_ai_vs_flexible.png

Figure 4: Comparing a strict regex parser, the AI parser and a more flexible regex parser.

Parser	Accuracy
strict regex	0.35
AI parser	0.57
flexible regex	0.61
Surprisingly, our hand curated list of regexes (which didn’t take too much time to write) outperforms the AI parser for this data set! It’s worth mentioning that the main selling point of structured generation is to not have to worry about this parsing. However it is a useful exercise that shows us that the AI parser is, in fact, not the “perfect” text parser: running our hand crafted flexible regex parser outperforms a call to Claude (and is much faster and cheaper!).

Reproducing These Results - Unstructured
Now let’s go ahead and run this using outlines.generate.text with the exact same prompt to see what we get. We will make a small modification to the one-shot example used in the prompt. Tam, et al’s examples only use two names, but all the questions use four. In my experience, even without structured generation, it’s always important that your examples match the format you are looking for. So we’ve modified the prompt example to include all four names.

When running using outlines.generate.text we get the following results compared to the original results reported in the data:

repro_nl_1.png

Figure 5: Reproducing the original NL results.

As you can see our results, while slightly better, are more or less in line with what the recorded results show. We’re also not aiming for perfect replication, just making sure our results are on the same page.

Since cleaning up the bad one-shot example, the gap in the difference between the parsers is also smaller. We’ve seen in the past that it seems what models are learning from the example cases is in fact the structure of the problem, so it’s not surprising that giving better examples of structure improves adherence to that structure.

Now we can really test the impact structured generation has on performance.

Anything You Can Parse, You Can Generate
The reason we focus on the AI parser is that understanding how you parse the response from an LLM is the key to really understanding structured generation. It’s a common misunderstanding (one made by the paper) to think that structured generation is merely another name for JSON-mode (or YAML-mode, XML-mode, etc). A better mental model for structured generation is: running our response parser as a generator.

To make this clear, when running structured generation on the prompt used, we are simply going to add structure for the reasoning step, and then append our answer regex to that. This allows a unification of the prompt, the parser, and the generator. That’s the secret to why structured generation is so powerful.

Let’s define our structure and see how it does. Here’s a regex that represents the “chain-of-thought” reasoning structure in the answer (which is also found in the prompt itself):

cot_regex = r'Answer: T[\w \\",\\.]{30,250}. The '
This will allow the model “think” for between 30 and 250 characters, then start it’s answer. To complete our structure we just use our existing answer_regex to make our generator:

struct_strict = outlines.generate.regex(
    model,
    cot_regex + answer_regex, 
    sampler=greedy())
That’s it! We’re going to use our default strict regex because the entire point of structured generation is not to worry about parsing out output! There’s no need to use the more flexible regex since the model will only output what we want. Let’s see how it does:

reproducing_nl_results.png

Figure 6: Reproducing the original NL results with structured generation.

method	Accuracy
original	0.57
unstructured (strict)	0.65
unstructured (flexible)	0.66
structured	0.68
Consistent with all of our past findings, structured generation outperforms unstructured generation.

What about JSON?
With some insight into how to properly perform structured generation, let’s try to figure out what went wrong with the JSON results. After all, if correct, the results in this chart are concerning:

bad_json.png

Figure 7: Original chart showing poor JSON-mode performance on Last Letter.

In the chart above, it looks like the structured generation (JSON-mode) is getting an awful < 10% accuracy compared to the unstructured (NL) result that’s getting ~70% accuracy. The NL result is essentially what we reproduced previously, so that result is in line with what we reproduced, but that 10% doesn’t match with what we saw. Maybe it’s an issue with requiring the model to respond in JSON?

Issue #2: Bad Prompting
As mentioned, a major issue with Speak Freely is that the paper uses different prompts for structured generation and unstructured generation. In our example, to evaluate the performance of structured generation, we compared results with the same prompt. This is the only honest way to make any statements about structured vs unstructured performance.

Because of this the first place we should look for trouble is in the prompt used for the JSON-mode evaluations. So let’s take a look at the recorded data that matches closely to these results in the chart (found in lastletter-t2-structure/struct_llama-3-8b-instruct_shots_0.jsonl). Here is an example of the prompt used:

Follow the instruction to complete the task:
Read carefully for each of the last question and think step 
by step before answering. You are given a string of words 
and you need to take the last letter of each words and concate them


Instruct : You must use the tool



Question: Take the last letters of each words in 
"Britt Tamara Elvis Nayeli" and concatenate them.
This prompt needs substantial improvement before it can be used to properly evaluate the task at hand! One practice I encourage when writing prompts is to always ask yourself “does this prompt contain enough information that a reasonably well informed human could answer the question correctly?” Reading this prompt is not obvious to me that:

The answer must be JSON (JSON isn’t mentioned at all)
Even if you did guess JSON for the response, what is the schema you should respond in?
While tool use is mentioned, the prompt doesn’t mention the tools to use! There is no way an LLM could infer what it’s supposed to be doing. Structured generation is incredible, but it can’t magically make a model understand what you want any more than throwing rail road tracks in your backyard will make your home a convenient train stop.

Proper Prompting for JSON
We discussed earlier that structured generation is not the same thing as JSON-mode, but sometimes we do want JSON. To understand where Speak Freely went wrong, let’s walk through the way to do structured generation correctly. We’ll start by using an instruct prompt such as the following:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in solving simple word puzzles using reasoning steps. Your specific
task is going to be to take a list of 4 names and reason about the last letter of each .,
then you will concatenate those letters into a word. The Question will be plaintest from the user
and response will be formatted as JSON below:

{"reasoning": <reasoning about the answer>, "answer": <final answer>}<|eot_id|><|start_header_id|>user<|end_header_id|>

Question: Take the last letters of each words in 'Ian Peter Bernard Stephen' and concatenate them.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{"reasoning": "The last letter of 'Ian' is 'N', the last letter of 'Peter' is 'R', the last letter of 'Bernard' is 'D', and the last letter of 'Stephen' is 'N'. Therefore, the answer is 'NRDN'.", "answer": "NRDN"}<|eot_id|><|start_header_id|>user<|end_header_id|>

Question: Take the last letters of each words in "Britt Tamara Elvis Nayeli" and concatenate them.",<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<|eot_id|>
Here are a few things that make this prompt good.

Using the proper instruct format for our model (using apply_chat_template).
Provide an example that uses the correct structure and matches our problem.
End with an empty “assistant” prompt so that the response will start with our structure.
Structured generation works perfectly well with continuation models, but since we’re using an instruct model we should use the instruct prompt format for best results (this is issue 1). The most important thing we’ve done is 2 which is to show the model the structure we want it to follow. Item 3 is a small but important detail: instruct prompts are trained to alternate between the ‘user’ role and the ‘assistant’ role, if we don’t start with the empty assistant response the model will want to start with assistant... rather than our desired structure {"reasoning": ....

It can be quite helpful when writing your prompt to also generate unstructured samples to see how well the model is following the desired formatting behavior.

Defining Our Structure
Next we want to define our structure (though really this should go hand in hand with the prompt). For this task we’ll use a simple Pydantic model:

class Response(BaseModel):
    reasoning: constr(max_length=250)
    answer: str = Field(pattern=r'[A-Z]{4}')
We’re constraining the reasoning step here to 250 characters to make sure it doesn’t take too long to reason, and also constraining our answer to only valid possible responses of 4 letters.

A really important step in the process is to verify the prompt contains our structure. The entire point of the prompt is to prime our LLM for success, if we aren’t showing it the exact structure we want, the model will have to work harder to get the correct answer. Here’s the code for our ensuring the prompt matches the structure:

from outlines.fsm.json_schema import build_regex_from_schema
schema_regex = build_regex_from_schema(Response.schema_json())

example_prompt = create_prompt(all_evals[5]['question'], tokenizer)
re.search(schema_regex, example_prompt)
After verifying this we’re good to go! It’s worth pointing out that doing all these thing will also generally improve your unstructured results.

JSON Results
Now let’s run a proper apples-to-apples comparison of structured generation to unstructured. Here is the outcome of running this eval:

json_results.png

Figure 8: Results of comparing structured and unstrutured JSON generation using the same prompt (compared with NL structured prompt).

method	Accuracy
NL (struct)	0.68
JSON (unstruct)	0.73
JSON (struct)	0.77
Once again we see that structured generation outperforms unstructured generation. It’s also worth noting that our unstructured JSON result (at 73% accuracy) outperformed our structured natural language (68% accuracy) result, but our overall winner is structured JSON with 77% accuracy!

Here we can see why it’s essential to compare apples to apples when understanding the performance of structured generation. If, for example, we compared the JSON prompt for unstructured and the NL for structured, we would incorrectly conclude the structured generation is slightly worse, when the issue is really the prompt.

Conclusion
We here at .txt are passionate about structured generation, and truly believe it has the potential to transform the work being done with LLMs in profound ways. That’s precisely why we take claims that structure generation has adverse effects on LLM output very seriously. As a community we are all still learning how to get the most out of LLMs and working hard to advance this promising area of research. Everyone struggles at times to get LLMs to perform. If you are experiencing problems with structured generation, don’t hesitate to reach out. We would love to understand more about the areas where structured generation might need improvement, and get those areas fixed.

That is why it is so disappointing when other researchers take less care and end up spreading misinformation, that we then have to do the work to correct. If you’re interested in driving the future of LLMs, we’re more than happy to help you on your journey.

But, if you are going to publish a damning critique of structured generation, just make sure you put the same effort in to understanding the problem as we would. Or, perhaps better expressed by Omar Little from the Wire:

"Come at the King, you best not miss."
기술적으로 최대한 자세하게 적어. 9개의 기사가 있고 하나도 빼먹지 말고 적어.
TECH BLOG by Dongyoung Kim Ph.D.

2024년 11월 28일 AI 소식

알리바바, QwQ: Reflect Deeply on the Boundaries of the Unknown

DeepSeek, NewsDeepSeek-R1-Lite Release 2024/11/20

NVIDIA, Hymba Hybrid-Head Architecture Boosts Small Language Model Performance

Hugging Face, SmolVLM - small yet mighty Vision Language Model

Blackforest Labs, Flux1 Tools: Hugging Face Transformers

Apple, Core ML Models

Alex Garcia, sqlite-vec now supports metadata columns and filtering

연구팀, XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

.txt 팀, Say What You Mean: A Response to ‘Let Me Speak Freely’

(today’s date in 년 월 일) AI 소식,

Summary

company name, Title

company name, Title