➡️ Cohere에서는 기업용 대형 언어 모델 시리즈의 마지막 모델인 Command R7B를 출시하였습니다.
➡️ Genesis-world에서는 로보틱스 및 물리 AI 애플리케이션을 위한 종합 물리 플랫폼인 Genesis를 공개하였습니다.
➡️ Answer.ai에서는 ModernBERT를 발표하여 BERT 모델의 현대화를 통해 성능과 효율성을 대폭 향상시켰습니다.
➡️ Google DeepMind에서는 Gemini 2.0 Flash Thinking 모델을 출시하여 다중 모달 입력과 강화된 추론 성능을 제공하고 있습니다.
➡️ IBM에서는 Granite 3.1 Language Models를 업데이트하여 더 긴 컨텍스트, 향상된 RAG 및 함수 호출 기능을 도입하였습니다.
➡️ 연구자들은 오픈 소스 LLM의 장점에 대한 논문을 제출하였습니다.
➡️ Hugging Face에서는 Falcon3 오픈 모델을 출시하여 다양한 버전과 향상된 성능을 선보였습니다.
➡️ Google DeepMind와 Google Research는 FACTS Grounding 벤치마크를 발표하여 LLM의 사실 정확성과 문서 기반 응답 능력을 평가하였습니다.
➡️ Tencent에서는 BrushEdit을 발표하여 정밀한 이미지 편집을 지원하는 인페인팅 모델을 소개하였습니다.
Cohere, Command R7B 출시

링크, 2024년 12월 14일
Cohere, R 시리즈 마지막 모델인 Command R7B 출시.
Command R7B, 128k 컨텍스트 길이, 다국어 지원, 인용 검증된 검색 보강 생성(RAG), 추론, 도구 사용, 에이전트 행동 기능 탑재.
소형 GPU, 맥북, CPU에서도 서비스 가능, AI 애플리케이션 배포 비용 절감.
HuggingFace Open LLM Leaderboard에서 동급 모델 중 평균 1위 기록.
수학, 코드, 추론 작업에서 다른 오픈 웨이트 모델 능가하는 성능 발휘.
기업용 AI 시스템 배포에 최적화, 높은 처리 속도와 실시간 사용 사례에 적합.
Genesis-world, Genesis 플랫폼 공개

링크, 2024년 12월 19일
Genesis-world, 로보틱스, 엠보디드 AI, 물리 AI 애플리케이션 위한 종합 물리 플랫폼 Genesis 공개.
Genesis, 범용 물리 엔진, 경량 로보틱스 시뮬레이션 플랫폼, 포토리얼리스틱 렌더링 시스템, 생성 데이터 엔진 통합.
100% 파이썬 개발, 설치 용이, 사용자 친화적 API 제공.
기존 시뮬레이션 플랫폼보다 10~80배 빠른 시뮬레이션 속도, 다양한 물리 솔버 지원.
멀티스레드 시뮬레이션과 차별화된 렌더링 기능으로 고도의 물리적 정확성과 시각적 충실도 제공.
오픈 소스로 제공, 커뮤니티 기여 환영, 로보틱스 연구를 위한 자동화된 데이터 생성 가능.
Answer.ai, ModernBERT 발표

링크, 2024년 12월 18일
Answer.ai, ModernBERT 발표하여 BERT 모델 현대화.
ModernBERT, 8192 토큰의 컨텍스트 길이와 플래시 어텐션, RoPE 임베딩, 교대 어텐션 도입, 성능과 효율성 향상.
2조 토큰 대규모 데이터로 훈련, 다양한 분류 작업과 다중 벡터 검색에서 뛰어난 성능 발휘.
OpenAI O1 모델과 비교, 비슷한 크기 모델 중 우수한 성능, 메모리 효율성 뛰어남.
다양한 언어와 코드 데이터 학습, 프로그래밍 관련 작업에서 독보적인 성능 발휘.
Hugging Face와 Transformers에 공개, Apache 2.0 라이선스로 제공.
Google DeepMind, Gemini 2.0 Flash Thinking 출시

링크, 2024년 12월 11일
Google DeepMind, Gemini 2.0 Flash Thinking 모델 출시, 향상된 추론 성능 제공.
Gemini 2.0 Flash, 다중 모달 입력(이미지, 비디오, 오디오) 지원, 체인 오브 싱크스(Chain of Thoughts) 노출시켜 강력한 추론 성능 구현.
Gemini API 통해 Google AI Studio와 Vertex AI에서 개발자 사용 가능.
Gemini 2.0 Flash, 빠른 응답 시간과 향상된 성능 제공, Gemini 앱과 AI 어시스턴트에 통합.
Project Astra, Project Mariner, Jules 등 에이전트 기반 연구 프로토타입 통해 다양한 도메인에서 에이전트 활용 가능성 탐구.
안전 및 보안 기준 준수, 사용자 프라이버시 보호 기능 탑재.
IBM, Granite 3.1 Language Models 업데이트

링크, 2024년 12월 18일
IBM, Granite 3.1 Language Models 업데이트, 더 긴 컨텍스트, 향상된 RAG 및 함수 호출 기능 도입.
Granite3.1-8B-Instruct 모델, 8B 파라미터로 다양한 언어와 코드 데이터 특화 성능 발휘.
128K 토큰 긴 시퀀스 길이 지원, 12개 언어 다국어 지원.
Apache 2.0 라이선스로 공개, Hugging Face에서 접근 가능, 다양한 파인튜닝 옵션 제공.
텍스트 분류, 질문 응답, 코드 관련 작업 등 다양한 용도로 활용 가능.
IBM의 슈퍼컴퓨터 Blue Vela 사용, 대규모 데이터로 훈련.
오픈 소스 LLM의 장점에 대한 논문

링크, 2024년 12월 16일
연구자들, 오픈 소스와 클로즈드 소스 LLM 장단점 분석 논문 아카이브 제출.
오픈 소스 LLM인 LLaMA와 BLOOM, 커뮤니티 주도 개발과 컴퓨팅 효율성으로 경쟁력 있는 성능 발휘.
클로즈드 소스 모델 GPT-4, 대규모 데이터셋과 계산 자원 활용해 최첨단 성능 유지, 투명성 부족과 접근성 제한 문제 있음.
오픈 소스 모델, 투명성과 재현 가능성 제공, 윤리적 감사 프레임워크 부재로 일관된 윤리적 관리 어려움.
하이브리드 모델 필요성 강조, 접근성, 기술적 성능, 윤리적 배치 충족 가능성 제시.
Hugging Face, Falcon3 오픈 모델 출시

링크, 2024년 12월 17일
Hugging Face, Falcon3 오픈 모델 출시.
Falcon3 모델, 10B 미만 파라미터로 수학, 코드, 과학 지식 분야에서 뛰어난 성능 발휘.
Falcon3-10B-Base 모델, MATH-Lvl5에서 22.9, GSM8K에서 83.0 점수 기록, 수학 추론 능력 입증.
다양한 버전(Instruct, GGUF, GPTQ-Int4 등) 제공, 다양한 애플리케이션에 유연하게 적용 가능.
Llama 아키텍처와 호환, AI 생태계에 쉽게 통합 가능.
Technology Innovation Institute (TII)에서 개발한 Falcon3, 14조 토큰으로 훈련.
Google DeepMind와 Google Research, FACTS Grounding 벤치마크 발표

링크, 2024년 12월 17일
Google DeepMind와 Google Research, FACTS Grounding 벤치마크 발표, LLM의 사실 정확성과 문서 기반 응답 능력 평가.
FACTS Grounding, 1,719개 예제로 구성, 860개 공개 데이터셋, 859개 비공개 데이터셋 포함.
다양한 도메인(의료, 법률, 기술, 금융, 소매 등) 포함, 사실 발견, 요약, 영향 분석, 개념 비교 등 작업 유형 포함.
Gemini 2.0 Flash 모델, 83.6% 정확도로 1위, OpenAI GPT-4o 모델, 78.8%로 그 뒤.
FACTS Grounding, 자동화된 LLM 심사 모델 사용, 다중 심사자 통해 편향 최소화.
Kaggle 리더보드 운영, 커뮤니티 참여 통해 지속적으로 업데이트 예정.
Tencent, Brush Edit 출시

링크, 2024년 12월 17일
Tencent, BrushEdit 출시, 정밀한 이미지 편집을 지원하는 인페인팅 모델 공개.
BrushEdit, 자연어와 이미지를 기반으로 사용자가 원하는 대로 이미지를 편집할 수 있는 모델.
모델이 이미지를 정확하게 재구성하고, 다양한 물체를 삽입하거나 제거하는 기능 제공.
사용자가 텍스트 프롬프트와 마우스 드래그만으로 이미지를 손쉽게 수정할 수 있음.
고도화된 학습 데이터와 딥러닝 기술을 바탕으로 현실감 있는 이미지 생성 및 편집 가능.
온라인 상에서 쉽게 접근할 수 있는 플랫폼으로 이미지 편집의 새로운 패러다임 제시.
Sources
This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:
(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)
company name, Title

링크, date
detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)
company name, Title

링크, date
링크, date,
detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)
…
###
https://cohere.com/blog/command-r7b
Introducing Command R7B: Fast and efficient generative AI
Image of Aidan Gomez
Aidan Gomez
12월 14, 2024

Blog Post Featured Image
The smallest model in our R series delivers top-tier speed, efficiency, and quality to build powerful AI applications on commodity GPUs and edge devices.

Product
Newsroom
Share:

X (Formerly Twitter) Icon
LinkedIn Icon
Today, we’re excited to release Command R7B, the smallest, fastest, and final model in our R series of enterprise-focused large language models (LLMs). Command R7B provides state-of-the-art performance in its class of open-weights models across real-world tasks that matter for users. The model is designed for developers and businesses that need to optimize for the speed, cost-performance, and compute resources of their use cases.

Like our other models in the R series, Command R7B offers a context length of 128k and excels in capabilities important for a wide range of business applications. It delivers a powerful combination of multilingual support, citation verified retrieval-augmented generation (RAG), reasoning, tool use, and agentic behavior. Thanks to its compact size and efficiency it can be served on low-end GPUs, a MacBook, or even CPUs – drastically lowering the cost of deploying AI applications into production.

High performance in a small package
A well-rounded model
Command R7B excels on standardized and externally verifiable benchmarks such as the HuggingFace Open LLM Leaderboard. Compared to other similarly sized open-weights models, Command R7B ranks first on average with strong performance across all tasks.


HuggingFace Leaderboard evaluation results. Competitor numbers are taken from the official leaderboard. Command R7B results are calculated by us using the official HuggingFace prompts and evaluation code.
Enhanced efficiency in math, code, and reasoning tasks
A major area of focus for Command R7B has been improving performance on math and reasoning, code, and multilingual tasks. In particular, the model matches or exceeds leading open-weights models in its class across common math and code benchmarks while using fewer parameters.


Model performance on math and code benchmarks. All numbers are from internal evaluations except those marked with an asterisk which are from externally reported results where these are higher. We use the base version of MBPPPlus, LBPP is the average across 6 languages, SQL the average of 3 datasets (SpiderDev and Test - hard and extra hard only, BirdBench, and an internal one) and COBOL is an internally developed dataset.

Document translation quality evaluated with corpus spBLEU on the NTREX dataset.
Best-in-class RAG, tool use, and agents
Command R7B outperforms the other similarly sized open-weights models when it comes to core business use cases such as RAG, tool use, and AI agents. It is an ideal choice for enterprises looking for a cost-efficient model grounded in their internal documents and data. Like our other R series models, our RAG offering delivers native in-line citations that significantly reduce hallucinations and make fact-checking easier.


Performance evaluated across the ChatRAGBench (10-dataset average), BFCL-v3, StrategyQA, Bamboogle, and Tooltalk-hard. Methodology and further details are provided at the bottom in a footnote [1].
For tool use, we see stronger overall performance than models of similar size on the industry-standard Berkeley Function-Calling Leaderboard. This shows Command R7B is particularly effective at tool use in real-world, diverse, and dynamic environments and avoids calling tools unnecessarily which is an important aspect of tool use in practical applications . Command R7B’s multi-step tool use capabilities allow it to power fast and capable AI agents.

Optimized for enterprise use cases
Our models are optimized for the capabilities enterprises need for real-world deployment of AI systems. The R series delivers an unmatched balance of efficiency and strong performance. This means ensuring they excel on human evaluation, the gold standard for quality assessment. Command R7B outperforms similarly sized open-weights models in blind head-to-head evaluations by human raters on RAG use cases our customers care about when building AI assistants for functions like customer service, HR, compliance, and IT support.


Head-to-head Human evaluation of vs Gemma 2 9B on a collection of 949 examples of enterprise RAG use-cases. All examples are at least 3-way blind-annotated by specially-trained human annotators, assessing fluency, faithfulness and response utility.
Efficient and fast
Command R7B’s compact size offers a reduced serving footprint that is ideal for rapid prototyping and iteration. It excels at high throughput, real-time use cases like chatbots and code assistants. It also unlocks dramatically cheaper deployment infrastructure such as consumer GPUs and CPUs to unlock on-device inference.

We achieve this without compromising on our enterprise-grade security and privacy standards to protect customers' data.

Get started
Command R7B is available today on the Cohere Platform as well as accessible on HuggingFace. We’re excited to be releasing the weights of this model to provide greater access to cutting-edge technology for the AI research community.

###
https://genesis-world.readthedocs.io/en/latest/
12/19/24

Genesis
_images/teaser.png
What is Genesis?
Genesis is a physics platform designed for general purpose Robotics/Embodied AI/Physical AI applications. It is simultaneously multiple things:

A universal physics engine re-built from the ground up, capable of simulating a wide range of materials and physical phenomena.

A lightweight, ultra-fast, pythonic, and user-friendly robotics simulation platform.

A powerful and fast photo-realistic rendering system.

A generative data engine that transforms user-prompted natural language description into various modalities of data.

Powered by a universal physics engine re-designed and re-built from the ground up, Genesis integrates various physics solvers and their coupling into a unified framework. This core physics engine is further enhanced by a generative agent framework that operates at an upper level, aiming towards fully automated data generation for robotics and beyond. Currently, we are open-sourcing the underlying physics engine and the simulation platform. The generative framework will be released in the near future.

Genesis is built and will continuously evolve with the following long-term missions:

Lowering the barrier to using physics simulations and making robotics research accessible to everyone. (See our commitment)

Unifying a wide spectrum of state-of-the-art physics solvers into a single framework, allowing re-creating the whole physical world in a virtual realm with the highest possible physical, visual and sensory fidelity, using the most advanced simulation techniques.

Minimizing human effort in collecting and generating data for robotics and other domains, letting the data flywheel spin on its own.

Project Page: https://genesis-embodied-ai.github.io/

Key Features
Compared to prior simulation platforms, here we highlight several key features of Genesis:

🐍 100% Python, both front-end interface and back-end physics engine, all natively developed in python.

👶 Effortless installation and extremely simple and user-friendly API design.

🚀 Parallelized simulation with unprecedented speed: Genesis is the world’s fastest physics engine, delivering simulation speeds up to 10~80x (yes, this is a bit sci-fi) faster than existing GPU-accelerated robotic simulators (Isaac Gym/Sim/Lab, Mujoco MJX, etc), without any compromise on simulation accuracy and fidelity.

💥 A unified framework that supports various state-of-the-art physics solvers, modeling a vast range of materials and physical phenomena.

📸 Photo-realistic ray-tracing rendering with optimized performance.

📐 Differentiability: Genesis is designed to be fully compatible with differentiable simulation. Currently, our MPM solver and Tool Solver are differentiable, and differentiability for other solvers will be added soon (starting with rigid-body simulation).

☝🏻 Physically-accurate and differentiable tactile sensor.

🌌 Native support for Generative Simulation, allowing language-prompted data generation of various modalities: interactive scenes, task proposals, rewards, assets, character motions, policies, trajectories, camera motions, (physically-accurate) videos, and more.

Getting Started
Quick Installation
Genesis is available via PyPI:

pip install genesis-world
You also need to install PyTorch following the official instructions.

Documentation
Please refer to our documentation site to for detailed installation steps, tutorials and API references.

Contributing to Genesis
The goal of the Genesis project is to build a fully transparent, user-friendly ecosystem where contributors from both robotics and computer graphics can come together to collaboratively create a high-efficiency, realistic (both physically and visually) virtual world for robotics research and beyond.

We sincerely welcome any forms of contributions from the community to make the world a better place for robots. From pull requests for new features, bug reports, to even tiny suggestions that will make Genesis API more intuitive, all are wholeheartedly appreciated!

WTF?! New open-source physics AI engine absolutely insane! 🤯 Genesis is a new physics engine that combines ultra-fast simulation with generative capabilities to create dynamic 4D worlds for robotics and physics.
TL;DR:
🚀 430,000x faster than real-time physics simulation, processes 43M FPS on a single RTX 4090
🐍 Built in pure Python, 10-80x faster than existing GPU solutions like Isaac Gym
🌐 Cross-platform support: Linux, MacOS, Windows, with CPU, NVIDIA, AMD, and Apple Metal backends
🧪 Unified framework combining multiple physics solvers: Rigid body, MPM, SPH, FEM, PBD, Stable Fluid
🤖 Extensive robot support: arms, legged robots, drones, soft robots; supports MJCF, URDF, obj, glb files
🎨 Built-in photorealistic ray-tracing rendering
📐 Differentiable simulation capabilities (currently for MPM and Tool solvers)
🔄 Can generate environments, camera motions, robot policies, character animations from text prompts
⚡ Takes only 26 seconds to train real-world transferrable robot locomotion policies
💻 Simple installation via pip: pip install genesis-world
🤝 Physics engine and simulation platform are fully open-sourced
🔜 ”.generate” method/generative framework coming soon.

###
https://huggingface.co/collections/answerdotai/modernbert-67627ad707a4acbf33c41deb
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Published on Dec 18
·
Submitted by
jph00
on Dec 19
Authors:
Benjamin Warner
,

Antoine Chaffin
,

Benjamin Clavié
,

Orion Weller
,
Oskar Hallström
,

Said Taghadouini
,
Alexis Gallagher
,
Raja Biswas
,
Faisal Ladhak
,

Tom Aarsen
,
Nathan Cooper
,
Griffin Adams
,
Jeremy Howard
,
Iacopo Poli
Abstract
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

ModernBERT, BERT revisited in the age of LLMs and Generative AI! LightOn and Answer.ai modernized BERT! Improved architecture with 8192 context length, flash attention, and trained on 2T tokens. ModernBERT outperforms version BERT and RoBERTa versions! 👀
TL;DR;
2️⃣ Comes in 2 sizes base (139M) and large (395M)
🚀 Better performance across all metrics than the original BERT
📏 8,192 token context length (16x longer than BERT)
⚡ Modern architecture with Flash Attention 2, RoPE embeddings, and alternating attention
📚 Trained on 2 trillion tokens, primarily English and Code
💨 2-4x faster than other models with mixed-length inputs
🔓 Released under Apache 2.0
🤗 Available on Hugging Face and Transformers (main)

Finally, a Replacement for BERT
Published December 19, 2024
Benjamin Warner's avatar
bwarner
Benjamin Warner
Answer.AI's avatar
answerdotai
Antoine Chaffin's avatar
NohTow
Antoine Chaffin
LightOn AI's avatar
lightonai
Benjamin Clavié's avatar
bclavie
Benjamin Clavié
Answer.AI's avatar
answerdotai
Orion Weller's avatar
orionweller
Orion Weller
guest
Oskar Hallström's avatar
ohallstrom
Oskar Hallström
LightOn AI's avatar
lightonai
Said Taghadouini's avatar
staghado
Said Taghadouini
LightOn AI's avatar
lightonai
Alexis Gallagher's avatar
alexisgallagher
Alexis Gallagher
Answer.AI's avatar
answerdotai
Raja Biswas's avatar
rbiswasfc
Raja Biswas
guest
Faisal Ladhak's avatar
fladhak
Faisal Ladhak
guest
Tom Aarsen's avatar
tomaarsen
Tom Aarsen
Nathan Cooper's avatar
ncoop57
Nathan Cooper
Answer.AI's avatar
answerdotai
Griffin Adams's avatar
griffin
Griffin Adams
guest
Jeremy Howard's avatar
jph00
Jeremy Howard
Answer.AI's avatar
answerdotai
Jonathan Whitaker's avatar
johnowhitaker
Jonathan Whitaker
Answer.AI's avatar
answerdotai
Iacopo Poli's avatar
iacolippo
Iacopo Poli
LightOn AI's avatar
lightonai
TL;DR
This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board, with a 8192 sequence length, better downstream performance and much faster processing.

ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (139M params) and large (395M params) model size.

Click to see how to use these models with transformers
Introduction
BERT was released in 2018 (millennia ago in AI-years!) and yet it’s still widely used today: in fact, it’s currently the second most downloaded model on the HuggingFace hub, with more than 68 million monthly downloads, only second to another encoder model fine-tuned for retrieval. That’s because its encoder-only architecture makes it ideal for the kinds of real-world problems that come up every day, like retrieval (such as for RAG), classification (such as content moderation), and entity extraction (such as for privacy and regulatory compliance).

Finally, 6 years later, we have a replacement! Today, we at Answer.AI and LightOn (and friends!) are releasing ModernBERT. ModernBERT is a new model series that is a Pareto improvement over BERT and its younger siblings across both speed and accuracy. This model takes dozens of advances from recent years of work on large language models (LLMs), and applies them to a BERT-style model, including updates to the architecture and the training process.



We expect to see ModernBERT become the new standard in the numerous applications where encoder-only models are now deployed, such as in RAG pipelines (Retrieval Augmented Generation) and recommendation systems.

In addition to being faster and more accurate, ModernBERT also increases context length to 8k tokens (compared to just 512 for most encoders), and is the first encoder-only model that includes a large amount of code in its training data. These features open up new application areas that were previously inaccessible through open models, such as large-scale code search, new IDE features, and new types of retrieval pipelines based on full document retrieval rather than small chunks.

But in order to explain just what we did, let’s first take a step back and look at where we’ve come from.

Decoder-only models
The recent high-profile advances in LLMs have been in models like GPT, Llama, and Claude. These are decoder-only models, or generative models. Their ability to generate human-like content has enabled astonishing new GenAI application areas like generated art and interactive chat. These striking applications have attracted major investment, funded booming research, and led to rapid technical advances. What we’ve done, essentially, is port these advances back to an encoder-only model.

Why? Because many practical applications need a model that’s lean and mean! And it doesn’t need to be a generative model.

More bluntly, decoder-only models are too big, slow, private, and expensive for many jobs. Consider that the original GPT-1 was a 117 million parameter model. The Llama 3.1 model, by contrast, has 405 billion parameters, and its technical report describes a data synthesis and curation recipe that is too complex and expensive for most corporations to reproduce. So to use such a model, like ChatGPT, you pay in cents and wait in seconds to get an API reply back from heavyweight servers outside of your control.

Of course, the open-ended capabilities of these giant generative models mean that you can, in a pinch, press them into service for non-generative or discriminative tasks, such as classification. This is because you can describe a classification task in plain English and ... just ask the model to classify. But while this workflow is great for prototyping, you don’t want to pay prototype prices once you’re in mass production.

The popular buzz around GenAI has obscured the role of encoder-only models. These are the workhorses of practical language processing, the models that are actually being used for such workloads right now in many scientific and commercial applications.

Encoder-only models
The output of an encoder-only model is a list of numerical values (an embedding vector). You might say that instead of answering with text, an encoder model literally encodes its “answer” into this compressed, numerical form. That vector is a compressed representation of the model's input, which is why encoder-only models are sometimes referred to as representational models.

While decoder-only models (like a GPT) can do the work of an encoder-only model (like a BERT), they are hamstrung by a key constraint: since they are generative models, they are mathematically “not allowed” to “peek” at later tokens. They can only ever look backwards. This is in contrast to encoder-only models, which are trained so each token can look forwards and backwards (bi-directionally). They are built for this, and it makes them very efficient at what they do.

Basically, a frontier model like OpenAI's O1 is like a Ferrari SF-23. It’s an obvious triumph of engineering, designed to win races, and that’s why we talk about it. But it takes a special pit crew just to change the tires and you can’t buy one for yourself. In contrast, a BERT model is like a Honda Civic. It’s also an engineering triumph, but more subtly, since it is engineered to be affordable, fuel-efficient, reliable, and extremely useful. And that’s why they’re absolutely everywhere.

You can see this by looking at it a number of ways.

Supporting generative models: One way to understand the prevalence of representational models (encoder-only) is to note how frequently they are used in concert with a decoder-only model to make a system which is safe and efficient.

The obvious example is RAG. Instead of relying on the LLM’s knowledge trained into the model’s parameters, the system uses a document store to furnish the LLM with information relevant to the query. But of course this only defers the problem. If the LLM doesn’t know which documents are relevant to the query, then the system will need some other process to select those documents? It’s going to need a model which is fast and cheap enough that it can be used to encode the large quantities of information needed to make the LLM useful. That model is often a BERT-like encoder-only model.

Another example is supervision architectures, where a cheap classifier might be used to ensure that generated text does not violate content safety requirements.

In short, whenever you see a decoder-only model in deployment, there’s a reasonable chance an encoder-only model is also part of the system. But the converse is not true.

Encoder-based systems: Before there was GPT, there were content recommendations in social media and in platforms like Netflix. There was ad targeting in those venues, in search, and elsewhere. There was content classification for spam detection, abuse detection, etc.. These systems were not built on generative models, but on representational models like encoder-only models. And all these systems are still out there and still running at enormous scale. Imagine how many ads are targeted per second around the world!

Downloads: On HuggingFace, RoBERTa, one of the leading BERT-based models, has more downloads than the 10 most popular LLMs on HuggingFace combined. In fact, currently, encoder-only models add up to over a billion downloads per month, nearly three times more than decoder-only models with their 397 million monthly downloads. In fact, the `fill-mask` model category, composed of encoder “base models” such as ModernBERT, ready to be fine-tuned for other downstream applications, is the most downloaded model category overall.

Inference costs: What the above suggests, is that on an inference-per-inference basis, there are many times more inferences performed per year on encoder-only models than on decoder-only or generative models. An interesting example is FineWeb-Edu, where model-based quality filtering had to be performed over 15 trillion tokens. The FineWeb-Edu team chose to generate annotations with a decoder-only model, Llama-3-70b-Instruct, and perform the bulk of the filtering with a fine-tuned BERT-based model. This filtering took 6,000 H100 hours, which, at HuggingFace Inference Points’ pricing of $10/hour, comes to a total of $60,000. On the other hand, feeding 15 trillion tokens to popular decoder-only models, even with the lowest-cost option of using Google’s Gemini Flash and its low inference cost of $0.075/million tokens, would cost over one million dollars!

Performance
Overview
Here’s a snapshot of the accuracy of ModernBERT and other models across a range of tasks, as measured by standard academic benchmarks – as you can see, ModernBERT is the only model which is a top scorer across every category, which makes it the one model you can use for all your encoder-based tasks:



If you’ve ever done an NLP competition on Kaggle, then you’ll know that DeBERTaV3 has been the choice of champions for years. But no longer: not only is ModernBERT the first base-size model to beat DeBERTaV3 on GLUE, it also uses less than 1/5th of Deberta’s memory.

And of course, ModernBERT is fast. It’s twice as fast as DeBERTa – in fact, up to 4x faster in the more common situation where inputs are mixed length. Its long context inference is nearly 3 times faster than other high-quality models such as NomicBERT and GTE-en-MLM.

ModernBERT’s context length of 8,192 tokens is over 16x larger than most existing encoders. This is critical, for instance, in RAG pipelines, where a small context often makes chunks too small for semantic understanding. ModernBERT is also the state-of-the-art long context retriever with ColBERT, and is 9 percentage points above the other long context models. Even more impressive: this very quickly trained model, simply tuned to compare to other backbones, outperforms even widely-used retrieval models on long-context tasks!

For code retrieval, ModernBERT is unique. There’s nothing to really compare it to, since there’s never been an encoder model like this trained on a large amount of code data before. For instance, on the StackOverflow-QA dataset (SQA), which is a hybrid dataset mixing both code and natural language, ModernBERT's specialized code understanding and long-context capabilities make it the only backbone to score over 80 on this task.

This means whole new applications are likely to be built on this capability. For instance, imagine an AI-connected IDE which had an entire enterprise codebase indexed with ModernBERT embeddings, providing fast long context retrieval of the relevant code across all repositories. Or a code chat service which described how an application feature worked that integrated dozens of separate projects.

Compared to the mainstream models, ModernBERT performs better across nearly all three broad task categories of retrieval, natural language understanding, and code retrieval. Whilst it slightly lags DeBERTaV3 in one area (natural language understanding), it is many times faster. Please note that ModernBERT, as any other base model, can only do masked word prediction out-of-the-box. To be able to perform other tasks, the base model should be fine-tuned as done in these boilerplates.

Compared to the specialized models, ModernBERT is comparable or superior in most tasks. In addition, ModernBERT is faster than most models across most tasks, and can handle inputs up to 8,192 tokens, 16x longer than the mainstream models.

Efficiency
Here’s the memory (max batch size, BS) and Inference (in thousands of tokens per second) efficiency results on an NVIDIA RTX 4090 for ModernBERT and other decoder models:



The first thing you might notice is that we’re analysing the efficiency on an affordable consumer GPU, rather than the latest unobtainable hyped hardware. First and foremost, ModernBERT is focused on practicality, not hype.

As part of this focus, it also means we’ve made sure ModernBERT works well for real-world applications, rather than just benchmarks. Models of this kind are normally tested on just the one exact size they’re best at – their maximum context length. That’s what the “fixed” column in the table shows. But input sizes vary in the real world, so that’s the performance we worked hard to optimise – the “variable” column. As you can see, for variable length inputs, ModernBERT is much faster than all other models.

For long context inputs, which we believe will be the basis for the most valuable and important future applications, ModernBERT is 2-3x faster than the next fastest model. And, on the “practicality” dimension again: ModernBERT doesn’t require the additional heavy “xformers” dependency, but instead only requires the now commonplace Flash Attention as a dependency.

Furthermore, thanks to ModernBERT’s efficiency, it can use a larger batch size than nearly any other model, and can be used effectively on smaller and cheaper GPUs. The efficiency of the base size, in particular, may enable new applications that run directly in browsers, on phones, and so forth.

Why is ModernBERT, well, Modern?
Now, we’ve made our case to why we should give some more love to encoder models. As trusted, under-appreciated workhorses, they’ve had surprisingly few updates since 2018's BERT!

Even more surprising: since RoBERTa, there has been no encoder providing overall improvements without tradeoffs (fancily known as “Pareto improvements”): DeBERTaV3 had better GLUE and classification performance, but sacrificed both efficiency and retrieval. Other models, such as AlBERT, or newer ones, like GTE-en-MLM, all improved over the original BERT and RoBERTa in some ways but regressed in others.

However, since the duo’s original release, we've learned an enormous amount about how to build better language models. If you’ve used LLMs at all, you’re very well aware of it: while they’re rare in the encoder-world, Pareto improvements are constant in decoder-land, where models constantly become better at everything. And as we’ve all learned by now: model improvements are only partially magic, and mostly engineering.

The goal of the (hopefully aptly named) ModernBERT project was thus fairly simple: bring this modern engineering to encoder models. We did so in three core ways:

a modernized transformer architecture
particular attention to efficiency
modern data scales & sources
Meet the New Transformer, Same as the Old Transformer
The Transformer architecture has become dominant, and is used by the vast majority of models nowadays. However, it’s important to remember that there isn’t one but many Transformers. The main thing they share in common is their deep belief that attention is indeed all you need, and as such, build various improvements centered around the attention mechanism.

ModernBERT takes huge inspiration from the Transformer++ (as coined by Mamba), first used by the Llama2 family of models. Namely, we replace older BERT-like building blocks with their improved equivalent, namely, we:

Replace the old positional encoding with "rotary positional embeddings" (RoPE): this makes the model much better at understanding where words are in relation to each other, and allows us to scale to longer sequence lengths.
Switch out the old MLP layers for GeGLU layers, improving on the original BERT’s GeLU activation function.
Streamline the architecture by removing unnecessary bias terms, letting us spend our parameter budget more effectively
Add an extra normalization layer after embeddings, which helps stabilize training
Upgrading a Honda Civic for the Race Track
We’ve covered this already: encoders are no Ferraris, and ModernBERT is no exception. However, that doesn’t mean it can’t be fast. When you get on the highway, you generally don’t go and trade in your car for a race car, but rather hope that your everyday reliable ride can comfortably hit the speed limit.

In fact, for all the application cases we mentioned above, speed is essential. Encoders are very popular in uses where they either have to process tons of data, allowing even tiny speed increments to add up very quickly, or where latency is very important, as is the case on RAG. In a lot of situations, encoders are even run on CPU, where efficiency is even more important if we want results in a reasonable amount of time.

As with most things in research, we build while standing on the shoulders of giants, and heavily leverage Flash Attention 2’s speed improvements. Our efficiency improvements rely on three key components: Alternating Attention, to improve processing efficiency, Unpadding and Sequence Packing, to reduce computational waste, and Hardware-Aware Model Design, to maximise hardware utilization.

Global and Local Attention
One of ModernBERT’s most impactful features is Alternating Attention, rather than full global attention. In technical terms, this means that our attention mechanism only attends to the full input every 3 layers (global attention), while all other layers use a sliding window where every token only attends to the 128 tokens nearest to itself (local attention).
As attention’s computational complexity balloons up with every additional token, this means ModernBERT can process long input sequences considerably faster than any other model.

In practice, it looks like this:


Conceptually, the reason this works is pretty simple: Picture yourself reading a book. For every sentence you read, do you need to be fully aware of the entire plot to understand most of it (full global attention)? Or is awareness of the current chapter enough (local attention), as long as you occasionally think back on its significance to the main plot (global attention)? In the vast majority of cases, it’s the latter.

Unpadding and Sequence Packing
Another core mechanism contributing to ModernBERT’s efficiency is its use for Unpadding and Sequence packing.

In order to be able to process multiple sequences within the same batch, encoder models require them to be the same length, so they can perform parallel computation. Traditionally, we’ve relied on padding to achieve this: figure out which sentence is the longest, and add meaningless tokens (padding tokens) to fill up every other sequence.

While padding solves the problem, it doesn’t do so elegantly: a lot of compute ends up being spent and wasted on padding tokens, which do not contribute any semantic information.

Padding vs sequence packing
Comparing padding with sequence packing. Sequence packing (‘unpadding’) avoids wasting compute on padding tokens and has more consistent non-padding token counts per batch. Samples are still processed individually through careful masking.
Unpadding solves this issue: rather than keeping these padding tokens, we remove them all, and concatenate them into mini-batches with a batch size of one, avoiding all unnecessary computations. If you’re using Flash Attention, our implementation of unpadding is even faster than previous methods, which heavily relied on unpadding and repadding sequences as they went through the model: we go one step further by introducing our own implementation of unpadding, relying heavily on recent developments in Flash Attention’s RoPE support. This allows ModernBERT to only have to unpad once, and optionally repad sequences after processing, resulting in a 10-20% speedup over previous methods.

To speed up pre-training even further, unpadding is in good company within our model, as we use it in conjunction with sequence packing. Sequence packing here is a logical next step: as we’re concatenating inputs into a single sequence, and GPUs are very good at parallelisation, we want to maximise the computational efficiency we can squeeze out of a single forward model pass. To do so, we use a greedy algorithm to group individual sequences into concatenated ones that are as close to the model’s maximum input length as possible.

Paying Attention to Hardware
Finally, the third facet of ModernBERT’s efficiency is hardware design.

We attempted to balance two insights that have been highlighted by previous research:

Deep & Narrow vs Wide & Shallow: Research shows that deeper models with narrower layers, often perform better than shallow models with fewer, wider layers. However, this is a double-edged sword: the deeper the model, the less parallelizable it becomes, and thus, the slower it runs at identical parameter counts.
Hardware Efficiency: Model dimensions need to align well with GPU hardware for maximum performance, and different target GPUs result in different constraints.
Sadly, there is no magic recipe to make a model run similarly well on a wide range of GPUs, but there is an excellent cookbook: The Case for Co-Designing Model Architectures with Hardware, in which the ways to optimize a model architecture for a given GPU are carefully laid out. We came up with a heuristic to extend their method to a basket of GPUs, while respecting a given set of constraints. Logically, the first step is to define said constraints, in our case:

Defining our target GPUs as common inference ones (RTX 3090/4090, A10, T4, L4)
Roughly defining our target model sizes at 130-to-150 million parameters for ModernBERT-Base, and 350-to-420 for ModernBERT-Large.
The final embedding sizes must match the original BERT’s dimensions, 768 for base and 1024 for large, to maximize backwards compatibility
Set performance constraints which are common across the basket of GPUs
Afterwards, we experimented with multiple model designs via a constrained grid search, varying both layer counts and layer width. Once we’d identified shapes that appeared to be the most efficient ones, we confirmed that our heuristics matched real-world GPU performance, and settled on the final model designs.

Training
def data(): return [‘text’, ‘bad_text’, ‘math’, ‘code’]
https://media1.tenor.com/m/xJSM2Ky3WpgAAAAd/steve-ballmer-microsoft.gif
Picture this exact scene, but replace Developers with Data

Another big aspect in which encoders have been trailing behind is training data. This is often understood to mean solely training data scale, but this is not actually the case: previous encoders, such as DeBERTaV3, were trained for long enough that they might have even breached the trillion tokens scale!

The issue, rather, has been training data diversity: many of the older models train on limited corpora, generally consisting of Wikipedia and Wikibooks. These data mixtures are very noticeably single text modality: they contain nothing but high-quality natural text.

In contrast, ModernBERT is trained on data from a variety of English sources, including web documents, code, and scientific articles. It is trained on 2 trillion tokens, of which most are unique, rather than the standard 20-to-40 repetitions common in previous encoders.

The impact of this is immediately noticeable: out of all the existing open source encoders, ModernBERT is in a class of its own on programming-related tasks. We’re particularly interested in what downstream uses this will lead to, in terms of improving programming assistants.

Process
We stick to the original BERT’s training recipe, with some slight upgrades inspired by subsequent work: we remove the Next-Sentence Prediction objective, since then shown to add overhead for no clear gains, and increase the masking rate from 15% to 30%.

Both models are trained with a three-phase process. First, we train on 1.7T tokens at a sequence length of 1024. We then adopt a long-context adaptation phase, training on 250B tokens at a sequence length of 8192, while keeping the total tokens seen per batch more or less consistent by lowering the batch size. Finally, we perform annealing on 50 billion tokens sampled differently, following the long-context extension ideal mix highlighted by ProLong.

Training in three phases is our way of ensuring our model is good across the board, which is reflected in its results: it is competitive on long-context tasks, at no cost to its ability to process short context…

… But it has another benefit: for the first two-phases, we train using a constant learning rate once the warmup phase is complete, and only perform learning rate decay on the final 50 billion tokens, following the Trapezoidal (or Warmup-Stable-Decay) learning rate. And what’s more: we will release every single immediate intermediate checkpoints from these stable phases, inspired by Pythia. Our main reason for doing so was supporting future research and applications: anyone is free to restart training from any of our pre-decay checkpoints, and perform annealing on domain-appropriate data for their intended use!

The tricks, it’s all about the tricks!
If you’ve made it this far into this announcement, you’re probably used to this: of course, we use tricks to make things quicker here too. To be precise, we have two main tricks.

Let’s start with the first one, which is pretty common: since the initial training steps are updating random weights, we adopt batch-size warmup: we start with a smaller batch size so the same number of tokens update the model weights more often, then gradually increase the batch size to the final training size. This significantly speeds up the initial phase of model training, where the model learns its most basic understanding of language.

The second trick is far more uncommon: weight initialization via tiling for the larger model size, inspired by Microsoft’s Phi family of models. This one’s based on the following realization: Why initialize the ModernBERT-large’s initial weights with random numbers when we have a perfectly good (if we dare say so ourselves) set of ModernBERT-base weights just sitting there?

And indeed, it turns out that tiling ModernBERT-base’s weights across ModernBERT-large works better than initializing from random weights. It also has the added benefit of stacking nicely with batch size warmup for even faster initial training.

Conclusion
In this blog post we introduced the ModernBERT models, a new state-of-the-art family of small and efficient encoder-only models, finally giving BERT a much needed do-over.

ModernBERT demonstrates that encoder-only models can be improved by modern methods. They continue to offer very strong performance on some tasks, providing an extremely attractive size/performance ratio.

More than anything, we’re really looking forward to seeing what creative ways to use these models the community will come up with! To encourage this, we’re opening a call for demos until January 10th, 2025: the 5 best ones will get added to this post in a showcase section and win a $100 (or local currency equivalent) Amazon gift card, as well as a 6-month HuggingFace Pro subscription! If you need a hint to get started, here’s a demo we thought about: code similarity HF space! And remember, this is an encoder model, so all the coolest downstream applications will likely require some sort of fine-tuning (on real or perhaps decoder-model synthetic data?). Thankfully, there's lots of cool frameworks out there to support fine-tuning encoders: 🤗Transformers itself for various tasks, including classification, GliNER for zero-shot Named Entity Recognition, or Sentence-Transformers for retrieval and similarity tasks!



###
https://aistudio.google.com/prompts/new_chat?model=gemini-2.0-flash-thinking-exp-1219
Well played! Google DeepMind just released its Gemini (reasoning) preview of an OpenAI o1-like reasoning mode! And it is not hiding its Chain of Thoughts! 😍
Gemini 2.0 Flash Thinking is an experimental model trained to think out loud, leading to stronger reasoning performance. It is natively multimodal and supports image inputs.
You can try it in AI studio, available for everyone

focus on model=gemini-2.0-flash-thinking-exp-1219

Introducing Gemini 2.0: our new AI model for the agentic era
Dec 11, 2024

·
11 min read

SundarPichai_2x.jpg
Sundar Pichai
CEO of Google and Alphabet
Demis_headshot
Demis Hassabis
CEO of Google DeepMind
koray
Koray Kavukcuoglu
CTO of Google DeepMind
Share
 Read AI-generated summary
Text "Gemini 2.0" in front of a futuristic blue and black abstract background
In this story
A message from our CEO
Introducing Gemini 2.0
Gemini 2.0 Flash
Project Astra
Project Mariner
Agents for developers
Agents in games
Building responsibly
A note from Google and Alphabet CEO Sundar Pichai:

Information is at the core of human progress. It’s why we’ve focused for more than 26 years on our mission to organize the world’s information and make it accessible and useful. And it’s why we continue to push the frontiers of AI to organize that information across every input and make it accessible via any output, so that it can be truly useful for you.

That was our vision when we introduced Gemini 1.0 last December. The first model built to be natively multimodal, Gemini 1.0 and 1.5 drove big advances with multimodality and long context to understand information across text, video, images, audio and code, and process a lot more of it.

Now millions of developers are building with Gemini. And it’s helping us reimagine all of our products — including all 7 of them with 2 billion users — and to create new ones. NotebookLM is a great example of what multimodality and long context can enable for people, and why it’s loved by so many.

Over the last year, we have been investing in developing more agentic models, meaning they can understand more about the world around you, think multiple steps ahead, and take action on your behalf, with your supervision.

Today we’re excited to launch our next era of models built for this new agentic era: introducing Gemini 2.0, our most capable model yet. With new advances in multimodality — like native image and audio output — and native tool use, it will enable us to build new AI agents that bring us closer to our vision of a universal assistant.

We’re getting 2.0 into the hands of developers and trusted testers today. And we’re working quickly to get it into our products, leading with Gemini and Search. Starting today our Gemini 2.0 Flash experimental model will be available to all Gemini users. We're also launching a new feature called Deep Research, which uses advanced reasoning and long context capabilities to act as a research assistant, exploring complex topics and compiling reports on your behalf. It's available in Gemini Advanced today.

No product has been transformed more by AI than Search. Our AI Overviews now reach 1 billion people, enabling them to ask entirely new types of questions — quickly becoming one of our most popular Search features ever. As a next step, we’re bringing the advanced reasoning capabilities of Gemini 2.0 to AI Overviews to tackle more complex topics and multi-step questions, including advanced math equations, multimodal queries and coding. We started limited testing this week and will be rolling it out more broadly early next year. And we’ll continue to bring AI Overviews to more countries and languages over the next year.

2.0’s advances are underpinned by decade-long investments in our differentiated full-stack approach to AI innovation. It’s built on custom hardware like Trillium, our sixth-generation TPUs. TPUs powered 100% of Gemini 2.0 training and inference, and today Trillium is generally available to customers so they can build with it too.

If Gemini 1.0 was about organizing and understanding information, Gemini 2.0 is about making it much more useful. I can’t wait to see what this next era brings.

-Sundar

Introducing Gemini 2.0: our new AI model for the agentic era
By Demis Hassabis, CEO of Google DeepMind and Koray Kavukcuoglu, CTO of Google DeepMind on behalf of the Gemini team

Over the past year, we have continued to make incredible progress in artificial intelligence. Today, we are releasing the first model in the Gemini 2.0 family of models: an experimental version of Gemini 2.0 Flash. It’s our workhorse model with low latency and enhanced performance at the cutting edge of our technology, at scale.

We are also sharing the frontiers of our agentic research by showcasing prototypes enabled by Gemini 2.0’s native multimodal capabilities.

Gemini 2.0 Flash
Gemini 2.0 Flash builds on the success of 1.5 Flash, our most popular model yet for developers, with enhanced performance at similarly fast response times. Notably, 2.0 Flash even outperforms 1.5 Pro on key benchmarks, at twice the speed. 2.0 Flash also comes with new capabilities. In addition to supporting multimodal inputs like images, video and audio, 2.0 Flash now supports multimodal output like natively generated images mixed with text and steerable text-to-speech (TTS) multilingual audio. It can also natively call tools like Google Search, code execution as well as third-party user-defined functions.

A chart comparing Gemini models and their capabilities
Our goal is to get our models into people’s hands safely and quickly. Over the past month, we’ve been sharing early, experimental versions of Gemini 2.0, getting great feedback from developers.

Gemini 2.0 Flash is available now as an experimental model to developers via the Gemini API in Google AI Studio and Vertex AI with multimodal input and text output available to all developers, and text-to-speech and native image generation available to early-access partners. General availability will follow in January, along with more model sizes.

To help developers build dynamic and interactive applications, we’re also releasing a new Multimodal Live API that has real-time audio, video-streaming input and the ability to use multiple, combined tools. More information about 2.0 Flash and the Multimodal Live API can be found in our developer blog.

Gemini 2.0 available in Gemini app, our AI assistant
Also starting today, Gemini users globally can access a chat optimized version of 2.0 Flash experimental by selecting it in the model drop-down on desktop and mobile web and it will be available in the Gemini mobile app soon. With this new model, users can experience an even more helpful Gemini assistant.

Early next year, we’ll expand Gemini 2.0 to more Google products.

Unlocking agentic experiences with Gemini 2.0
Gemini 2.0 Flash’s native user interface action-capabilities, along with other improvements like multimodal reasoning, long context understanding, complex instruction following and planning, compositional function-calling, native tool use and improved latency, all work in concert to enable a new class of agentic experiences.

The practical application of AI agents is a research area full of exciting possibilities. We’re exploring this new frontier with a series of prototypes that can help people accomplish tasks and get things done. These include an update to Project Astra, our research prototype exploring future capabilities of a universal AI assistant; the new Project Mariner, which explores the future of human-agent interaction, starting with your browser; and Jules, an AI-powered code agent that can help developers.

We’re still in the early stages of development, but we’re excited to see how trusted testers use these new capabilities and what lessons we can learn, so we can make them more widely available in products in the future.

Gemini 2.0 supercut video
2:53
Project Astra: agents using multimodal understanding in the real world
Since we introduced Project Astra at I/O, we’ve been learning from trusted testers using it on Android phones. Their valuable feedback has helped us better understand how a universal AI assistant could work in practice, including implications for safety and ethics. Improvements in the latest version built with Gemini 2.0 include:

Better dialogue: Project Astra now has the ability to converse in multiple languages and in mixed languages, with a better understanding of accents and uncommon words.
New tool use: With Gemini 2.0, Project Astra can use Google Search, Lens and Maps, making it more useful as an assistant in your everyday life.
Better memory: We’ve improved Project Astra’s ability to remember things while keeping you in control. It now has up to 10 minutes of in-session memory and can remember more conversations you had with it in the past, so it is better personalized to you.
Improved latency: With new streaming capabilities and native audio understanding, the agent can understand language at about the latency of human conversation.
We’re working to bring these types of capabilities to Google products like Gemini app, our AI assistant, and to other form factors like glasses. And we’re starting to expand our trusted tester program to more people, including a small group that will soon begin testing Project Astra on prototype glasses.

Project Astra demo video
4:32
Project Mariner: agents that can help you accomplish complex tasks
Project Mariner is an early research prototype built with Gemini 2.0 that explores the future of human-agent interaction, starting with your browser. As a research prototype, it’s able to understand and reason across information in your browser screen, including pixels and web elements like text, code, images and forms, and then uses that information via an experimental Chrome extension to complete tasks for you.

When evaluated against the WebVoyager benchmark, which tests agent performance on end-to-end real world web tasks, Project Mariner achieved a state-of-the-art result of 83.5% working as a single agent setup.

It’s still early, but Project Mariner shows that it’s becoming technically possible to navigate within a browser, even though it’s not always accurate and slow to complete tasks today, which will improve rapidly over time.

To build this safely and responsibly, we’re conducting active research on new types of risks and mitigations, while keeping humans in the loop. For example, Project Mariner can only type, scroll or click in the active tab on your browser and it asks users for final confirmation before taking certain sensitive actions, like purchasing something.

Trusted testers are starting to test Project Mariner using an experimental Chrome extension now, and we’re beginning conversations with the web ecosystem in parallel.

Mariner demo video
2:15
Jules: agents for developers
Next, we’re exploring how AI agents can assist developers with Jules — an experimental AI-powered code agent that integrates directly into a GitHub workflow. It can tackle an issue, develop a plan and execute it, all under a developer’s direction and supervision. This effort is part of our long-term goal of building AI agents that are helpful in all domains, including coding.

More information about this ongoing experiment can be found in our developer blog post.

Agents in games and other domains
Google DeepMind has a long history of using games to help AI models become better at following rules, planning and logic. Just last week, for example, we introduced Genie 2, our AI model that can create an endless variety of playable 3D worlds — all from a single image. Building on this tradition, we’ve built agents using Gemini 2.0 that can help you navigate the virtual world of video games. It can reason about the game based solely on the action on the screen, and offer up suggestions for what to do next in real time conversation.

We're collaborating with leading game developers like Supercell to explore how these agents work, testing their ability to interpret rules and challenges across a diverse range of games, from strategy titles like “Clash of Clans” to farming simulators like “Hay Day.”

Beyond acting as virtual gaming companions, these agents can even tap into Google Search to connect you with the wealth of gaming knowledge on the web.

Navi demo video
2:26
In addition to exploring agentic capabilities in the virtual world, we’re experimenting with agents that can help in the physical world by applying Gemini 2.0's spatial reasoning capabilities to robotics. While it’s still early, we’re excited about the potential of agents that can assist in the physical environment.

You can learn more about these research prototypes and experiments at labs.google.

Building responsibly in the agentic era
Gemini 2.0 Flash and our research prototypes allow us to test and iterate on new capabilities at the forefront of AI research that will eventually make Google products more helpful.

As we develop these new technologies, we recognize the responsibility it entails, and the many questions AI agents open up for safety and security. That is why we are taking an exploratory and gradual approach to development, conducting research on multiple prototypes, iteratively implementing safety training, working with trusted testers and external experts and performing extensive risk assessments and safety and assurance evaluations.

For example:

As part of our safety process, we’ve worked with our Responsibility and Safety Committee (RSC), our longstanding internal review group, to identify and understand potential risks.
Gemini 2.0's reasoning capabilities have enabled major advancements in our AI-assisted red teaming approach, including the ability to go beyond simply detecting risks to now automatically generating evaluations and training data to mitigate them. This means we can more efficiently optimize the model for safety at scale.
As Gemini 2.0’s multimodality increases the complexity of potential outputs, we’ll continue to evaluate and train the model across image and audio input and output to help improve safety.
With Project Astra, we’re exploring potential mitigations against users unintentionally sharing sensitive information with the agent, and we’ve already built in privacy controls that make it easy for users to delete sessions. We’re also continuing to research ways to ensure AI agents act as reliable sources of information and don’t take unintended actions on your behalf.
With Project Mariner, we’re working to ensure the model learns to prioritize user instructions over 3rd party attempts at prompt injection, so it can identify potentially malicious instructions from external sources and prevent misuse. This prevents users from being exposed to fraud and phishing attempts through things like malicious instructions hidden in emails, documents or websites.
We firmly believe that the only way to build AI is to be responsible from the start and we'll continue to prioritize making safety and responsibility a key element of our model development process as we advance our models and agents.

Gemini 2.0, AI agents and beyond
Today’s releases mark a new chapter for our Gemini model. With the release of Gemini 2.0 Flash, and the series of research prototypes exploring agentic possibilities, we have reached an exciting milestone in the Gemini era. And we’re looking forward to continuing to safely explore all the new possibilities within reach as we build towards AGI.

###
https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d

Granite 3.1 Update! IBM just updated their Granite LLM family with longer context, better RAG, and function calling! Here are the highlights: ✨
📏 Extended context from 32K to 128K
🔄 Custom <document> role for improved RAG
🛠️ Improved function-calling with <tool_call> schema
🌐 Maintains support for 12 languages while offering finetuning flexibility
🤗 Apache 2.0 license and available on Hugging Face
🌎 Multilingual 12 languages, including English, German, Spanish, French…
4️⃣ 8 new checkpoints base and instruct for 8b, 2b, 3B-A0.8B, and 1B-A0.4B
❌ No official updated evals or benchmarks found

Granite-3.1-8B-Instruct
Model Summary: Granite-3.1-8B-Instruct is a 8B parameter long-context instruct model finetuned from Granite-3.1-8B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.

Developers: Granite Team, IBM
GitHub Repository: ibm-granite/granite-3.1-language-models
Website: Granite Docs
Paper: Granite 3.1 Language Models (coming soon)
Release Date: December 18th, 2024
License: Apache 2.0
Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages.

Intended Use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications.

Capabilities

Summarization
Text classification
Text extraction
Question-answering
Retrieval Augmented Generation (RAG)
Code related tasks
Function-calling tasks
Multilingual dialog use cases
Long-context tasks including long document/meeting summarization, long document QA, etc.
Generation: This is a simple example of how to use Granite-3.1-8B-Instruct model.

Install the following libraries:

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

Then, copy the snippet from the section that is relevant for your use case.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "auto"
model_path = "ibm-granite/granite-3.1-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
# change input text as desired
chat = [
    { "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },
]
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
# tokenize the text
input_tokens = tokenizer(chat, return_tensors="pt").to(device)
# generate output tokens
output = model.generate(**input_tokens,
                        max_new_tokens=100)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output)

Model Architecture: Granite-3.1-8B-Instruct is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings.

Model	2B Dense	8B Dense	1B MoE	3B MoE
Embedding size	2048	4096	1024	1536
Number of layers	40	40	24	32
Attention head size	64	128	64	64
Number of attention heads	32	32	16	24
Number of KV heads	8	8	8	8
MLP hidden size	8192	12800	512	512
MLP activation	SwiGLU	SwiGLU	SwiGLU	SwiGLU
Number of experts	—	—	32	40
MoE TopK	—	—	8	8
Initialization std	0.1	0.1	0.1	0.1
Sequence length	128K	128K	128K	128K
Position embedding	RoPE	RoPE	RoPE	RoPE
# Parameters	2.5B	8.1B	1.3B	3.3B
# Active parameters	2.5B	8.1B	400M	800M
# Training tokens	12T	12T	10T	10T
Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the Granite 3.0 Technical Report, Granite 3.1 Technical Report (coming soon), and Accompanying Author List.

Infrastructure: We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.

Ethical Considerations and Limitations: Granite 3.1 Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering eleven languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks.

Resources

⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

###
https://arxiv.org/abs/2412.12004
[Submitted on 16 Dec 2024]
The Open Source Advantage in Large Language Models (LLMs)
Jiya Manchanda, Laura Boettcher, Matheus Westphalen, Jasser Jasser
Large language models (LLMs) mark a key shift in natural language processing (NLP), having advanced text generation, translation, and domain-specific reasoning. Closed-source models like GPT-4, powered by proprietary datasets and extensive computational resources, lead with state-of-the-art performance today. However, they face criticism for their "black box" nature and for limiting accessibility in a manner that hinders reproducibility and equitable AI development. By contrast, open-source initiatives like LLaMA and BLOOM prioritize democratization through community-driven development and computational efficiency. These models have significantly reduced performance gaps, particularly in linguistic diversity and domain-specific applications, while providing accessible tools for global researchers and developers. Notably, both paradigms rely on foundational architectural innovations, such as the Transformer framework by Vaswani et al. (2017). Closed-source models excel by scaling effectively, while open-source models adapt to real-world applications in underrepresented languages and domains. Techniques like Low-Rank Adaptation (LoRA) and instruction-tuning datasets enable open-source models to achieve competitive results despite limited resources. To be sure, the tension between closed-source and open-source approaches underscores a broader debate on transparency versus proprietary control in AI. Ethical considerations further highlight this divide. Closed-source systems restrict external scrutiny, while open-source models promote reproducibility and collaboration but lack standardized auditing documentation frameworks to mitigate biases. Hybrid approaches that leverage the strengths of both paradigms are likely to shape the future of LLM innovation, ensuring accessibility, competitive technical performance, and ethical deployment.
The Open-Source Advantage in Large Language Models (LLMs)
Jiya Manchanda
Department of Philosophy and Religion Rollins College Winter Park jmanchanda@rollins.edu
Equal contribution
Laura Boettcher
Department of Mathematics and Computer Science Rollins College Winter Park lboettcher@rollins.edu
Equal contribution
Matheus Westphalen
Department of Mathematics and Computer Science Rollins College Winter Park mwestphalen@rollins.edu
Equal contribution
Jasser Jasser
Department of Mathematics and Computer Science Rollins College Winter Park jjasser@rollins.edu
Abstract
Large language models (LLMs) mark a key shift in natural language processing (NLP), having advanced text generation, translation, and domain-specific reasoning. Closed-source models like GPT-4, powered by proprietary datasets and extensive computational resources, lead with state-of-the-art performance today. However, they face criticism for their "black box" nature and for limiting accessibility in a manner that hinders reproducibility and equitable AI development. By contrast, open-source initiatives like LLaMA and BLOOM prioritize democratization through community-driven development and computational efficiency. These models have significantly reduced performance gaps, particularly in linguistic diversity and domain-specific applications, while providing accessible tools for global researchers and developers. Notably, both paradigms rely on foundational architectural innovations, such as the Transformer framework by Vaswani et al. (2017). Closed-source models excel by scaling effectively, while open-source models adapt to real-world applications in underrepresented languages and domains. Techniques like Low-Rank Adaptation (LoRA) and instruction-tuning datasets enable open-source models to achieve competitive results despite limited resources. To be sure, the tension between closed-source and open-source approaches underscores a broader debate on transparency versus proprietary control in AI. Ethical considerations further highlight this divide. Closed-source systems restrict external scrutiny, while open-source models promote reproducibility and collaboration but lack standardized auditing documentation frameworks to mitigate biases. Hybrid approaches that leverage the strengths of both paradigms are likely to shape the future of LLM innovation, ensuring accessibility, competitive technical performance, and ethical deployment.

Keywords Open-source models
⋅
 Large Language Models (LLMs)
⋅
 Transparency

1Introduction
Large language models (LLMs) stand at the forefront of connectionist artificial intelligence (AI) today, having revolutionized the realm of natural language processing (NLP) and driven advancements in areas such as text generation, translation, domain-specific inferencing, and sentiment analysis [1]. These models, built on cutting-edge neural architectures and trained on expansive datasets, have not only become indispensable tools in industry but also focal points of academic investigation. However, their rapid development has brought critical issues into focus—chief among them is the tension between open-source and closed-source approaches [2]. This underexplored division (in part, of labor) presents urgent questions about transparency, accessibility, and the equitable distribution of the benefits of AI, and will be the subject of much consideration in this paper.

The closed-source approach, typified by models like OpenAI’s GPT-4, has thus far dominated benchmarks for LLMs [3]. Their ability to excel stems squarely from the proprietary datasets that they are trained on (in addition to significant computational investments), which have enabled advanced capabilities in reasoning, text synthesis, and conversational AI [4]. Despite their technical achievements, though, these models are often criticized for their lack of transparency [5]. The proprietary nature of their design, development, and data restricts access to methodologies and findings, and has raised concerns about the reproducibility of their outcomes. Moreover, the concentration of these resources within a small number of organizations is claimed to have exacerbated inequities in global AI development. Needless to say, this dynamic has restricted many researchers and practitioners from competing effectively or even building upon these systems [6].

It is in light of the aforementioned trajectory of development that open-source LLMs, including LLaMA and BLOOM, have emerged as powerful alternatives. These models operate on the ethos of accessibility and community-driven innovation, and now offer researchers and developers the tools to advance NLP without vast computational resources [7]. By way of innovations in fine-tuning, such as low-rank adaptation (LoRA) and domain-specific optimization, open-source models have significantly narrowed performance gaps in recent months [8]. Projects like BLOOM’s multilingual framework demonstrate the capacity of open-source efforts to address linguistic diversity and real-world complexity, while models like LLaMA highlight the feasibility of achieving high performance with computational efficiency. These initiatives underscore the role of open collaboration in broadening the scope of AI research and ensuring a more equitable distribution of its benefits.

The divide between open- and closed-source models is best framed within the historical development of LLMs, which provides essential context for understanding their current capabilities. Early statistical language models, inspired by Shannon’s Information Theory, relied on probabilistic methods to predict word sequences [9]. While these models laid the groundwork for computational linguistics, they were limited in handling long-range dependencies and capturing the semantic properties of pragmatic language use [10]. The transition to neural network-based models in the 2000s marked a significant advancement. Word embeddings like Word2Vec and GloVe enabled dense vector representations of words, which improved our ability to model such semantic relationships [11]. The development of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks further expanded this capacity for processing sequential data and capturing temporal dependencies [12].

However, these architectures struggled with scalability when handling large datasets. The introduction of the Transformer architecture in 2017 by Vaswani et al. shifted the paradigm of NLP insofar as it overcame these limitations to a statistically significant degree [13]. Transformers, with their self-attention mechanisms, allowed for parallel sequence processing and the corresponding modeling of long-range dependencies, which have since become the backbone of modern LLMs. Subsequent innovations built upon this Transformer framework. Simply look to BERT’s bidirectional training approach, which enhanced performance on tasks like question answering and natural language inference, as well as GPT’s autoregressive design, which excelled in text generation and summarization [14]. The release of the closed-source GPT-3, which came bearing 175 billion parameters, demonstrated how scaling model size could improve generalization abilities and enable few-shot learning across highly varied tasks [15]. It was not long before open-source models such as BLOOM and LLaMA met those same benchmarks, though. This achievement quickly showcased how high performance can be attained without dependence on massive scale or proprietary frameworks.

In examining this dynamic between open- and closed-source LLMs, this paper seeks to elucidate the core issues that will shape the trajectory of developments in computational linguistics. Section 2 explores the innovation and development processes underpinning open-source and closed-source models, highlighting key breakthroughs and limitations. Section 3 evaluates the comparative performance of these models, focusing on benchmarks and task-specific outcomes. Section 4 addresses accessibility and use cases, specifically assessing how the availability and practical applications of these systems impact various stakeholders. Section 5 delves into ethical considerations concerning transparency, scrutinizing the implications of proprietary practices and open collaboration. Section 6 discusses the broader implications of the open-versus-closed divide, integrating findings from previous sections. Finally, Section 7 outlines potential directions for future research, proposing a mechanism of how to best foster innovation while ensuring equitable AI deployment and governance.

2Innovation and Development
The innovation and development of Large Language Models (LLMs) are marked by (a) foundational architectural changes and (b) refined training methodologies. As noted, the Transformer architecture, introduced by Vaswani et al. in 2017, fundamentally changed the way machine learning models process sequences of data. Previous models, such as recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, operated sequentially. This means that they processed one word or token at a time, with each step depending on the previous output. While effective for short-term dependencies, these architectures struggled to efficiently handle long-range relationships in data. As a result, they were challenged by vanishing gradients and slow training times. Transformers addressed these limitations by introducing a mechanism called self-attention [16]. This mechanism allows the model to evaluate the relationship between all tokens in a sequence simultaneously, rather than step-by-step. So, for example, when processing a sentence, a Transformer can determine the importance of every word relative to the others in a single compute. This parallel processing capability reduces computational bottlenecks by allowing for faster training and, by extension, inference. Moreover, self-attention enables Transformers to model long-range dependencies. In natural language, the meaning of a word or phrase often depends on context from distant parts of the text. Transformers excel in capturing these relationships, which is critical for tasks like summarization, translation, and complex reasoning.

Of course, the architecture’s scalability naturally facilitated its adoption in large-scale closed-source models like GPT-3 and GPT-4. Yet, it was not lost on open-source models including BLOOM and LLaMA to leverage the Transformer framework for the purpose of achieving competitive performance via computational efficiency. Subsidiary incremental developments upon this foundation like LLaMA’s grouped query attention (GQA) reduced memory demands by sharing attention weights, which allowed for performance gains without exorbitant resource requirements. In a similar fashion, Flash Attention optimized for training speed and energy efficiency by reducing the computational complexity of self-attention operations [17]. These advancements sufficiently demonstrate how open-source models may meet closed-source models on the question of scalability by innovating within architectural constraints.

The evolution of training techniques has been equally pivotal. Closed-source models often utilize proprietary datasets encompassing billions of tokens, which renders them able to achieve comprehensive pretraining objectives. A key instance of this phenomenon would be autoregressive modeling, which, employed in GPT, trains models to predict the next token in a sequence [18]. In so doing, it reinforces the coherence and fluency of its outcomes. Similarly, masked language modeling, which, employed in BERT, predicts missing tokens in a sentence, fostering a deeper understanding of bidirectional context [19]. These methods allow models to capture the nuanced language patterns and semantic relationships that form the basis of their strong generalization capabilities [20]. It is important to note that Reinforcement Learning from Human Feedback (RLHF) enhances this process by incorporating human evaluative feedback directly into training loops. In RLHF, the models are fine-tuned to align their outputs with human-defined preferences, thus improving both accuracy and alignment with values [21]. This approach is particularly impactful in domains where value-based considerations stand at the forefront of decision-making, such as healthcare and governance.

In response to these advancements, open-source models have introduced training approaches that effectively balance computational efficiency with performance optimization. Low-Rank Adaptation (LoRA), for example, reduces the computational burden during fine-tuning by updating only a subset of task-specific parameters. This method enables smaller organizations to tailor large models to specific applications without incurring the substantial costs associated with full model retraining. Moreover, publicly available instruction-tuning datasets provide open-source models with an alternative to Reinforcement Learning from Human Feedback (RLHF). These datasets allow models to adapt to highly varied tasks by leveraging clearly structured instructions. Quantization techniques further enhance their efficiency by reducing the precision of model weights. Now, model distillation complements this by compressing large models into smaller, more efficient versions that retain their essential capabilities. These compressed models are particularly valuable for edge-case applications where computational resources are limited.

Notably, it is the real-world applications of LLMs that illustrate the distinct trajectories of open- and closed-source models. Closed-source LLMs dominate high-stakes domains such as conversational AI, creative content generation, and advanced reasoning. For example, GPT-3 was widely adopted for automated content creation, customer support, and other on-demand tasks. Conversely, open-source models rely on their adaptive capabilities to meet inclusivity metrics given their access to expansive training. A case in point is BLOOM’s multilingual framework that supports over 40 languages. As such, open-source models have the potential to, and already are, surpassing closed-source models on the question of diverse datasets and how those contribute to global NLP research. Collaboration between LLMs and external tools has further expanded their utility. Retrieval-Augmented Generation (RAG), for instance, integrates information retrieval systems with generative models, enabling real-time access to updated knowledge [22]. Such systems are particularly effective in fields like healthcare, law, and finance, where timely and accurate information is critical [23]. By building on modular architectures and publicly accessible datasets, open-source models have been able to achieve comparable functionality to closed-source systems.

3Performance
The performance of open-source and closed-source Large Language Models (LLMs) is indicative of their underlying architecture, pre-training datasets, and optimization strategies, but understanding their comparative strengths requires a closer evaluation. Closed-source models, such as GPT-4, dominate state-of-the-art benchmarks like HumanEval and GSM8K [24]. This is due in part to their ability to leverage proprietary datasets and in part a result of their expansive pre-training corpora. The datasets on which these models are trained span hundreds of terabytes, which allows them to excel at a wide range of domain-general tasks with minimal fine-tuning [25]. This is precisely the reason that closed-source models possess superior generalization capabilities when it comes to generative tasks, such as creative writing and textual summarization. Look also to GPT-4’s parameter count of one trillion. It is this critical component, coupled with chain-of-thought prompting, that underlies o1’s ability to outperform the vast majority of open-source models on multi-step problem-solving and step-by-step reasoning [26]. However, the dominance of closed-source models is constrained by marked methodological challenges. Data contamination, i.e., overlaps between training and evaluation datasets, has been claimed to compromise the reliability of benchmarks for closed-source models [27]. The lack of transparency in proprietary datasets compounds this issue, leaving many results difficult to validate independently [28].

Despite these advantages, open-source LLMs have made significant progress in closing the performance gap. Techniques like Low-Rank Adaptation (LoRA) and Conditioned Reinforcement Learning Fine-Tuning (C-RLFT) have been instrumental in this advancement. LoRA’s capacity to selectively fine-tune parameters, for example, minimizes computational costs while maintaining high domain-specific accuracy with competitive results on benchmarks such as GSM8K. NVIDIA’s NVLM 1.0D 72B model exemplifies domain-specific excellence, achieving a 4.3-point improvement in mathematical reasoning and coding tasks through multimodal training [29]. Unlike models such as InternVL2-Llama3-76B, which exhibit degraded text-based performance after multimodal training, NVLM not only preserves but also enhances its textual capabilities. This robustness enables NVLM to handle complex domain-specific inputs, such as handwritten pseudocode or location-specific queries, underscoring its precision within specialized contexts. Its performance provides a proof-of-concept for domain-specific models to complement domain-general systems.

Similar outcomes in performance hold true for domain-specific models like StarCoder, which underwent targeted optimization for programming tasks, and ClinicalBERT, which underwent targeted optimization for medical analyses. These platforms often outperform general-purpose closed-source models in benchmarks like HumanEval. Techniques like knowledge distillation and model compression are salient instances of the aforementioned open-source optimization strategies [30]. Knowledge distillation creates smaller, efficient versions of larger models by transferring essential performance traits, while model compression reduces the requirements of computational resources. To be sure, these models underscore how open-source frameworks not only rival but increasingly exceed closed-source models in specialized applications, while maintaining their stronghold on accessibility to resource-constrained communities.

Now although open-source models’ transparency facilitates reproducibility of outcomes, their benchmarking potential is limited by narrower dataset scopes and fewer resources. Current benchmarks often prioritize narrow task-specific metrics, overlooking the complexity and diversity of real-world applications. Resource disparities further skew these outcomes in performance, as closed-source models benefit from high-performance distributed systems, whereas open-source models must optimize performance through collaborative resource pooling and parameter-efficient strategies. Overcoming these limitations would require the development of parameter-normalized and task-agnostic evaluation frameworks to enable more comprehensive comparisons.

4Accessibility
The question of how accessible Large Language Models (LLMs) are varies substantially between open-source and closed-source systems in a way that influences their deployment on a range of applications. Open-source initiatives such as LLaMA and BLOOM have been instrumental in democratizing LLM technology. LLaMA achieves this by enabling researchers to run advanced NLP tasks on single GPUs [31]. Optimizing for smaller model sizes lowers computational barriers while maintaining high benchmark performance, particularly in reasoning and mathematical tasks, and allows LLaMa to often surpass larger closed-source counterparts like GPT-3 and Chinchilla. Similarly, BLOOM was developed collaboratively by over 1,000 researchers; it supports 46 natural languages and 13 programming languages [32]. Its public release forefronted the potential of open science to expand LLM applications globally and foster inclusion for underrepresented linguistic communities.

Techniques like Low-Rank Adaptation (LoRA) further enhance such accessibility by significantly reducing computational requirements for fine-tuning. LoRA achieves this by freezing most model parameters and optimizing a low-rank adaptation of the model weights, enabling developers to tailor LLMs such as GPT-3 for specific tasks with considerably lower resource demands. This mechanism has proven versatile and found applications in areas such as textual summarization and SQL query generation, which are critical for both academic research and industry use [33]. Additionally, smaller and distilled models, such as DistilBERT, play a key role in bridging the gap between cutting-edge NLP capabilities and practical deployment. By reducing the size of BERT by 40% while retaining 97% of its language understanding ability, DistilBERT facilitates real-time, on-device applications [34].

Domain-specific LLMs have also begun to make solutions tailored for specialized fields more accessible. ClinicalBERT, fine-tuned on medical datasets, improves performance in tasks such as named entity recognition (NER) and natural language inference (NLI) for healthcare applications [35]. By releasing ClinicalBERT publicly, researchers have ensured that even organizations with limited resources can access advanced NLP tools to enhance clinical decision-making and patient care. Likewise, LEGAL-BERT and FinBERT address specific needs in the legal and financial sectors. LEGAL-BERT outperforms general-purpose models in tasks such as contract analysis and case law classification [36], while FinBERT excels in sentiment analysis for finance-related tasks, aiding in market trend predictions and supporting strategic financial decisions [37]. These domain-specific models illustrate the versatility of open-source systems in addressing niche requirements while ensuring accessibility for smaller organizations.

Yet, it must be acknowledged that closed-source models like Codex can also expand accessibility within targeted use cases. Integrated into GitHub Copilot, Codex translates natural language inputs into code. This significantly lowers the barrier to entry for non-programmers and enhances efficiency for experienced developers by 55% [38]. By streamlining the software development process and offering educational tools, Codex exemplifies how closed-source models can drive adoption in specific domains despite their proprietary nature. However, the lack of transparency and limited availability of these models often restrict customization, which are hallmarks of open-source initiatives.

5Transparency
The ethical dimensions of Large Language Models (LLMs) lie at the heart of their societal impact, with transparency emerging as a pivotal factor in evaluating their fairness, accountability, and trustworthiness. The contrasting approaches taken by open-source and closed-source LLMs reveal a fundamental trade-off between visibility and proprietary control. Open-source models offer unmatched access to internal mechanisms but often lack the governance structures necessary for consistent ethical rigor. Conversely, closed-source models protect intellectual property at the expense of public trust and external accountability. This ongoing debate underscores the challenge of balancing innovation with moral standards in the development and deployment of LLMs.

By definition, open-source LLMs promote transparency by providing unrestricted access to their architectures, weights, and training methodologies [39]. This openness allows researchers and developers to scrutinize these models at granular levels, auditing for biases, testing adherence to fairness metrics, and identifying vulnerabilities within their decision-making processes. The communal nature of open-source ecosystems democratizes ethical oversight to a certain extent. Platforms like GitHub and Hugging Face not only disseminate open-source models but also provide accompanying documentation, such as model cards, that outline ethical considerations, known limitations, and appropriate usage contexts. As such, distributed networks of researchers and practitioners can collaboratively review and improve these models, often uncovering latent issues such as algorithmic bias, adversarial weaknesses, or dual-use risks [40]. For instance, the open scrutiny of datasets and fine-tuning protocols has led to significant advancements in understanding and mitigating biases in multilingual models like BLOOM.

Yet, this distributed accountability relies heavily on the quality of available data, and the effectiveness of this transparency is often undermined by inconsistencies in documentation quality. It has been noted that templates are frequently reused, which leads to superficial descriptions of ethical risks and gaps in actionable insights for addressing them. For these reasons, establishing rigorous, standardized frameworks for ethical auditing—including detailed model cards with metrics for bias auditing, fairness evaluations, potential misuse scenarios presented in a standardized schema, and testing of interpretability—is crucial to fully realize the transparency potential of open-source LLMs. Moreover, tools for automated ethical assessments, such as explainability algorithms and adversarial robustness tests, could supplement human-led audits and provide baseline evaluations that align with broader governance standards.

By contrast, closed-source LLMs operate within proprietary frameworks that limit visibility into their internal mechanisms [41]. The lack of access to training datasets, preprocessing pipelines, and decision-making logic restricts third-party audits and independent evaluations to a degree that has effectively rendered them “black boxes.” This opacity exacerbates the difficulty of identifying and addressing biases embedded in these systems. For example, when closed-source models produce outputs that reinforce harmful stereotypes, as was the case with Google’s Gemini model, it remains unclear whether the issue stems from biased training data, flawed objective functions, or other systemic deficiencies [42]. Worse still is the fact that in virtue of the models being closed-source, their developers are not obligated to disclose ethical risks, mitigation strategies, or even the fundamental design principles guiding their models. This lack of disclosure creates a trust deficit, particularly in high-stakes domains such as healthcare diagnostics, legal decision-making, or autonomous transportation systems like self-driving cars, where the consequences of errors or biases can be profound. Moreover, the absence of external oversight often means that ethical considerations are deprioritized in favor of performance optimization or market demands.

For these reasons, regulatory frameworks serve as critical mediators of closed-source LLMs [43]. Governments and industry consortia should establish mandates for transparency in high-risk applications. These could include requirements for disclosing anonymized datasets, publishing high-level explanations of decision-making processes, or engaging external (third-party, independent, unaffiliated) ethics oversight boards for pre-deployment risk assessments. While such measures may not match the transparency of open-source systems models, they can provide critical visibility into the ethically salient dimensions of proprietary systems without compromising trade secrets.

In short, bridging this divide between open- and closed-source LLMs will require hybrid solutions that combine the strengths of both paradigms. For instance, closed-source developers could adopt modular transparency—releasing anonymized components or high-level abstractions of decision-making logic to facilitate third-party evaluations. Simultaneously, open-source ecosystems could leverage advancements in ethical assessment tools to enhance their auditing processes. By integrating these complementary approaches, the ML community can create systems that not only achieve state-of-the-art performance but also uphold principles of fairness, accountability, and trustworthiness. This kind of interdisciplinary collaboration between computer science, ethics, policy, and sociology will be sure to play a decisive role in defining the trajectory of ethical LLM development and its alignment with our values.

6Discussion
Our comparative analysis of open-source and closed-source Large Language Models (LLMs) offers critical insights into the differing trajectories of innovation, accessibility, and collaboration in natural language processing (NLP). Both paradigms share foundational technologies, such as Transformer architectures, but their distinct underlying philosophical commitments have led to varied impacts on the field. This discussion evaluates both trajectories, clarifying the strengths and limitations of each, while advancing a clear and grounded position on the transformative potential of open-source models.

Closed-source LLMs continue to lead in performance, leveraging proprietary datasets and significant computational investments to excel in tasks requiring advanced generative abilities, multi-step reasoning, and broad generalization. However, their success comes at the cost of limited transparency and restricted accessibility, which creates challenges for external validation and replication. The closed-source approach also consolidates resources and technological power within a few institutions. In so doing, it poses barriers to equitable AI development and raising concerns about reproducibility of outcomes and organizational accountability. By contrast, open-source LLMs emphasize accessibility and collaborative development. While these models often trail closed-source systems in absolute performance, they have made significant progress in narrowing the gap through methods such as Low-Rank Adaptation (LoRA) and quantization. These strategies enable efficient, competitive outcomes even in resource-constrained environments. By utilizing diverse datasets across languages and contexts, open-source models demonstrate their capacity to address real-world challenges with inclusivity. This democratic ethos has already empowered researchers and developers globally, and is likely to continue to do so.

The scalability of closed-source models is evident in their ability to set performance benchmarks, leveraging extensive datasets and robust infrastructure. However, their reliance on proprietary resources limits their adaptability to niche or underrepresented use cases. By contrast, open-source models, though constrained by resource limitations, possess the potential to surpass these benchmarks by leveraging their access to diverse and evolving datasets. Their adaptability allows them to address specialized challenges and emerging contexts, but their success hinges on sustained contributions from the global community. Collaborative efforts are essential to overcoming resource gaps and achieving continuous improvement.

Accessibility remains a key distinction between the two paradigms. Closed-source systems have improved accessibility within specific domains through tools like GitHub Copilot, which streamline adoption for non-expert users. However, these tools often lack the flexibility needed for personal customization [44]. Open-source models excel in this regard, offering modular architectures and reduced computational barriers that enable widespread experimentation and deployment. This flexibility supports innovation across academic, industrial, and grassroots initiatives and serves to highlight the potential for integrating the strengths of both approaches to optimize usability and adaptability.

Notably, ethical considerations further differentiate these paradigms. The opacity of closed-source models exacerbates deficits in trust, particularly in critical applications such as healthcare and legal decision-making. Their restricted access to internal mechanisms limits external auditing and accountability, which, by extension, raises concerns about fairness and safety. In contrast, open-source models prioritize transparency by providing unrestricted access to architectures, datasets, and methodologies. However, inconsistencies in documentation and the absence of standardized ethical frameworks pose challenges for ensuring reliable oversight. A hybrid approach that combines the transparency of open-source models with regulatory measures could address these concerns, ensuring more responsible AI deployment.

One promising avenue for future research lies in addressing the phenomenon of hallucinations in LLMs, which manifests differently depending on the context. When these models generate incorrect outputs, they are often labeled as mistakes, yet when their outputs are creative and contextually aligned, they are celebrated as innovation [45]. This tension is particularly pronounced in reasoning tasks, where precision and coherence are of paramount importance. Understanding and mitigating hallucinations requires a systematic approach to model evaluation, focusing on distinguishing productive creativity from erroneous reasoning. Open-source models can play a pivotal role in this research by fostering collaborative experiments with diverse datasets and benchmarking strategies that illuminate the mechanisms underlying hallucinations.

Another direction for future work involves enhancing the reasoning capabilities of LLMs through interdisciplinary contributions. By integrating insights from cognitive science and formal logic, researchers can develop frameworks to improve reasoning fidelity and robustness in LLMs. Open-source ecosystems are uniquely positioned to drive progress in this area, offering the transparency and flexibility needed to experiment with novel architectures, training methods, and evaluation protocols. Collaborative efforts could enable the design of more reliable and context-aware reasoning systems, pushing the boundaries of what LLMs can achieve in tasks requiring deep understanding and logical coherence.

To be sure, closed-source LLMs currently dominate performance benchmarks due to their resource-intensive strategies, while open-source models offer unparalleled accessibility and the potential for equitable AI advancement. The future of open-source LLMs depends on fostering a robust ecosystem of contributors to drive innovation and refinement. By leveraging diverse datasets and emphasizing collaboration, open-source models have the capacity to address global challenges and redefine the NLP landscape in a way that is thus far unclear whether closed-source models can.

###
https://huggingface.co/blog/falcon3
Welcome to the Falcon 3 Family of Open Models!
Published December 17, 2024

WHOA! The Falcon has landed 🔥
TL;DR: Meet Falcon3, the game-changing family of LLMs advancing open and accessible large foundation models.
Here are the top highlights:
📉 Smaller but Mighty: Falcon3 models achieve state-of-the-art performance with fewer parameters (under 10B)
🤓 Math Whiz: Falcon3-10B-Base scores 22.9 on MATH-Lvl5 and 83.0 on GSM8K, showcasing enhanced math reasoning
💻 Coding Master: Falcon3-10B-Base achieves 73.8 on MBPP and Falcon3-10B-Instruct scores 45.8 on Multipl-E, demonstrating coding prowess
📚 Long Context Support: Falcon3 models support up to 32k tokens, with Falcon3-10B-Instruct scoring 86.3 on BFCL
💡 Reasoning Superstar: Falcon3-7B-Base and Falcon3-10B-Base achieve 51.0 and 59.7 on BBH, showcasing improved reasoning capabilities
🎯 Scientific Knowledge Expansion: Falcon3 models demonstrate advances in specialized knowledge, with scores of 67.4 and 73.1 on MMLU benchmarks
🤝 Compatibility: All Falcon3 models are compatible with Llama architecture, ensuring seamless integration into the AI ecosystem
🎉 Variants Galore: Falcon3 models come in various flavors, including Instruct, GGUF, GPTQ-Int4, and more, offering flexibility for diverse applications


We introduce Falcon3, a family of decoder-only large language models under 10 billion parameters, developed by Technology Innovation Institute (TII) in Abu Dhabi. By pushing the boundaries of performance and training efficiency, this release reflects our ongoing commitment to advancing open and accessible large foundation models.
Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models' science, math, and code capabilities.

This iteration includes five base models:

Falcon3-1B-Base
Falcon3-3B-Base
Falcon3-Mamba-7B-Base
Falcon3-7B-Base
Falcon3-10B-Base
In developing these models, we incorporated several key innovations aimed at improving the models' performances while reducing training costs:

One pre-training for transformer-based models: We conducted a single large-scale pretraining run on the 7B model, using 1024 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data.
Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, we upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2 trillion tokens of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters.
Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.
Pure SSM: We have further enhanced Falcon Mamba 7B by training on an additional 1.5 trillion tokens of high-quality data, resulting in Falcon3-Mamba-7B-Base. Notably, the updated model offers significantly improved reasoning and mathematical capabilities.
Other variants: All models in the Falcon3 family are available in variants such as Instruct, GGUF, GPTQ-Int4, GPTQ-Int8, AWQ, and 1.58-bit, offering flexibility for a wide range of applications.
Key Highlights
Falcon3 featured the limits within the small and medium scales of large language models by demonstrating high performance on common benchmarks:

Falcon3-1B-Base surpasses SmolLM2-1.7B and is on par with gemma-2-2b.
Falcon3-3B-Base outperforms larger models like Llama-3.1-8B and Minitron-4B-Base, highlighting the benefits of pre-training with knowledge distillation.
Falcon3-7B-Base demonstrates top performance, on par with Qwen2.5-7B, among models under the 9B scale.
Falcon3-10B-Base stands as the state-of-the-art achieving strong results in the under-13B category.
All the transformer-based Falcon3 models are compatible with Llama architecture allowing better integration in the AI ecosystem.
Falcon3-Mamba-7B continues to lead as the top-performing State Space Language Model (SSLM), matching or even surpassing leading transformer-based LLMs at the 7B scale, along with support for a longer 32K context length. Having the same architecture as the original Falcon Mamba 7B, users can integrate Falcon3-Mamba-7B seamlessly without any additional effort.
The instruct versions of our collection of base models further show remarkable performance across various benchmarks with Falcon3-7B-Instruct and Falcon3-10B-Instruct outperforming all instruct models under the 13B scale on the open leaderboard.
Enhanced Capabilities
We evaluated models with our internal evaluation pipeline (based on lm-evaluation-harness) and we report raw scores. Our evaluations highlight key areas where the Falcon3 family of models excel, reflecting the emphasis on enhancing performance in scientific domains, reasoning, and general knowledge capabilities:

Math Capabilities: Falcon3-10B-Base achieves 22.9 on MATH-Lvl5 and 83.0 on GSM8K, showcasing enhanced reasoning in complex math-focused tasks.
Coding Capabilities: Falcon3-10B-Base achieves 73.8 on MBPP, while Falcon3-10B-Instruct scores 45.8 on Multipl-E, reflecting their abilities to generalize across programming-related tasks.
Extended Context Length: Models in the Falcon3 family support up to 32k tokens (except the 1B supporting up to 8k context), with functional improvements such as scoring 86.3 on BFCL (Falcon3-10B-Instruct).
Improved Reasoning: Falcon3-7B-Base and Falcon3-10B-Base achieve 51.0 and 59.7 on BBH, reflecting enhanced reasoning capabilities, with the 10B model showing improved reasoning performance over the 7B.
Scientific Knowledge Expansion: Performance on MMLU benchmarks demonstrates advances in specialized knowledge, with scores of 67.4/39.2 (MMLU/MMLU-PRO) for Falcon3-7B-Base and 73.1/42.5 (MMLU/MMLU-PRO) for Falcon3-10B-Base respectively.
Models' Specs and Benchmark Results
Detailed specifications of the Falcon3 family of models are summarized in the following table. The architecture of Falcon3-7B-Base is characterized by a head dimension of 256 which yields high throughput when using FlashAttention-3 as it is optimized for this dimension. These decoder-only models span 18 to 40 layers for the transformer-based ones, and 64 layers for the mamba one, all models share the SwiGLU activation function, with vocabulary size of 131K tokens (65Kfor Mamba-7B). The Falcon3-7B-Base is trained on the largest amount of data ensuring comprehensive coverage of concepts and knowledge, the other variants require way less data.




Training efficiency


The table below highlights the performances of Falcon3-7B-Base and Falcon3-10B-Base on key benchmarks showing competitive performances in general, math, reasoning, and common sense understanding domains. Feel free to take a look at models' cards where we provide additional evaluation results (e.g. MT-Bench, Alpaca, etc).


Training efficiency


The instruct models also demonstrate competitive and super performances with equivalent and small-size models as highlighted in the tables below.

Instruct models
Falcon3-1B-Instruct and Falcon3-3B-Instruct achieve robust performance across the evaluated benchmarks. Specifically, Falcon3-1B attains competitive results in IFEval (54.4), MUSR (40.7), and SciQ (86.8), while Falcon3-3B exhibits further gains—particularly in MMLU-PRO (29.7) and MATH (19.9)—demonstrating clear scaling effects. Although they do not surpass all competing models on every metric, Falcon models show strong performances in reasoning and common-sense understanding relative to both Qwen and Llama. In our internal evaluation pipeline:

We use lm-evaluation harness.
We report raw scores obtained by applying chat template without fewshot_as_multiturn (unlike Llama3.1).
We use same batch-size across all models.



Training efficiency


Furthermore, Falcon3-7B and Falcon3-10B show robust performance across the evaluated benchmarks. Falcon3-7B achieves competitive scores on reasoning (Arc Challenge: 65.9, MUSR: 46.4) and math (GSM8K: 79.1), while Falcon3-10B demonstrates further improvements, notably in GSM8K (83.1) and IFEval (78), indicating clear scaling benefits.

Training efficiency


Open Source Commitment
In line with our mission to foster AI accessibility and collaboration, all models in the Falcon3 family are released under the Falcon LLM license. We hope the AI community finds these models valuable for research, application development, and further experimentation. Falcon3 is not a culmination but a continuation of our efforts to create more capable, efficient, specialized foundation models. In January 2025, we will further release other models of the Falcon3 family featuring enhanced multi-modal capabilities including image, video, and audio support, as well as a full technical report covering our methodologies. We welcome feedback and collaboration from the community as we continue to refine and advance these technologies.

###
https://www.kaggle.com/facts-leaderboard
FACTS Leaderboard
FACTS is a novel benchmark from Google DeepMind and Google Research designed to evaluate the factual accuracy and grounding of AI models.

Introduction
The FACTS Grounding benchmark evaluates the ability of Large Language Models (LLMs) to generate factually accurate responses grounded in provided long-form documents, encompassing a variety of domains. FACTS Grounding moves beyond simple factual question-answering by assessing whether LLM responses are fully grounded to the provided context and correctly synthesize information from a long context document. By providing a standardized evaluation framework, FACTS Grounding aims to promote the development of LLMs that are both knowledgeable and trustworthy, facilitating their responsible deployment in real-world applications.

Dataset
Starter Code
Technical Report
Blog Post
Model
Factuality Score
95% CI
Organization
License
Knowledge Cutoff
1
gemini-2.0-flash-exp
83.6%
±1.8%
Google
Proprietary
8/2024
2
gemini-1.5-flash-002
82.9%
±1.8%
Google
Proprietary
11/2023
3
gemini-1.5-pro-002
80.0%
±1.9%
Google
Proprietary
11/2023
4
claude-3-5-sonnet-20241022
79.4%
±1.9%
Anthropic
Proprietary
4/2024
5
gpt-4o
78.8%
±1.9%
OpenAI
Proprietary
10/2023
6
claude-3-5-haiku-20241022
74.2%
±2.1%
Anthropic
Proprietary
4/2024
7
gpt-4o-mini
71.0%
±2.1%
OpenAI
Proprietary
10/2023
8
o1-mini
62.0%
±2.3%
OpenAI
Proprietary
10/2023
9
o1-preview
61.7%
±2.3%
OpenAI
Proprietary
10/2023
About FACTS Grounding

FACTS Grounding is based on a novel set of factual grounding examples collected from human raters. Each example consists of a system instruction, user request and a context document (maximum of 32k tokens), and requires a long-form response. AI generated responses to these examples are evaluated by an ensemble of automated judge models.

For more details, please refer to the Examples Section or Technical Report.

Grounding Example Distribution

The full FACTS Grounding benchmark is comprised of 1,719 examples. This includes 860 public examples available in the FACTS Grounding Public Examples Dataset. The remaining 859 examples comprise a private set that will be held out to mitigate risks of benchmark contamination. Leaderboard results on this page are the results across both public and private sets.
Task Distribution
Fact Finding
Find & Summarize
Effect Analysis
Explain/Define
Concept Comparison
Pros & Cons
Summarize & Format
Summarize
Summarize & Simplify
31.6%
4.4%
6.1%
7.5%
8.9%
29.6%
Task	Count
Fact Finding	543
Find & Summarize	509
Effect Analysis	153
Explain/Define	129
Concept Comparison	105
Pros & Cons	76
Summarize & Format	75
Summarize	65
Summarize & Simplify	64
Domain Distribution
Medical
Legal
Internet/Technology
Financial
Retail/Product
29%
11.4%
18.2%
19.2%
22.2%
Task	Count
Medical	499
Legal	382
Internet/Technology	330
Financial	312
Retail/Product	196

Running FACTS Grounding
Starter Code

If you’d like to test your own model’s performance on FACTS Grounding, you can generate your own responses on the set of public examples with the methodology described in the Technical Report.

Computing the Factuality Score

The factuality score in the FACTS Grounding benchmark is calculated by first using three different frontier LLM judges to determine if a response is grounded to the provided context. A response is labeled "accurate" if all its claims are directly supported or don't require support from the context; otherwise, it's marked "inaccurate." Each judge calculates a factuality score individually as the percentage of accurate responses. To mitigate bias, the final score is an average across all three judges. Responses deemed ineligible are disqualified from the factuality scoring process and are treated as factually inaccurate. The factuality score reported in this leaderboard is the average across both the public and private example sets.

Quality Filtering

To prevent models from "gaming" the factuality score by providing short, evasive responses, FACTS Grounding employs a quality filtering step. This process uses the same three LLM judges, but with different prompt templates designed to identify responses that don't sufficiently address the user's request. A response is disqualified only if all three judges agree that a response is "ineligible". In this way, low-quality responses are filtered out from the final score shown in the leaderboard.

Adding New Models

The FACTS Grounding leaderboard will be actively maintained so suggestions from the community on new models to evaluate are welcome! To begin, we will focus on expanding coverage to more frontier language models.

As the FACTS Grounding benchmark includes a set of private held out prompts, official results on the leaderboard will be run by the Kaggle team.

To request a model for evaluation, please fill out this form.

Limitations

While this benchmark represents a step forward in evaluating factual accuracy, more work remains to be done. First, this benchmark relies on potentially noisy automated LLM judge models for evaluation. By ensembling a range of frontier LLMs and averaging judge outputs, we attempt to mitigate this. Second, the FACTS benchmark focuses only on evaluating grounded responses to long-form text input and could potentially be extended.


Questions, comments, or issues? Share your thoughts with us in the discussion forum.

What is the best LLM for RAG and grounded long-form inputs? Google DeepMind FACTS is a new benchmark for how well LLMs can generate factually accurate, long-form responses while remaining faithful to provided source documents. 👀
TL;DR;
📊 1,719 examples (860 public, 859 private) covering prompts up to 32k tokens
🎯 Focuses grounding, excluding creativity, mathematics, and complex reasoning
🏢 Covers medical (29%), legal (22.2%), tech (19.2%), financial (18.1%), retail (11.4%)
📝 Task types include fact-finding (31.6%), find & summarize (29.7%), effect analysis (8.9%), and more
👥 Use three LLM as a judge (Gemini 1.5 Pro, GPT-4o, Claude 3.5 Sonnet) to reduce bias
🏆 Gemini 2.0 Flash outperforms all other models with 83.6% accuracy
🛡️ Comprehensive quality assurance process to ensure high-quality, non-trivial examples
🔢 Judge models tend to rate their own outputs higher, showing the importance of using multiple judge models.
📝 Models can be submitted on Kaggle for evaluation

Responsibility & Safety

FACTS Grounding: A new benchmark for evaluating the factuality of large language models
Published
17 December 2024
Authors
FACTS team

Share

Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations

Large language models (LLMs) are transforming how we access information, yet their grip on factual accuracy remains imperfect. They can “hallucinate” false information, particularly when given complex inputs. In turn, this can erode trust in LLMs and limit their applications in the real world.

Today, we’re introducing FACTS Grounding, a comprehensive benchmark for evaluating the ability of LLMs to generate responses that are not only factually accurate with respect to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries.

We hope our benchmark will spur industry-wide progress on factuality and grounding. To track progress, we’re also launching the FACTS leaderboard on Kaggle. We’ve already tested leading LLMs using FACTS Grounding and have populated the initial leaderboard with their grounding scores. We will maintain and update the leaderboard as the field advances.

A table showing the current leaderboard ranking on the FACTS leaderboard
Current leaderboard ranking

FACTS Grounding dataset
To accurately evaluate the factuality and grounding of any given LLM, the FACTS Grounding dataset comprises 1,719 examples, each carefully crafted to require long-form responses grounded in the context document provided. Each example comprises a document, a system instruction requiring the LLM to exclusively reference the provided document, and an accompanying user request.

A table showing an example of a question and response based on the FACTS grounding dataset
An example from the FACTS Grounding dataset

All examples are divided into a "public" set (860) and a "private" (859) held out set. We are releasing the public set today so anyone can use it to evaluate an LLM. Of course, we know that issues of benchmark contamination and leaderboard hacking are important to protect against, so following standard industry practice, we are keeping the private evaluation set held out. The FACTS leaderboard scores are the average performance across both public and private sets.

To ensure a diversity of inputs, the FACTS Grounding examples include documents with a variety of lengths, up to a maximum of 32,000 tokens (roughly 20,000 words), covering domains such as finance, technology, retail, medicine, and law. The user requests are similarly wide ranging, including requests for summarization, Q&A generation, and rewriting tasks. We did not include any examples that could require creativity, mathematics, or complex reasoning – capabilities which might require the model to apply more advanced reasoning in addition to grounding.

Two pie charts showing the domain and task distribution of the prompts
Prompt distribution

Collective judgement by leading LLMs
To succeed on a given example, an LLM must synthesize the complex information in the document and generate a long-form response that is both a comprehensive answer to the user request and fully attributable to that document.

FACTS Grounding evaluates model responses automatically using three frontier LLM judges — namely Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. We selected a combination of different judges to mitigate any potential bias of a judge giving higher scores to the responses produced by a member of its own model family. The automatic judge models were comprehensively evaluated against a held-out test set to find the best performing judging prompt templates and to verify agreement with human raters.

Each FACTS Grounding example is judged in two phases. First, responses are evaluated for eligibility, and disqualified if they don’t sufficiently address the user’s request. Second, responses are judged as factually accurate if they are fully grounded in information contained in the provided document, with no hallucinations.

With the eligibility and grounding accuracy of a given LLM response evaluated separately by multiple AI judge models, the results are then aggregated to determine if the LLM has dealt with the example successfully. The final score for the overall grounding task is the average of all judge models’ scores across all examples. Find more details of our FACTS Grounding evaluation methodology in our paper.

An illustration of how a factuality score is assigned
A factually correct response that fails to properly address the user’s request fails the benchmarking example. Here we see three instances of model responses that the automated LLM judges considered ineligible

FACTS Grounding will continue to evolve
We are mindful that benchmarks can be quickly overtaken by progress, so this launch of our FACTS Grounding benchmark and leaderboard is just the beginning. Factuality and grounding are among the key factors that will shape the future success and usefulness of LLMs and broader AI systems, and we aim to grow and iterate FACTS Grounding as the field progresses, continually raising the bar.

We encourage the AI community to engage with FACTS Grounding, evaluate their models on the open set of examples or to submit their models for evaluation. We believe that comprehensive benchmarking methods, coupled with continuous research and development will continue to improve AI systems.

###
https://liyaowei-stu.github.io/project/BrushEdit/
12/17/24
Tencent's new BrushEdit enables precise image editing via an inpainting model. Learn more ⬇️
Supports both, Instruction-based editing and tools based interactive editing.🔥
- Add Objects in Image
- Edit Image Backgrounds
- Edit Image Objects
- Remove Something from Image


BrushEdit: All-In-One Image Inpainting and Editing
Yaowei Li1*, Yuxuan Bian3*, Xuan Ju3*, Zhaoyang Zhang2‡, Junhao Zhuang4, Ying Shan2✉,  Yuexian Zou1✉ Qiang Xu3✉
1Peking University 2ARC Lab, Tencent PCG
3The Chinese University of Hong Kong
4Tsinghua University
✉ Corresponding Author ‡ Project Lead




Demonstration Video

😎 We recommend watching in full screen and with sound on. 😎
BGM and YouTube.
TL;DR
BrushEdit is an advanced, unified AI agent for image inpainting and editing.

Main Elements: 🛠️ Fully automated / 🤠 Interactive editing.
Abstract
Image editing has advanced significantly with the development of both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with major modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven key metrics, including mask region preservation, and editing effect coherence.

Pipeline Overview
Our approach consists of four main steps: (i) Editing category classification: determine the type of editing required. (ii) Identification of the primary editing object: Identify the main object to be edited. (iii) Acquisition of the editing mask and target Caption: Generate the editing mask and corresponding target caption. (iv) Image inpainting: Perform the actual image editing. Steps (i) to (iii) utilize pre-trained MLLMs and detection models to ascertain the editing type, target object, editing masks, and target caption. Step (iv) involves image editing using the dual-branch inpainting model improved BrushNet. This model inpaints the target areas based on the target caption and editing masks, leveraging the generative potential and background preservation capabilities of inpainting models.
기술적으로 최대한 자세하게 적어. 9개의 기사가 있고 하나도 빼먹지 말고 적어.
TECH BLOG by Dongyoung Kim Ph.D.

2024년 12월 20일 AI 소식

Cohere, Command R7B 출시

Genesis-world, Genesis 플랫폼 공개

Answer.ai, ModernBERT 발표

Google DeepMind, Gemini 2.0 Flash Thinking 출시

IBM, Granite 3.1 Language Models 업데이트

오픈 소스 LLM의 장점에 대한 논문

Hugging Face, Falcon3 오픈 모델 출시

Google DeepMind와 Google Research, FACTS Grounding 벤치마크 발표

Tencent, Brush Edit 출시

(today’s date in 년 월 일) AI 소식,

Summary

company name, Title

company name, Title