META ➡️ META에서는 Llama 3.3 70B 모델을 발표하여, 초거대한 파라미터 대비 월등히 높은 성능과 128K 토큰 컨텍스트 지원을 강조하였습니다.

Microsoft ➡️ Microsoft Research에서는 Florence-VL 기반의 차세대 멀티모달 대규모 언어 모델(MLLM)을 공개하여 멀티모달 이해 및 비전-언어 정렬 능력을 한층 강화하였습니다. ➡️ Microsoft는 Florence-VL 관련 논문을 통해 제너레이티브 비전 파운데이션 모델의 우수성을 재확인하였습니다. ➡️ Microsoft는 Magentic-One이라는 다중 에이전트 시스템을 소개하여 복잡한 태스크 해결 능력과 오케스트레이션을 강조하였습니다. ➡️ Microsoft의 TRELLIS3D는 통합적인 3D 생성 모델을 발표하여 다양한 3D 출력 포맷과 멀티모달 입력 처리에 강점을 보여주었습니다.

OpenGVLab ➡️ OpenGVLab에서는 InternVL 2.5 대규모 멀티모달 모델을 발표하며 비전과 언어 간 통합 처리 성능을 선보였습니다.

DeepSeek ➡️ DeepSeek는 실시간 웹 검색 연계 및 수많은 벤치마크 성능 개선을 이룬 DeepSeek V2.5-1210 모델을 발표하였습니다.

Google ➡️ Google은 PaliGemma 2 VLM 시리즈를 출시하여 다양한 해상도와 파라미터 스케일을 지원함으로써 전이학습과 멀티태스크 적용에 용이한 기반을 제공하였습니다.

Tencent ➡️ Tencent는 HunyuanVideo 대규모 비디오 생성 모델을 오픈소스로 공개하여 고품질 비디오 생성 역량을 보여주었습니다.

AWS (Amazon) ➡️ AWS는 Amazon Bedrock Agents 정식 출시로 오케스트레이션 및 추론 과정의 투명성과 제어 용이성을 개선하였습니다. ➡️ AWS는 Amazon Nova라는 신규 파운데이션 모델 패밀리를 선보여 멀티모달·멀티태스크 능력 확장을 통한 정확도 및 활용성 향상을 제시하였습니다.

LG AI Research ➡️ LG AI Research는 EXAONE 3.5를 통해 Frontier AI급 성능 달성과 장문 컨텍스트 및 인스트럭션 수행 능력 강화에 성공하였습니다.

HuggingFace ➡️ HuggingFaceFW는 FineWeb 2.0이라는 대규모 멀티언어 프리트레이닝 데이터셋을 공개하여 다국어 모델 성능 개선에 기여할 기반을 마련하였습니다. ➡️ Huggingface는 TGI(Text Generation Inference) 3.0 업데이트로 초장문 프롬프트 처리 및 속도 향상을 이끌어내어 대규모 LLM 추론 환경을 개선하였습니다.

PyTorch 생태계 ➡️ PyTorch 생태계에는 vLLM이 합류하여 다양한 하드웨어 환경 및 대규모 LLM 서빙 효율을 높이는 방향으로 발전하고 있습니다.

META, Llama 3.3 70B Instruct 모델 발표

링크, 2024/12/06

128K 토큰 Context 지원: 기존 3.1 대비 컨텍스트 길이를 대폭 늘려 긴 문맥 처리 요구사항을 충족하며, 대용량 대화나 장문 분석에 적합. GQA(Grouped-Query Attention) 아키텍처 최적화를 통해 최대 128K 토큰까지 효율적으로 핸들.
모델 크기 대비 성능 향상: 파라미터 수(70B)는 기존 대비 크게 증가하지 않았지만, Code Generation(HumanEval, MBPP EvalPlus), Reasoning & Math(GPQA Diamond, MATH), Steerability(IFEval), Multilingual 능력(MGSM) 등 모든 지표에서 전반적 향상.
405B 모델 수준에 육박하는 성능: 3.3-70B 모델이 405B급 모델과 비교 가능할 정도로 효율적 파라미터 활용성 구현. 같은 CoT(Chain-of-Thought) 태스크에서 405B 모델에 비해 근접한 또는 더 나은 정답률.
Transformer 아키텍처 최적화 & RLHF: 최적화된 Transformer 구조와 SFT(RLHF 포함)로 인스트럭션 튜닝 강화. Multilingual 대화형 환경에서 훈련, 다국어 이해 및 코드 생성 모두에서 진일보.
Hugging Face Transformers 완전 호환: 모델 파라미터, 토크나이저, 추론 파이프라인이 모두 Hugging Face 에 통합되어 손쉬운 접근성과 활용성 제공.

Microsoft Research, Florence-VL 기반 차세대 MLLM 발표

링크, 2024/12/06

Florence-VL: Generative Vision Foundation Model 기반 차세대 멀티모달 LLM: Florence-2라는 강력한 비전 기반 모델을 LLM(Phi 3.5, LLaMA 3 등)과 결합하여 새로운 멀티모달 대규모 언어 모델(MLLM) 시스템 구축. 기존 CLIP 스타일 대비 다양한 수준(depth)의 피처 추출과 breadth(멀티 프롬프트) 기반 피처 융합(DBFusion)을 통해 시각 정보 처리 능력을 크게 강화.
Depth-Breadth Fusion(DBFusion) 기법: 비전 인코더(Florence-2)로부터 다양한 레이어 깊이에서 추출한 정보와 여러 형태의 프롬프트를 결합하여 세밀한 시각적 디테일과 추상적 개념 양쪽을 모두 포착. 이를 통해 단순한 이미지 캡션 생성부터 복잡한 차트 이해, 문서 OCR, 지식기반 VQA 등 고난도 비주얼-랭귀지 태스크까지 광범위한 적용 가능.
고품질 데이터셋 및 Instruction Tuning 적용: 사전훈련 단계에서 다양한 공개 소스 데이터 사용 후, 고품질 이미지 캡션 및 인스트럭션 튜닝용 데이터로 파인튜닝. 이를 통해 모델의 활용성, 사용자 지침 준수 능력, 헛소리(hallucination) 최소화 등 실제 서비스 환경에 적합한 기능 강화.
SOTA 성능 달성 및 오픈소스 공개: Florence-VL은 다양한 멀티모달 및 비전 중심 벤치마크(VQA, OCR, Chart Understanding, Knowledge-intensive Understanding 등)에서 기존 SOTA 모델 대비 뛰어난 성능을 보여줌. 전체 트레이닝 레시피와 모델 체크포인트를 오픈소스로 공개하여 커뮤니티가 재현, 확장, 개선에 기여할 수 있게 함.

OpenGVLab, InternVL 2.5 MLLM 시리즈 공개

링크, 2024/12/05

다양한 파라미터 범위 (1B~78B) MLLM: InternViT(비전 인코더)와 Qwen2.5, InternLM 2.5 등의 LLM을 조합한 MLP Projector를 통해 다양한 스케일 제공.
Dynamic High-Resolution 처리: 448×448 타일 방식, 픽셀 언슛플(pixle-unshuffle)로 비주얼 토큰 수 효율화. 멀티이미지·비디오 입력 처리 확장.
3단계 학습 파이프라인:
- Stage 1: MLP Warmup - 비전-언어 정렬 및 크로스모달 이해 기본기 마련
- Stage 1.5: 비전 인코더 Incremental Learning - 희귀 도메인(다국어 OCR, 차트) 처리 능력 증진
- Stage 2: Full Model Instruction Tuning - 노이즈 필터링 된 고품질 멀티모달 데이터 활용, LLM 성능 저하 최소화
데이터 필터링 & Loss Reweighting: LLM 기반 품질평가, 반복 샘플 제거, JPEG 압축 등 다양한 전처리 및 후처리 전략으로 데이터 노이즈 최소화, 학습 안정성 극대화.
MMMU 벤치마크 70% 상회, GPT-4o 대비 경쟁력: 전세계적 최고 수준 비전-언어 태스크 수행능력으로 오픈소스 모델 생태계 강화.

DeepSeek, Inc., DeepSeek V2.5-1210 출시

링크, 2024/12/10

인터넷 검색 실시간 연동: https://chat.deepseek.com/ 에서 Internet Search 옵션 활성화 시 실시간으로 웹정보를 활용한 QA 및 최신정보 기반 답변 가능.
성능 강화: 수학(MATH-500), 코딩(LiveCodebench), 작문, 롤플레이 능력 전반 향상. HumanEval 기준 코드 성능 개선, 고품질 답변.
오픈소스 모델 Hugging Face 제공: 상업적 허용 라이선스로 배포, 커뮤니티 생태계 기여.
기능 다양화: Function Calling, JSON Output, FIM(Fill-In-the-Middle) Completion 지원으로 다양한 애플리케이션 개발 용이.
시리즈 종결 및 차기 모델 예고: V2 시리즈 누적 성과 기반, 차세대 파운데이션 모델 개발 계획.

Microsoft, Florence-VL 발표

링크, 2024/12/06

아키텍처 및 방법론 상세화: Florence-2 비전 파운데이션 모델과 Phi 3.5, LLaMA3 LLM 결합. Depth-Breadth Fusion(깊은층+여러 프롬프트)으로 다면적 시각 특성 활용.
고품질 지식추론 및 Hallucination 억제: 비전-랭귀지 정렬 개선으로 답변의 정확성과 사실성 강화.
다양한 데이터셋에서 최고 수준 성능: OCR, 차트, 지식기반 VQA 등 범용 능력. 모델, 학습 레시피 공개로 재현성 및 추가 연구 장려.

Google, PaliGemma 2 공개

링크, 2024/12/05

3B/10B/28B 파라미터 규모, 다양한 해상도 지원(224,448,896px): 각 모델별 해상도 세팅과 전이학습 최적화를 통해 다양한 시각 언어 태스크 지원.
확장된 전이학습 범위: OCR(문자인식), 테이블 구조 파악, 분자구조 인식, 음악 스코어 이해, 의료영상 리포트 생성 등 복잡한 VLM 태스크.
SigLIP-So400m 비전 인코더 기반: 강력한 비전 인식 능력과 Gemma 2 언어모델 결합으로 멀티태스킹 전이 학습 성능.
오픈소스 공개 및 분석: 다양한 파인튜닝 전략 및 LLM통합 기법 실험 가능.

Tencent, HunyuanVideo 대규모 비디오 생성 모델 공개

링크, 2024/12/07

초대규모 비디오 생성 모델(13B+ 파라미터) 오픈소스: 비디오 콘텐츠 생성 프레임워크 제공.
3D VAE 기반 영상 압축·재생성: 고품질 고해상도 비디오 생성, 멀티프레임 및 멀티이미지 입력 처리.
멀티모달 LLM과 결합: 이미지-텍스트 정렬 강화, 비디오 설명문 생성, 영상 내 객체 추출 및 행동 이해.
데이터 강화 기법: 랜덤 JPEG 압축, 반복 샘플 필터링 등을 적용해 실제 환경에 강인한 비디오 생성 모델 구축.
Runway Gen-3, Luma 1.6 등 상용모델 대비 우수한 성능: 오픈소스 형태로 커뮤니티 기여 확대.

Amazon Web Services, Amazon Bedrock Agents 정식 출시

링크, 2023/12/10

에이전트 기반 오케스트레이션: Bedrock용 Agents는 FM 추론을 자동으로 멀티스텝 태스크로 나누고, RAG로 확장된 지식 기반 및 API 호출 통해 문제 해결.
오케스트레이션 프롬프트 수정 가능: 오토메이션된 프롬프트 템플릿 수정을 통해 특화 도메인 작업 최적화.
CoT 추론 가시화: 각 단계별 연쇄추론(Chain-of-Thought) 확인 가능, 문제 해결 과정 투명성 제고.
API 호출 검증 및 데이터 제어: 안전한 API 연동, 프롬프트 엔지니어링 자동화, 기업 워크플로우 개선.
미국 동부/서부 리전 사용 가능, InvokeModel API 기준 과금: 상용 서비스 통합 쉬움.

Microsoft, Magentic-One: Multi-Agent 시스템 소개

링크, 2024/11/04

범용 Multi-Agent 프레임워크: Orchestrator (계획·추론) + Coder(코드생성), WebSurfer(웹이동), FileSurfer(파일 탐색), Terminal(코드실행).
Outer/Inner Loop 관리: Outer loop에서 Task Ledger(사실, 가설, 플랜) 업데이트, Inner loop에서 Progress Ledger 관리. 정체 상태시 재계획 수립.
GAIA, AssistantBench, WebArena 등 벤치마크 테스트: 복잡한 사용자 요청 자동 처리, 동적 계획 재수립, 성능 우수.
PyTorch AutoGen 기반: 상호작용 모듈화 쉬우며, Agent 추가·확장 용이. 오픈소스 코드 공유.

LG AI Research, EXAONE 3.5 공개

링크, 2024/12/09

3개 모델(2.4B/7.8B/32B) 오픈소스: 경량 디바이스용 초소형 모델부터 Frontier급 32B 모델까지 다양한 스펙.
32K 토큰 Long Context 처리 강화: RAG, 대규모 문서 요약, 분석에 최적화. 실제 Effective Context Length 보장.
Instruction Following 최상위 성능: 7개 벤치마크 평균 1위, 다국어 환경서도 우수.
효율적 사전·사후학습(DPO, SFT), Decontamination 실시: 중복·개인정보 제거로 안전성·신뢰도 향상.
AI 윤리 공개: 혐오 표현 필터링 성능 우수, 지역/직업 편향 개선 필요사항 명시. 투명성 확보로 연구자 커뮤니티 기여.

Microsoft, TRELLIS3D 3D 생성 모델

링크, 2024/12/04

Structured Latent(SLAT) 표현 도입: 3D 격자 + 시각 특징 융합해 Radiance Fields, 3D Gaussians, Meshes 등 다양한 포맷으로 디코딩 가능.
Rectified Flow Transformers 적용: 3D latent space 상에서 안정적 학습, 대규모 50만개 3D 오브젝트로 훈련된 최대 2B 파라미터 모델.
멀티모달 입력(텍스트, 이미지)으로 3D 오브젝트 생성: 세밀한 형상, 텍스처 구현, 다양한 산업 분야(AR/VR, 게임, 디자인) 활용성.
고성능·유연한 편집 지원: 로컬부분 편집, 다양한 출력형태 지원으로 생산성 확대.
오픈소스: 코드·모델 공개, 3D 생성 연구 발전 기여.

HuggingFace, FineWeb 2.0 대규모 코퍼스

링크, 2024/12/09

8TB, 약 3조 토큰: 2013~2024년 CommonCrawl 기반, 1000+ 언어 포괄.
정교한 필터링 및 디듀플리케이션: 언어별 정제 프로세스, 불필요/민감한 정보 제거. 데이터 품질 극대화.
CC-100, mC4 등 기존 코퍼스 대비 성능 향상: FineTasks 벤치마크로 입증, 다국어 모델 연구 시 최적화된 프리트레이닝 데이터 제공.
ODC-By 1.0 라이선스: 상업용 활용 가능, 코드 공개로 재현성 및 확장성 확보.

AWS, Amazon Nova 파운데이션 모델 패밀리 공개

링크, 2024/12/04

Nova Micro/Lite/Pro/Premier/Canvas/Reel 시리즈: 가격-성능 스펙트럼 상 다양한 선택지. 텍스트, 이미지, 비디오 입력 처리 지원.
RAG(현실 정보 강화) 및 파인튜닝/디스틸링 지원: 고객 데이터 기반 맞춤 모델 생성, 정확도·사실성 개선.
멀티모달·Multilanguage 지원: 200개 언어, 영상 이해/생성, 에이전트 연동 등 확장성 높음.
Bedrock 기반 통합: 단일 API로 다양한 FM 접근, 기업 애플리케이션 배치 용이.
향후 발전 방향: 음성-음성, Any-to-Any 멀티모달 모델 계획으로 전방위 AI 어시스턴트 구현 목표.

Huggingface, TGI(Text Generation Inference) 3.0 릴리즈

링크, 2024/12/10

3배 많은 토큰 처리 및 vLLM 대비 13배 속도 향상: 긴 프롬프트 처리 효율 대폭 개선. 대규모 토큰 문자열(200k+ 토큰) 처리 시 캐싱 활용.
Zero Configuration 최적화: 하드웨어·모델 기반 자동 파라미터 설정으로 사용 편의성 증대.
Flash-infer, Flash-decoding 커널 도입: Prompt ingestion 속도 증가, 메모리 사용량 감소.
Prefix Caching 최적화: 같은 프롬프트 반복 시 응답 시간 극단적 단축, 실시간 서비스에 적합.
미래 계획: 특수 모델 지원, 장기 KV-cache 유지, 멀티모달 모델 호환성 개선 예정.

PyTorch, vLLM PyTorch 생태계 합류

링크, 2024/12/09

vLLM: 고효율 LLM 서빙 엔진: PagedAttention 알고리즘 기반으로 메모리 효율적 캐시 관리, 하드웨어 가속기 전반 지원.
초대규모 인퍼런스 성공사례: 아마존 Prime Day기간 초대량 트래픽(3백만 토큰/분) 1초 미만 응답 지연으로 처리, 실서비스 검증.
LLAMA, Qwen, DeepSeek 등 대형 모델 최적화: 다양한 백엔드(GPU, TPU, CPU)와 호환, 최적화된 분산 추론 제공.
오픈소스 기반 발전: 대형 커뮤니티 지원, PyTorch와 긴밀 결합으로 LLM 모델 효율적 서비스 배치 용이.
PyTorch 생태계 공식 합류: PyTorch Ecosystem 프로젝트로 인정, AI 업계 표준화 기여.

Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)
…

###
https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
META
12/6/24

BOOOOM! Meta released Llama 3.3 70B - 128K context, multilingual, enhanced tool calling, outperforms Llama 3.1 70B and comparable to Llama 405B 🔥
Comparable performance to 405B with 6x LESSER parameters ⚡
Llama 3.3 70B vs 405B:
> GPQA Diamond (CoT): 50.5% vs 49.0%
> Math (CoT): 77.0% vs 73.8%
> Steerability (IFEval): 92.1% vs 88.6%
3.1 70B vs 3.3 70B:
Code Generation
> HumanEval: 80.5% → 88.4% (+7.9%)
> MBPP EvalPlus: 86.0% → 87.6% (+1.6%)
Steerability
> IFEval: 87.5% → 92.1% (+4.6%)
Reasoning & Math
> GPQA Diamond (CoT): 48.0% → 50.5% (+2.5%)
> MATH (CoT): 68.0% → 77.0% (+9%)
Multilingual Capabilities
> MGSM: 86.9% → 91.1% (+4.2%)
> Model weights on the Hub and fully integrated with Transformers! 🤗

Model Information
The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.

Model developer: Meta

Model Architecture: Llama 3.3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Training Data	Params	Input modalities	Output modalities	Context length	GQA	Token count	Knowledge cutoff
Llama 3.3 (text only)	A new mix of publicly available online data.	70B	Multilingual Text	Multilingual Text and code	128k	Yes	15T+	December 2023
Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Llama 3.3 model. Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.

Model Release Date:

70B Instruct: December 6, 2024
Status: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.

License A custom commercial license, the Llama 3.3 Community License Agreement, is available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE

Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.3 in applications, please go here.

Intended Use
Intended Use Cases Llama 3.3 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.3 model also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.3 Community License allows for these use cases.

Out-of-scope Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.3 Community License. Use in languages beyond those explicitly referenced as supported in this model card**.

**Note: Llama 3.3 has been trained on a broader collection of languages than the 8 supported languages. Developers may fine-tune Llama 3.3 models for languages beyond the 8 supported languages provided they comply with the Llama 3.3 Community License and the Acceptable Use Policy and in such cases are responsible for ensuring that any uses of Llama 3.3 in additional languages is done in a safe and responsible manner.

How to use
This repository contains two versions of Llama-3.3-70B-Instruct, for use with transformers and with the original llama codebase.

###
📣 Microsoft Research releases Florence-VL, a new family of MLLMs powered by the generative vision foundation model Florence-2.
Achieves significant improvements in general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, and more🔥

###
https://github.com/OpenGVLab/InternVL

2024/12/05: 🚀 We release the InternVL2.5, an advanced multimodal large language model (MLLM) series with parameter coverage ranging from 1B to 78B. InternVL2_5-78B is the first open-source MLLMs to achieve over 70% on the MMMU benchmark. matching the performance of leading closed-source commercial models like GPT-4o. These models are available at HF link.

New InternVL drop with a state-of-the-art 78B vision language model with MIT license 🔥
All links are in comments 💬
The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes
78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks 👏

Model Architecture
As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.

image/png

As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448×448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data.

Training Strategy
Dynamic High-Resolution for Multimodal Data
In InternVL 2.0 and 2.5, we extend the dynamic high-resolution training approach, enhancing its capabilities to handle multi-image and video datasets.

image/png

For single-image datasets, the total number of tiles n_max are allocated to a single image for maximum resolution. Visual tokens are enclosed in <img> and </img> tags.

For multi-image datasets, the total number of tiles n_max are distributed across all images in a sample. Each image is labeled with auxiliary tags like Image-1 and enclosed in <img> and </img> tags.

For videos, each frame is resized to 448×448. Frames are labeled with tags like Frame-1 and enclosed in <img> and </img> tags, similar to images.

Single Model Training Pipeline
The training pipeline for a single model in InternVL 2.5 is structured across three stages, designed to enhance the model's visual perception and multimodal capabilities.

image/png

Stage 1: MLP Warmup. In this stage, only the MLP projector is trained while the vision encoder and language model are frozen. A dynamic high-resolution training strategy is applied for better performance, despite increased cost. This phase ensures robust cross-modal alignment and prepares the model for stable multimodal training.

Stage 1.5: ViT Incremental Learning (Optional). This stage allows incremental training of the vision encoder and MLP projector using the same data as Stage 1. It enhances the encoder’s ability to handle rare domains like multilingual OCR and mathematical charts. Once trained, the encoder can be reused across LLMs without retraining, making this stage optional unless new domains are introduced.

Stage 2: Full Model Instruction Tuning. The entire model is trained on high-quality multimodal instruction datasets. Strict data quality controls are enforced to prevent degradation of the LLM, as noisy data can cause issues like repetitive or incorrect outputs. After this stage, the training process is complete.

Progressive Scaling Strategy
We introduce a progressive scaling strategy to align the vision encoder with LLMs efficiently. This approach trains with smaller LLMs first (e.g., 20B) to optimize foundational visual capabilities and cross-modal alignment before transferring the vision encoder to larger LLMs (e.g., 72B) without retraining. This reuse skips intermediate stages for larger models.

image/png

Compared to Qwen2-VL's 1.4 trillion tokens, InternVL2.5-78B uses only 120 billion tokens—less than one-tenth. This strategy minimizes redundancy, maximizes pre-trained component reuse, and enables efficient training for complex vision-language tasks.

Training Enhancements
To improve real-world adaptability and performance, we introduce two key techniques:

Random JPEG Compression: Random JPEG compression with quality levels between 75 and 100 is applied as a data augmentation technique. This simulates image degradation from internet sources, enhancing the model's robustness to noisy images.

Loss Reweighting: To balance the NTP loss across responses of different lengths, we use a reweighting strategy called square averaging. This method balances contributions from responses of varying lengths, mitigating biases toward longer or shorter responses.

Data Organization
Dataset Configuration
In InternVL 2.0 and 2.5, the organization of the training data is controlled by several key parameters to optimize the balance and distribution of datasets during training.

image/png

Data Augmentation: JPEG compression is applied conditionally: enabled for image datasets to enhance robustness and disabled for video datasets to maintain consistent frame quality.

Maximum Tile Number: The parameter n_max controls the maximum tiles per dataset. For example, higher values (24–36) are used for multi-image or high-resolution data, lower values (6–12) for standard images, and 1 for videos.

Repeat Factor: The repeat factor r adjusts dataset sampling frequency. Values below 1 reduce a dataset's weight, while values above 1 increase it. This ensures balanced training across tasks and prevents overfitting or underfitting.

Data Filtering Pipeline
During development, we found that LLMs are highly sensitive to data noise, with even small anomalies—like outliers or repetitive data—causing abnormal behavior during inference. Repetitive generation, especially in long-form or CoT reasoning tasks, proved particularly harmful.

image/png

To address this challenge and support future research, we designed an efficient data filtering pipeline to remove low-quality samples.

image/png

The pipeline includes two modules, for pure-text data, three key strategies are used:

LLM-Based Quality Scoring: Each sample is scored (0–10) using a pre-trained LLM with domain-specific prompts. Samples scoring below a threshold (e.g., 7) are removed to ensure high-quality data.
Repetition Detection: Repetitive samples are flagged using LLM-based prompts and manually reviewed. Samples scoring below a stricter threshold (e.g., 3) are excluded to avoid repetitive patterns.
Heuristic Rule-Based Filtering: Anomalies like abnormal sentence lengths or duplicate lines are detected using rules. Flagged samples undergo manual verification to ensure accuracy before removal.
For multimodal data, two strategies are used:

Repetition Detection: Repetitive samples in non-academic datasets are flagged and manually reviewed to prevent pattern loops. High-quality datasets are exempt from this process.
Heuristic Rule-Based Filtering: Similar rules are applied to detect visual anomalies, with flagged data verified manually to maintain integrity.
Training Data
As shown in the following figure, from InternVL 1.5 to 2.0 and then to 2.5, the fine-tuning data mixture has undergone iterative improvements in scale, quality, and diversity. For more information about the training data, please refer to our technical report.

image/png

Evaluation on Multimodal Capability
Multimodal Reasoning and Mathematics
image/png

image/png

OCR, Chart, and Document Understanding
image/png

Multi-Image & Real-World Comprehension
image/png

Comprehensive Multimodal & Hallucination Evaluation
image/png

Visual Grounding
image/png

Multimodal Multilingual Understanding
image/png

Video Understanding
image/png

Evaluation on Language Capability
Training InternVL 2.0 models led to a decline in pure language capabilities. InternVL 2.5 addresses this by collecting more high-quality open-source data and filtering out low-quality data, achieving better preservation of pure language performance.

image/png

###
https://api-docs.deepseek.com/news/news1210
12/10/24
DeepSeek, Inc.

🚀 DeepSeek V2.5: The Grand Finale 🎉
🌐 Internet Search is now live on the web! Visit https://chat.deepseek.com/ and toggle “Internet Search” for real-time answers. 🕒


📊 DeepSeek-V2.5-1210 raises the bar across benchmarks like math, coding, writing, and roleplay—built to serve all your work and life needs.

🔧 Explore the open-source model on Hugging Face: https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210


🙌 With the release of DeepSeek-V2.5-1210, the V2.5 series comes to an end.

💪 Since May, the DeepSeek V2 series has brought 5 impactful updates, earning your trust and support along the way.

✨ As V2 closes, it’s not the end—it’s the beginning of something greater. DeepSeek is working on next-gen foundation models to push boundaries even further. Stay tuned!

“Every end is a new beginning.” 🕊️

The whales are back! 🐳 Deepseek-V2.5-1210 is here, with upgrades across math, coding, writing, and reasoning!
TL;DR:
🧠 236B MoE Instruct model with post-training updates to DeepSeek-V2.5
💡 MATH-500 benchmark has improved from 74.8% to 82.8%
📈 LiveCodebench increased from 29.2% to 34.38%
🛠️ Updated chat template, support Function Calling, JSON Output and FIM Completion
📊 Requires 80GB x 8 GPUs in BF16
🤗 Available on Hugging Face under a commercially permissive license

###
https://huggingface.co/papers/2412.04424
Microsoft

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Published on Dec 6
·
Submitted by
jiuhai
on Dec 6
#2 Paper of the day
Authors:

Jiuhai Chen
,

Jianwei Yang
,

Haiping Wu
,
Dianqi Li
,

Jianfeng Gao
,

Tianyi Zhou
,
Bin Xiao
Abstract
We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. https://github.com/JiuhaiChen/Florence-VL

https://huggingface.co/collections/google/paligemma-2-release-67500e1e1dbfdd4dee27ba48
Google
12/5/24
PaliGemma 2: A Family of Versatile VLMs for Transfer

Wohooo! Google just dropped released PaliGemma 2 - 3B, 10B & 28B Vision Language Models! 🔥
> 9 pre-trained models: 3B, 10B, and 28B with resolutions of 224x224, 448x448, and 896x896
> 2 models fine-tuned on DOCCI: Image-text caption pairs, supporting 3B and 10B (448x448)
Kudos Google for their commitment to Open Science! ⚡

PaliGemma 2: A Family of Versatile VLMs for Transfer
Published on Dec 5
·
Submitted by
osanseviero
on Dec 5
#1 Paper of the day
Authors:
Andreas Steiner
,
André Susano Pinto
,

Michael Tschannen
,

Daniel Keysers
,
Xiao Wang
,

Yonatan Bitton
,

Alexey Gritsenko
,

Matthias Minderer
,

Anthony Sherbondy
,

Shangbang Long
,
Siyang Qin
,
Reeve Ingle
,

Emanuele Bugliarello
,
Sahar Kazemzadeh
,

Thomas Mesnard
,

Ibrahim Alabdulmohsin
,

Lucas Beyer
,

Xiaohua Zhai
Abstract
PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.


###
https://aivideo.hunyuan.tencent.com/
12/7/24
Tencent

HunyuanVideo: A Systematic Framework For Large Video Generation Model
                     
Replicate

👋 Join our WeChat and Discord


This repo contains PyTorch model definitions, pre-trained weights and inference/sampling code for our paper exploring HunyuanVideo. You can find more visualizations on our project page.

HunyuanVideo: A Systematic Framework For Large Video Generation Model

🎥 Demo
 demo.mp4
The video is heavily compressed due to compliance of GitHub policy. The high quality version can be downloaded from here.

🔥🔥🔥 News!!
Dec 7, 2024: 🚀 We release the parallel inference code for HunyuanVideo powered by xDiT.
Dec 3, 2024: 🤗 We release the inference code and model weights of HunyuanVideo.


🔥 Just in! Tencent has released HunyuanVideo, the biggest open-source video generation model! This framework is designed for large-scale video generation, offering a unified architecture for both image and video creation. HunyuanVideo integrates a Multimodal Large Language Model (MLLM) as a text encoder, enhancing image-text alignment and reasoning capabilities. It also employs a 3D VAE for efficient compression of video data, allowing for high-quality video generation at original resolutions.
The model has been tested against leading closed-source models and has shown superior performance in motion quality and overall video generation.
This is really an impressive! 2025 will be the year of agents and video models 👀
[Submitted on 3 Dec 2024 (v1), last revised 6 Dec 2024 (this version, v2)]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Daquan Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Caesar Zhong (refer to the report for detailed contributions)
Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at this https URL.

###
https://aws.amazon.com/ko/blogs/korea/agents-for-amazon-bedrock-is-now-available-with-improved-control-of-orchestration-and-visibility-into-reasoning/
Amazon Web Services 한국 블로그
Amazon Bedrock Agent 정식 출시 – 향상된 오케스트레이션 제어 및 추론 가시성 제공
by Antje Barth on 10 12월 2023 in Amazon Bedrock, Announcements, Artificial Intelligence, AWS re:Invent, Generative AI, Launch, News Permalink  Share
지난 7월에 Amazon Bedrock용 에이전트(미리보기)를 소개했습니다. 오늘부터 Amazon Bedrock용 에이전트를 정식 버전으로 사용할 수 있습니다.

Amazon Bedrock용 에이전트는 여러 단계의 태스크를 오케스트레이션하여 생성형 인공 지능(AI) 애플리케이션 개발을 가속화하는 데 도움이 됩니다. 에이전트는 파운데이션 모델(FM)의 추론 기능을 사용하여 사용자 요청 태스크를 여러 단계로 분할합니다. 개발자 제공 지침을 사용하여 오케스트레이션 계획을 만듭니다. 그런 다음 회사 API를 호출하고 검색 증강 생성(Retrieval Augmented Generation, RAG)을 통해 지식 기반에 접근하여 최종 사용자에게 최종 응답을 제공하는 방법으로 계획을 실행합니다. 작동 방식이 궁금하다면 에이전트에 대한 이전 게시물에서 고급 추론에 대한 기본 정보와 RAG에 대한 기본 정보를 확인하시기 바랍니다.

오늘부터 Amazon Bedrock용 에이전트에는 오케스트레이션 제어 개선 및 연쇄적 사고 추론에 대한 가시성 개선과 같은 향상된 기능도 포함됩니다.

Amazon Bedrock용 에이전트는 프롬프트 엔지니어링 및 사용자 요청 태스크의 오케스트레이션을 자동화합니다. 예를 들어 소매 주문 관리 또는 보험 청구 처리와 같은 작업의 오케스트레이션을 자동화합니다. 에이전트는 오케스트레이션 프롬프트를 자동으로 구축한 다음 연결된 기술 자료가 있는 경우 해당 회사의 특정 정보로 이 프롬프트를 보강합니다. 그런 다음 API를 호출하여 사용자에게 자연어로 된 응답을 제공합니다.

개발 과정에서 계획이 실행될 때 사용된 추론을 확인하고 싶다면 새로운 추적 기능을 사용하면 됩니다. 오케스트레이션 프로세스의 중간 단계를 확인하고 이 정보를 사용하여 문제를 해결할 수 있습니다.

또한 에이전트에 의해 자동으로 생성된 프롬프트에 액세스하고 프롬프트를 수정하여 최종 사용자 경험을 추가로 개선할 수도 있습니다. 자동으로 생성된 이 프롬프트(또는 프롬프트 템플릿)를 업데이트하여 FM의 오케스트레이션 및 응답을 개선하는 방식으로 오케스트레이션에서 더 많은 부분을 제어할 수 있습니다.

추론 단계를 보는 방법과 프롬프트를 수정하는 방법을 보여 드리겠습니다.

추론 단계 보기
추적 기능을 사용하여 에이전트의 추론 과정, 즉 생각의 연결고리(Chain of Thought, CoT)를 볼 수 있습니다. CoT 추적을 사용하여 에이전트가 단계별로 태스크를 수행하는 방법을 확인할 수 있습니다. CoT 프롬프트는 ReAct(reasoning(추론)과 acting(행동)의 합성어)라는 추론 기법에 기반을 둡니다. ReAct와 특정 프롬프트 구조에 대해 자세히 알아보려면 이전 블로그 게시물에서 고급 추론에 대한 기본 정보를 확인하세요.

시작하려면 Amazon Bedrock 콘솔로 이동하여 기존 에이전트의 작업 초안을 선택합니다. 그런 다음 테스트 버튼을 선택하고 샘플 사용자 요청을 입력합니다. 에이전트 응답에서 Show trace(추적 보기)를 선택합니다.

Agents for Amazon Bedrock

CoT 추적은 에이전트의 추론을 단계별로 보여줍니다. 각 단계를 열어 CoT 세부 정보를 확인합니다.

Agents for Amazon Bedrock

가시성이 향상되면 에이전트가 태스크를 완료할 때 어떤 근거를 사용했는지 이해하는 데 도움이 됩니다. 개발자는 사용자 경험을 반복적으로 테스트하고 개선할 때 이 정보를 사용하여 프롬프트, 지침 및 작업 설명을 다듬으면서 에이전트의 작업 및 응답을 조정할 수 있습니다.

에이전트를 통해 생성된 프롬프트 수정
에이전트는 제공된 지침에 따라 자동으로 프롬프트 템플릿을 생성합니다. 사용자는 사용자 입력의 전처리, 오케스트레이션 계획 및 FM 응답의 후처리를 업데이트할 수 있습니다.

시작하려면 Amazon Bedrock 콘솔로 이동하여 기존 에이전트의 작업 초안을 선택합니다. 그런 다음 Advanced prompts(고급 프롬프트) 옆의 편집 버튼을 선택합니다.

Agents for Amazon Bedrock

여기에서 4가지 유형의 템플릿에 액세스할 수 있습니다. 전처리 템플릿은 에이전트로 사용자 입력을 컨텍스트화하고 분류하는 방식을 정의합니다. 오케스트레이션 템플릿은 에이전트에 단기 메모리, 설명을 포함한 사용 가능한 작업 및 지식 기반 목록, 문제를 분류하고 이러한 작업과 지식을 다양한 순서 또는 조합으로 사용하는 방법에 대한 퓨샷(Few-shot) 예제들을 제공합니다. 지식 기반 응답 생성 템플릿은 응답에서 지식 기반을 사용하고 요약하는 방법을 정의합니다. 후처리 템플릿은 에이전트가 최종 응답의 서식을 지정하고 최종 사용자에게 제시하는 방법을 정의합니다. 템플릿 기본값을 계속 사용하거나 템플릿 기본값을 편집하고 재정의할 수 있습니다.

알아야 할 사항
다음은 Amazon Bedrock용 에이전트로 작업할 때 알아야 할 몇 가지 모범 사례와 중요한 정보입니다.

에이전트는 특정 태스크에 집중할 때 가장 우수한 성능을 발휘합니다. 목표(지침)가 명확하고 사용 가능한 작업 세트(API)에 대한 집중도가 높아질수록 FM으로 더 쉽게 올바른 단계를 추론하고 식별할 수 있습니다. 다양한 태스크를 다룰 에이전트가 필요하다면 개별 에이전트를 따로 만드는 것이 좋습니다.

다음은 몇 가지 추가 지침입니다.

API 수 – 3~5개의 API와 입력 파라미터 몇 개를 에이전트에 사용합니다.
API 설계 – API 설계에 대한 일반적인 모범 사례(예: 멱등성 보장)를 따릅니다.
API 호출 검증 – 모든 API 호출에 대해 철저한 검증을 적용하여 API 설계 모범 사례를 따릅니다. 대규모 언어 모델(LLM)은 할루시네이션 입력 및 출력을 생성할 수 있습니다. API 호출 검증은 이러한 상황에 도움이 되는 것으로 검증되었기 때문에 특히 중요합니다.
가용성 및 요금
오늘부터 미국 동부(버지니아 북부) 및 미국 서부(오레곤) AWS 리전에서 Amazon Bedrock용 에이전트를 사용할 수 있습니다. 요금은 에이전트의 추론 호출(InvokeModel API)에 대해 부과됩니다. InvokeAgent API에는 별도의 요금이 부과되지 않습니다. Amazon Bedrock 요금에서 모든 세부 정보를 확인하세요.

###
https://www.microsoft.com/en-us/research/articles/magentic-one-a-generalist-multi-agent-system-for-solving-complex-tasks/
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
November 4, 2024
Share this page
Microsoft

By Adam Fourney, Principal Researcher; Gagan Bansal, Senior Researcher; Hussein Mozannar, Senior Researcher; Victor Dibia, Principal Research Software Engineer; Saleema Amershi, Partner Research Manager

Contributors: Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang (Eric) Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, Saleema Amershi

An illustrated workflow of Magentic-One completing a complex task from the GAIA agentic benchmark. The workflow starts with a description of the Task which reads “The attached image contains a Python script. Run the Python code against an array of strings, listed below. Output of the script is a URL containing C++ source code, compile, run and return the sum of the third and fifth integers…” The task description is shown flowing to the Orchestrator agent which then creates a dynamic/task-specific plan. The rest of the workflow lists the steps of the task being executed by the other agents on the Magentic-One team. First, the File Surfer accesses the image provided in the task description and extracts the code. Second, the Coder agent analyzes the Python code from the image. Third, the Computer Terminal executes the code provided by the Coder agent, outputting an url string. Fourth, the Web Surfer agent navigates to the url and extracts the C++ code shown on the page. Fifth, the Coder agent analyzes the C++ code. Sixth, the Computer Terminal executes the C++ code. Finally, the Orchestrator determines the task is complete and outputs the final result.
We are introducing Magentic-One, our new generalist multi-agent system for solving open-ended web and file-based tasks across a variety of domains. Magentic-One represents a significant step towards developing agents that can complete tasks that people encounter in their work and personal lives. We are also releasing an open-source implementation of Magentic-One(opens in new tab) on Microsoft AutoGen, our popular open-source framework for developing multi-agent applications.
The future of AI is agentic. AI systems are evolving from having conversations to getting things done—this is where we expect much of AI’s value to shine. It’s the difference between generative AI recommending dinner options to agentic assistants that can autonomously place your order and arrange delivery. It’s the shift from summarizing research papers to actively searching for and organizing relevant studies in a comprehensive literature review.

Modern AI agents, capable of perceiving, reasoning, and acting on our behalf, are demonstrating remarkable performance in areas such as software engineering, data analysis, scientific research, and web navigation. Still, to fully realize the long-held vision of agentic systems that can enhance our productivity and transform our lives, we need advances in generalist agentic systems. These systems must reliably complete complex, multi-step tasks across a wide range of scenarios people encounter in their daily lives.

Introducing Magentic-One(opens in new tab), a high-performing generalist agentic system designed to solve such tasks. Magentic-One employs a multi-agent architecture where a lead agent, the Orchestrator, directs four other agents to solve tasks. The Orchestrator plans, tracks progress, and re-plans to recover from errors, while directing specialized agents to perform tasks like operating a web browser, navigating local files, or writing and executing Python code.

Magentic-One achieves statistically competitive performance to the state-of-the-art on multiple challenging agentic benchmarks, without requiring modifications to its core capabilities or architecture. Built on AutoGen(opens in new tab), our popular open-source multi-agent framework, Magentic-One’s modular, multi-agent design offers numerous advantages over monolithic single-agent systems. By encapsulating distinct skills in separate agents, it simplifies development and reuse, similar to object-oriented programming. Magentic-One’s plug-and-play design further supports easy adaptation and extensibility by enabling agents to be added or removed without needing to rework the entire system—unlike single-agent systems, which often struggle with inflexible workflows.

We’re making Magentic-One open-source(opens in new tab) for researchers and developers. While Magentic-One shows strong generalist capabilities, it’s still far from human-level performance and can make mistakes. Moreover, as agentic systems grow more powerful, their risks—like taking undesirable actions or enabling malicious use-cases—can also increase. While we’re still in the early days of modern agentic AI, we’re inviting the community to help tackle these open challenges and ensure our future agentic systems are both helpful and safe. To this end, we’re also releasing AutoGenBench(opens in new tab), an agentic evaluation tool with built-in controls for repetition and isolation to rigorously test agentic benchmarks and tasks while minimizing undesirable side-effects.

Code on GitHub
Read the technical report
How it works
A diagram illustrating Magentic-One’s multi-agent architecture. The diagram depicts the inner working of the Orchestrator agent at the top and points to the other agents on the team at the bottom. Within the Orchestrator, an outer and inner loop are depicted. The outer loop shows a task ledger, which contains facts, guesses, and the current plan, and a pointer into and out of an inner loop. The inner loop shows a progress ledger, which tracks the current task progress and assignments for each agent, pointing to a decision node with the text “Task complete?”. If “Yes” the diagram shows the flow breaking out of the Orchestrator and pointing to a “Task Complete” termination node. If “No” the diagram shows the flow pointing to another decision node with the text “Progress being made?”. If “Yes” the flow points out of the Orchestrator toward one of the other agents on the team, indicating a handoff of control. If “No”, the flow points to third decision node with the text “Stall count > 2”. If “Yes” the flow goes back to the outer loop’s Task Ledger which is updated before the agents try again. If “No”, the flow again points out of the Orchestrator toward one of the other agents. The other agents depicted at the bottom of the diagram are named and described as follows: a Coder (“Write code and reason to solve tasks”), Computer Terminal (“Execute code written by the coder agent”), WebSurfer (“Browse the internet (navigate pages, fill forms, etc)”), and a FileSurfer (“Navigate files (e.g., PDFs, pptx, WAV, etc)”).
Magentic-One features an Orchestrator agent that implements two loops: an outer loop and an inner loop. The outer loop (lighter background with solid arrows) manages the task ledger (containing facts, guesses, and plan) and the inner loop (darker background with dotted arrows) manages the progress ledger (containing current progress, task assignment to agents).
Magentic-One work is based on a multi-agent architecture where a lead Orchestrator agent is responsible for high-level planning, directing other agents and tracking task progress. The Orchestrator begins by creating a plan to tackle the task, gathering needed facts and educated guesses in a Task Ledger that is maintained. At each step of its plan, the Orchestrator creates a Progress Ledger where it self-reflects on task progress and checks whether the task is completed. If the task is not yet completed, it assigns one of Magentic-One other agents a subtask to complete. After the assigned agent completes its subtask, the Orchestrator updates the Progress Ledger and continues in this way until the task is complete. If the Orchestrator finds that progress is not being made for enough steps, it can update the Task Ledger and create a new plan. This is illustrated in the figure above; the Orchestrator work is thus divided into an outer loop where it updates the Task Ledger and an inner loop to update the Progress Ledger.

Magentic-One consists of the following agents:

Orchestrator: The lead agent responsible for task decomposition, planning, directing other agents in executing subtasks, tracking overall progress, and taking corrective actions as needed
WebSurfer: An LLM-based agent proficient in commanding and managing the state of a Chromium-based web browser. For each request, the WebSurfer performs actions such as navigation (e.g., visiting URLs, performing searches), interacting with webpages (e.g., clicking, typing), and reading actions (e.g., summarizing, answering questions). It then reports on the new state of the webpage. The WebSurfer relies on the browser’s accessibility tree and set-of-marks prompting to perform its tasks.
FileSurfer: An LLM-based agent that commands a markdown-based file preview application to read local files. It can also perform common navigation tasks such as listing directory contents and navigating through them.
Coder: An LLM-based agent specialized in writing code, analyzing information collected from the other agents, and creating new artifacts.
ComputerTerminal: Provides access to a console shell for executing programs and installing new libraries.
Together, Magentic-One’s agents equip the Orchestrator with the tools and capabilities it needs to solve a wide range of open-ended problems and autonomously adapt to, and act in, dynamic and ever-changing web and file-system environments.

While the default multimodal LLM used for all agents is GPT-4o, Magentic-One is model-agnostic, allowing the integration of heterogeneous models to support different capabilities or meet different cost requirements. For example, different LLMs and SLMs or specialized versions can power different agents. For the Orchestrator, we recommend a strong reasoning model, like GPT-4o. In a different configuration, we also experimented with using OpenAI o1-preview for the Orchestrator’s outer loop and for the Coder, while other agents continued to use GPT-4o.

Evaluation
To rigorously evaluate Magentic-One’s performance, we introduce AutoGenBench, an open-source standalone tool for running agentic benchmarks that allows repetition and isolation, e.g., to control for variance of stochastic LLM calls and side-effects of agents taking actions in the world. AutoGenBench facilitates agentic evaluation and allows adding new benchmarks. Using AutoGenBench, we can evaluate Magentic-One on a variety of benchmarks. Our criterion for selecting benchmarks is that they should involve complex multi-step tasks, with at least some steps requiring planning and tool use, including using web browsers to act on real or simulated webpages. We consider three benchmarks in this work that satisfy this criterion: GAIA, AssistantBench, and WebArena.

In the Figure below we show the performance of Magentic-One on the three benchmarks and compare with GPT-4 operating on its own and the per-benchmark highest-performing open-source baseline and non open-source benchmark specific baseline according to the public leaderboards as of October 21, 2024. Magentic-One (GPT-4o, o1) achieves statistically comparable performance to previous SOTA methods on both GAIA and AssistantBench and competitive performance on WebArena. Note that GAIA and AssistantBench have a hidden test set while WebArena does not, and thus WebArena results are self-reported. Together, these results establish Magentic-One as a strong generalist agentic system for completing complex tasks.

A bar chart showing evaluation results of Magentic-One on the GAIA, AssistantBench, and WebArena benchmarks. The bars are grouped along the x-axis by benchmark, with bars corresponding to: GPT-4, Benchmark specific non-open source SOTA, Benchmark specific open-source SOTA, Magentic-One (GPT-4o), Magentic-One (GPT-4o, o1-preview), and Human performance, in that order for each benchmark. The y-axis shows “Accuracy (%)” from 0-100%. The chart shows GPT-4 performing worst on all benchmarks (around 7%,16%, and 15%, respectively) while the human level performance (only available for GAIA and WebArena) achieves around 92% and 78%, respectively. The chart shows Magentic-One perform comparably to the SOTA solutions on all benchmarks, aside from the Benchmark specific non-OS SOTA results on WebArena. An asterisk is shown in this case to depict that the non-open-source solutions provide no documentation or implementation for the community.
Evaluation results of Magentic-One on the GAIA, AssistantBench and WebArena. Error bars indicate 95% confidence intervals. Note that WebArena results are self-reported.
Risks and mitigations
Agentic systems like Magentic-One mark a significant shift in both the opportunities and risks associated with AI. Magentic-One interacts with a digital world designed for humans, taking actions that can change states and potentially lead to irreversible consequences. These inherent and undeniable risks were evident during our testing, where several emerging issues surfaced. For example, during development, a misconfiguration led agents to repeatedly attempt and fail to log into a WebArena website. This resulted in the account being temporarily suspended. The agents then tried to reset the account’s password. Even more concerning were cases in which agents, until explicitly stopped, attempted to recruit human assistance by posting on social media, emailing textbook authors, or even drafting a freedom of information request to a government entity. In each case, the agents were unsuccessful due to a lack of the required tools or accounts, or because human observers intervened.

Aligned with the Microsoft AI principles and Responsible AI practices, we worked to identify, measure, and mitigate potential risks before deploying Magentic-One. Specifically, we conducted red-teaming exercises to assess risks related to harmful content, jailbreaks, and prompt injection attacks, finding no increased risk from our design. Additionally, we provide cautionary notices and guidance for using Magentic-One safely, including examples and appropriate default settings. Users are advised to keep humans in the loop for monitoring, and ensure that all code execution examples, evaluations, and benchmarking tools are run in sandboxed Docker containers to minimize risks.

Recommendations and looking forward
We recommend using Magentic-One with models that have strong alignment, pre- and post-generation filtering, and closely monitored logs during and after execution. In our own use, we follow the principles of least privilege and maximum oversight. Minimizing risks associated with agentic AI will require new ideas and extensive research, as much work is still needed to understand these emerging risks and develop effective mitigations. We are committed to sharing our learnings with the community and evolving Magentic-One in line with the latest safety research.

As we look ahead, there are valuable opportunities to improve agentic AI, particularly in safety and Responsible AI research. Agents acting on the public web may be vulnerable to phishing, social engineering, and misinformation threats, much like human users. To counter these risks, an important direction is to equip agents with the ability to assess the reversibility of their actions—distinguishing between those that are easily reversible, those that require effort, and those that are irreversible. Actions like deleting files, sending emails, or filing forms are often difficult or impossible to undo. Systems should therefore be designed to pause and seek human input before proceeding with such high-risk actions.

We invite the community to collaborate with us in ensuring that future agentic systems are both helpful and safe.

For further information, results and discussion, please see our technical report.

###
https://www.lgresearch.ai/blog/view?seq=506
LG AI Research
BLOG
2024.12.09




EXAONE 3.5 3개 모델 오픈소스로 공개 - Frontier AI급의 모델, Instruction Following 및 Long Context 최고 수준 성능 달성
EXAONE 3.5 기반 모델들을 오픈소스로 공개합니다. 지난 8월 EXAONE 3.0 기반 7.8B 모델을 공개한 이후 4달 만에 한층 강력해진 모델 라인업을 선보이게 되었습니다.

EXAONE 3.0을 공개한 직후 우리는 기업, 기관, 학계 등 많은 곳으로부터 다양한 피드백을 받았습니다. 그중 활용 목적에 맞춰 효율적으로 사용할 수 있는 다양한 사이즈의 모델을 공개해달라는 요청이 핵심을 이루었습니다.




이미지 1. EXAONE 3.5 3개 모델 오픈소스로 공개


피드백을 반영해 이번에 공개하는 EXAONE 3.5 기반 모델은 3가지 유형으로 구축되었습니다. 3개 모델 모두, 동일 사이즈의 글로벌 모델 대비 강력한 성능을 입증했습니다. 먼저 온 디바이스용 초경량 모델인 2.4B 모델입니다. 온 디바이스 환경이나 저사양 GPU에서도 학습과 추론이 가능한 경량화 모델로, 우수한 인프라 환경이 갖춰지지 않은 곳에서도 모델 구동이 가능합니다. 다음으로 사용자의 목적에 맞춰 범용적 활용이 가능한 경량 모델인 7.8B 모델입니다. 이전 버전의 오픈소스 모델과 크기는 동일하지만 성능은 더욱 향상되었습니다. 마지막으로 Frontier AI 급의 고성능 모델인 32B 모델입니다. 성능을 최우선으로 고려하는 고객을 위한 강력한 모델입니다.

우리의 오픈소스 모델 공개는 여기서 그치지 않습니다. EXAONE 3.5 모델에 대한 다양한 피드백에 귀 기울이고, 연구자들의 니즈에 맞춘 모델을 꾸준히 공개해 나갈 계획입니다. 이 과정을 통해 연구와 생태계 발전에 기여하고, AI 혁신의 기반을 이뤄가겠습니다. EXAONE 3.5 모델에 대한 여러분의 다양한 피드백을 기다리겠습니다.


Our Expertise in EXAONE 3.5

Training Efficiency

EXAONE 3.5의 3개 모델의 특징은 뛰어난 성능과 함께 경제성까지 확보했다는 점입니다. 그 배경에는 모델 학습의 효율성을 높인 LG AI연구원만의 연구개발 방식이 있습니다. 사전학습 단계에서는 중복된 데이터와 개인 식별 정보를 제거하는 등의 과정을 통해, 모델 답변의 성능을 높이고 인프라 비용을 줄이고자 했습니다. 또한 사후학습 단계에서는 모델의 사용성을 높이고, 새로운 과제 수행 능력을 높이는 방향에 초점을 맞췄습니다. 크게 SFT(Supervised Fine-tuning)와 DPO(Direct Preference Optimization) 방식을 통해 Instruction Following 능력을 강화하고 사용자의 선호도를 모델이 잘 반영할 수 있도록 했습니다.


Decontamination

EXAONE 3.5의 성능 평가 결과에 대한 신뢰도를 높이기 위해 치밀한 Decontamination 과정도 수행했습니다. 글로벌 모델에 사용된 Decontamination 방식을 차용하되, 평가에 활용된 데이터셋과 학습 데이터를 비교하는 과정을 10회 반복 수행함으로써 엄격하게 벤치마크 성능 평가를 진행했습니다.

이는 지금부터 소개할 EXAONE 3.5의 벤치마크 성능을 자신 있게 설명할 수 있는 이유입니다.


Key Takeaways 1. EXAONE 3.5 : A Global Model of Excellence

1. Long Context Understanding : The Top Performance in Four Benchmarks

EXAONE 3.5의 가장 강력한 특징은 바로 Long Context에 대한 이해도와 처리 능력이 향상됐다는 점입니다. 웹 검색 결과나 참조 문서 기반으로 답변을 생성하는 RAG(Retrieval-Augmented Generation) 기술이 활용되면서, 모델의 Long Context 이해도가 중요해졌습니다. 이번에 공개한 EXAONE 3.5 모든 모델들은 32K 토큰 Context를 처리할 수 있도록 모델링 되었고, 각각의 모델은 유사 사이즈의 글로벌 모델 대비 Long Context 처리 부분에서 가장 좋은 성능을 입증했습니다.

특히 일부 모델들의 경우 32K 토큰보다 더 긴 Context가 이해가 가능하다고 이야기하지만, 이는 모델 설계 시 확인한 이론적 수치에 불과한 경우가 많습니다. 반면 EXAONE 3.5는 모델이 실제로 이해하고 처리할 수 있는 최대 토큰 길이인 Effective Context Length가 32K로, 최근의 AI 연구와 활용 흐름에 맞춰 가장 효과적으로 활용할 수 있는 모델입니다. 특히 Bi-lingual 모델인 EXAONE은 영어뿐만 아니라 한국어에서도 Long Context Understanding 성능이 최고 수준임을 확인했습니다.




이미지 2. Performance Comparison Results of EXAONE 3.5 - On Four Benchmarks Representing Long Context Scenarios
(Excluded from results if the model does not support context lengths longer than 16K)


2. Instruction Following Capabilities : The Highest Scores Across Seven Benchmarks

EXAONE 개발의 여정에서 가장 중요하게 생각하는 부분은 바로 모델의 실제 사용성 측면입니다. 실제 모델 연구와 개발 과정에서도 EXAONE이 실제 산업현장에서 사람의 생산성과 업무 효율성을 높여줄 수 있을 정도의 성능을 갖췄는지에 초점을 맞춰왔습니다. EXAONE 3.5의 테크니컬 리포트에서는 실제 사용성과 관련된 성능을 ‘Real-world use cases’로 기재했습니다. 총 7개의 벤치마크를 활용했고 EXAONE 3.5의 세 개 모델 모두 Instruction Following 능력 평균 점수 1위를 보이며, 동일 사이즈의 글로벌 모델들을 큰 차이로 앞서고 있음을 확인했습니다. Instruction Following 성능 역시 영어 뿐만 아니라 한국어에서도 우수한 성능을 확인할 수 있습니다.




이미지 3. Performance Comparison Results of EXAONE 3.5 - On Seven Benchmarks Representing Real-world Use Case Scenarios


3. Business Partnerships : Uncovering New Opportunities

이제 AI 서비스는 기술의 가능성을 보여주는 것을 넘어, 실제 활용성을 입증하고 비즈니스 모델을 만들어가야 할 때입니다. LG AI연구원 역시 국내 및 글로벌 기업들과의 파트너십을 체결해 비즈니스 성과를 가시화하고 있습니다. 국내에서는 폴라리스오피스, 한컴 등 자체 소프트웨어를 보유한 기업 서비스에 EXAONE 3.5 기반의 AI 솔루션 적용을 논의 중입니다. 특히 공공기관 활용도가 높은 한컴오피스에 EXAONE 3.5 기반의 AI 서비스를 구현하는 PoC 과제를 추진하고 있으며, 이를 통해 정부 및 공공기관의 업무 효율성 혁신을 이뤄갈 것으로 기대합니다.


Key Takeaways 2. EXAONE's Enhanced Features

1. General Domain : Competitive Results on Nine Benchmarks Compared to SOTA Open Models

EXAONE 3.5는 수학 능력, 프로그래밍 능력도 우수한 성능을 보입니다. 총 9개의 단일 벤치마크를 사용했고, 특히 2.4B 모델은 평균 점수 1위를 기록하며 동일 사이즈의 글로벌 모델 대비 뛰어난 성능을 보였습니다. 7.8B 모델과 32B 모델 역시 평균 점수 상위권을 기록했습니다.




이미지 4. Performance Comparison Results of EXAONE 3.5 - On Nine Benchmarks Representing General Scenarios.
Bold scores indicate the best performance, and underlined scores mean the second best


2. Responsible AI : Open and Transparent Disclosure of Information

EXAONE 3.5의 개발 과정에서 우리는 윤리적인 AI를 위한 기업의 책임을 다하고자 했습니다. 다양한 사이즈의 모델 라인업을 오픈소스로 공개할 경우 AI 연구와 생태계 발전에 기여할 수 있는 반면, 의도하지 않은 사회적 약자에 대한 불평등 문제나 유해 콘텐츠의 생성, 사용자에 의한 악의적 사용 등 잠재적 위험의 가능성도 존재하기 때문입니다. 우리는 잠재적 위험을 미리 식별하고 방지하기 위해 AI 라이프사이클의 전 주기에서 위험 여부를 검토하는 AI윤리영향평가 과정을 진행했고, LG AI 윤리원칙을 준수하며 연구 개발을 이어왔습니다.

EXAONE 3.5의 윤리성 평가 결과, 우수한 점과 함께 보완이 필요한 부분도 확인할 수 있었습니다. 세 개 모델 모두 혐오 표현이나 불법적인 요소를 필터링 하는 부분에서 우수한 결과를 보였습니다. 반면 상대적으로 2.4B 모델의 경우 지역이나 직업과 관련한 편향을 개선해야 하는 점도 확인했습니다. 평가 결과를 그대로 공개한 이유는 AI 윤리가 발전하기 위해서는 투명한 정보 공개가 반드시 선행되어야 한다고 판단하고 있기 때문입니다. 이번 정보 공개를 바탕으로 연구자들이 AI 윤리 측면에서 더욱 활발한 연구를 진행하기를 바라며, LG AI연구원도 AI 윤리 연구를 이어가겠습니다.

###
https://trellis3d.github.io/
Microsoft
12/4/24
*
Structured 3D Latents
for Scalable and Versatile 3D Generation
Jianfeng Xiang1,3
Zelong Lv2,3
Sicheng Xu3
Yu Deng3
Ruicheng Wang2,3
Bowen Zhang2,3
Dong Chen3
Xin Tong3
Jiaolong Yang3
1Tsinghua University
2USTC
3Microsoft Research
* Generated by TRELLIS, using its image to 3D assets cabilities.
Paper
Arxiv
Code
Demo
TL;DR: A native 3D generative model built on a unified Structured Latent representation and Rectified Flow Transformers, enabling versatile and high-quality 3D asset creation.
 We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding.
We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
NOTE: The appearance and geometry shown in this page are rendered from 3D Gaussians and meshes, respectively. GLB files are extracted by baking appearance from 3D Gaussians to meshes.

TRELLIS is a large 3D asset generation model. It takes in text or image prompts and generates high-quality 3D assets in various formats, such as Radiance Fields, 3D Gaussians, and meshes. The cornerstone of TRELLIS is a unified Structured LATent (SLAT) representation that allows decoding to different output formats and Rectified Flow Transformers tailored for SLAT as the powerful backbones. We provide large-scale pre-trained models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. TRELLIS significantly surpasses existing methods, including recent ones at similar scales, and showcases flexible output format selection and local 3D editing capabilities which were not offered by previous models.

Check out our Project Page for more videos and interactive demos!

🌟 Features
High Quality: It produces diverse 3D assets at high quality with intricate shape and texture details.
Versatility: It takes text or image prompts and can generate various final 3D representations including but not limited to Radiance Fields, 3D Gaussians, and meshes, accommodating diverse downstream requirements.
Flexible Editing: It allows for easy editings of generated 3D assets, such as generating variants of the same object or local editing of the 3D asset.

###
https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
12/9/24
FineWeb 2.0 - 8 Terabytes, 3 Trillion tokens, 1000 languages - simply the best multilingual pre-training corpus out there! 🔥
Available under a commercially permissive license! 🤗
🥂 FineWeb2
FineWeb 2: A sparkling update with 1000s of languages
A sparkling update with 1000s of languages

What is it?
This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.

The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.

In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.

multilingual-comparisons
The data was sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using 🏭 datatrove, our large scale data processing library. This carefully deduplicated and filtered dataset comprises roughly 8 terabytes of compressed text data, with almost 3 trillion words (see How many tokens? for more details). For PII and opt-out see Personal and Sensitive Information and opt-out.

You will find our ablation and evaluation setup in this github repo. We will soon upload model checkpoints from our ablation experiments.

Stay tuned for our upcoming 📝 blogpost explaining how we individually adapted the original 🍷 FineWeb pipeline to each language!

Languages and available subsets
For English data, please refer to the original 🍷 FineWeb.

Each language is identified by its ISO 639-3 code, and the data is grouped by language-script pairs, since some languages have content in multiple scripts.

In total, we provide filtered data for 1,893 language-script pairs. Of these, 486 have more than 1MB of text data, and 80 have more than 1GB of filtered data. Most languages also include a small test split which should not be trained on.

While we tried our best to not overfilter, we know that our filtering isn't perfect, and wanted to allow the community to easily re-filter the data with their own filtering criteria. We have therefore also uploaded the data that was removed by our filtering pipeline for each language (it is suffixed by _removed). The filtered + the removed subsets of each language represent the entire data for a given language following global deduplication, which means that you do not have to re-deduplicate it yourself. You can find and adapt our filtering code here.

Additionally, we also uploaded data for scripts that the language classifier does not support or in a supported script but unknown language, without any deduplication or filtering. These are prefixed by und_.

The following table shows the size of the filtering subset for the biggest 80 languages. Feel free to expand the details below for the full list.

###
https://www.aboutamazon.com/news/aws/amazon-nova-artificial-intelligence-bedrock-aws
AWS
Introducing Amazon Nova, our new generation of foundation models
New state-of-the-art foundation models from Amazon deliver frontier intelligence and industry-leading price performance.

AWS
Artificial Intelligence
Technology
Innovation
re:Invent
Sellers
Share
Introducing Amazon Nova. Prompt: A dinosaur sitting in a teacup.
Written by Amazon Staff

Last updated: December 4, 2024

4 min read

From our custom-built Inferentia and Trainium chips, to offering best-in-class foundation models on Amazon Bedrock, and AI-powered experiences like Rufus and Alexa, we’re committed to delivering generative AI (Gen AI) solutions that offer real-world value to our customers. Our goal is to use AI to simplify the lives of shoppers, sellers, advertisers, enterprises, and everyone in between.
A smart phone with a screenshot of the Amazon shopping app.
Amazon’s new AI Shopping Guides make it easier to research product types and buy smarter. Here’s how.
Amazon simplifies product research, leveraging generative AI to bring together shopping guidance and product recommendations on over 100 product types.

As the next step in our AI journey, we’ve built Amazon Nova, a new generation of foundation models (FMs). With the ability to process text, image, and video as prompts, customers can use Amazon Nova-powered generative AI applications to understand videos, charts, and documents, or generate videos and other multimedia content.
“Inside Amazon, we have about 1,000 Gen AI applications in motion, and we’ve had a bird’s-eye view of what application builders are still grappling with,” said Rohit Prasad, SVP of Amazon Artificial General Intelligence. “Our new Amazon Nova models are intended to help with these challenges for internal and external builders, and provide compelling intelligence and content generation while also delivering meaningful progress on latency, cost-effectiveness, customization, information grounding, and agentic capabilities.”
The new Amazon Nova models available in Amazon Bedrock include:
Amazon Nova Micro, a text-only model that delivers the lowest latency responses at very low cost.
Amazon Nova Lite, a very low-cost multimodal model that is lightning fast for processing image, video, and text inputs.
Amazon Nova Pro, a highly capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks.
Amazon Nova Premier, the most capable of Amazon’s multimodal models for complex reasoning tasks and for use as the best teacher for distilling custom models (available in the Q1 2025 timeframe).
Amazon Nova Canvas, a state-of-the-art image generation model.
Amazon Nova Reel, a state-of-the-art video generation model.
A still photo of a shoreline transforming to a motion graphic, using the prompt "dolly forward".
Amazon Nova Reel transforms a single image input into a brief video with the prompt: dolly forward.
How Amazon Nova models will benefit customers
All Amazon Nova models are incredibly capable, fast, cost-effective, and have been designed to be easy to use with a customer’s systems and data. They support a wide range of tasks across 200 languages and multiple modalities. Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro are at least 75 percent less expensive than the best performing models in their respective intelligence classes in Amazon Bedrock. They are also the fastest models in their respective intelligence classes in Amazon Bedrock.
Dr. Ruba Borno, VP, Global Specialists and Partners, AWS, on stage at re:invent 2024
Announcements and updates from AWS re:Invent 2024
News from AWS re:Invent, including all things generative AI, new service announcements, tech demos, and more.

The models are integrated with Amazon Bedrock, a fully managed service that makes high-performing FMs from leading AI companies and Amazon available for use through a single API. Using Amazon Bedrock, customers can easily experiment with and evaluate Amazon Nova models, as well as other FMs, to determine the best model for an application.
The models also support custom fine-tuning, which allows customers to point the models to examples in their own proprietary data that have been labeled to boost accuracy. The Amazon Nova model learns what matters most to the customer from their own data (including text, images, and videos), and then Amazon Bedrock trains a private fine-tuned model that will provide tailored responses.
In addition to supporting fine-tuning, the models also support distillation, which enables the transfer of specific knowledge from a larger, highly capable “teacher model” to a smaller, more efficient model that is highly accurate, but also faster and cheaper to run.
Amazon Nova models are integrated with Amazon Bedrock Knowledge Bases and excel at Retrieval Augmented Generation (RAG), which enables customers to ensure the best accuracy by grounding responses in an organization’s own data.
Amazon Nova models have been optimized to make them easy to use and effective in agentic applications that require interacting with an organization’s proprietary systems and data through multiple APIs to execute multistep tasks.
Creative content generation
With the output quality, intuitive API platform, and advanced customization opportunities, the Amazon Nova creative generation models, Amazon Nova Canvas and Amazon Nova Reel, are helping Amazon Ads remove barriers for sellers and advertisers leading to new levels of creativity and innovation, like using video advertising for the first time, creating display and video campaigns for more products in their catalog, and experimenting with new strategies like keyword-level creative. On average, we’ve seen brands using these tools advertise five times more products and twice as many images per advertised product, shifting budgets to what’s driving the best results.

In this example, Amazon Ads used Amazon Nova Reel to create a video advertisement for a fictional boxed pasta brand. In the whimsical “Pasta City,” buildings are sculpted from towering tubes of cannelloni noodles, the neighborhood is sprinkled with Italian spice landscaping, and the streets are lined with savory marinara sauce, fusilli noodles, and tender meatballs. The ad is an example of how advertisers can bring their products to life and create high-quality content using Amazon Nova models.
Conversations with Amelia seller assistant on a mobile device.
5 new generative AI tools to accelerate seller growth and enhance the customer shopping experience
Amazon puts powerful AI tools in the hands of sellers and small businesses to enable their success.

Video understanding
To put the video understanding capabilities of Amazon Nova Pro to the test, we asked the model to review and describe a brief silent video clip of a football game. The results include details about the setting of the game, the team uniforms, descriptions of actions taken by the players, and how the play culminates.

Video via Shutterstock, Inc.
Video Prompt: “Describe the video.”
Output: “The video depicts a football game in progress on a green field. Players from two teams, one in yellow uniforms and the other in white, are engaged in a play. The quarterback from the yellow team throws a pass to a receiver, who catches the ball and begins running downfield. Defenders from the white team pursue him, attempting to tackle him. The play culminates in a tackle, bringing the receiver down on the field.”
Prime Video and AI features: Tom Clancy's Jack Ryan
Clearer dialogue, better recs, and more: How Prime Video is using AI to improve your streaming experience
Features include Dialogue Boost, improved personalized recommendations, recaps of your favorite shows, and more live sports insights.

Looking ahead
We will introduce two additional Amazon Nova models in 2025, including a speech-to-speech model and a native multimodal-to-multimodal—or “any-to-any” modality model. Our speech-to-speech model will understand streaming speech input in natural language, interpreting verbal and nonverbal cues (like tone and cadence), and delivering natural humanlike interactions, while our any-to-any model will be capable of processing text, images, audio, and video, as both input and output. It will simplify the development of applications where the same model can be used to perform a wide variety of tasks, such as translating content from one modality to another, editing content, and powering AI agents that can understand and generate all modalities.
Sign up for the weekly Amazon newsletter
email
Enter email

Sign Up
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Amazon Privacy Policy Opt out anytime

Responsible AI
Amazon Nova models are built with integrated safety measures and protections. The company has launched AWS AI Service Cards for Amazon Nova models, offering transparent information on use cases, limitations, and responsible AI practices. Learn more about Amazon Nova and our commitment to responsible AI.
This is only the beginning for Amazon Nova, and we’re excited to continue innovating to deliver real-world value to every Amazon customer. Learn more and get started with Amazon Nova.
Trending news and stories

3 features to try on the new Kindle Scribe
Amazon’s most unique delivery locations around the world
7 ways Amazon is thinking big about nuclear energy
Photos: See inside Amazon's Zero Carbon certified fulfillment center with more than 10 sustainability features


###
https://huggingface.co/docs/text-generation-inference/conceptual/chunking
Huggingface
12/10/24

3x more tokens and 13x faster generations than vLLM? 👀 Hugging Face TGI 3.0 released! 🎉TGI 3.0 dramatically improves LLM inference processing by 3x more input tokens, running 13x faster than vLLM on long prompts while requiring zero configuration!
TL;DR:
🚀 Processes 3x more tokens than vLLM (30k vs 10k tokens on L4 GPU for llama 3.1-8B)
⚡ Achieves 13x faster processing on long prompts (200k+ tokens) through conversation caching
🔧 Significantly reduced memory & Zero configuration needed for models
🔬 New kernels (flash-infer, flash-decoding), optimized prefix caching, and improved VRAM efficiency
🤝 Soon available on AWS, Google Cloud, and Dell Enterprise Hub
🔜 Future: special models, KV-cache retention, and multimodal models

TGI v3 overview
Summary
Performance leap: TGI processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !

3x more tokens.
By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.

13x faster
On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Daniël de Kok for the beast data structure.

Zero config
That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Benchmarks
Methodology
To ensure accurate and reliable results, we employed a robust benchmarking protocol that addresses common pitfalls in performance evaluation. Specifically:

Consistent Code: We used the same codebase to run against different engines, ensuring that any performance differences are attributable to the LLM itself, rather than variations in the testing framework.
Request-Based Measurement: Instead of measuring Requests Per Second (RPS) by sending as many requests as possible, we opted for a more consistent approach, sending a fixed number of requests and measuring the time it takes for the server to complete all of them. This method avoids boundary effects and provides a more accurate representation of performance.
Realistic Combinations: We selected realistic combinations of LLMs and hardware configurations so we used 8xH100 for a 70B, not a 8B, which would be a waste of money.
Realistic scenarios We benchmarked engines with prefix caching on, so we are reporting the results of the 2nd run, not the first one. During the first run of a benchmark, every request is new, so prefix caching is not working, masking the real world benefits of using it.
Note: Boundary effect is when the benchmarks are flaky because their results depend on fine details of the engine being benchmarked. For instance, a system ingesting a constant 10RPS, but receiving in the benchmark a single final request at -0.1s before the end of the benchmark, and that single request takes a full 10s to process. Then a benchmark taking 30s would measure 7.5RPS instead of the expected 10, because that single query isn’t being parallelized with others. Another very slightly slower engine would receive that request at +0.1s which would get discarded by the benchmark and therefore measure the slower system as being faster.

For more details on benchmarking in general we recommend the documentation of k6: https://grafana.com/docs/k6/latest/.

Scenarios
We selected a handful of scenarios to simplify the picture, they seem to accurately reflect a larger trend.

Small scenario: This scenario consists of the first 200 requests from the orca datasets being prompted to the model. The 200 requests total 8k tokens together and are representative of conversation starters. Prefix caching has very limited impact in that scenario and we feel it’s a relatively balanced benchmark for simple use cases.

Long scenario: This scenario consists of 20 requests totalling 200k prompt tokens which are essentially asking for summaries of large chunks for text. In practical scenarios this is really useful when you are feeding large chunks of code, large chunks of business data or documents repeatedly and ask simple questions about them (summarization, classification, or where to find some data). This scenario is the one closest to what a lot of professional use cases seem to be doing by including a lot of information in the prompt itself. Those very long conversations are the ones that benefit the most for our recent changes since we are enable ever larger prompts and ever faster caching.

Hardware
L4 : This is a single L4 (24GB) which represents small or even home compute capabilities. We tested meta-llama/Meta-Llama-3.1-8B-Instruct on it.
4xL4: This is a more beefy deployment usually used for either very large requests deployments for 8B models (the ones under test) or it can also easily handle all 30GB models. For this benchmark we tested meta-llama/Meta-Llama-3.1-8B-Instruct
8xH100 This is one of the beefiest deployments possible. We tested meta-llama/Meta-Llama-3.1-70B-Instruct as it’s the most representative models of this size. Llama 3.3 wasn’t released at the time of benchmarking (it’s the exact same model so it doesn’t make any difference).
Replicating the results
The commands to run the benchmarks are as follows:

Prepare the datasets:
Copied
cd text-generation-inference/load_tests
make prepare_orca
python long.py
Launch the engine:
TGI: text-generation-launcher --model-id $MODEL_ID --num-shard $N --port 8000 (or docker variant) vLLM: vllm serve $MODEL_ID --tensor-parallel $N —enable-prefix-caching (or docker variant)

Start scenario: Small: MODEL_ID=$MODEL_ID HOST=localhost:8000 k6 run load_tests/common.js Long: MODEL_ID=$MODEL_ID HOST=localhost:8000 k6 run load_tests/long.js
Results
benchmarks_v3

Our benchmarking results show significant performance gains, with a 13x speedup over vLLM with prefix caching, and up to 30x speedup without prefix caching. These results are consistent with our production data and demonstrate the effectiveness of our optimized LLM architecture.

Raw results

2nd run		TGI v3 (time in s)	vLLM (s)	Amount of req
Llama 3.1 8b	Small test - L4 - 8B	17.5	19.9	200
Llama 3.1 8b	Long test* - L4 - 8B	53	57	10
Llama 3.1 8b	Small test - 4xL4 - 8B	4.8	6	200
Llama 3.1 8b	Long test - 4xL4 - 8B	3.2	12.5	20
Llama 3.1 70b	Small test - 8XH100 - 70B	6.2	7.4	200
Llama 3.1 70b	Long test - 8H100 - 70B	2	27.5	20
1st run		TGI (s)	vLLM (s)	Amount of req
Llama 3.1 8b	Small test - L4	19.9	19.9	200
Llama 3.1 8b	Long test (10) - L4	49.8	55	10
Llama 3.1 8b	Small test - 4xL4	13	12.6	200
Llama 3.1 8b	Long test - 4xL4	47	50.3	20
Llama 3.1 70b	Small test - 8XH100	7.5	7.6	200
Llama 3.1 70b	Long test - 8H100	12.1	28.3	20
Caveats and Limitations
While our results are promising, there are some caveats to consider:

Constrained kv-cache: If a deployment lacks kv-cache space, that means that many queries will require the same slots of kv-cache, leading to contention in the kv-cache. You can limit that effect by limiting --max-total-tokens to reduce individual queries impact. You can also use more GPUs or larger GPUs in order to increase the size of the kv-cache.
Replication: In scenarios where multiple replicas are behind a single endpoint, there’s no reason for every query from a particular user to hit the same replica, therefore the cache will not be present, meaning no speed benefit. You can use sticky sessions load balancing to force every user to send their requests on the same replica. Do not apply this blindly, it’s possible this may not be necessary at all.
Technical Insights
Our performance gains can be attributed to several key factors:

New Kernels: Our custom kernels, including flashinfer and flashdecoding, offer improved performance at large prompt lengths and enable more efficient scheduling.
Prefix Caching: Our optimized prefix caching structure allows for fast query matching, even for long prompts. The overhead is roughly 6us.
Chunking Code: Our chunking code enables finer control over compute resources, ensuring optimal performance and reduced VRAM usage.
Kernel Optimizations: We’ve implemented various other kernel optimizations, including better kernel selection. Notably we’ve implemented several small kernels involved in the queries bookkeeping which are particularly efficient on small models. Every kernel launch has an overhead of several milliseconds so fusing them together increases a lot performance when this bookkeeping is important relative to the raw model calculations. This happens typically on oversized compute for a particular model and particularly small models.
VRAM efficiency: In the realm of very large requests (100k+ tokens) there are a lot of places which start becoming big memory consumers. We’ve hunted the biggest ones and found ways to reduce/reuse or delete them. The biggest culprit probably is logits calculation. Logits for llama 3.1-8b take 25.6GB (=100k tokens 128k vocabulary 2(f16)) which is more than the full model which is 16GB. The thing is that in general we do not need every prompt logits, so we simply removed them and removed them from being potentially asked by users by default. We think this is ok since they are mostly used by researchers. You can enable your deployments to have them again by using the --enable-prefill-logprobs flag, but you will experience reduced token prompt size.
Future Directions
While we’ve made significant progress, there are still opportunities for improvement:

Special models: All LLMs come with the aforementioned improvements. Some specific set of features might not (some quantizations, speculation or VLMs for instance are harder to optimize for with the same level of detail).
KV-Cache Long-Term Retention: Addressing KV-cache long-term retention is a challenge. There are several solutions envisionned like shared KV-cache (like redis or memcached) solutions or innovative storage approaches. It is an area of ongoing research of ours.
Multimodal models: We are also investigating quite a lot other kind of models, like audio-to-audio, image/video generation, and other hybrids, where we see a lot of potential of applying the same principles we’ve applied in TGI to maximize performance.
By sharing our benchmarking methodology, results, and technical insights, we aim to contribute to the ongoing development of more efficient and effective LLMs.

###
https://pytorch.org/blog/vllm-joins-pytorch/
Pytorch
December 09, 2024

vLLM Joins PyTorch Ecosystem: Easy, Fast, and Cheap LLM Serving for Everyone

by vLLM Team

vllm logo

We’re thrilled to announce that the vLLM project has become a PyTorch ecosystem project, and joined the PyTorch ecosystem family!

For more information on what it means to be a PyTorch ecosystem project, see the PyTorch Ecosystem Tools page.

Running large language models (LLMs) is both resource-intensive and complex, especially as these models scale to hundreds of billions of parameters. That’s where vLLM comes in — a high-throughput, memory-efficient inference and serving engine designed for LLMs.

Originally built around the innovative PagedAttention algorithm, vLLM has grown into a comprehensive, state-of-the-art inference engine. A thriving community is also continuously adding new features and optimizations to vLLM, including pipeline parallelism, chunked prefill, speculative decoding, and disaggregated serving.

Since its release, vLLM has garnered significant attention, achieving over 31,000 GitHub stars—a testament to its popularity and thriving community. This milestone marks an exciting chapter for vLLM as we continue to empower developers and researchers with cutting-edge tools for efficient and scalable AI deployment. Welcome to the next era of LLM inference!

vLLM has always had a strong connection with the PyTorch project. It is deeply integrated into PyTorch, leveraging it as a unified interface to support a wide array of hardware backends. These include NVIDIA GPUs, AMD GPUs, Google Cloud TPUs, Intel GPUs, Intel CPUs, Intel Gaudi HPUs, and AWS Neuron, among others. This tight coupling with PyTorch ensures seamless compatibility and performance optimization across diverse hardware platforms.

Do you know you can experience the power of vLLM right from your phone? During this year’s Amazon Prime Day, vLLM played a crucial role in delivering lightning-fast responses to millions of users. Across three regions, over 80,000 Trainium and Inferentia chips powered an average of 3 million tokens per minute, all while maintaining a P99 latency of less than 1 second for the first response. That means when customers opened the Amazon app and chatted with Rufus, they were seamlessly interacting with vLLM in action!

vLLM also collaborates tightly with leading model vendors to ensure support for popular models. This includes tight integration with Meta LLAMA, Mistral, QWen, and DeepSeek models, plus many others. One particularly memorable milestone was the release of LLAMA 3.1 (405B). As the launching partner, vLLM was the first to enable running this very large model, showcasing vLLM’s capability to handle the most complex and resource-intensive language models.

To install vLLM, simply run:

pip install vllm
vLLM is designed for both researchers and production-grade serving.

To run vLLM as an OpenAI API compatible server, just use the Huggingface model ID:

vllm serve meta-llama/Llama-3.1-8B
To run vLLM as a simple function:

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
   "Hello, my name is",
   "The president of the United States is",
   "The capital of France is",
   "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="meta-llama/Llama-3.1-8B")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Open-source innovation is part of the vLLM’s DNA. Born out of a Berkeley academic project, it follows the legacy of other pioneering open-source initiatives such as BSD, which revolutionized operating systems in the 1980s. Other innovations from the same organization include Apache Spark and Ray, now the standard for big data and AI systems. In the Gen AI era, vLLM serves as a platform dedicated to democratizing AI inference.

The vLLM team remains steadfast in its mission to keep the project “of the community, by the community, and for the community.” Collaboration and inclusivity lie at the heart of everything we do.

If you have collaboration requests or inquiries, feel free to reach out at vllm-questions@lists.berkeley.edu. To join the active and growing vLLM community, explore our GitHub repository or connect with us on the vLLM Slack. Together, we can push the boundaries of AI innovation and make it accessible to all.

vLLM Joins PyTorch Ecosystem 🎉
Easy, Fast, and Cheap LLM Serving for Everyone
vLLM has always had a strong connection with the PyTorch project. It is deeply integrated into PyTorch, leveraging it as a unified interface to support a wide array of hardware backends. These include NVIDIA GPUs, AMD GPUs, Google Cloud TPUs, Intel GPUs, Intel CPUs, Intel Gaudi HPUs, and AWS Neuron, among others. This tight coupling with PyTorch ensures seamless compatibility and performance optimization across diverse hardware platforms.

기술적으로 최대한 자세하게 적어. 14개의 기사가 있고 하나도 빼먹지 말고 적어.

TECH BLOG by Dongyoung Kim Ph.D.

2024년 12월 11일 AI 소식

META, Llama 3.3 70B Instruct 모델 발표

Microsoft Research, Florence-VL 기반 차세대 MLLM 발표

OpenGVLab, InternVL 2.5 MLLM 시리즈 공개

DeepSeek, Inc., DeepSeek V2.5-1210 출시

Microsoft, Florence-VL 발표

Google, PaliGemma 2 공개

Tencent, HunyuanVideo 대규모 비디오 생성 모델 공개

Amazon Web Services, Amazon Bedrock Agents 정식 출시

Microsoft, Magentic-One: Multi-Agent 시스템 소개

LG AI Research, EXAONE 3.5 공개

Microsoft, TRELLIS3D 3D 생성 모델

HuggingFace, FineWeb 2.0 대규모 코퍼스

AWS, Amazon Nova 파운데이션 모델 패밀리 공개

Huggingface, TGI(Text Generation Inference) 3.0 릴리즈

PyTorch, vLLM PyTorch 생태계 합류

(today’s date in 년 월 일) AI 소식,

Summary

company name, Title

company name, Title