➡️ Microsoft에서는 대규모 STEM QA에 강점을 보이는 새로운 14B 파라미터 모델 Phi-4를 Apache 2.0 라이선스로 공개하였습니다.

➡️ Google에서는 500M 파라미터의 시계열 예측 모델 TimesFM-2.0을 선보였습니다.

➡️ NVIDIA는 물리적 AI 개발 가속화를 위한 Cosmos 플랫폼과 함께 ‘VILA’라는 비전-언어 모델 패밀리를 공개하였습니다.

➡️ ‘moondream’ 프로젝트에서는 초경량 2B·0.5B 파라미터 VLM을 제안하였습니다.

➡️ 1.58-bit FLUX 연구는 1.58비트 양자화로 텍스트-투-이미지 모델의 용량 감소 및 추론 효율을 대폭 높이는 기법을 발표했습니다.

➡️ 소프트웨어 공학 분야에서는 Agentless라는 간단한 접근법이 복잡한 LLM 에이전트보다 뛰어난 성능을 보일 수 있음을 보고하였고, 검색 기반 강화 프레임워크 ‘Search-o1’ 역시 LLM의 한계를 보완하는 방식을 제시했습니다.

➡️ KaLM-Embedding은 고품질 다국어 임베딩 모델로서, 적절한 데이터 필터링 기법과 프리트레인 아키텍처 변화를 통해 우수한 성능을 선보였습니다.

➡️ ProTracker 연구는 영상 내 포인트를 추적하는 고효율 방법을 내놓았습니다.

➡️ ‘Long Context vs. RAG’ 논문은 초장문 맥락 사용 vs. Retrieval-Augmented Generation 간의 장단점을 정밀 비교하였습니다.

➡️ Chip Huyen은 에이전트(Agents)의 개념과 구축 방안, 그리고 실패 모드 및 평가 방식을 심도 있게 분석하는 글을 통해 LLM 기반 에이전트 설계의 가이드라인을 제시했습니다.

Microsoft, Phi-4

링크, 1/9/25

14B 파라미터 규모의 STEM 특화 대규모 언어 모델 발표
GPT-4를 능가하는 STEM QA 성능을 보였으며, reasoning, math, code generation 능력이 뛰어남
9.8T 토큰에 달하는 고품질 데이터와 멀티에이전트·self-revision 기반 대규모 합성 데이터 활용
1920대의 H100-80G GPU로 21일간 학습
16K 토큰 컨텍스트 및 안전성 확보를 위한 SFT·DPO 기법 적용
English 중심 최적화로 reasoning-focused benchmark에서 우수한 결과 달성

Google, TimesFM-2.0

링크

500M 파라미터 Time Series Foundation Model(TimesFM-2.0) 공개
4배 더 긴 최대 컨텍스트(2048 시계열 포인트)로 시계열 예측 정확도 향상
새로운 버전(v2.0)은 v1.0 대비 최대 25% 높은 정확도를 보임
GIFT-Eval 리더보드에서 MASE·CRPS 측면에서 최고 성능 기록
Fine-tuning 및 zero-shot covariate 지원 등을 통한 유연한 활용 가능
Google Research에서 개발, ICML 2024 논문 발표 예정

NVIDIA, Cosmos

링크, 2025년 1월 6일 공개

Physical AI 개발을 가속화하기 위한 ‘Cosmos’ 플랫폼 공개
Text2World·Video2World 등 영상·텍스트 기반 시뮬레이션 모델(확산/오토리그레시브) 제공
실제 AV·로보틱스 동영상을 통해 학습된 물리 기반 환경 예측·생성 가능
선행 모델들 대비 오픈 라이선스(NVIDIA Open Model License)로 기업·연구자 활용에 유리
NVIDIA NeMo 프레임워크를 이용해 후속 학습 및 파인튜닝 가능
AV·로보틱스 업계(1X, Agility Robotics, XPENG, Uber, Waabi 등)에서 이미 사용 중

moondream 프로젝트

링크, 1/9/25

초경량 비전 언어 모델(VLM)인 “Moondream” 공개 (2B·0.5B 파라미터 버전)
이미지 캡셔닝, VQA, 객체 검출, 포인트 추론 등 다양한 비전 태스크 지원
8비트(int8) 및 심지어 4비트(int4) 양자화를 활용, 적은 메모리로도 구동 가능
Edge 환경, 모바일 기기 등 제한된 자원에서 동작 가능하도록 설계
Apache 2.0 라이선스 하에 오픈소스로 제공, PyPI 패키지로 간편 설치 가능

1.58-bit FLUX

링크, (논문 업로드일: 2025년 1월 초)

텍스트-투-이미지 모델 FLUX.1-dev를 약 1.58비트(±1, 0)로 양자화한 연구
11.9B 파라미터 중 99.5%를 1.58비트로 표현하면서도 높은 이미지 생성 성능 유지
7.7배의 모델 스토리지 절감, 5.1배의 GPU 메모리 사용량 절감, 추론 지연시간 개선
Image data 없이도 self-supervision 기반 양자화가 가능함을 시사
합성능력과 효율성을 함께 잡은 새로운 저비트 양자화 기법

Fudan University 외, “Agentless: Demystifying LLM-based Software Engineering Agents”

링크, 2024년 7월 2일(출판일)

SWE-bench Lite 벤치마크에서 복잡한 소프트웨어 개발 업무를 에이전트 없이 해결하는 방식 제안
“Agentless” 접근법이 다양한 툴을 사용하는 복잡한 LLM 에이전트보다 단순하면서도 비용 및 성능 면에서 우수
Localization-Repair 2단계 프로세스로 이루어진 간단한 모델이, 복잡한 에이전트 대비 성공률 및 경제성이 뛰어남
오픈소스 소프트웨어 에이전트 대비 27.33%의 높은 성능과 $0.34의 낮은 비용 달성

NVIDIA, VILA

링크

비전·언어 모델(VLM)을 효율·정확도 균형 있게 설계한 “VILA” 계열 발표
“Cosmos Nemotron VLMs”의 일부로 출시되어, 영상·다중이미지 처리 효율성 개선
구조적 개선(Scale-then-Compress)으로 고해상도 이미지와 긴 동영상 처리에도 효율적
학습·추론·파인튜닝 전 과정에서 4.5배~2.8배 효율 향상, 오픈 및 상용 VLM들과 경쟁
다양한 이미지·동영상 벤치마크에서 상위권 성능 기록

HIT-TMG, KaLM-Embedding

링크, 2025년 1월 2일 발표

Qwen2-0.5B 기반으로 구축한 오픈 라이선스(MIT) 다국어 임베딩 모델
MTEB 벤치마크에서 평균 64.53점 달성(C-MTEB 64.13, MTEB 64.94)
고품질·다양화된 훈련 데이터를 확보하기 위해 ranking consistency filtering 기법 도입
Matryoshka Representation Learning으로 임베딩 차원을 유연하게 지원
<1B 파라미터임에도 여러 언어에서 높은 성능 보임, Sentence-Transformers로 통합 사용 가능

Tsinghua University 외, “Search-o1: Agentic Search-Enhanced Large Reasoning Models”

링크, 2025년 1월 9일(논문 제출일)

OpenAI-o1 스타일의 긴 단계 추론(Large Reasoning Model)에 검색(검색 에이전트)을 접목한 프레임워크인 “Search-o1” 발표
불확실한 지식 포인트에서 외부 정보를 동적으로 검색하고, Reason-in-Documents 모듈로 노이즈를 최소화
수학·과학·코딩 등 복잡한 reasoning 태스크 및 6가지 오픈 QA 벤치마크에서 우수한 성능
LLM의 지식 부족을 보완하여, 추론 신뢰성과 정확성을 높임
Agentic RAG 메커니즘을 통해 외부 문서 검색 및 문서 재정리 후 reasoning에 반영

Long Context vs. RAG for LLMs: An Evaluation and Revisits

링크, 2025년 1월 9일(논문 제출일)

초장문(Long Context, LC)와 Retrieval-Augmented Generation(RAG)을 통한 외부 정보 활용 방안을 비교 연구
Wikipedia 기반 QA에서는 LC가 RAG 대비 전반적으로 더 우수한 성능을 보이는 반면, 대화형 질의 등에서는 RAG가 유리
Summarization 기반 Retrieval이 Chunk-based Retrieval보다 성능이 높음을 검증
LC와 RAG를 혼합하거나 적절히 선택하는 전략이 과제별로 중요함을 제안
기존 연구들이 놓친 ‘맥락 적합성’ 문제가 실제 성능에 매우 큰 영향을 준다고 지적

Agents (by Chip Huyen)

링크, 2025년 1월 7일

에이전트의 개념 정의, 도구 사용, 계획(Planning) 방식, 실패 모드, 평가 등을 체계적으로 정리
Planning과 Execution을 분리하고, Multi-Agent 시스템 설계를 통해 복잡도를 분산시키는 방법 제시
에이전트가 사용할 툴을 신중히 고르는 것이 중요하며, 툴이 많아질수록 혼선이 생길 수 있음을 지적
Reflection(자기 평가) 기법을 적용해 에이전트가 스스로 에러를 수정하고 성능을 향상할 수 있는 가능성 언급
Anthropic의 “Building effective agents”와 유사하면서도 실행 흐름, 실패 모드 구체화에 초점

ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking

링크, 2025년 1월 초(논문 제출일)

영상 내 임의의 포인트를 장기간 추적하기 위한 새 프레임워크 “ProTracker” 제안
Optical Flow와 semantic feature 기반 예측을 확률적으로 통합해 정확도 및 견고성 향상
Occlusion이나 비슷한 영역이 많은 영상에서도 드리프트 없이 지속적으로 포인트 추적
TAP-Vid-DAVIS 등 다양한 벤치마크에서 최고 수준의 성능 달성
Geometry-aware feature filtering, long-term keypoint relocalization 등으로 잡음 제거 및 안정성 극대화

Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)
…

###
https://huggingface.co/microsoft/phi-4
1/9/25
It's here! Phi-4 on Hugging Face with MIT License! Microsoft Phi-4 is a 14B LLM that outperforms OpenAI GPT-4o on STEM-focused QA. Built using large-scale, high-quality synthetic data created by multi-agent, self-revision workflows. 👀
TL;DR:
🧠 14B parameters, but performs on par with 70B models
📚 Trained on 9.8T tokens of high-quality data
⚡ 16K token context length
🎯 Outperforms previous version (Phi-3) across all benchmarks
🔬 21 days of training on 1920 H100-80G GPUs
🛡️ Comprehensive safety alignment using SFT and DPO
🌐 Optimized for English language tasks
🎓 Particularly strong in reasoning, math, and code generation

Phi-4 Technical Report
Marah Abdin Jyoti Aneja Harkirat Behl S´ebastien Bubeck
Ronen Eldan Suriya Gunasekar Michael Harrison Russell J. Hewett
Mojan Javaheripi Piero Kauffmann James R. Lee Yin Tat Lee
Yuanzhi Li Weishung Liu Caio C. T. Mendes Anh Nguyen
Eric Price Gustavo de Rosa Olli Saarikivi Adil Salim
Shital Shah Xin Wang Rachel Ward Yue Wu
Dingli Yu Cyril Zhang Yi Zhang
Microsoft Research
Abstract
We present phi-4, a 14-billion parameter language model developed with a training recipe that
is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic
data throughout the training process. While previous models in the Phi family largely distill the
capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model
on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong
performance relative to its size – especially on reasoning-focused benchmarks – due to improved data,
training curriculum, and innovations in the post-training scheme.

###
https://huggingface.co/google/timesfm-2.0-500m-jax
Google
We just released the weights of TimesFM-2.0 (jax: https://goo.gle/4agXX3w, and pytorch: https://goo.gle/3WdOjZy) on Hugging Face. This checkpoint has 500M parameters and can be better than v1.0 by up to 25% on leading benchmarks. It also has a 4x longer maximum context length.
TimesFM-2.0 takes the top spot on the GIFT-Eval (https://goo.gle/4aeiA0f) leaderboard in terms of point forecasting accuracy measured by MASE as well as probabilistic forecasting accuracy measured by CRPS. It is better than the next best model by about 6% in terms of aggregated MASE.
Instructions for using this model are in our repository (https://goo.gle/4h6ocM7). It should work with our prior examples of forecasting with covariates and fine tuning.

TimesFM
TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model developed by Google Research for time-series forecasting.

Resources and Technical Documentation:

Paper: A decoder-only foundation model for time-series forecasting, ICML 2024.
Google Research blog
GitHub repo
Authors: Google Research

This is not an officially supported Google product.

Checkpoint timesfm-2.0-500m
timesfm-2.0-500m is the second open model checkpoint:

It performs univariate time series forecasting for context lengths up to 2048 time points and any horizon lengths, with an optional frequency indicator. Note that it can go even beyond 2048 context even though it was trained with that as the maximum context.
It focuses on point forecasts. We experimentally offer 10 quantile heads but they have not been calibrated after pretraining.
It ideally requires the context to be contiguous (i.e. no "holes"), and the context and the horizon to be of the same frequency. In case there are nans we fill in the missing values with linear interpolation before calling the model.

TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model developed by Google Research for time-series forecasting.

Paper: A decoder-only foundation model for time-series forecasting, to appear in ICML 2024.
Google Research blog
Hugging Face release
This repo contains the code to load public TimesFM checkpoints and run model inference. Please visit our Hugging Face release to download model checkpoints.

This is not an officially supported Google product.

We recommend at least 32GB RAM to load TimesFM dependencies.

Update - Dec. 30, 2024
We are launching a 500m checkpoint as a part of TimesFM-2.0 release. This new checkpoint can be upto 25% better than v1.0 on leading benchmarks and also has a 4 times longer max. context length.
Launched finetuning support that lets you finetune the weights of the pretrained TimesFM model on your own data.
Launched ~zero-shot covariate support with external regressors. More details here.
Checkpoint timesfm-1.0-200m (-pytorch)
timesfm-1.0-200m is our first open model checkpoint:

It performs univariate time series forecasting for context lengths up to 512 timepoints and any horizon lengths, with an optional frequency indicator.
It focuses on point forecasts, and does not support probabilistic forecasts. We experimentally offer quantile heads but they have not been calibrated after pretraining.
Checkpoint timesfm-2.0-500m (-jax/-pytorch)
timesfm-2.0-500m is our second open model checkpoint:

It performs univariate time series forecasting for context lengths up to 2048 timepoints and any horizon lengths, with an optional frequency indicator.
It focuses on point forecasts. We experimentally offer 10 quantile heads but they have not been calibrated after pretraining.
This new checkpoint can be upto 25% better than v1.0 on leading benchmarks and also has a 4 times longer max. context length.
Benchmarking
TimesFM 2.0 has been added to GIFT-Eval which is one of the most comprehensive time-series bechmarks available. It takes the top spot in terms of aggregated MASE and CRPS, where it is 6% better than the next best model in terms of aggregated MASE.

###
https://github.com/NVIDIA/Cosmos
NVIDIA Cosmos is a developer-first world foundation model platform designed to help Physical AI developers build their Physical AI systems better and faster. Cosmos contains

pre-trained models, available via Hugging Face under the NVIDIA Open Model License that allows commercial use of the models for free
training scripts under the Apache 2 License, offered through NVIDIA Nemo Framework for post-training the models for various downstream Physical AI applications
Details of the platform is described in the Cosmos paper. Preview access is avaiable at build.nvidia.com.

Key Features
Pre-trained Diffusion-based world foundation models for Text2World and Video2World generation where a user can generate visual simulation based on text prompts and video prompts.
Pre-trained Autoregressive-based world foundation models for Video2World generation where a user can generate visual simulation based on video prompts and optional text prompts.
Video tokenizers for tokenizing videos into continuous tokens (latent vectors) and discrete tokens (integers) efficiently and effectively.
Video curation pipeline for building your own video dataset. [Coming soon]
Post-training scripts via NeMo Framework to post-train the pre-trained world foundation models for various Physical AI setup.
Pre-training scripts via NeMo Framework for building your own world foundation model. [Diffusion] [Autoregressive] [Tokenizer].
Model Family
Model name	Description	Try it out
Cosmos-1.0-Diffusion-7B-Text2World	Text to visual world generation	Inference
Cosmos-1.0-Diffusion-14B-Text2World	Text to visual world generation	Inference
Cosmos-1.0-Diffusion-7B-Video2World	Video + Text based future visual world generation	Inference
Cosmos-1.0-Diffusion-14B-Video2World	Video + Text based future visual world generation	Inference
Cosmos-1.0-Autoregressive-4B	Future visual world generation	Inference
Cosmos-1.0-Autoregressive-12B	Future visual world generation	Inference
Cosmos-1.0-Autoregressive-5B-Video2World	Video + Text based future visual world generation	Inference
Cosmos-1.0-Autoregressive-13B-Video2World	Video + Text based future visual world generation	Inference
Cosmos-1.0-Guardrail	Guardrail contains pre-Guard and post-Guard for safe use	Embedded in model inference scripts

NVIDIA Makes Cosmos World Foundation Models Openly Available to Physical AI Developer Community
State-of-the-art models trained on millions of hours of driving and robotics videos to democratize physical AI development, available under open model license.
January 6, 2025 by Ming-Yu Liu

 Share

NVIDIA Cosmos, a platform for accelerating physical AI development, introduces a family of world foundation models — neural networks that can predict and generate physics-aware videos of the future state of a virtual environment — to help developers build next-generation robots and autonomous vehicles (AVs).

World foundation models, or WFMs, are as fundamental as large language models. They use input data, including text, image, video and movement, to generate and simulate virtual worlds in a way that accurately models the spatial relationships of objects in the scene and their physical interactions.

Announced today at CES, NVIDIA is making available the first wave of Cosmos WFMs for physics-based simulation and synthetic data generation — plus state-of-the-art tokenizers, guardrails, an accelerated data processing and curation pipeline, and a framework for model customization and optimization.

Researchers and developers, regardless of their company size, can freely use the Cosmos models under NVIDIA’s permissive open model license that allows commercial usage. Enterprises building AI agents can also use new open NVIDIA Llama Nemotron and Cosmos Nemotron models, unveiled at CES.

The openness of Cosmos’ state-of-the-art models unblocks physical AI developers building robotics and AV technology and enables enterprises of all sizes to more quickly bring their physical AI applications to market. Developers can use Cosmos models directly to generate physics-based synthetic data, or they can harness the NVIDIA NeMo framework to fine-tune the models with their own videos for specific physical AI setups.

Physical AI leaders — including robotics companies 1X, Agility Robotics and XPENG, and AV developers Uber and Waabi  — are already working with Cosmos to accelerate and enhance model development.

Developers can preview the first Cosmos autoregressive and diffusion models on the NVIDIA API catalog, and download the family of models and fine-tuning framework from the NVIDIA NGC catalog and Hugging Face.



World Foundational Models for Physical AI
Cosmos world foundation models are a suite of open diffusion and autoregressive transformer models for physics-aware video generation. The models have been trained on 9,000 trillion tokens from 20 million hours of real-world human interactions, environment, industrial, robotics and driving data.

The models come in three categories: Nano, for models optimized for real-time, low-latency inference and edge deployment; Super, for highly performant baseline models; and Ultra, for maximum quality and fidelity, best used for distilling custom models.

When paired with NVIDIA Omniverse 3D outputs, the diffusion models generate controllable, high-quality synthetic video data to bootstrap training of robotic and AV perception models. The autoregressive models predict what should come next in a sequence of video frames based on input frames and text. This enables real-time next-token prediction, giving physical AI models the foresight to predict their next best action.

Developers can use Cosmos’ open models for text-to-world and video-to-world generation. Versions of the diffusion and autoregressive models, with between 4 and 14 billion parameters each, are available now on the NGC catalog and Hugging Face.

Also available are a 12-billion-parameter upsampling model for refining text prompts, a 7-billion-parameter video decoder optimized for augmented reality, and guardrail models to ensure responsible, safe use.

To demonstrate opportunities for customization, NVIDIA is also releasing fine-tuned model samples for vertical applications, such as generating multisensor views for AVs.

Advancing Robotics, Autonomous Vehicle Applications
Cosmos world foundation models can enable synthetic data generation to augment training datasets, simulation to test and debug physical AI models before they’re deployed in the real world, and reinforcement learning in virtual environments to accelerate AI agent learning.

Developers can generate massive amounts of controllable, physics-based synthetic data by conditioning Cosmos with composed 3D scenes from NVIDIA Omniverse.

Waabi, a company pioneering generative AI for the physical world, starting with autonomous vehicles, is evaluating the use of Cosmos for the search and curation of video data for AV software development and simulation. This will further accelerate the company’s industry-leading approach to safety, which is based on Waabi World, a generative AI simulator that can create any situation a vehicle might encounter with the same level of realism as if it happened in the real world.

In robotics, WFMs can generate synthetic virtual environments or worlds to provide a less expensive, more efficient and controlled space for robot learning. Embodied AI startup Hillbot is boosting its data pipeline by using Cosmos to generate terabytes of high-fidelity 3D environments. This AI-generated data will help the company refine its robotic training and operations, enabling faster, more efficient robotic skilling and improved performance for industrial and domestic tasks.

In both industries, developers can use NVIDIA Omniverse and Cosmos as a multiverse simulation engine, allowing a physical AI policy model to simulate every possible future path it could take to execute a particular task — which in turn helps the model select the best of these paths.

Data curation and the training of Cosmos models relied on thousands of NVIDIA GPUs through NVIDIA DGX Cloud, a high-performance, fully managed AI platform that provides accelerated computing clusters in every leading cloud.

Developers adopting Cosmos can use DGX Cloud for an easy way to deploy Cosmos models, with further support available through the NVIDIA AI Enterprise software platform.

Customize and Deploy With NVIDIA Cosmos
In addition to foundation models, the Cosmos platform includes a data processing and curation pipeline powered by NVIDIA NeMo Curator and optimized for NVIDIA data center GPUs.

Robotics and AV developers collect millions or billions of hours of real-world recorded video, resulting in petabytes of data. Cosmos enables developers to process 20 million hours of data in just 40 days on NVIDIA Hopper GPUs, or as little as 14 days on NVIDIA Blackwell GPUs. Using unoptimized pipelines running on a CPU system with equivalent power consumption, processing the same amount of data would take over three years.

The platform also features a suite of powerful video and image tokenizers that can convert videos into tokens at different video compression ratios for training various transformer models.

The Cosmos tokenizers deliver 8x more total compression than state-of-the-art methods and 12x faster processing speed, which offers superior quality and reduced computational costs in both training and inference. Developers can access these tokenizers, available under NVIDIA’s open model license, via Hugging Face and GitHub.

Developers using Cosmos can also harness model training and fine-tuning capabilities offered by NeMo framework, a GPU-accelerated framework that enables high-throughput AI training.

Developing Safe, Responsible AI Models
Now available to developers under the NVIDIA Open Model License Agreement, Cosmos was developed in line with NVIDIA’s trustworthy AI principles, which include nondiscrimination, privacy, safety, security and transparency.

The Cosmos platform includes Cosmos Guardrails, a dedicated suite of models that, among other capabilities, mitigates harmful text and image inputs during preprocessing and screens generated videos during postprocessing for safety. Developers can further enhance these guardrails for their custom applications.

Cosmos models on the NVIDIA API catalog also feature an inbuilt watermarking system that enables identification of AI-generated sequences.

NVIDIA Cosmos was developed by NVIDIA Research. Read the research paper, “Cosmos World Foundation Model Platform for Physical AI,” for more details on model development and benchmarks. Model cards providing additional information are available on Hugging Face.

Learn more about world foundation models in an AI Podcast episode that features Ming-Yu Liu, vice president of research at NVIDIA.

Get started with NVIDIA Cosmos and join NVIDIA at CES. Watch the Cosmos demo and Huang’s keynote below:

###
https://github.com/vikhyat/moondream
1/9/25
🌔 moondream
a tiny vision language model that kicks ass and runs anywhere

Website | Demo

Examples
Image	Example
	What is the girl doing?
The girl is sitting at a table and eating a large hamburger.

What color is the girl's hair?
The girl's hair is white.
	What is this?
This is a computer server rack, which is a device used to store and manage multiple computer servers. The rack is filled with various computer servers, each with their own dedicated space and power supply. The servers are connected to the rack via multiple cables, indicating that they are part of a larger system. The rack is placed on a carpeted floor, and there is a couch nearby, suggesting that the setup is in a living or entertainment area.

What is behind the stand?
Behind the stand, there is a brick wall.
About
Moondream is a highly efficient open-source vision language model that combines powerful image understanding capabilities with a remarkably small footprint. It's designed to be versatile and accessible, capable of running on a wide range of devices and platforms.

The project offers two model variants:

Moondream 2B: The primary model with 2 billion parameters, offering robust performance for general-purpose image understanding tasks including captioning, visual question answering, and object detection.
Moondream 0.5B: A compact 500 million parameter model specifically optimized as a distillation target for edge devices, enabling efficient deployment on resource-constrained hardware while maintaining impressive capabilities.
Getting Started
Latest Model Checkpoints
These are the latest bleeding-edge versions of both models, with all new features and improvements:

Model	Precision	Download Size	Memory Usage	Best For	Download Link
Moondream 2B	int8	1,733 MiB	2,624 MiB	General use, best quality	Download
Moondream 0.5B	int8	593 MiB	996 MiB	Edge devices, faster speed	Download
Announcement: New Moondream 2B model update!

Moondream
The Open Source VLM That Runs Everywhere.

Get Started

Try it out!





Over 6 million downloads!

Explore the lineup.


Moondream 2B
Powerful and fast

1.9B

Parameters

fp16, int8, int4

Quantized

2GiB

Memory

Quantized Aware Training

Training

Servers, PC, Mobile

Target Devices

GPU, CPU-Optimized

Inference

Apache 2.0

License

New

Moondream 0.5B
Tiny and speedy.

0.5B

Parameters

int8, int4

Quantized

1GiB

Memory

Quantized Aware Training

Training

Mobile, Edge

Target Devices

GPU, CPU-Optimized

Inference

Apache 2.0

License

Discover the capabilities.
Query
Get human-like answers from any prompt.


List all the food shown in this image.

A halved avocado, cherry tomatoes, green onions, spinach, mushrooms, and a few peppers. There are also two eggs on the board.

Caption
Generate detailed descriptions of any scene.


The image depicts a clownfish, a type of sea anemone, swimming in a vibrant underwater scene. The clownfish has a distinctive red body and black stripes on its sides, with a white stripe running along its back. It is positioned near a cluster of purple and white anemones, which provide a striking contrast against the deep blue background. The clownfish's large eyes are visible, and it appears to be looking towards the right side of the image. The anemones have a textured appearance with many small bumps and protrusions.

Object Detection
Get bounding boxes from a prompt.


Detect: Drone

4 objects detected.

Point
Get X, Y locations for any items.


Point: Sign in with Apple button

1 point detected.

Get started in 5 minutes.
Our clients are optimized for CPU and GPU inference, and are a snap to learn.

pip install moondream

import moondream as md

from PIL import Image



# initialize with a downloaded model

model = md.vl(model="./moondream-2b-int8.mf")



# open an image

image = Image.open("./image.jpg")



# query the image

result = model.query(image, "Is this a hot dog?")

print("Answer: ", result["answer"])

###
https://arxiv.org/pdf/2412.18653
1.58-bit FLUX
Chenglin Yang1
, Celong Liu1
, Xueqing Deng1
, Dongwon Kim2
,
Xing Mei1
, Xiaohui Shen1
, Liang-Chieh Chen1
1ByteDance 2POSTECH

We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.

Figure 1. Visual comparisons between FLUX and 1.58-bit FLUX. 1.58-bit FLUX demonstrates comparable generation quality to FLUX
while employing 1.58-bit quantization, where 99.5% of the 11.9B parameters in the vision transformer are constrained to the values +1,
-1, or 0. For consistency, all images in each comparison are generated using the same latent noise input. 1.58-bit FLUX utilizes a custom
1.58-bit kernel. Additional visual comparisons are provided in Fig. 3 and Fig

Figure 2. Efficiency measurements on the vision transformer component of FLUX and 1.58-bit FLUX. The measurements are based
on generating a single image with 50 inference steps. (a) 1.58-bit FLUX reduces checkpoint storage by 7.7× compared to FLUX. (b)
1.58-bit FLUX achieves a 5.1× reduction in inference memory usage across various GPU types. The x-axis labels, m-nG, represent GPU
type m with a maximum memory capacity of n Gigabytes (G)

###
https://huggingface.co/papers/2407.01489
Agentless: Demystifying LLM-based Software Engineering Agents
Published on Jul 2, 2024
·
Submitted by
nevetsaix
on Jul 3, 2024
#2 Paper of the day
Authors:

Chunqiu Steven Xia
,

Yinlin Deng
,

Soren Dunn
,

Lingming Zhang
Abstract
Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless -- an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic two-phase process of localization followed by repair, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (27.33%) and lowest cost (\$0.34) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.

###
https://github.com/NVlabs/VILA
NVIDIA's VILA: Optimized Vision Language Models
💡VILA is a part of the new Cosmos Nemotron VLMs
💡A family of open VLMs designed to optimize efficiency and accuracy for efficient video and multi-image understanding
💡Trending on GitHub
💡Built with Gradio
VILA is a family of open VLMs designed to optimize both efficiency and accuracy for efficient video understanding and multi-image understanding.
NVILA: Efficient Frontier Visual Language Models
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu
Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.

💡 News
[2025/1] As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models.

###
https://huggingface.co/collections/HIT-TMG/kalm-embedding-67316afa4c56f4fc1f58764b
New open multilingual embedding models released! KaLM-Embedding is a series of embedding models built on Qwen 2 0.5B and released under MIT. 👀
TL;DR:
🚀 Built on Qwen2-0.5B trained on 550k synthetic data released under MIT
🧹 Implements ranking consistency filtering to remove noisy and false negative samples
📊 Achieves 64.53 average score on MTEB benchmark (64.13 C-MTEB, 64.94 MTEB)
🎯 Supports flexible dimension embedding through Matryoshka Representation Learning
🌍 Strong multilingual performance outperforms other open models
🤗 Integrated into sentence-transformers available on Hugging Face
KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model
Published on Jan 2
Authors:

Xinshuo Hu
,

Zifei Shan
,
Xinping Zhao
,
Zetian Sun
,
Zhenyu Liu
,
Dongfang Li
,
Shaolin Ye
,
Xinyuan Wei
,
Qian Chen
,
Baotian Hu
,
Min Zhang
Abstract
As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with <1B parameters.

###
https://arxiv.org/pdf/2501.05366
[Submitted on 9 Jan 2025]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou
Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive long stepwise reasoning capabilities through large-scale reinforcement learning. However, their extended reasoning processes often suffer from knowledge insufficiency, leading to frequent uncertainties and potential errors. To address this limitation, we introduce \textbf{Search-o1}, a framework that enhances LRMs with an agentic retrieval-augmented generation (RAG) mechanism and a Reason-in-Documents module for refining retrieved documents. Search-o1 integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval of external knowledge when LRMs encounter uncertain knowledge points. Additionally, due to the verbose nature of retrieved documents, we design a separate Reason-in-Documents module to deeply analyze the retrieved information before injecting it into the reasoning chain, minimizing noise and preserving coherent reasoning flow. Extensive experiments on complex reasoning tasks in science, mathematics, and coding, as well as six open-domain QA benchmarks, demonstrate the strong performance of Search-o1. This approach enhances the trustworthiness and applicability of LRMs in complex reasoning tasks, paving the way for more reliable and versatile intelligent systems. The code is available at \url{this https URL}.

We propose Search-o1, the first framework that integrates the agentic search workflow into the
o1-like reasoning process of LRM for achieving autonomous knowledge supplementation.
• To effectively integrate external knowledge during reasoning, Search-o1 combines the reasoning
process with an agentic RAG mechanism and a knowledge refinement module. This design enables
the LRM to retrieve external knowledge on demand, seamlessly incorporating it into the reasoning
chain while maintaining the original logical flow.
• With five complex reasoning domains and six open-domain QA benchmarks, we demonstrate that
Search-o1 achieves remarkable performance in the reasoning field while maintaining substantial
improvements in the general knowledge. Further quantitative analysis confirms its efficiency and
scalability, offering practical guidance for trustworthy reasoning in LRMs.


###
https://arxiv.org/pdf/2501.03220
ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking
Tingyang Zhang1,2 Chen Wang1 Zhiyang Dou1,3 Qingzhe Gao4
Jiahui Lei1 Baoquan Chen2 Lingjie Liu1
1University of Pennsylvania 2Peking University
3The University of Hong Kong 4Shandong University
{tyzh,chenw30,zydou,leijh,lingjie.liu}@seas.upenn.edu;
gaoqingzhe97@gmail.com; baoquan@pku.edu.cn
Figure 1. Visualization of tracking trajectories in various videos. Our method robustly recovers each point’s complete trajectory without
drifting over time, even in challenging scenarios such as occlusions and multiple similar regions.
Abstract
In this paper, we propose ProTracker, a novel framework
for robust and accurate long-term dense tracking of arbitrary points in videos. The key idea of our method is incorporating probabilistic integration to refine multiple predictions from both optical flow and semantic features for
robust short-term and long-term tracking. Specifically, we
integrate optical flow estimations in a probabilistic manner,
producing smooth and accurate trajectories by maximizing
the likelihood of each prediction. To effectively re-localize
challenging points that disappear and reappear due to occlusion, we further incorporate long-term feature correspondence into our flow predictions for continuous trajectory generation. Extensive experiments show that ProTracker achieves the state-of-the-art performance among
unsupervised and self-supervised approaches, and even outperforms supervised methods on several benchmarks. Our
code and model will be publicly available upon publication.
Project page: https://michaelszj.github.io/
protracker
Pipeline

Pipeline

Pipeline overview of our proposed method. (1) Sample & Chain: Key points are initially sampled and linked through optical flow chaining to produce preliminary trajectory predictions. (2) Long-term Correspondence: Key points are re-localized over longer time spans to maintain continuity, even for points that temporarily disappear. (3) Dual-Stage Filter: Masks and feature filters are applied to remove incorrect predictions, reducing noise for subsequent steps. (4) Probabilistic Integration: Filtered flow predictions across frames are first integrated and then combined with long-term keypoint to produce the final prediction, producing smoother and more consistent trajectories.



TAP-Vid-DAVIS Comparisons

Qualitative comparisons to DINO-Tracker [1], CaDex++ [2] and LocoTrack [3] on TAP-Vid-DAVIS [7].
Our method is able to capture finer details and recover the full trajectory of less distinctive points.


PreviousNext

Qualitative comparisons to TAPTR [4], SpaTrack [5] and Co-Tracker [6] on TAP-Vid-DAVIS [7].
While these sliding window based trackers are prone to drift and vulnerable to occlusions, our method reliably maintains accurate tracking of the same point.


PreviousNext



Comparisons on challenging videos

To further illustrate our method's robustness, we conduct experiments on challenging videos from the web.Some of the previous methods relies on computing heatmap between the query point and the target frame. However, the per-frame heatmap lacks temporal-awareness and may confuse between different objects. We address this issue by leveraging mask and combining the heatmap with optical flow. By comparing the results of our method with DINO-Tracker[1] and TAPIR[8], we show that although our method also relies on per-frame heatmap to extract keypoints, our method has strong temporal-awareness and is able to tell between similar objects.


PreviousNext


To further demonstrate the robustness of our method, we conduct experiments on extended videos from TAP-Vid-DAVIS, simulating high frame-rate videos by repeating each frame three times. In contrast to typical sliding-window or flow-based trackers (such as TAPTR [4], SpatialTracker [5] and Co-Tracker [6]), which tend to accumulate errors and drift over time, our integration of long-term key points with short-term optical flow enables continuous, drift-free tracking of the same point through occlusions. Experiments are conducted in full resolution.


PreviousNext



Ablations

We conduct ablation study on different components of our method. w/o key indicates directly using the results from the flow integration as output without the joint integration with long-term key points. w/o geo removes the process of filtering by the geometry-aware feature. w/o mask uses the rough flow prediction without object-level filtering. w/o pro replaces the probabilistic integration by choosing the prediction of the lowest σ as the final results. We visualize the results on libby, parkour, horsejump-high, shooting and car-roundabout, respectively.
The results shows that without long-term keypoints, the methods cannot locate some point when they appear after occlusion (e.g. libby, parkour);
without geometry-aware feature, the methods may drift to other parts (e.g. car-roundabout,shooting);
without mask, the methods may confuse between different objects (e.g. parkour, shooting);
without probabilistic integration, the methods can be less accurate (e.g. car-roundabout, horsejump-high).

###
https://arxiv.org/pdf/2501.01880
Long Context vs. RAG for LLMs: An Evaluation and Revisits
Xinze Li1
, Yixin Cao2†
, Yubo Ma1
, Aixin Sun1†
1 S-Lab, Nanyang Technological University
2 School of Computer Science, Fudan University
{xinze002, yubo001}@e.ntu.edu.sg axsun@ntu.edu.sg
yxcao@fudan.edu.cn
Abstract
Extending context windows (i.e., Long Context, LC) and using retrievers to selectively
access relevant information (i.e., RetrievalAugmented Generation, RAG) are the two main
strategies to enable LLMs to incorporate extremely long external contexts. This paper revisits recent studies on this topic, highlighting their key insights and discrepancies. We
then provide a more comprehensive evaluation by filtering out questions answerable without external context, identifying the most effective retrieval methods, and expanding the
datasets. We show that LC generally outperforms RAG in question-answering benchmarks, especially for Wikipedia-based questions. Summarization-based retrieval performs
comparably to LC, while chunk-based retrieval
lags behind. However, RAG has advantages in
dialogue-based and general question queries.
These insights underscore the trade-offs between RAG and LC strategies, offering guidance for future optimization of LLMs with external knowledge sources. We also provide an
in-depth discussion on this topic, highlighting
the overlooked importance of context relevance
in existing studies.

Our key contributions in this paper are as follows:
(i) Providing a comprehensive survey of existing
studies on LC and RAG, analyzing their implementations and key insights. (ii) Proposing a fair and
systematic evaluation framework, and performing
detailed analyses to understand the strengths and
limitations of LC and RAG. (iii) Discussing chal1The experiment code and expanded datasets are available
at https://github.com/lixinze777/LC_VS_RAG
lenges for comparing and combining LC and RAG,
reflecting on the key points that researchers tend to
overlook in this field. Evaluation results indicate
that LC models generally outperform RAG when
processing self-contained information like stories,
while RAG excels at handling fragmented information, particularly in dialogue-based contexts.
These experiments deepen our understanding of the
strengths and limitations of LC and RAG, offering
valuable insights into optimizing retrieval strategies and effectively integrating these approaches to
enhance performance in open-domain question answering. These findings also based on a systematic
survey of existing studies on this topic (see § 2).
Additionally, we discuss key aspects of comparing
LC and RAG in § 6, highlighting areas that have
been underexplored in prior research.

###
https://huyenchip.com/2025/01/07/agents.html
Agents
Jan 7, 2025 • Chip Huyen

Intelligent agents are considered by many to be the ultimate goal of AI. The classic book by Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach (Prentice Hall, 1995), defines the field of AI research as “the study and design of rational agents.”

The unprecedented capabilities of foundation models have opened the door to agentic applications that were previously unimaginable. These new capabilities make it finally possible to develop autonomous, intelligent agents to act as our assistants, coworkers, and coaches. They can help us create a website, gather data, plan a trip, do market research, manage a customer account, automate data entry, prepare us for interviews, interview our candidates, negotiate a deal, etc. The possibilities seem endless, and the potential economic value of these agents is enormous.

This section will start with an overview of agents and then continue with two aspects that determine the capabilities of an agent: tools and planning. Agents, with their new modes of operations, have new modes of failure. This section will end with a discussion on how to evaluate agents to catch these failures.

This post is adapted from the Agents section of AI Engineering (2025) with minor edits to make it a standalone post.

Notes:

AI-powered agents are an emerging field with no established theoretical frameworks for defining, developing, and evaluating them. This section is a best-effort attempt to build a framework from the existing literature, but it will evolve as the field does. Compared to the rest of the book, this section is more experimental. I received helpful feedback from early reviewers, and I hope to get feedback from readers of this blog post, too.
Just before this book came out, Anthropic published a blog post on Building effective agents (Dec 2024). I’m glad to see that Anthropic’s blog post and my agent section are conceptually aligned, though with slightly different terminologies. However, Anthropic’s post focuses on isolated patterns, whereas my post covers why and how things work. I also focus more on planning, tool selection, and failure modes.
The post contains a lot of background information. Feel free to skip ahead if it feels a little too in the weeds!
Agents Overview
Great write-up on Agents by Chip.
Here are my takeaways:
🤖 Agents Overview
An AI agent is made up of both the environment it operates in (e.g., a game, the internet, or computer system) and the set of actions it can perform through its available tools. This dual definition is fundamental to understanding how agents work.
👨‍💻 Agent Example
The figure shows an example of an agent built on top of GPT-4. The environment is the computer which has access to a terminal and filesystem. The set of action include navigate, searching files, viewing files, etc.
🧰 Importance of Tools
Tools allow agents to both perceive their environment (through read actions) and modify it (through write actions). Adding appropriate tools can dramatically expand what an agent can do, from performing calculations to accessing real-time information.
💡 Tool Selection
More tools give agents more capabilities but also make it harder for them to use them effectively. Finding the right tool inventory requires careful experimentation and analysis of usage patterns.
🧩 Planning
Effective agents require robust planning capabilities to break down complex tasks into manageable steps. This planning should ideally be decoupled from execution to allow for validation before running potentially costly or time-consuming operations.
📍 Foundation Models Can Act as Planners
While there's debate about whether LLMs can truly plan, they can be effective components of planning systems, especially when augmented with appropriate tools and reflection capabilities.
⛓️ Multi-Agent Systems
Most practical agent implementations are multi-agent systems, with different components handling plan generation, validation, and execution. This separation of concerns allows for better specialization and error handling.
🎛️ Control Flows
Agent plans can involve various control flows beyond simple sequential execution, including parallel execution, conditional statements, and loops. However, more complex control flows are harder to generate and execute correctly.
💭 Reflection and Error Correction
While not strictly required, reflection capabilities (the ability to evaluate progress and correct mistakes) significantly improve agent performance. This can be implemented through self-critique or separate evaluation components.
❌ Failure Modes
Agents can fail in multiple ways, including planning failures (invalid tools or parameters), tool execution failures (incorrect outputs), and efficiency failures (taking too long or using too many resources).
📈 Evaluation
Proper agent evaluation needs to consider multiple metrics, including success rate, efficiency, cost, and time taken. This should be done across different tasks and compared against appropriate baselines.

기술적으로 최대한 자세하게 적어. 12개의 기사가 있고 하나도 빼먹지 말고 적어.

TECH BLOG by Dongyoung Kim Ph.D.

2025년 01월 13일 AI 소식

Microsoft, Phi-4

Google, TimesFM-2.0

NVIDIA, Cosmos

moondream 프로젝트

1.58-bit FLUX

Fudan University 외, “Agentless: Demystifying LLM-based Software Engineering Agents”

NVIDIA, VILA

HIT-TMG, KaLM-Embedding

Tsinghua University 외, “Search-o1: Agentic Search-Enhanced Large Reasoning Models”

Long Context vs. RAG for LLMs: An Evaluation and Revisits

Agents (by Chip Huyen)

ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking

(today’s date in 년 월 일) AI 소식,

Summary

company name, Title

company name, Title