➡️ Google은 제너레이티브 AI 모델의 기능을 확장하는 ‘Agent’ 개념과 아키텍처를 상세히 설명하는 백서를 발표했습니다. 이 백서는 에이전트의 핵심 구성 요소인 모델, 도구, 오케스트레이션 레이어를 설명하고, LangChain 및 Vertex AI를 이용한 에이전트 구축 사례를 제시합니다.

➡️ Hugging Face에서는 LLM에 에이전트 기능을 통합하는 간단한 라이브러리인 ‘smolagents’를 출시했습니다. smolagents는 코드 에이전트를 위한 강력한 지원을 제공하며, 다양한 LLM과 도구를 허브를 통해 통합할 수 있도록 설계되었습니다.

➡️ Meta FAIR에서는 가상 에이전트 제어, 비디오 워터마킹, 새로운 언어 모델링 방식 등 다양한 AI 연구 결과와 오픈소스 모델을 공개했습니다.

➡️ NVIDIA는 문서에서 콘텐츠와 메타데이터를 추출하는 마이크로서비스인 ‘NVIDIA Ingest’를 발표했습니다. 이 서비스는 다양한 문서 형식을 지원하며, LLM 애플리케이션에 활용될 수 있도록 설계되었습니다.

➡️ 이 외에도 웹 크롤링 라이브러리, PDF 번역 도구, LLM 성능 향상 기법 등 다양한 AI 관련 소식이 있었습니다.

Google, Agents

링크, September 2024

제너레이티브 AI 에이전트는 목표 달성을 위해 외부 세계를 관찰하고 도구를 사용하여 행동하는 애플리케이션으로 정의됨.
에이전트는 자율적으로 작동하며, 명시적인 지시 없이도 목표 달성을 위해 추론 가능.
에이전트의 핵심 구성 요소는 의사 결정 역할을 하는 모델, 외부 세계와의 상호 작용을 가능하게 하는 도구, 정보 처리 및 의사 결정을 관리하는 오케스트레이션 레이어임.
모델은 명령어 기반 추론 및 논리 프레임워크(ReAct, CoT, ToT)를 따르는 LM을 활용하며, 특정 에이전트 아키텍처의 요구 사항에 따라 다양한 크기 및 방식으로 구성 가능.
도구는 에이전트가 외부 데이터 및 서비스와 상호 작용하도록 지원하며, 웹 API 메소드(GET, POST 등)와 유사한 형태로 제공됨.
확장 프로그램(Extensions), 함수(Functions), 데이터 저장소(Data Stores)는 Google 모델이 상호 작용할 수 있는 주요 도구 유형임.
확장 프로그램은 API와 에이전트 간의 간극을 표준화된 방식으로 연결하여 에이전트가 API의 구현 방식에 관계없이 원활하게 실행하도록 함.
함수는 특정 작업을 수행하는 재사용 가능한 코드 모듈로, 모델이 함수 호출 시점과 필요한 인수를 결정함. 함수는 클라이언트 측에서 실행됨.
데이터 저장소는 에이전트가 최신 정보에 접근하도록 지원하며, 벡터 데이터베이스 형태로 구현되어 RAG(Retrieval Augmented Generation) 애플리케이션에 활용됨.
에이전트의 응답 품질은 모델의 추론 능력, 올바른 도구 선택 능력, 도구 정의의 정확성에 따라 결정됨.
모델 성능 향상을 위해 인-컨텍스트 학습, 검색 기반 인-컨텍스트 학습, 파인튜닝 기반 학습 등 다양한 타겟 학습 방식이 활용될 수 있음.
LangChain 및 LangGraph 라이브러리를 사용하여 실제 에이전트 프로토타입을 구축하는 예시 제시.
Vertex AI 플랫폼은 에이전트 개발, 테스트, 평가, 배포를 위한 완전 관리형 환경을 제공함.

Hugging Face, Introducing smolagents, a simple library to build agents

링크, December 31, 2024

smolagents는 LLM에 에이전트 기능을 통합하는 간단한 Python 라이브러리임.
에이전트는 LLM 출력이 워크플로우를 제어하는 프로그램으로 정의되며, 에이전시 수준은 LLM이 워크플로우에 미치는 영향력에 따라 연속적인 스펙트럼으로 표현됨.
멀티 스텝 에이전트는 루프를 통해 작업을 수행하며, 각 단계에서 LLM은 외부 도구를 호출하는 액션을 작성하고, 관찰 결과를 바탕으로 다음 단계를 결정함.
에이전트는 LLM이 앱의 워크플로우를 결정해야 할 때 유용하지만, 미리 결정된 워크플로우로 충분한 경우 과도한 설정일 수 있음.
코드 에이전트는 액션을 JSON 대신 코드로 작성하는 방식으로, 코드의 표현력과 LLM 학습 데이터와의 연관성을 활용하여 성능 향상을 기대할 수 있음.
smolagents는 단순성, 코드 에이전트에 대한 강력한 지원, 허브 통합, 다양한 LLM 지원을 목표로 개발됨.
에이전트 구축을 위해서는 도구 목록과 LLM 모델이 필요하며, 도구는 타입 힌트와 독스트링을 포함한 함수로 정의하고 @tool 데코레이터를 사용하여 만들 수 있음.
허브를 통해 도구를 공유하고 로드하는 기능 지원.
오픈 소스 모델이 에이전트 워크플로우에서 최고의 성능을 보이는 클로즈드 모델에 필적할 수 있음을 보여주는 벤치마크 결과 제시.

Thytu, Agentarium

링크, 2025/1/2

Agentarium은 AI 에이전트 관리 및 오케스트레이션을 위한 새로운 Python 프레임워크임.
주요 기능으로는 고급 에이전트 관리, 강력한 상호 작용 관리, 체크포인트 시스템, 데이터 생성, 성능 최적화, 유연한 환경 구성 (YAML), 확장 가능한 아키텍처 등이 있음.
다양한 역할과 기능을 가진 여러 AI 에이전트를 생성하고 오케스트레이션 가능.
에이전트 간의 복잡한 상호 작용을 조정.
에이전트 상태 및 상호 작용을 저장하고 복원하는 기능 제공.
에이전트 상호 작용을 통해 합성 데이터 생성 가능.
효율성과 확장성을 고려하여 구축됨.
YAML 구성 파일을 사용하여 사용자 정의 환경 정의 가능.
특정 요구 사항에 맞게 확장 및 사용자 정의가 용이한 아키텍처 제공.

Byaidu, PDFMathTranslate

링크, 2024/12/1

PDF 과학 논문 번역 및 이중 언어 비교 도구.
수식, 차트, 목차 및 주석 유지 (미리보기).
다양한 언어 및 번역 서비스 지원.
명령줄 도구, 대화형 사용자 인터페이스 및 Docker 제공.
GitHub Issues, Telegram Group 또는 QQ Group을 통해 피드백 제공 가능.
기여 방법은 Contribution Guide 참조.

Meta FAIR, Memory Layers at Scale

링크, December 23, 2024

Meta FAIR에서 메모리 레이어를 활용하여 모델의 매개변수 수를 늘리지 않고도 정보를 저장하고 검색하는 새로운 기술을 개발하고 관련 연구 결과 및 모델을 오픈소스로 공개함.
메모리 레이어는 훈련 가능한 키-값 조회 메커니즘을 사용하여 FLOPs 증가 없이 모델에 추가 매개변수를 추가함.
희소하게 활성화되는 메모리 레이어는 연산 집약적인 밀집 피드포워드 레이어를 보완하여 정보를 저렴하게 저장하고 검색할 수 있는 전용 용량을 제공함.
개선된 메모리 레이어로 강화된 언어 모델은 다운스트림 작업에서 두 배 이상의 연산 예산을 가진 밀집 모델보다 성능이 뛰어나며, 연산량과 매개변수 수가 일치할 때 MoE 모델보다도 우수한 성능을 보임.
특히 사실적인 작업에서 성능 향상이 두드러짐.
최대 1280억 개의 메모리 매개변수를 사용하여 확장 법칙을 보여주는 완전 병렬화 가능한 메모리 레이어 구현을 제공하며, 1조 개의 토큰으로 사전 훈련하여 최대 80억 개의 매개변수를 가진 기본 모델과 비교함.
Meta FAIR에서 에이전트, 견고성 및 안전성, 머신 러닝을 용이하게 하는 아키텍처 개발에 대한 최근 혁신을 강조하는 여러 새로운 연구 결과 공개.
Meta Motivo (가상 구현 에이전트의 동작 제어를 위한 파운데이션 모델) 및 Meta Video Seal (오픈 소스 비디오 워터마킹 모델) 포함.

링크, December 12, 2024

Meta FAIR에서 더욱 강력한 에이전트 구축, 견고성 및 안전성 확보, 모델이 새로운 정보를 보다 효과적으로 학습하고 현재 한계를 뛰어넘어 확장할 수 있도록 지원하는 아키텍처 혁신에 중점을 둔 최신 연구, 코드, 모델 및 데이터 세트 공개.
Meta Video Seal (신경 비디오 워터마킹을 위한 최첨단 포괄적인 프레임워크 데모 및 코드), Meta Omni Seal Bench (신경 워터마킹 전용 리더보드), Meta Watermark Anything 모델 (허용 라이선스로 재출시) 공개.
Meta Motivo (가상 구현 휴머노이드 에이전트의 움직임을 제어하는 행동 파운데이션 모델), Flow Matching 가이드 및 코드베이스, Meta Explore Theory-of-Mind (ToM 추론을 위한 프로그램 기반 적대적 데이터 생성), Meta Large Concept Models (LCM, 새로운 언어 모델링 패러다임), Meta Dynamic Byte Latent Transformer (계층적 바이트 레벨 모델), Meta Memory Layers at Scale 연구 결과 및 코드, Meta Image Diversity Modeling 연구 업데이트 및 텍스트-이미지 생성 모델 평가 툴박스, Meta CLIP 1.2 공개.

NVIDIA, NVIDIA-Ingest

링크, 2025/3/1

NVIDIA Ingest는 PDF, Word, PowerPoint 문서 구문 분석을 지원하는 확장 가능하고 성능 지향적인 문서 콘텐츠 및 메타데이터 추출 마이크로서비스임.
다운스트림 생성 애플리케이션에 사용하기 위해 특수화된 NVIDIA NIM 마이크로서비스를 사용하여 텍스트, 표, 차트 및 이미지를 찾고 컨텍스트화하고 추출함.
문서를 페이지로 분할하는 프로세스를 병렬화하여 콘텐츠를 분류하고 (표, 차트, 이미지, 텍스트), 개별 콘텐츠로 추출하고, 광학 문자 인식 (OCR)을 통해 컨텍스트화하여 잘 정의된 JSON 스키마로 변환함.
추출된 콘텐츠에 대한 임베딩 계산을 선택적으로 관리하고, 벡터 데이터베이스 Milvus에 저장을 선택적으로 관리할 수 있음.
JSON 작업 설명, 문서 페이로드 및 해당 페이로드에 수행할 수집 작업을 허용하고 작업 결과를 검색할 수 있도록 지원하며, 결과는 기본 문서에서 추출된 객체에 대한 메타데이터 목록과 처리 주석 및 타이밍/추적 데이터를 포함하는 JSON 딕셔너리임.
PDF, Docx, pptx 및 이미지를 지원하며, 처리량과 정확성 간의 균형을 맞추기 위해 각 문서 유형에 대한 여러 추출 방법을 지원함 (예: PDF 문서의 경우 pdfium, Unstructured.io 및 Adobe Content Extraction Services를 통한 추출 지원).
텍스트 분할 및 청킹, 변환 및 필터링, 임베딩 생성, 이미지 스토리지를 포함한 다양한 유형의 사전 및 사후 처리 작업을 지원함.

unclecode, crawl4ai

링크, 2024/12/15

Crawl4AI는 웹 크롤링 및 데이터 추출을 간소화하여 LLM 및 AI 애플리케이션에 즉시 사용할 수 있도록 지원하는 무료 오픈소스 라이브러리임.
빠른 성능 (유료 서비스보다 뛰어남), LLM 친화적인 출력 형식 (JSON, 정리된 HTML, 마크다운), 여러 URL 동시 크롤링 지원, 모든 미디어 태그 (이미지, 오디오, 비디오) 추출, 외부 및 내부 링크 추출 기능 제공.
페이지에서 메타데이터 추출, 인증, 헤더 및 페이지 수정을 위한 사용자 정의 훅, 사용자 에이전트 사용자 정의, 페이지 스크린샷 캡처, 크롤링 전에 사용자 정의 JavaScript 실행 기능 지원.
실시간, 비용 효율적인 성능으로 6배 빠른 결과 제공.
세션 관리, 프록시 및 원활한 데이터 액세스를 위한 사용자 정의 훅 제공.
비용이 많이 드는 모델에 대한 의존도를 줄이기 위해 고급 알고리즘 사용.
Docker 및 클라우드 통합에 적합한 완전한 오픈 소스 라이브러리.

Jio Oh 외, Better Think with Tables: Leveraging Tables to Enhance Large Language Model Comprehension

링크, 2024/12/22

LLM은 복잡한 쿼리 (특히 여러 조건이 포함된 현실 세계 시나리오)에 어려움을 겪음.
테이블을 활용하여 중간 사고를 수행하도록 LLM을 지원하는 “Thinking with Tables” 기술 제안.
LLM이 정보를 테이블로 구성하도록 유도하는 사전 지시를 통해 평균 40.29%의 상대적 성능 향상, 더 높은 견고성 및 다양한 요청, 조건 또는 시나리오에 대한 일반화 가능성을 달성함.
데이터 구조화 수준이 모델에 미치는 영향을 비교하기 위해 네 가지의 서로 다른 구조화 수준을 소개하고 결과를 비교함.

Brian J Chan 외, Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

링크, 2024/12/20

검색 증강 생성 (RAG)은 외부 지식 소스를 통합하여 언어 모델을 향상시키는 강력한 접근 방식으로 주목받고 있지만, 검색 지연 시간, 문서 선택 오류 가능성, 시스템 복잡성 증가와 같은 문제가 있음.
긴 컨텍스트 창을 특징으로 하는 대규모 언어 모델 (LLM)의 등장으로 실시간 검색을 우회하는 대안적인 패러다임인 캐시 증강 생성 (CAG)을 제안함.
제한적이고 관리 가능한 크기의 문서 또는 지식과 같은 관련 리소스를 LLM의 확장된 컨텍스트에 미리 로드하고 런타임 매개변수를 캐싱하는 방식임.
추론 중에 모델은 추가 검색 단계 없이 이러한 미리 로드된 매개변수를 활용하여 쿼리에 응답함.
CAG는 검색 지연 시간을 제거하고 컨텍스트 관련성을 유지하면서 검색 오류를 최소화함.
여러 벤치마크에 대한 성능 평가 결과, 특히 제한된 지식 기반을 가진 특정 애플리케이션의 경우 CAG가 기존 RAG 파이프라인을 능가하거나 보완하는 시나리오를 강조함.
CAG는 RAG에 대한 간소화되고 효율적인 대안을 제공하며, 복잡성을 줄이면서 유사하거나 우수한 결과를 달성할 수 있음을 시사함.

Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:

(today’s date in 년 월 일) AI 소식,

Summary

(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)

company name, Title

링크, date

detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)

company name, Title

링크, date

링크, date,

detailed summary1, (개조식 문체 사용)
detailed summary2, (개조식 문체 사용)
…
detailed summary N, (개조식 문체 사용)
…

###
Agent by Google

Agents
Authors: Julia Wiesinger, Patrick Marlow
and Vladimir Vuskovic
Agents
September 2024 2
Acknowledgements
Reviewers and Contributors
Evan Huang
Emily Xue
Olcan Sercinoglu
Sebastian Riedel
Satinder Baveja
Antonio Gulli
Anant Nawalgaria
Curators and Editors
Antonio Gulli
Anant Nawalgaria
Grace Mollison
Technical Writer
Joey Haymaker
Designer
Michael Lanning
Introduction 4
What is an agent? 5
The model 6
The tools 7
The orchestration layer 7
Agents vs. models 8
Cognitive architectures: How agents operate 8
Tools: Our keys to the outside world 12
Extensions 13
 Sample Extensions 15
Functions 18
 Use cases 21
 Function sample code 24
Data stores 27
 Implementation and application 28
Tools recap 32
Enhancing model performance with targeted learning 33
Agent quick start with LangChain 35
Production applications with Vertex AI agents 38
Summary 40
Endnotes 42
Table of contents
Agents
September 2024 4
Introduction
Humans are fantastic at messy pattern recognition tasks. However, they often rely on tools
- like books, Google Search, or a calculator - to supplement their prior knowledge before
arriving at a conclusion. Just like humans, Generative AI models can be trained to use tools
to access real-time information or suggest a real-world action. For example, a model can
leverage a database retrieval tool to access specific information, like a customer's purchase
history, so it can generate tailored shopping recommendations. Alternatively, based on a
user's query, a model can make various API calls to send an email response to a colleague
or complete a financial transaction on your behalf. To do so, the model must not only have
access to a set of external tools, it needs the ability to plan and execute any task in a selfdirected fashion. This combination of reasoning, logic, and access to external information
that are all connected to a Generative AI model invokes the concept of an agent, or a
program that extends beyond the standalone capabilities of a Generative AI model. This
whitepaper dives into all these and associated aspects in more detail.
This combination of reasoning,
logic, and access to external
information that are all connected
to a Generative AI model invokes
the concept of an agent.
Agents
September 2024 5
What is an agent?
In its most fundamental form, a Generative AI agent can be defined as an application that
attempts to achieve a goal by observing the world and acting upon it using the tools that it
has at its disposal. Agents are autonomous and can act independently of human intervention,
especially when provided with proper goals or objectives they are meant to achieve. Agents
can also be proactive in their approach to reaching their goals. Even in the absence of
explicit instruction sets from a human, an agent can reason about what it should do next to
achieve its ultimate goal. While the notion of agents in AI is quite general and powerful, this
whitepaper focuses on the specific types of agents that Generative AI models are capable of
building at the time of publication.
In order to understand the inner workings of an agent, let’s first introduce the foundational
components that drive the agent’s behavior, actions, and decision making. The combination
of these components can be described as a cognitive architecture, and there are many
such architectures that can be achieved by the mixing and matching of these components.
Focusing on the core functionalities, there are three essential components in an agent’s
cognitive architecture as shown in Figure 1.
Agents
September 2024 6
Figure 1. General agent architecture and components
The model
In the scope of an agent, a model refers to the language model (LM) that will be utilized as
the centralized decision maker for agent processes. The model used by an agent can be one
or multiple LM’s of any size (small / large) that are capable of following instruction based
reasoning and logic frameworks, like ReAct, Chain-of-Thought, or Tree-of-Thoughts. Models
can be general purpose, multimodal or fine-tuned based on the needs of your specific agent
architecture. For best production results, you should leverage a model that best fits your
desired end application and, ideally, has been trained on data signatures associated with the
tools that you plan to use in the cognitive architecture. It’s important to note that the model is
typically not trained with the specific configuration settings (i.e. tool choices, orchestration/
reasoning setup) of the agent. However, it’s possible to further refine the model for the
agent’s tasks by providing it with examples that showcase the agent’s capabilities, including
instances of the agent using specific tools or reasoning steps in various contexts.
Agents
September 2024 7
The tools
Foundational models, despite their impressive text and image generation, remain constrained
by their inability to interact with the outside world. Tools bridge this gap, empowering agents
to interact with external data and services while unlocking a wider range of actions beyond
that of the underlying model alone. Tools can take a variety of forms and have varying
depths of complexity, but typically align with common web API methods like GET, POST,
PATCH, and DELETE. For example, a tool could update customer information in a database
or fetch weather data to influence a travel recommendation that the agent is providing to
the user. With tools, agents can access and process real-world information. This empowers
them to support more specialized systems like retrieval augmented generation (RAG),
which significantly extends an agent’s capabilities beyond what the foundational model can
achieve on its own. We’ll discuss tools in more detail below, but the most important thing
to understand is that tools bridge the gap between the agent’s internal capabilities and the
external world, unlocking a broader range of possibilities.
The orchestration layer
The orchestration layer describes a cyclical process that governs how the agent takes in
information, performs some internal reasoning, and uses that reasoning to inform its next
action or decision. In general, this loop will continue until an agent has reached its goal or a
stopping point. The complexity of the orchestration layer can vary greatly depending on the
agent and task it’s performing. Some loops can be simple calculations with decision rules,
while others may contain chained logic, involve additional machine learning algorithms, or
implement other probabilistic reasoning techniques. We’ll discuss more about the detailed
implementation of the agent orchestration layers in the cognitive architecture section.
Agents
September 2024 8
Agents vs. models
To gain a clearer understanding of the distinction between agents and models, consider the
following chart:
Models Agents
Knowledge is limited to what is available in their
training data.
Knowledge is extended through the connection
with external systems via tools
Single inference / prediction based on the
user query. Unless explicitly implemented for
the model, there is no management of session
history or continuous context. (i.e. chat history)
Managed session history (i.e. chat history) to
allow for multi turn inference / prediction based
on user queries and decisions made in the
orchestration layer. In this context, a ‘turn’ is
defined as an interaction between the interacting
system and the agent. (i.e. 1 incoming event/
query and 1 agent response)
No native tool implementation. Tools are natively implemented in agent
architecture.
No native logic layer implemented. Users can
form prompts as simple questions or use
reasoning frameworks (CoT, ReAct, etc.) to
form complex prompts to guide the model in
prediction.
Native cognitive architecture that uses reasoning
frameworks like CoT, ReAct, or other pre-built
agent frameworks like LangChain.
Cognitive architectures: How agents operate
Imagine a chef in a busy kitchen. Their goal is to create delicious dishes for restaurant
patrons which involves some cycle of planning, execution, and adjustment.
Agents
September 2024 9
• They gather information, like the patron’s order and what ingredients are in the pantry
and refrigerator.
• They perform some internal reasoning about what dishes and flavor profiles they can
create based on the information they have just gathered.
• They take action to create the dish: chopping vegetables, blending spices, searing meat.
At each stage in the process the chef makes adjustments as needed, refining their plan as
ingredients are depleted or customer feedback is received, and uses the set of previous
outcomes to determine the next plan of action. This cycle of information intake, planning,
executing, and adjusting describes a unique cognitive architecture that the chef employs to
reach their goal.
Just like the chef, agents can use cognitive architectures to reach their end goals by
iteratively processing information, making informed decisions, and refining next actions
based on previous outputs. At the core of agent cognitive architectures lies the orchestration
layer, responsible for maintaining memory, state, reasoning and planning. It uses the rapidly
evolving field of prompt engineering and associated frameworks to guide reasoning and
planning, enabling the agent to interact more effectively with its environment and complete
tasks. Research in the area of prompt engineering frameworks and task planning for
language models is rapidly evolving, yielding a variety of promising approaches. While not an
exhaustive list, these are a few of the most popular frameworks and reasoning techniques
available at the time of this publication:
• ReAct, a prompt engineering framework that provides a thought process strategy for
language models to Reason and take action on a user query, with or without in-context
examples. ReAct prompting has shown to outperform several SOTA baselines and improve
human interoperability and trustworthiness of LLMs.
Agents
September 2024 10
• Chain-of-Thought (CoT), a prompt engineering framework that enables reasoning
capabilities through intermediate steps. There are various sub-techniques of CoT including
self-consistency, active-prompt, and multimodal CoT that each have strengths and
weaknesses depending on the specific application.
• Tree-of-thoughts (ToT),
,
 a prompt engineering framework that is well suited for
exploration or strategic lookahead tasks. It generalizes over chain-of-thought prompting
and allows the model to explore various thought chains that serve as intermediate steps
for general problem solving with language models.
Agents can utilize one of the above reasoning techniques, or many other techniques, to
choose the next best action for the given user request. For example, let’s consider an agent
that is programmed to use the ReAct framework to choose the correct actions and tools for
the user query. The sequence of events might go something like this:
1. User sends query to the agent
2. Agent begins the ReAct sequence
3. The agent provides a prompt to the model, asking it to generate one of the next ReAct
steps and its corresponding output:
a. Question: The input question from the user query, provided with the prompt
b. Thought: The model’s thoughts about what it should do next
c. Action: The model’s decision on what action to take next
i. This is where tool choice can occur
ii. For example, an action could be one of [Flights, Search, Code, None], where the first
3 represent a known tool that the model can choose, and the last represents “no
tool choice”
Agents
September 2024 11
d. Action input: The model’s decision on what inputs to provide to the tool (if any)
e. Observation: The result of the action / action input sequence
i. This thought / action / action input / observation could repeat N-times as needed
f. Final answer: The model’s final answer to provide to the original user query
4. The ReAct loop concludes and a final answer is provided back to the user
Figure 2. Example agent with ReAct reasoning in the orchestration layer
As shown in Figure 2, the model, tools, and agent configuration work together to provide
a grounded, concise response back to the user based on the user’s original query. While
the model could have guessed at an answer (hallucinated) based on its prior knowledge,
it instead used a tool (Flights) to search for real-time external information. This additional
information was provided to the model, allowing it to make a more informed decision based
on real factual data and to summarize this information back to the user.
Agents
September 2024 12
In summary, the quality of agent responses can be tied directly to the model’s ability to
reason and act about these various tasks, including the ability to select the right tools, and
how well that tools has been defined. Like a chef crafting a dish with fresh ingredients and
attentive to customer feedback, agents rely on sound reasoning and reliable information to
deliver optimal results. In the next section, we’ll dive into the various ways agents connect
with fresh data.
Tools: Our keys to the outside world
While language models excel at processing information, they lack the ability to directly
perceive and influence the real world. This limits their usefulness in situations requiring
interaction with external systems or data. This means that, in a sense, a language model
is only as good as what it has learned from its training data. But regardless of how much
data we throw at a model, they still lack the fundamental ability to interact with the outside
world. So how can we empower our models to have real-time, context-aware interaction with
external systems? Functions, Extensions, Data Stores and Plugins are all ways to provide this
critical capability to the model.
While they go by many names, tools are what create a link between our foundational models
and the outside world. This link to external systems and data allows our agent to perform a
wider variety of tasks and do so with more accuracy and reliability. For instance, tools can
enable agents to adjust smart home settings, update calendars, fetch user information from
a database, or send emails based on a specific set of instructions.
As of the date of this publication, there are three primary tool types that Google models are
able to interact with: Extensions, Functions, and Data Stores. By equipping agents with tools,
we unlock a vast potential for them to not only understand the world but also act upon it,
opening doors to a myriad of new applications and possibilities.
Agents
September 2024 13
Extensions
The easiest way to understand Extensions is to think of them as bridging the gap between
an API and an agent in a standardized way, allowing agents to seamlessly execute APIs
regardless of their underlying implementation. Let’s say that you’ve built an agent with a goal
of helping users book flights. You know that you want to use the Google Flights API to retrieve
flight information, but you’re not sure how you’re going to get your agent to make calls to this
API endpoint.
Figure 3. How do Agents interact with External APIs?
One approach could be to implement custom code that would take the incoming user query,
parse the query for relevant information, then make the API call. For example, in a flight
booking use case a user might state “I want to book a flight from Austin to Zurich.” In this
scenario, our custom code solution would need to extract “Austin” and “Zurich” as relevant
entities from the user query before attempting to make the API call. But what happens if the
user says “I want to book a flight to Zurich” and never provides a departure city? The API call
would fail without the required data and more code would need to be implemented in order
to catch edge and corner cases like this. This approach is not scalable and could easily break
in any scenario that falls outside of the implemented custom code.
Agents
September 2024 14
A more resilient approach would be to use an Extension. An Extension bridges the gap
between an agent and an API by:
1. Teaching the agent how to use the API endpoint using examples.
2. Teaching the agent what arguments or parameters are needed to successfully call the
API endpoint.
Figure 4. Extensions connect Agents to External APIs
Extensions can be crafted independently of the agent, but should be provided as part of the
agent’s configuration. The agent uses the model and examples at run time to decide which
Extension, if any, would be suitable for solving the user’s query. This highlights a key strength
of Extensions, their built-in example types, that allow the agent to dynamically select the
most appropriate Extension for the task.
Figure 5. 1-to-many relationship between Agents, Extensions and APIs
Agents
September 2024 15
Think of this the same way that a software developer decides which API endpoints to use
while solving and solutioning for a user’s problem. If the user wants to book a flight, the
developer might use the Google Flights API. If the user wants to know where the nearest
coffee shop is relative to their location, the developer might use the Google Maps API. In
this same way, the agent / model stack uses a set of known Extensions to decide which one
will be the best fit for the user’s query. If you’d like to see Extensions in action, you can try
them out on the Gemini application by going to Settings > Extensions and then enabling any
you would like to test. For example, you could enable the Google Flights extension then ask
Gemini “Show me flights from Austin to Zurich leaving next Friday.”
Sample Extensions
To simplify the usage of Extensions, Google provides some out of the box extensions that
can be quickly imported into your project and used with minimal configurations. For example,
the Code Interpreter extension in Snippet 1 allows you to generate and run Python code from
a natural language description.
Agents
September 2024 16
Python
import vertexai
import pprint
PROJECT_ID = "YOUR_PROJECT_ID"
REGION = "us-central1"
vertexai.init(project=PROJECT_ID, location=REGION)
from vertexai.preview.extensions import Extension
extension_code_interpreter = Extension.from_hub("code_interpreter")
CODE_QUERY = """Write a python method to invert a binary tree in O(n) time."""
response = extension_code_interpreter.execute(
 operation_id = "generate_and_execute",
 operation_params = {"query": CODE_QUERY}
 )
print("Generated Code:")
pprint.pprint({response['generated_code']})
# The above snippet will generate the following code.

Generated Code:
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
Continues next page…
Agents
September 2024 17
Python
def invert_binary_tree(root):
“””
Inverts a binary tree.
Args:
root: The root of the binary tree.
Returns:
The root of the inverted binary tree.
“””
if not root:
return None

Swap the left and right children recursively

root.left, root.right =
invert_binary_tree(root.right), invert_binary_tree(root.left)
return root

Example usage:

Construct a sample binary tree

root = TreeNode(4)
root.left = TreeNode(2)
root.right = TreeNode(7)
root.left.left = TreeNode(1)
root.left.right = TreeNode(3)
root.right.left = TreeNode(6)
root.right.right = TreeNode(9)

Invert the binary tree

inverted_root = invert_binary_tree(root)

Snippet 1. Code Interpreter Extension can generate and run Python code
Agents
September 2024 18
To summarize, Extensions provide a way for agents to perceive, interact, and influence the
outside world in a myriad of ways. The selection and invocation of these Extensions is guided
by the use of Examples, all of which are defined as part of the Extension configuration.
Functions
In the world of software engineering, functions are defined as self-contained modules
of code that accomplish a specific task and can be reused as needed. When a software
developer is writing a program, they will often create many functions to do various tasks.
They will also define the logic for when to call function_a versus function_b, as well as the
expected inputs and outputs.
Functions work very similarly in the world of agents, but we can replace the software
developer with a model. A model can take a set of known functions and decide when to use
each Function and what arguments the Function needs based on its specification. Functions
differ from Extensions in a few ways, most notably:
1. A model outputs a Function and its arguments, but doesn’t make a live API call.
2. Functions are executed on the client-side, while Extensions are executed on
the agent-side.
Using our Google Flights example again, a simple setup for functions might look like the
example in Figure 7.
Agents
September 2024 19
Figure 7. How do functions interact with external APIs?
Note that the main difference here is that neither the Function nor the agent interact directly
with the Google Flights API. So how does the API call actually happen?
With functions, the logic and execution of calling the actual API endpoint is offloaded away
from the agent and back to the client-side application as seen in Figure 8 and Figure 9 below.
This offers the developer more granular control over the flow of data in the application. There
are many reasons why a Developer might choose to use functions over Extensions, but a few
common use cases are:
• API calls need to be made at another layer of the application stack, outside of the direct
agent architecture flow (e.g. a middleware system, a front end framework, etc.)
• Security or Authentication restrictions that prevent the agent from calling an API directly
(e.g API is not exposed to the internet, or non-accessible by agent infrastructure)
• Timing or order-of-operations constraints that prevent the agent from making API calls in
real-time. (i.e. batch operations, human-in-the-loop review, etc.)
Agents
September 2024 20
• Additional data transformation logic needs to be applied to the API Response that the
agent cannot perform. For example, consider an API endpoint that doesn’t provide a
filtering mechanism for limiting the number of results returned. Using Functions on the
client-side provides the developer additional opportunities to make these transformations.
• The developer wants to iterate on agent development without deploying additional
infrastructure for the API endpoints (i.e. Function Calling can act like “stubbing” of APIs)
While the difference in internal architecture between the two approaches is subtle as seen in
Figure 8, the additional control and decoupled dependency on external infrastructure makes
Function Calling an appealing option for the Developer.
Figure 8. Delineating client vs. agent side control for extensions and function calling
Agents
September 2024 21
Use cases
A model can be used to invoke functions in order to handle complex, client-side execution
flows for the end user, where the agent Developer might not want the language model to
manage the API execution (as is the case with Extensions). Let’s consider the following
example where an agent is being trained as a travel concierge to interact with users that want
to book vacation trips. The goal is to get the agent to produce a list of cities that we can use
in our middleware application to download images, data, etc. for the user’s trip planning. A
user might say something like:
I’d like to take a ski trip with my family but I’m not sure where to go.
In a typical prompt to the model, the output might look like the following:
Sure, here’s a list of cities that you can consider for family ski trips:
• Crested Butte, Colorado, USA
• Whistler, BC, Canada
• Zermatt, Switzerland
While the above output contains the data that we need (city names), the format isn’t ideal
for parsing. With Function Calling, we can teach a model to format this output in a structured
style (like JSON) that’s more convenient for another system to parse. Given the same input
prompt from the user, an example JSON output from a Function might look like Snippet
5 instead.
Agents
September 2024 22
Unset
function_call {
 name: "display_cities"
 args: {
 "cities": ["Crested Butte", "Whistler", "Zermatt"],
 "preferences": "skiing"
 }
}
Snippet 5. Sample Function Call payload for displaying a list of cities and user preferences
This JSON payload is generated by the model, and then sent to our Client-side server to do
whatever we would like to do with it. In this specific case, we’ll call the Google Places API to
take the cities provided by the model and look up Images, then provide them as formatted
rich content back to our User. Consider this sequence diagram in Figure 9 showing the above
interaction in step by step detail.
Agents
September 2024 23
Figure 9. Sequence diagram showing the lifecycle of a Function Call
The result of the example in Figure 9 is that the model is leveraged to “fill in the blanks” with
the parameters required for the Client side UI to make the call to the Google Places API. The
Client side UI manages the actual API call using the parameters provided by the model in the
returned Function. This is just one use case for Function Calling, but there are many other
scenarios to consider like:
• You want a language model to suggest a function that you can use in your code, but you
don't want to include credentials in your code. Because function calling doesn't run the
function, you don't need to include credentials in your code with the function information.
Agents
September 2024 24
• You are running asynchronous operations that can take more than a few seconds. These
scenarios work well with function calling because it's an asynchronous operation.
• You want to run functions on a device that's different from the system producing the
function calls and their arguments.
One key thing to remember about functions is that they are meant to offer the developer
much more control over not only the execution of API calls, but also the entire flow of data
in the application as a whole. In the example in Figure 9, the developer chose to not return
API information back to the agent as it was not pertinent for future actions the agent might
take. However, based on the architecture of the application, it may make sense to return the
external API call data to the agent in order to influence future reasoning, logic, and action
choices. Ultimately, it is up to the application developer to choose what is right for the
specific application.
Function sample code
To achieve the above output from our ski vacation scenario, let’s build out each of the
components to make this work with our gemini-1.5-flash-001 model.
First, we’ll define our display_cities function as a simple Python method.
Agents
September 2024 25
Python
def display_cities(cities: list[str], preferences: Optional[str] = None):
"""Provides a list of cities based on the user's search query and preferences.
Args:
 preferences (str): The user's preferences for the search, like skiing,
 beach, restaurants, bbq, etc.
 cities (list[str]): The list of cities being recommended to the user.
Returns:
 list[str]: The list of cities being recommended to the user.
"""
return cities
Snippet 6. Sample python method for a function that will display a list of cities.
Next, we’ll instantiate our model, build the Tool, then pass in our user’s query and tools to
the model. Executing the code below would result in the output as seen at the bottom of the
code snippet.
Agents
September 2024 26
Python
from vertexai.generative_models import GenerativeModel, Tool, FunctionDeclaration
model = GenerativeModel("gemini-1.5-flash-001")
display_cities_function = FunctionDeclaration.from_func(display_cities)
tool = Tool(function_declarations=[display_cities_function])
message = "I’d like to take a ski trip with my family but I’m not sure where
to go."
res = model.generate_content(message, tools=[tool])
print(f"Function Name: {res.candidates[0].content.parts[0].function_call.name}")
print(f"Function Args: {res.candidates[0].content.parts[0].function_call.args}")
> Function Name: display_cities
> Function Args: {'preferences': 'skiing', 'cities': ['Aspen', 'Vail',
'Park City']}
Snippet 7. Building a Tool, sending to the model with a user query and allowing the function call to take place
In summary, functions offer a straightforward framework that empowers application
developers with fine-grained control over data flow and system execution, while effectively
leveraging the agent/model for critical input generation. Developers can selectively choose
whether to keep the agent “in the loop” by returning external data, or omit it based on
specific application architecture requirements.
Agents
September 2024 27
Data stores
Imagine a language model as a vast library of books, containing its training data. But unlike
a library that continuously acquires new volumes, this one remains static, holding only the
knowledge it was initially trained on. This presents a challenge, as real-world knowledge is
constantly evolving. Data Stores address this limitation by providing access to more dynamic
and up-to-date information, and ensuring a model’s responses remain grounded in factuality
and relevance.
Consider a common scenario where a developer might need to provide a small amount of
additional data to a model, perhaps in the form of spreadsheets or PDFs.
Figure 10. How can Agents interact with structured and unstructured data?
Agents
September 2024 28
Data Stores allow developers to provide additional data in its original format to an agent,
eliminating the need for time-consuming data transformations, model retraining, or finetuning. The Data Store converts the incoming document into a set of vector database
embeddings that the agent can use to extract the information it needs to supplement its next
action or response to the user.
Figure 11. Data Stores connect Agents to new real-time data sources of various types.
Implementation and application
In the context of Generative AI agents, Data Stores are typically implemented as a vector
database that the developer wants the agent to have access to at runtime. While we won’t
cover vector databases in depth here, the key point to understand is that they store data
in the form of vector embeddings, a type of high-dimensional vector or mathematical
representation of the data provided. One of the most prolific examples of Data Store usage
with language models in recent times has been the implementation of Retrieval Augmented
Agents
September 2024 29
Generation (RAG) based applications. These applications seek to extend the breadth and
depth of a model’s knowledge beyond the foundational training data by giving the model
access to data in various formats like:
• Website content
• Structured Data in formats like PDF, Word Docs, CSV, Spreadsheets, etc.
• Unstructured Data in formats like HTML, PDF, TXT, etc.
Figure 12. 1-to-many relationship between agents and data stores, which can represent various types of
pre-indexed data
The underlying process for each user request and agent response loop is generally modeled
as seen in Figure 13.
1. A user query is sent to an embedding model to generate embeddings for the query
2. The query embeddings are then matched against the contents of the vector database
using a matching algorithm like SCaNN
3. The matched content is retrieved from the vector database in text format and sent back to
the agent
4. The agent receives both the user query and retrieved content, then formulates a response
or action
Agents
September 2024 30
5. A final response is sent to the user
Figure 13. The lifecycle of a user request and agent response in a RAG based application
The end result is an application that allows the agent to match a user’s query to a known data
store through vector search, retrieve the original content, and provide it to the orchestration
layer and model for further processing. The next action might be to provide a final answer to
the user, or perform an additional vector search to further refine the results.
A sample interaction with an agent that implements RAG with ReAct reasoning/planning can
be seen in Figure 14.
Agents
September 2024 31
Figure 14. Sample RAG based application w/ ReAct reasoning/planning
Agents
September 2024 32
Tools recap
To summarize, extensions, functions and data stores make up a few different tool types
available for agents to use at runtime. Each has their own purpose and they can be used
together or independently at the discretion of the agent developer.
Extensions Function Calling Data Stores
Execution Agent-Side Execution Client-Side Execution Agent-Side Execution
Use Case • Developer wants
agent to control
interactions with the
API endpoints
• Useful when
leveraging native prebuilt Extensions (i.e.
Vertex Search, Code
Interpreter, etc.)
• Multi-hop planning
and API calling
(i.e. the next agent
action depends on
the outputs of the
previous action /
API call)
• Security or
Authentication
restrictions prevent the
agent from calling an
API directly
• Timing constraints or
order-of-operations
constraints that
prevent the agent
from making API calls
in real-time. (i.e. batch
operations, human-inthe-loop review, etc.)
• API that is not exposed
to the internet, or
non-accessible by
Google systems
Developer wants to
implement Retrieval
Augmented Generation
(RAG) with any of the
following data types:
• Website Content from
pre-indexed domains
and URLs
• Structured Data in
formats like PDF,
Word Docs, CSV,
Spreadsheets, etc.
• Relational / NonRelational Databases
• Unstructured Data in
formats like HTML, PDF,
TXT, etc.
Agents
September 2024 33
Enhancing model performance with
targeted learning
A crucial aspect of using models effectively is their ability to choose the right tools when
generating output, especially when using tools at scale in production. While general training
helps models develop this skill, real-world scenarios often require knowledge beyond the
training data. Imagine this as the difference between basic cooking skills and mastering
a specific cuisine. Both require foundational cooking knowledge, but the latter demands
targeted learning for more nuanced results.
To help the model gain access to this type of specific knowledge, several approaches exist:
• In-context learning: This method provides a generalized model with a prompt, tools, and
few-shot examples at inference time which allows it to learn ‘on the fly' how and when to
use those tools for a specific task. The ReAct framework is an example of this approach in
natural language.
• Retrieval-based in-context learning: This technique dynamically populates the model
prompt with the most relevant information, tools, and associated examples by retrieving
them from external memory. An example of this would be the ‘Example Store’ in Vertex AI
extensions or the data stores RAG based architecture mentioned previously.
• Fine-tuning based learning: This method involves training a model using a larger dataset
of specific examples prior to inference. This helps the model understand when and how to
apply certain tools prior to receiving any user queries.
To provide additional insights on each of the targeted learning approaches, let’s revisit our
cooking analogy.
Agents
September 2024 34
• Imagine a chef has received a specific recipe (the prompt), a few key ingredients (relevant
tools) and some example dishes (few-shot examples) from a customer. Based on this
limited information and the chef’s general knowledge of cooking, they will need to figure
out how to prepare the dish ‘on the fly’ that most closely aligns with the recipe and the
customer’s preferences. This is in-context learning.
• Now let’s imagine our chef in a kitchen that has a well-stocked pantry (external data
stores) filled with various ingredients and cookbooks (examples and tools). The chef is now
able to dynamically choose ingredients and cookbooks from the pantry and better align
to the customer’s recipe and preferences. This allows the chef to create a more informed
and refined dish leveraging both existing and new knowledge. This is retrieval-based
in-context learning.
• Finally, let’s imagine that we sent our chef back to school to learn a new cuisine or set of
cuisines (pre-training on a larger dataset of specific examples). This allows the chef to
approach future unseen customer recipes with deeper understanding. This approach is
perfect if we want the chef to excel in specific cuisines (knowledge domains). This is finetuning based learning.
Each of these approaches offers unique advantages and disadvantages in terms of speed,
cost, and latency. However, by combining these techniques in an agent framework, we can
leverage the various strengths and minimize their weaknesses, allowing for a more robust and
adaptable solution.
Agents
September 2024 35
Agent quick start with LangChain
In order to provide a real-world executable example of an agent in action, we’ll build a quick
prototype with the LangChain and LangGraph libraries. These popular open source libraries
allow users to build customer agents by “chaining” together sequences of logic, reasoning,
and tool calls to answer a user’s query. We’ll use our gemini-1.5-flash-001 model and
some simple tools to answer a multi-stage query from the user as seen in Snippet 8.
The tools we are using are the SerpAPI (for Google Search) and the Google Places API. After
executing our program in Snippet 8, you can see the sample output in Snippet 9.
Agents
September 2024 36
Python
from langgraph.prebuilt import create_react_agent
from langchain_core.tools import tool
from langchain_community.utilities import SerpAPIWrapper
from langchain_community.tools import GooglePlacesTool
os.environ["SERPAPI_API_KEY"] = "XXXXX"
os.environ["GPLACES_API_KEY"] = "XXXXX"
@tool
def search(query: str):
"""Use the SerpAPI to run a Google Search."""
search = SerpAPIWrapper()
return search.run(query)
@tool
def places(query: str):
"""Use the Google Places API to run a Google Places Query."""
places = GooglePlacesTool()
return places.run(query)
model = ChatVertexAI(model="gemini-1.5-flash-001")
tools = [search, places]
query = "Who did the Texas Longhorns play in football last week? What is the
address of the other team's stadium?"
agent = create_react_agent(model, tools)
input = {"messages": [("human", query)]}
for s in agent.stream(input, stream_mode="values"):
message = s["messages"][-1]
if isinstance(message, tuple):
 print(message)
else:
 message.pretty_print()
Snippet 8. Sample LangChain and LangGraph based agent with tools
Agents
September 2024 37
Unset
=============================== Human Message ================================
Who did the Texas Longhorns play in football last week? What is the address
of the other team's stadium?
================================= Ai Message =================================
Tool Calls: search
Args:
query: Texas Longhorns football schedule
================================ Tool Message ================================
Name: search
{...Results: "NCAA Division I Football, Georgia, Date..."}
================================= Ai Message =================================
The Texas Longhorns played the Georgia Bulldogs last week.
Tool Calls: places
Args:
query: Georgia Bulldogs stadium
================================ Tool Message ================================
Name: places
{...Sanford Stadium Address: 100 Sanford...}
================================= Ai Message =================================
The address of the Georgia Bulldogs stadium is 100 Sanford Dr, Athens, GA
30602, USA.
Snippet 9. Output from our program in Snippet 8
While this is a fairly simple agent example, it demonstrates the foundational components
of Model, Orchestration, and tools all working together to achieve a specific goal. In the
final section, we’ll explore how these components come together in Google-scale managed
products like Vertex AI agents and Generative Playbooks.
Agents
September 2024 38
Production applications with Vertex
AI agents
While this whitepaper explored the core components of agents, building production-grade
applications requires integrating them with additional tools like user interfaces, evaluation
frameworks, and continuous improvement mechanisms. Google’s Vertex AI platform
simplifies this process by offering a fully managed environment with all the fundamental
elements covered earlier. Using a natural language interface, developers can rapidly
define crucial elements of their agents - goals, task instructions, tools, sub-agents for task
delegation, and examples - to easily construct the desired system behavior. In addition, the
platform comes with a set of development tools that allow for testing, evaluation, measuring
agent performance, debugging, and improving the overall quality of developed agents. This
allows developers to focus on building and refining their agents while the complexities of
infrastructure, deployment and maintenance are managed by the platform itself.
In Figure 15 we’ve provided a sample architecture of an agent that was built on the Vertex
AI platform using various features such as Vertex Agent Builder, Vertex Extensions, Vertex
Function Calling and Vertex Example Store to name a few. The architecture includes many of
the various components necessary for a production ready application.
Agents
September 2024 39
Figure 15. Sample end-to-end agent architecture built on Vertex AI platform
You can try a sample of this prebuilt agent architecture from our official documentation.
Agents
September 2024 40
Summary
In this whitepaper we’ve discussed the foundational building blocks of Generative AI
agents, their compositions, and effective ways to implement them in the form of cognitive
architectures. Some key takeaways from this whitepaper include:
1. Agents extend the capabilities of language models by leveraging tools to access realtime information, suggest real-world actions, and plan and execute complex tasks
autonomously. agents can leverage one or more language models to decide when and
how to transition through states and use external tools to complete any number of
complex tasks that would be difficult or impossible for the model to complete on its own.
2. At the heart of an agent’s operation is the orchestration layer, a cognitive architecture that
structures reasoning, planning, decision-making and guides its actions. Various reasoning
techniques such as ReAct, Chain-of-Thought, and Tree-of-Thoughts, provide a framework
for the orchestration layer to take in information, perform internal reasoning, and generate
informed decisions or responses.
3. Tools, such as Extensions, Functions, and Data Stores, serve as the keys to the outside
world for agents, allowing them to interact with external systems and access knowledge
beyond their training data. Extensions provide a bridge between agents and external APIs,
enabling the execution of API calls and retrieval of real-time information. functions provide
a more nuanced control for the developer through the division of labor, allowing agents
to generate Function parameters which can be executed client-side. Data Stores provide
agents with access to structured or unstructured data, enabling data-driven applications.
The future of agents holds exciting advancements and we’ve only begun to scratch the
surface of what is possible. As tools become more sophisticated and reasoning capabilities
are enhanced, agents will be empowered to solve increasingly complex problems.
Furthermore, the strategic approach of ‘agent chaining’ will continue to gain momentum. By
Agents
September 2024 41
combining specialized agents - each excelling in a particular domain or task - we can create
a ‘mixture of agent experts’ approach, capable of delivering exceptional results across
various industries and problem areas.
It’s important to remember that building complex agent architectures demands an iterative
approach. Experimentation and refinement are key to finding solutions for specific business
cases and organizational needs. No two agents are created alike due to the generative nature
of the foundational models that underpin their architecture. However, by harnessing the
strengths of each of these foundational components, we can create impactful applications
that extend the capabilities of language models and drive real-world value.
Agents
September 2024 42

###
https://huggingface.co/blog/smolagents
Introducing smolagents, a simple library to build agents
Published December 31, 2024
Aymeric Roucher's avatar
m-ric
Aymeric Roucher
Merve Noyan's avatar
merve
Merve Noyan
Thomas Wolf's avatar
thomwolf
Thomas Wolf

Today we are launching smolagents, a very simple library that unlocks agentic capabilities for language models. Here’s a glimpse:
from smolagents import CodeAgent, DuckDuckGoSearchTool, HfApiModel

agent = CodeAgent(tools=[DuckDuckGoSearchTool()], model=HfApiModel())

agent.run("How many seconds would it take for a leopard at full speed to run through Pont des Arts?")


Table of Contents
🤔 What are agents?
✅ When to use agents / ⛔ when to avoid them
Code agents
Introducing smolagents: making agents simple 🥳
Building an agent
How strong are open models for agentic workflows?
Next steps 🚀
🤔 What are agents?
Any efficient system using AI will need to provide LLMs some kind of access to the real world: for instance the possibility to call a search tool to get external information, or to act on certain programs in order to solve a task. In other words, LLMs should have agency. Agentic programs are the gateway to the outside world for LLMs.

AI Agents are programs where LLM outputs control the workflow.

Any system leveraging LLMs will integrate the LLM outputs into code. The influence of the LLM's input on the code workflow is the level of agency of LLMs in the system.

Note that with this definition, "agent" is not a discrete, 0 or 1 definition: instead, "agency" evolves on a continuous spectrum, as you give more or less power to the LLM on your workflow.

The table below illustrates how agency varies across systems:

Agency Level	Description	How that's called	Example Pattern
☆☆☆	LLM output has no impact on program flow	Simple processor	process_llm_output(llm_response)
★☆☆	LLM output determines basic control flow	Router	if llm_decision(): path_a() else: path_b()
★★☆	LLM output determines function execution	Tool call	run_function(llm_chosen_tool, llm_chosen_args)
★★★	LLM output controls iteration and program continuation	Multi-step Agent	while llm_should_continue(): execute_next_step()
★★★	One agentic workflow can start another agentic workflow	Multi-Agent	if llm_trigger(): execute_agent()
The multi-step agent has this code structure:

memory = [user_defined_task]
while llm_should_continue(memory): # this loop is the multi-step part
    action = llm_get_next_action(memory) # this is the tool-calling part
    observations = execute_action(action)
    memory += [action, observations]

So this system runs in a loop, executing a new action at each step (the action can involve calling some pre-determined tools that are just functions), until its observations make it apparent that a satisfactory state has been reached to solve the given task. Here’s an example of how a multi-step agent can solve a simple math question:


✅ When to use agents / ⛔ when to avoid them
Agents are useful when you need an LLM to determine the workflow of an app. But they’re often overkill. The question is: do I really need flexibility in the workflow to efficiently solve the task at hand? If the pre-determined workflow falls short too often, that means you need more flexibility. Let's take an example: say you're making an app that handles customer requests on a surfing trip website.

You could know in advance that the requests will belong to either of 2 buckets (based on user choice), and you have a predefined workflow for each of these 2 cases.

Want some knowledge on the trips? ⇒ give them access to a search bar to search your knowledge base
Wants to talk to sales? ⇒ let them type in a contact form.
If that deterministic workflow fits all queries, by all means just code everything! This will give you a 100% reliable system with no risk of error introduced by letting unpredictable LLMs meddle in your workflow. For the sake of simplicity and robustness, it's advised to regularize towards not using any agentic behaviour.

But what if the workflow can't be determined that well in advance?

For instance, a user wants to ask : "I can come on Monday, but I forgot my passport so risk being delayed to Wednesday, is it possible to take me and my stuff to surf on Tuesday morning, with a cancellation insurance?" This question hinges on many factors, and probably none of the predetermined criteria above will suffice for this request.

If the pre-determined workflow falls short too often, that means you need more flexibility.

That is where an agentic setup helps.

In the above example, you could just make a multi-step agent that has access to a weather API for weather forecasts, Google Maps API to compute travel distance, an employee availability dashboard and a RAG system on your knowledge base.

Until recently, computer programs were restricted to pre-determined workflows, trying to handle complexity by piling up if/else switches. They focused on extremely narrow tasks, like "compute the sum of these numbers" or "find the shortest path in this graph". But actually, most real-life tasks, like our trip example above, do not fit in pre-determined workflows. Agentic systems open up the vast world of real-world tasks to programs!

Code agents
In a multi-step agent, at each step, the LLM can write an action, in the form of some calls to external tools. A common format (used by Anthropic, OpenAI, and many others) for writing these actions is generally different shades of "writing actions as a JSON of tools names and arguments to use, which you then parse to know which tool to execute and with which arguments".

Multiple research papers have shown that having the tool calling LLMs in code is much better.

The reason for this simply that we crafted our code languages specifically to be the best possible way to express actions performed by a computer. If JSON snippets were a better expression, JSON would be the top programming language and programming would be hell on earth.

The figure below, taken from Executable Code Actions Elicit Better LLM Agents, illustrate some advantages of writing actions in code:


Writing actions in code rather than JSON-like snippets provides better:

Composability: could you nest JSON actions within each other, or define a set of JSON actions to re-use later, the same way you could just define a python function?
Object management: how do you store the output of an action like generate_image in JSON?
Generality: code is built to express simply anything you can have a computer do.
Representation in LLM training data: plenty of quality code actions is already included in LLMs’ training data which means they’re already trained for this!
Introducing smolagents: making agents simple 🥳
We built smolagents with these objectives:

✨ Simplicity: the logic for agents fits in ~thousand lines of code (see this file). We kept abstractions to their minimal shape above raw code!

🧑‍💻 First-class support for Code Agents, i.e. agents that write their actions in code (as opposed to "agents being used to write code"). To make it secure, we support executing in sandboxed environments via E2B.

On top of this CodeAgent class, we still support the standard ToolCallingAgent that writes actions as JSON/text blobs.
🤗 Hub integrations: you can share and load tools to/from the Hub, and more is to come!

🌐 Support for any LLM: it supports models hosted on the Hub loaded in their transformers version or through our inference API, but also supports models from OpenAI, Anthropic and many others via our LiteLLM integration.

smolagents is the successor to transformers.agents, and will be replacing it as transformers.agents gets deprecated in the future.

Building an agent
To build an agent, you need at least two elements:

tools: a list of tools the agent has access to
model: an LLM that will be the engine of your agent.
For the model, you can use any LLM, either open models using our HfApiModel class, that leverages Hugging Face's free inference API (as shown in the leopard example above), or you can use LiteLLMModel to leverage litellm and pick from a list of 100+ different cloud LLMs.

For the tool, you can just make a function with type hints on inputs and outputs, and docstrings giving descriptions for inputs, and use the @tool decorator to make it a tool.

Here’s how to make a custom tool that gets travel times from Google Maps, and how to use it into a travel planner agent:

from typing import Optional
from smolagents import CodeAgent, HfApiModel, tool


def get_travel_duration(start_location: str, destination_location: str, departure_time: Optional[int] = None) -> str:
    """Gets the travel time in car between two places.

    Args:
        start_location: the place from which you start your ride
        destination_location: the place of arrival
        departure_time: the departure time, provide only a `datetime.datetime` if you want to specify this
    """
    import googlemaps # All imports are placed within the function, to allow for sharing to Hub.
    import os

    gmaps = googlemaps.Client(os.getenv("GMAPS_API_KEY"))

    if departure_time is None:
        from datetime import datetime
        departure_time = datetime(2025, 1, 6, 11, 0)

    directions_result = gmaps.directions(
        start_location,
        destination_location,
        mode="transit",
        departure_time=departure_time
    )
    return directions_result[0]["legs"][0]["duration"]["text"]

agent = CodeAgent(tools=[get_travel_duration], model=HfApiModel(), additional_authorized_imports=["datetime"])

agent.run("Can you give me a nice one-day trip around Paris with a few locations and the times? Could be in the city or outside, but should fit in one day. I'm travelling only via public transportation.")

After a few steps of gathering travel times and running calculations, the agent returns this final proposition:

Out - Final answer: Here's a suggested one-day itinerary for Paris:
Visit Eiffel Tower at 9:00 AM - 10:30 AM
Visit Louvre Museum at 11:00 AM - 12:30 PM
Visit Notre-Dame Cathedral at 1:00 PM - 2:30 PM
Visit Palace of Versailles at 3:30 PM - 5:00 PM
Note: The travel time to the Palace of Versailles is approximately 59
minutes from Notre-Dame Cathedral, so be sure to plan your day accordingly.

After building a tool, sharing it to the Hub is as simple as:

get_travel_duration.push_to_hub("{your_username}/get-travel-duration-tool")

You can see the result under this space. You can check the logic for the tool under the file tool.py in the space. As you can see, the tool was actually exported to a class inheriting from class Tool, which is the underlying structure for all our tools.

How strong are open models for agentic workflows?
We've created CodeAgent instances with some leading models, and compared them on this benchmark that gathers questions from a few different benchmarks to propose a varied blend of challenges.

Find the benchmark here for more detail on the agentic setup used, and see a comparison of code agents versus tool calling agents (spoilers: code works better).

benchmark of different models on agentic workflows

This comparison shows that open source models can now take on the best closed models!

Next steps 🚀
Start with the guided tour to familiarize yourself with the library.
Study more in-depth tutorials to learn more on tools or general best practices.
Dive into examples to set up specific systems: text-to-SQL, agentic RAG or multi-agent orchestration.
Read more on agents:
This excellent blog post by Anthropic gives solid general knowledge.
This collection gathers the most impactful research papers on agents.


###
https://github.com/Thytu/Agentarium
25/1/2
Agentarium is a new Python framework for managing and orchestrating AI agents.
Features include (from the repo):
• 🤖 Advanced Agent Management: Create and orchestrate multiple AI agents with different roles and capabilities
• 🔄 Robust Interaction Management: Coordinate complex interactions between agents
• 💾 Checkpoint System: Save and restore agent states and interactions
• 📊 Data Generation: Generate synthetic data through agent interactions
• ⚡ Performance Optimized: Built for efficiency and scalability
• 🌍 Flexible Environment Configuration: Define custom environments with YAML configuration files
• 🛠️ Extensible Architecture: Easy to extend and customize for your specific needs

###
https://github.com/Byaidu/PDFMathTranslate
24/12/1
PDF scientific paper translation and bilingual comparison.

📊 Preserve formulas, charts, table of contents, and annotations (preview).
🌐 Support multiple languages, and diverse translation services.
🤖 Provides commandline tool, interactive user interface, and Docker
Feel free to provide feedback in GitHub Issues, Telegram Group or QQ Group.

For details on how to contribute, please consult the Contribution Guide.

###
https://arxiv.org/abs/2412.09764
Memory Layers at Scale
Vincent-Pierre Berges∗
, Barlas Oğuz∗
, Daniel Haziza, Wen-tau Yih, Luke Zettlemoyer, Gargi Ghosh
Meta FAIR
∗Main authors
Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without
increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense
feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This
work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On
downstream tasks, language models augmented with our improved memory layer outperform dense
models with more than twice the computation budget, as well as mixture-of-expert models when
matched for both compute and parameters. We find gains are especially pronounced for factual tasks.
We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to
128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B
parameters.
Date: December 23, 2024
Open Source
Sharing new research, models, and datasets from Meta FAIR
December 12, 2024
•
11 minute read


Takeaways

Today, Meta FAIR is releasing several new research artifacts that highlight our recent innovations in developing agents, robustness and safety, and architectures that facilitate machine learning.
The work we’re sharing advances our goal of achieving advanced machine intelligence and includes Meta Motivo, a foundation model for controlling the behavior of virtual embodied agents, and Meta Video Seal, an open source model for video watermarking.
We aim to democratize access to state-of-the-art technologies that transform our interaction with the physical world, which is why we're committed to fostering a collaborative and open ecosystem that accelerates progress and discovery.
As we continue to work towards our goal of achieving advanced machine intelligence, we want to share our progress with the research community so they can build upon our work. Today, we’re excited to release some of the latest research, code, models, and datasets from Meta Fundamental AI Research (FAIR). The artifacts we’re sharing today focus on building more capable agents, robustness and safety, and architecture innovations that enable models to learn new information more effectively and scale beyond current limits.

In this release, we’re sharing a demo and code for Meta Video Seal, an open source model work video watermarking that builds on the popular Meta Audio Seal work we shared last year. We’re also sharing a variety of other artifacts, including a foundation model for controlling the behavior of virtual embodied agents, a method for scaling memory layers that will enable more factual information, and code to help models become more socially intelligent. There’s plenty more to explore in this post with nine total projectsand artifacts ready for people to download and start using today.

This work supports our long and proven track record of sharing open reproducible science with the community. By publicly sharing our early research work, we hope to inspire iterations and ultimately help advance AI in a responsible way. As always, we look forward to seeing what the community will build using these new releases and continuing the dialogue about how we can all advance AI together responsibly and build for the greater good.

Meta Motivo

Unsupervised reinforcement learning involves pre-training models to solve a wide range of downstream tasks in complex environments. Most methods require highly curated interaction datasets and often rely on unsupervised losses that lead to policies that may not align well with target tasks. Today, we’re sharing Meta Motivo, a first-of-its-kind behavioral foundation model that controls the movements of a virtual embodied humanoid agent to perform complex tasks.

Meta Motivo is trained with a novel algorithm that leverages an unlabeled dataset of motions to ground unsupervised reinforcement learning towards learning human-like behaviors while retaining zero-shot inference capabilities. The key technical novelty of our algorithm is to learn a representation that can be used to embed states, motions, and rewards into the same latent space. As a result, Meta Motivo is able to solve a wide range of whole-body control tasks, including motion tracking, goal pose reaching, and reward optimization, without any additional training or planning.


Meta Motivo achieves competitive performance compared to task-specific methods and outperforms state-of-the-art unsupervised reinforcement learning and model-based baselines, while exhibiting more human-like behaviors. The model also displays a surprising level of robustness to changes in the environment, such as gravity, wind, or direct perturbations, despite not being trained for them.

In the future, we believe this research could pave the way for fully embodied agents in the Metaverse, leading to more lifelike NPCs, democratization of character animation, and new types of immersive experiences.

Read the paper

Try the demo

Download the code and model

Meta Video Seal

While AI tools can help bring the world closer together, it’s important that we implement safeguards to mitigate the risks of imitation, manipulation, and other forms of misuse that can undermine their benefits. Post-hoc watermarking is a crucial step towards better traceability for content and AI models.


Today, we’re releasing Meta Video Seal, a state-of-the art comprehensive framework for neural video watermarking. Video Seal adds a watermark (with an optional hidden message) into videos that is imperceptible to the naked eye and can later be uncovered to determine a video’s origin. The watermark has proven resilience against common video editing efforts like blurring or cropping, as well as compression algorithms commonly used when sharing content online. We’re publicly releasing the Video Seal model under a permissive license, along with a research paper, training code, and inference code. A demo is also available to try the model out interactively.

Along with Video Seal, we’re also releasing Meta Omni Seal Bench, a leaderboard dedicated to neural watermarking covering several modalities, enabling the research community to easily test and add their own work in the field. We’re also re-releasing our Meta Watermark Anything model under a permissive license and will organize a workshop on watermarking at ICLR in 2025.

This research is a testimony to our commitment to responsible AI. We hope that other researchers and developers will join our efforts by integrating watermarking capabilities when building generative AI models. Watermark Anything, Video Seal, and Audio Seal—our previous work on post-hoc audio watermarking—are now all available for download and ready to be integrated.

Read the paper

Try the demo

Download the Video Seal code and model

Download the Watermark Anything code and model

View the Omni Seal Bench leaderboard

Flow Matching guide and codebase, a Meta FAIR release

Flow Matching is a state-of-the-art generative paradigm for many modalities including generation of images, videos, audio, music, 3D structures like proteins, and more. Our method has already replaced classical diffusion in many generative applications at Meta, including Meta Movie Gen, Meta Audiobox, and Meta Melody Flow, and across the industry in works such as Stable-Diffusion-3, Flux, Fold-Flow, and Physical Intelligence Pi_0. Flow Matching provides a simple yet flexible generative AI framework, improving performance and efficiency while allowing easy generalization to complex data. Today, we’re sharing a paper and code, including core implementations of both continuous and discrete Flow Matching, alongside state-of-the-art training scripts to enable the research community to easily use and iterate on the Flow Matching method. By publicly sharing this work, we hope to inspire wider adoption of Flow Matching and enable people to use it in their own generative projects.

Read the paper

Download the code

Meta Explore Theory-of-Mind

A key aspect of our social intelligence enables us to reason about the thoughts and beliefs of other agents, both human and artificial. Existing Theory-of-Mind (ToM) datasets have limitations, focusing solely on evaluation and depicting only a narrow range of interactions. To address this and move closer to achieving advanced machine intelligence, we introduce Meta Explore Theory-of-Mind, a program-guided adversarial data generation for theory of mind reasoning. Our novel framework enables the generation of diverse, challenging, and scalable ToM reasoning data for both training and evaluation, which will help accelerate progress in this critical area of research.


Explore Theory-of-Mind generates robust and reliable stories that push the limits of large language models (LLMs), making it ideal for evaluating frontier models or fine-tuning data, resulting in significant improvements on classic theory of mind benchmarks. Our first-of-its-kind approach led to a 27-point accuracy improvement on the commonly used ToMi benchmark when fine-tuning a Llama-3.1 7B model, which means unprecedented accuracy in evaluating the theory of mind training data. Explore Theory-of-Mind can be used to generate datasets for improving LLMs, enhance goal-oriented scenarios, and collect interaction datasets, while also serving as a benchmark for evaluating LLM performance.

Read the paper

Download the code

Download the dataset

Meta Large Concept Models

As we work toward advanced machine intelligence, models will need to be able to reason across languages and modalities and to excel at long-form generational capabilities that require explicit hierarchical thinking, such as writing an essay. Current mainstream language modeling approaches typically operate at the token level and don’t explicitly reason in a hierarchical manner.

Today, we’re introducing a fundamentally different training paradigm for language modeling: the Large Concept Model (LCM). The core idea of the LCM is to decouple reasoning from language representation, and it’s inspired by how humans can plan high-level thoughts to communicate. For example, when giving a presentation multiple times, a presenter always has the same series of ideas they want to convey (materialized by their slides projected on screen), but their exact choice of words might vary from one run to the other.


Guided by that principle, the LCM is a significant departure from a typical LLM. Rather than predicting the next token, the LCM is trained to predict the next concept or high-level idea, represented by a full sentence in a multimodal and multilingual embedding space. Our work explores how predictions can be made for text in such a continuous space. Overall, the LCM outperforms or matches recent LLMs in the pure generative task of summarization, offers strong zero-shot generalization to unseen languages, and is more computationally efficient as input context grows. We hope the research community uses this work to improve language models that can operate on any modality or language, in an explicit hierarchical manner.

Read the paper

Download the code

Meta Dynamic Byte Latent Transformer

Language models assume text has been tokenized in a heuristic preprocessing step, breaking words into smaller units that are easier to process. This limits end-to-end learning, is difficult to optimize in practice, and can hurt performance on rare text sequences. To address this, we’re introducing Dynamic Byte Latent Transformer, a hierarchical byte-level (tokenizer-free) model with dynamic patching schemes that are able to operate over bytes—without any tokenization heuristics—while also improving efficiency for long sequences during training and inference.

Dynamic Byte Latent Transformer outperforms tokenizer-based models across the board in terms of robustness, with a seven point advantage on average, and excels at processing longtail and rare sequences of unseen symbols. By sharing this work, we hope to accelerate advancements that will enable us to better reason over a variety of domains that are important to advanced machine intelligence, including low resource languages, coding, and factuality.

Read the paper

Download the code

Meta Memory Layers

Parametric memory, the repository of factual information stored in the weights on a neural network during pretraining, enables LLMs to understand complex concepts and linguistic nuances. As current scaling methods approach their limit of efficient scaling, new architectures that enable models to learn information more effectively must be explored. Today, we’re sharing a research paper and code for Meta Memory Layers at Scale, a method for scaling memory layers that enables an increase in factuality against commonly used benchmarks as we work toward achieving advanced machine intelligence.

Memory Layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Sparsely activated memory layers complement the compute-heavy nature of dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as MoE models when matched for both compute and parameters.

Contrary to the prevailing perception in the field that sparse memory architectures cannot be scaled competitively, we demonstrate efficient scaling of sparse memory layers up to 128 billion parameters and 8B base models, with significant improvements at comparable compute across the board for commonly used factuality benchmarks.

Read the paper

Download the code

Meta Image Diversity Modeling

This year, FAIR has focused on research to better understand and develop new methods for the safe development of image generation models. Today, we’re announcing updates on this research and releasing a comprehensive evaluation toolbox for text-to-image generative models. The image generation model we’ve developed through the course of this research builds on our prior research on generative models’ architectures and losses and prioritizes generating images that are representative of the physical world while maintaining competitive image quality with state-of-the-art models.

To further the research into new methods and techniques for responsible development, we’re collaborating with external experts, whom we’re inviting to use our model to carry out research in areas that can help us to improve the safety and responsibility across image diversity modeling. This initiative highlights our commitment to collaborating with the wider AI research community to collectively advance AI responsibility.

Additionally, we will be open sourcing a comprehensive evaluation toolbox for text-to-image generative models to improve the ease and reproducibility of image generation benchmarking while promoting interpretable takeaways that inform future responsible text-to-image research.

Through our continued work, we hope to better understand and offer new methods for responsible development of image generative models that can be adopted by the broader research community.

Read the paper

Download the code

Meta CLIP 1.2

We’re excited to release Meta CLIP 1.2, a milestone in our ongoing efforts to develop a high-performance vision-language encoder. We have been working on advanced algorithms to effectively curate and align vast amounts of image-text data, unlocking the learning of human knowledge about the world. This enables our models to learn efficiently and accurately, capturing the nuances of fine-grained mapping between image and language semantics.

Large-scale, high-quality, and diverse datasets are essential for building foundation models that can learn about the world. Meta CLIP is our effort towards building such datasets and foundation models. To ensure a high-quality and safe vision-language encoder foundation model, we’ve developed algorithms to effectively curate and align data with human knowledge from vast data pools, enabling our models to learn efficiently and cover all possibilities. We also conducted rigorous data research while applying robust integrity and privacy-protective measures.

By releasing our data algorithms, training recipes, and foundation models trained on our curated dataset, we’re providing researchers and developers with the tools they need to advance the field of vision-language understanding. These foundation models can be used as vision encoding for MLLM, multi-modal embedding for retrieval, and zero-shot classification, while serving as a starting point for research on data quality. Additionally, our algorithms and training methods can also be used to create high-quality, large-scale, CLIP-like datasets from scratch, which can help with new research or production use cases.

###
https://github.com/NVIDIA/nv-ingest
1/3/25
NVIDIA-Ingest: Multi-modal data extraction
NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. Including support for parsing PDFs, Word and PowerPoint documents, it uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.

NVIDIA Ingest enables parallelization of the process of splitting documents into pages where contents are classified (as tables, charts, images, text), extracted into discrete content, and further contextualized via optical character recognition (OCR) into a well defined JSON schema. From there, NVIDIA Ingest can optionally manage computation of embeddings for the extracted content, and also optionally manage storing into a vector database Milvus.

Table of Contents
Introduction
Prerequisites
Quickstart
Repo Structure
Notices
Introduction
What NVIDIA-Ingest is ✔️
A microservice that:

Accepts a JSON Job description, containing a document payload, and a set of ingestion tasks to perform on that payload.
Allows the results of a Job to be retrieved; the result is a JSON dictionary containing a list of Metadata describing objects extracted from the base document, as well as processing annotations and timing/trace data.
Supports PDF, Docx, pptx, and images.
Supports multiple methods of extraction for each document type in order to balance trade-offs between throughput and accuracy. For example, for PDF documents we support extraction via pdfium, Unstructured.io, and Adobe Content Extraction Services.
Supports various types of pre and post processing operations, including text splitting and chunking; transform, and filtering; embedding generation, and image offloading to storage.
What NVIDIA-Ingest is not ✖️
A service that:

Runs a static pipeline or fixed set of operations on every submitted document.
Acts as a wrapper for any specific document parsing library.

👏 NVIDIA released ingest, a microservice designed for extracting content and metadata efficiently from thousands of documents like PDFs, Word, and PowerPoint. It uses NVIDIA NIM microservices to efficiently parse and extract text, tables, charts, and images, making it ideal for downstream applications.
NVIDIA Ingest splits documents into pages, classifies content, and uses optical character recognition (OCR) to convert it into a structured JSON format. It can also compute embeddings for the extracted content and store them in a vector database like Milvus.
The service supports various extraction methods to balance throughput and accuracy, such as pdfium, Unstructured.io, and Adobe Content Extraction Services for PDFs. It also offers pre and post-processing operations, including text splitting, filtering, and embedding generation.

###
https://github.com/unclecode/crawl4ai
12/15/24
Web scraping will never be the same!

Crawl4AI simplifies web crawling and data extraction, making it ready to use for LLMs and AI applications.

Here’s why it’s a game-changer:

🆓 Completely free and open-source

🚀 Blazing fast performance, outperforming many paid services

🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)

🌍 Supports crawling multiple URLs simultaneously

🎨 Extracts all media tags (Images, Audio, Video)

🔗 Extracts all external and internal links

But that’s not all:

📚 Extracts metadata from pages

🔄 Custom hooks for auth, headers, and page modifications

🕵️ User-agent customization

🖼️ Takes screenshots of pages

📜 Executes custom JavaScript before crawling

🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
unclecode%2Fcrawl4ai | Trendshift

GitHub Stars GitHub Forks

PyPI version Python Version Downloads

License Code style: black Security: bandit

Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.

✨ Check out latest update v0.4.24x

🎉 Version 0.4.24x is out! Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! Read the release notes →

🧐 Why Crawl4AI?
Built for LLMs: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
Lightning Fast: Delivers results 6x faster with real-time, cost-efficient performance.
Flexible Browser Control: Offers session management, proxies, and custom hooks for seamless data access.
Heuristic Intelligence: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
Open Source & Deployable: Fully open-source with no API keys—ready for Docker and cloud integration.
Thriving Community: Actively maintained by a vibrant community and the #1 trending GitHub repository.



###
https://arxiv.org/pdf/2412.17189
[Submitted on 22 Dec 2024]

Better Think with Tables: Leveraging Tables to Enhance Large Language
Model Comprehension
Jio Oh*‡1
, Geon Heo*1
, Seungjun Oh1
, Jindong Wang2
, Xing Xie2
, Steven Euijong Whang†1
1KAIST, 2Microsoft Research Asia
Abstract
Despite the recent advancement of Large Langauge Models (LLMs), they struggle with complex queries often involving multiple conditions, common in real-world scenarios. We
propose Thinking with Tables, a technique that
assists LLMs to leverage tables for intermediate thinking aligning with human cognitive
behavior. By introducing a pre-instruction that
triggers an LLM to organize information in
tables, our approach achieves a 40.29% average relative performance increase, higher robustness, and show generalizability to different
requests, conditions, or scenarios. We additionally show the influence of data structuredness
for the model by comparing results from four
distinct structuring levels that we introduce.

###
https://arxiv.org/abs/2412.15605
Don’t Do RAG:
When Cache-Augmented Generation is All You Need for
Knowledge Tasks
Brian J Chan∗
Chao-Ting Chen∗
Jui-Hung Cheng∗
Department of Computer Science
National Chengchi University
Taipei, Taiwan
{110703065,110703038,110703007}@nccu.edu.tw
Hen-Hsen Huang
Insititue of Information Science
Academia Sinica
Taipei, Taiwan
hhhuang@iis.sinica.edu.tw
[Submitted on 20 Dec 2024]

Abstract
Retrieval-augmented generation (RAG) has gained traction as a
powerful approach for enhancing language models by integrating
external knowledge sources. However, RAG introduces challenges
such as retrieval latency, potential errors in document selection,
and increased system complexity. With the advent of large language models (LLMs) featuring significantly extended context windows, this paper proposes an alternative paradigm, cache-augmented
generation (CAG) that bypasses real-time retrieval. Our method involves preloading all relevant resources, especially when the documents or knowledge for retrieval are of a limited and manageable
size, into the LLM’s extended context and caching its runtime parameters. During inference, the model utilizes these preloaded parameters to answer queries without additional retrieval steps. Comparative analyses reveal that CAG eliminates retrieval latency and
minimizes retrieval errors while maintaining context relevance. Performance evaluations across multiple benchmarks highlight scenarios where long-context LLMs either outperform or complement
traditional RAG pipelines. These findings suggest that, for certain
applications, particularly those with a constrained knowledge base,
CAG provide a streamlined and efficient alternative to RAG, achieving comparable or superior results with reduced complexity

기술적으로 최대한 자세하게 적어. 9개의 기사가 있고 하나도 빼먹지 말고 적어.

TECH BLOG by Dongyoung Kim Ph.D.

2025년 01월 06일 AI 소식

Google, Agents

Hugging Face, Introducing smolagents, a simple library to build agents

Thytu, Agentarium

Byaidu, PDFMathTranslate

Meta FAIR, Memory Layers at Scale

NVIDIA, NVIDIA-Ingest

unclecode, crawl4ai

Jio Oh 외, Better Think with Tables: Leveraging Tables to Enhance Large Language Model Comprehension

Brian J Chan 외, Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

(today’s date in 년 월 일) AI 소식,

Summary

company name, Title

company name, Title

Swap the left and right children recursively

Example usage:

Construct a sample binary tree

Invert the binary tree

TECH BLOG by Dongyoung Kim Ph.D.

2025년 01월 06일 AI 소식

Google, Agents

Hugging Face, Introducing smolagents, a simple library to build agents

Thytu, Agentarium

Byaidu, PDFMathTranslate

Meta FAIR, Memory Layers at Scale

Meta FAIR, Sharing new research, models, and datasets from Meta FAIR

NVIDIA, NVIDIA-Ingest

unclecode, crawl4ai

Jio Oh 외, Better Think with Tables: Leveraging Tables to Enhance Large Language Model Comprehension

Brian J Chan 외, Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

(today’s date in 년 월 일) AI 소식,

Summary

company name, Title

company name, Title

Swap the left and right children recursively

Example usage:

Construct a sample binary tree

Invert the binary tree