2024년 5월 26일 AI 소식 · TECH BLOG by Dongyoung Kim Ph.D.

요약

오늘의 소식에서는 OmniGlue의 이미지 매칭 기술, Mistral 모델의 메모리 효율적인 파인튜닝, 대형 언어 모델을 이용한 재무 분석, 트랜스포머의 선형적 특성, World Knowledge Model을 통한 에이전트 플래닝, LLM의 개선된 사실 기반 인용 기술, 고해상도 3D 메쉬 생성 모델, 혼합 모달 초기 융합 모델 Chameleon, 그리고 훈련 없이 무한 비디오 생성이 가능한 FIFO-Diffusion 기법에 대해 다룹니다.

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

OmniGlue, 2024년 CVPR

새로운 학습 가능한 이미지 매처 OmniGlue 소개
이미지 매칭 기술의 일반화 문제 해결
시각 기초 모델을 활용하여 이미지 매칭 프로세스 가이드
키포인트 위치 기반 주의 메커니즘 제안
6개 데이터셋에서 실험 수행, SuperGlue 대비 20.9% 성능 향상
LightGlue 대비 9.5% 성능 우수

Mistral-finetune

Mistral-finetune, 공개

메모리 효율적인 Mistral 모델 파인튜닝 코드베이스
LoRA 기반의 훈련 패러다임 사용
대부분의 가중치를 고정하고 저순위 행렬 변동만 훈련
A100 또는 H100 GPU 사용 권장
다중 GPU 단일 노드 훈련 최적화

Financial Statement Analysis with Large Language Models

논문 링크, 시카고 부스 연구 논문

LLM을 활용한 재무제표 분석
GPT-4가 미래 수익 예측에서 인간 분석가보다 우수
이야기 형식 없이도 정확한 수익 변화 예측 가능
GPT 예측을 기반으로 한 거래 전략이 높은 샤프 비율과 알파 제공

Your Transformer is Secretly Linear

논문 링크, 2024년 5월 20일

트랜스포머 디코더의 선형적 특성 발견
계층 간 임베딩 변환에서 높은 선형 관계
잔여 구성 요소 제거 시 선형성 감소
코사인 유사성 기반 정규화를 통해 모델 성능 향상

Agent Planning with World Knowledge Model

논문 링크, 2024년 5월

대형 언어 모델을 사용한 에이전트 플래닝
실험 결과 WKM이 블라인드 시행착오와 환각 행동 문제 해결
인스턴스 수준의 과제 지식이 미지의 과제에도 일반화 가능
강한 에이전트 모델 플래닝에 약한 WKM 가이드 가능

Effective large language model adaptation for improved grounding

연구 블로그, 2024년 5월 24일

LLM의 사실 기반 인용 개선을 위한 AGREE 프레임워크 소개
종합적인 실험에서 이전 접근법 대비 30% 이상의 향상된 성과
LLM을 튜닝하여 응답에 인용을 포함하고 사실 기반으로 만듦
테스트 시간 적응(TTA) 메커니즘 도입

CraftsMan: High-fidelity Mesh Generation

논문 링크, 2024년 5월

고해상도 3D 메쉬 생성 시스템 CraftsMan 소개
아티스트의 작업 흐름을 모방하여 거친 메쉬 생성 후 세부적으로 정교화
텍스트 프롬프트 또는 참조 이미지를 입력으로 사용
멀티뷰(MV) 확산 모델을 활용하여 3D 지오메트리 생성
표면 세부 사항을 자동 또는 상호작용 방식으로 정교화

논문 링크, 2024년 5월 16일

혼합 모달 초기 융합 모델 Chameleon 소개
시각적 질문 응답, 이미지 캡션 생성, 텍스트 및 이미지 생성 등 다양한 작업 수행
Llama-2보다 텍스트 작업에서 우수한 성능 발휘
Gemini-Pro 및 GPT-4V와 경쟁 가능

FIFO-Diffusion: Generating Infinite Videos from Text

FIFO-Diffusion, 2024년 5월

텍스트 조건부 비디오 생성 기술 FIFO-Diffusion 소개
훈련 없이 무한 비디오 생성 가능
대각선 디노이징 기법을 사용하여 연속적인 프레임 처리
고해상도 비디오 생성에 유망한 결과 도출

각 링크의 상세 내용과 연구 결과는 AI 기술의 최신 동향과 발전 가능성을 보여줍니다.

Sources

This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is: # AI News for (today's date), ## Summary (overall short summary), ## Link1 Title, link, date - detailed summary1, - detailed summary2, - detailed summary..N, ## Link2 Title, link, date - detailed summary1, - detailed summary2, - detailed point..N, etc. The report should be written in Korean and use the 개조식 문체 style. give the very deep details for each link as much as possible.

###
https://hwjiang1510.github.io/OmniGlue/
OmniGlue: Generalizable Feature Matching with Foundation Model Guidance
Hanwen Jiang1, Arjun Karpur2, Bingyi Cao2, Qixing Huang1, Andre Araujo2
1UT Austin     2Google Research
CVPR 2024
 
Abstract
The image matching field has been witnessing a continuous emergence of novel learnable feature matching techniques, with ever-improving performance on conventional benchmarks. However, our investigation shows that despite these gains, their potential for real-world applications is restricted by their limited generalization capabilities to novel image domains. In this paper, we introduce OmniGlue, the first learnable image matcher that is designed with generalization as a core principle. OmniGlue leverages broad knowledge from a vision foundation model to guide the feature matching process, boosting generalization to domains not seen at training time. Additionally, we propose a novel keypoint position-guided attention mechanism which disentangles spatial and appearance information, leading to enhanced matching descriptors. We perform comprehensive experiments on a suite of 6 datasets with varied image domains, including scene-level, object-centric and aerial images. OmniGlue's novel components lead to relative gains on unseen domains of 20.9% with respect to a directly comparable reference model SuperGlue, while also outperforming the recent LightGlue method by 9.5% relatively.

OmniGlue Framework
OmniGlue is the first learnable image matcher that is de-signed with generalization as a core principle. OmniGlue benefits from two designs: foundation model guidance and keypoint-position attention guidance. The visual foundation model, which is trained on large-scale data, provides coarse but generalizable correspondence cues. It huides the inter-image feature propagation process. The keypoint-position attention guidance disentangles the positional informatation from the keypoint features, which avoids the model specializing too strongly in the training dis-tribution of keypoints and relative pose transformations.

###

https://github.com/mistralai/mistral-finetune
Mistral-finetune
Open In Colab
mistral-finetune is a light-weight codebase that enables memory-efficient and performant finetuning of Mistral's models. It is based on LoRA, a training paradigm where most weights are frozen and only 1-2% additional weights in the form of low-rank matrix perturbations are trained.

For maximum efficiency it is recommended to use a A100 or H100 GPU. The codebase is optimized for multi-GPU-single-node training setups, but for smaller models, such as the 7B a single GPU suffices.

Note

The goal of this repository is to provide a simple, guided entrypoint to finetune Mistral models. As such, it is fairly opinionated (especially around data formatting) and does not aim at being exhaustive across multiple model architecture or hardware types. For more generic approaches, you can check out some other great projects like torchtune.

###

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4835311
Financial Statement Analysis with Large Language Models
Chicago Booth Research Paper Forthcoming

Fama-Miller Working Paper

54 Pages Posted: 21 May 2024
Alex Kim
University of Chicago Booth School of Business

Maximilian Muhn
University of Chicago - Booth School of Business

Valeri V. Nikolaev
University of Chicago Booth School of Business

Date Written: May 20, 2024

Abstract
We investigate whether an LLM can successfully perform financial statement analysis in a way similar to a professional human analyst. We provide standardized and anonymous financial statements to GPT4 and instruct the model to analyze them to determine the direction of future earnings. Even without any narrative or industry-specific information, the LLM outperforms financial analysts in its ability to predict earnings changes. The LLM exhibits a relative advantage over human analysts in situations when the analysts tend to struggle. Furthermore, we find that the prediction accuracy of the LLM is on par with the performance of a narrowly trained state-of-the-art ML model. LLM prediction does not stem from its training memory. Instead, we find that the LLM generates useful narrative insights about a company's future performance. Lastly, our trading strategies based on GPT's predictions yield a higher Sharpe ratio and alphas than strategies based on other models. Taken together, our results suggest that LLMs may take a central role in decision-making.

Keywords: GPT4, neural network, asset pricing, earnings, direction of earnings changes, analysts, chain-of-thought, financial statement analysis, large language models

###

https://huggingface.co/papers/2405.12250
Your Transformer is Secretly Linear
Published on May 20
·
Featured in Daily Papers on May 22
Authors:

Anton Razzhigaev
,

Matvey Mikhalchuk
,

Elizaveta Goncharova
,

Nikolai Gerasimenko
,

Ivan Oseledets
,
Denis Dimitrov
,

Andrey Kuznetsov
Abstract
This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.

###

https://arxiv.org/abs/2405.14205
Agent Planning with World Knowledge Model
Shuofei Qiao, Runnan Fang, Ningyu Zhang, Yuqi Zhu, Xiang Chen, Shumin Deng, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Recent endeavors towards directly using large language models (LLMs) as agent models to execute interactive planning tasks have shown commendable results. Despite their achievements, however, they still struggle with brainless trial-and-error in global planning and generating hallucinatory actions in local planning due to their poor understanding of the ''real'' physical world. Imitating humans' mental world knowledge model which provides global prior knowledge before the task and maintains local dynamic knowledge during the task, in this paper, we introduce parametric World Knowledge Model (WKM) to facilitate agent planning. Concretely, we steer the agent model to self-synthesize knowledge from both expert and sampled trajectories. Then we develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning. Experimental results on three complex real-world simulated datasets with three state-of-the-art open-source LLMs, Mistral-7B, Gemma-7B, and Llama-3-8B, demonstrate that our method can achieve superior performance compared to various strong baselines. Besides, we analyze to illustrate that our WKM can effectively alleviate the blind trial-and-error and hallucinatory action issues, providing strong support for the agent's understanding of the world. Other interesting findings include: 1) our instance-level task knowledge can generalize better to unseen tasks, 2) weak WKM can guide strong agent model planning, and 3) unified WKM training has promising potential for further development. Code will be available at this https URL.

###

https://research.google/blog/effective-large-language-model-adaptation-for-improved-grounding/
Effective large language model adaptation for improved grounding
May 24, 2024

Xi Ye, Student Researcher, and Ruoxi Sun, Research Scientist, Google Cloud

We introduce AGREE, a learning-based framework that enables LLMs to provide accurate citations in their responses, making them more reliable and increasing user trust.

Over the last few years, large language models (LLMs) have showcased remarkable advances in various capabilities, such as multi-hop reasoning, generating plans, and using tools and APIs, all of which demonstrate promise for numerous downstream applications. However, their reliability in real-world deployment is sometimes compromised by the issue of "hallucination", where such models generate plausible but nonfactual information. Hallucinations tend to occur more frequently when LLMs are prompted with open-ended queries that require drawing upon broad world knowledge. This poses risks in domains that demand high factual accuracy, such as news reporting and educational content.

Grounding aims to combat the hallucination problems of LLMs by tracking back their claims to reliable sources. Such a system would not only provide coherent and helpful responses, but also supports its claims with relevant citations to external knowledge.

With this in mind, in our paper “Effective large language model adaptation for improved grounding”, to be presented at NAACL 2024, we introduce a new framework for grounding of LLMs. This framework, which we call AGREE (Adaptation for GRounding EnhancEment), enables LLMs to self-ground the claims in their responses and to provide precise citations to retrieved documents, increasing user trust and expanding their potential applications. Comprehensive experiments on five datasets suggest AGREE leads to substantially better grounding than prior prompting-based or post-hoc citing approaches, often achieving relative improvements of over 30%.

A holistic approach to improve grounding
Prior research on improving grounding mostly follows two prominent paradigms. One is to add citations post-hoc using an additional natural language inference (NLI) model. This approach heavily relies on the knowledge within an LLM’s embeddings and does not extend well to facts beyond that. Another common method for grounding is to leverage the instruction-following and in-context learning capabilities of LLMs. With this second approach, LLMs are required to learn grounding just from a few demonstration prompts, which, in practice, does not lead to the best grounding quality.

Our new framework, AGREE, takes a holistic approach to adapt LLMs for better grounding and citation generation, combining both learning-based adaptation and test-time adaptation (TTA). Different from prior prompting-based approaches, AGREE fine-tunes LLMs, enabling them to self-ground the claims in their responses and provide accurate citations. This tuning on top of the pre-trained LLMs requires well-grounded responses (with citations), for which we introduce a method that can automatically construct such data from unlabeled queries. The self-grounding capability of tuned LLMs further grants them a TTA capability that can iteratively improve their responses.

High-level illustration of AGREE. At training time, we generate training data automatically and adapt LLMs for better grounding via fine-tuning. At test time, we introduce a test-time adaptation mechanism to iteratively improve their responses.

Tuning LLMs for self-grounding
During training, AGREE collects synthetic data from unlabeled queries, which we then use to fine-tune a base LLM into an adapted LLM that can self-ground its claims. Given an unlabeled query, we first retrieve relevant passages from reliable sources (e.g., Wikipedia) using a retriever model. We present the retrieved passages to the base LLM and sample a set of initial responses (without citations). Next, we use an NLI model (in our case, a variant of Google TrueNLI model), which can judge whether a claim is supported by a passage, to help add citations to the initial responses. For each sentence in an initial response, we use the NLI model to find the passage that can support the sentence, and add a citation to the supporting passage accordingly. We do not add citations to those sentences that do not have a passage that can back them up.

LLMAdaptation-2-Process
Illustration of the tuning process. We sample responses from the base model, use an NLI model to add citations to the sampled responses, and tune the base model with the best-grounded response.

Now that the initial responses are augmented with automatically created citations, we then select the best-grounded responses to fine-tune the base LLM. We determine which are the best grounded by measuring the averaged grounding score over all the sentences in the response according to the NLI model. With these responses, we tune the base LLM to teach it to include citations to its responses. In addition, we also teach base LLM to indicate those sentences in its responses that are unsupported, which will be useful during test-time adaptation so the LLM can iteratively refine its responses.

We create the tuning data using the queries from three commonly used datasets, Natural Questions, StrategyQA, and Fever, since they provide diverse text and require different types of reasoning processes.

Test-time adaptation
At test time, AGREE introduces an iterative inference strategy that empowers the LLM to actively seek additional information based on its self-generated citations. Given a query, we first use the retriever model to obtain an initial passage set. Next, we iteratively invoke the following procedure: 1) At each iteration, the adapted LLM generates a response containing citations to the passage set and finds any unsupported statements that do not have citations. 2) Then, we actively present more information to the LLM based on the citation information — if there are unsupported statements, we include additional information that is retrieved from reliable sources using those statements, otherwise, we include more unseen passages that are retrieved using the query to acquire more complete information.

LLMAdaptation-3-TTA
Illustration of the test-time adaptation (TTA) mechanism. The adapted LLM retrieves from the corpus based on self-generated citation information to refine its response in an iterative way.

Experiments
We conduct comprehensive experiments to demonstrate the effectiveness of AGREE both with and without TTA. We evaluate it across five datasets, including two in-domain datasets (NQ and StrategyQA) that have been used for adapting the base LLM and three out-of-domain datasets (ASQA, QAMPARI and an internal QA dataset, called “Enterprise” below) to test the generalization of our framework. We apply AGREE to adapt two LLMs and compare them against a competitive prompting-based baseline (ICLCite), and a post-hoc citing baseline (PostCite), both from ALCE.

LLMAdaptation-4-Performance
Performance across five datasets of AGREE compared to baselines ICLCite and PostCite. Our approach achieves substantially better grounding and citation precision compared to the baselines.

There are three key takeaways from the figure above, which illustrates the effectiveness of our approach.

Tuning is effective for superior grounding.
Across five datasets, AGREE generates responses that are better grounded in the text corpus (measured by citation recall) and provides accurate citations to its responses (measured by citation precision). It outperforms each of our selected baselines by a substantial margin. Tuning with high-quality data is a much more effective way for LLMs to learn to ground their responses without needing an additional NLI model.
The improvements can generalize.
AGREE adapts the base LLM only using in-domain training sets (NQ, StrategyQA), and directly tests the model on out-of-domain test datasets (ASQA, QAMPARI, Enterprise). The results suggest that the improvements can effectively generalize to out-of-domain datasets that contain different question types or use different types of external knowledge. This is a fundamental advantage of the proposed approach — AGREE can generalize to a target domain in the zero-shot setting without needing demonstrations from that domain.
TTA improves both grounding and answer correctness.
Comparing our framework at its full capacity and a variant without test-time adaptation, we observe improvements in terms of both better grounding and accuracy. This is because TTA allows the LLMs to actively collect more relevant passages to construct better answers following the self-grounding guidance.
Conclusion
In conclusion, we present AGREE, a framework for improving the factuality and verifiability of LLM-generated content. AGREE presents an effective learning-based approach to adapt a base LLM to self-ground its response using automatically collected data. This integrated capability for grounding further enables the LLM to improve the responses at test time. Our evaluations across five datasets demonstrate the benefits of the holistic adaptation approach compared to approaches that solely rely on prompting or the parametric knowledge of LLMs. We encourage you to read the paper to learn about our findings and join us in building more trustworthy and reliable language models.

###

https://huggingface.co/spaces/wyysf/CraftsMan
CraftsMan: High-fidelity Mesh Generation
with 3D Native Generation and Interactive Geometry Refiner
Weiyu Li*1,2, Jiarui Liu*1,2, Rui Chen1,2, Yixun Liang2,3, Xuelin Chen4, Ping Tan1,2, Xiaoxiao Long5

1HKUST, 2LightIllusions, 3HKUST(GZ), 4Tencent AI Lab, 5HKU

TL; DR: CraftsMan (aka 匠心) is a two-stage text/image to 3D mesh generation model. By mimicking the modeling workflow of artist/craftsman, we propose to generate a coarse mesh (5s) with smooth geometry using 3D diffusion model and then refine it (20s) using enhanced multi-view normal maps generated by 2D normal diffusion, which is also can be in a interactive manner like Zbrush.
✨ Overview
This repo contains source code (training / inference) of 3D diffusion model, pretrained weights and gradio demo code of our 3D mesh generation project, you can find more visualizations on our project page. If you have high-quality 3D data or some other ideas, we very much welcome any form of cooperation.

Full abstract here
We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. Despite the significant advancements in 3D generation, existing methods still struggle with lengthy optimization processes, irregular mesh topologies, noisy surfaces, and difficulties in accommodating user edits, consequently impeding their widespread adoption and implentation in 3D modeling softwares. Our work is inspired by the craftsman, who usually roughs out the holistic figure of the work first and elaborate the surface details subsequently. Specifically, we employ a 3D native diffusion model, which operates on latent space learned from latent set-based 3D representations, to generate coarse geometries with regular mesh topology in seconds. In particular, this process takes as input a text prompt or a reference image, and leverages a powerful multi-view (MV) diffusion model to generates multiple views of the coarse geometry, which are fed into our MV-conditioned 3D diffusion model for generating the 3D geometry, significantly improving robustness and generalizability. Following that, a normal-based geometry refiner is used to significantly enhance the surface details. This refinement can be performed automatically, or interactively with user-supplied edits. Extensive experiments demonstrate that our method achieves high e�cacy in producing superior quality 3D assets compared to existing methods.

###

https://huggingface.co/papers/2405.09818
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Published on May 16
·
Featured in Daily Papers on May 17
Authors:
Chameleon Team
Abstract
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

###

https://jjihwan.github.io/projects/FIFO-Diffusion
FIFO-Diffusion: Generating Infinite Videos from Text
without Training
Jihwan Kim*1 Junoh Kang*1 Jinyoung Choi1 Bohyung Han1, 2

1ECE & 2IPAI, Seoul National University
(\* Equal Contribution)
{kjh26720, junoh.kang, jin0.choi, bhhan}@snu.ac.kr

[arXiv] [Code]

1K-frame Long Videos (512 x 320 resolution, VideoCrafter2)
A spectacular fireworks display over Sydney Harbour, 4K, high resolution.

A colony of penguins waddling on an Antarctic ice sheet, 4K, ultra HD.

An astronaut floating in space, high quality, 4K resolution.

A spectacular fireworks display over Sydney Harbour, 4K, high resolution.

A colony of penguins waddling on an Antarctic ice sheet, 4K, ultra HD.

An astronaut floating in space, high quality, 4K resolution.

A spectacular fireworks display over Sydney Harbour, 4K, high resolution.

A colony of penguins waddling on an Antarctic ice sheet, 4K, ultra HD.

An astronaut floating in space, high quality, 4K resolution.

Abstract
We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. We have demonstrated the promising results and effectiveness of the proposed methods on strong text-to-video generation baselines.

```