2024년 5월 30일 AI 소식 · TECH BLOG by Dongyoung Kim Ph.D.

Summary

오늘의 AI 소식에서는 GPT-2 모델의 재현, GPT-4o와 Gemini 1.5의 컨텍스트 메모리 평가, RAG 2.0의 소개, META의 비전-언어 모델의 소개, 실행 가능한 코드 액션을 통한 더 나은 LLM 에이전트, 그리고 여러 AI 및 머신러닝 관련 최신 연구 및 발표 내용을 다룹니다.
Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20

https://github.com/karpathy/llm.c/discussions/481, 2024-05-29 (Karpathy)
Karpathy는 90분 만에 $20로 llm.c를 사용하여 GPT-2 (124M) 모델을 재현하는 방법을 공유함.
llm.c는 효율적으로 모델 FLOP 활용도를 약 60%까지 끌어올릴 수 있음.
Lambda의 8X A100 80GB SXM 노드를 사용하여 약 90분 만에 모델 재현 가능.
FineWeb 데이터셋에서 10억 토큰으로 학습을 수행하고, HellaSwag 정확도에서 OpenAI의 GPT-2 (124M)를 능가함.
필요한 환경 설정 및 하드웨어 요구사항, 세부 하이퍼파라미터 설정, 그리고 실행 방법을 상세히 설명함.
OpenAI’s GPT-4o vs. Gemini 1.5 ⭐ Context Memory Evaluation

https://medium.com/@lars.chr.wiik/openais-gpt-4o-vs-gemini-1-5-context-memory-evaluation-1f2da3e15526, 2024-05-20 (Lars Wiik)
긴 컨텍스트에서 정보 추출 능력을 평가하는 Needle in the Haystack 테스트 결과, OpenAI와 Google의 LLM 성능 비교.
GPT-4o, GPT-4-turbo, GPT-4-0613이 상위 성능을 보였으며, Google의 Gemini 모델은 성능이 저조함.
긴 컨텍스트 윈도우에서 OpenAI 모델의 성능이 더 우수하며, 특히 8k 이상의 컨텍스트 길이에서 Gemini 모델이 50% 이하의 정확도로 하락.
Google의 최신 모델이 100만 토큰 입력을 지원하지만, 여전히 OpenAI의 모델이 일관된 성능을 보임.
Introducing RAG 2.0

https://contextual.ai/introducing-rag2/, 2024-03-19 (Contextual AI Team)
RAG 2.0은 엔드-투-엔드로 최적화된 시스템으로, GPT-4 기반의 기존 RAG 시스템을 크게 능가하는 성능을 보임.
자연어 질문 응답, 신뢰성, 최신성 등 다양한 축에서 RAG 2.0의 성능을 입증.
고객 작업 부하에서 기존 RAG 시스템보다 더 큰 성능 향상을 보이며, 실제 환경에서의 적용 가능성을 강조.
Google Cloud의 최신 ML 인프라를 활용하여 RAG 2.0 모델을 훈련 및 배포함.
AI Success Depends on the CFO, Not IT | Gartner Finance Keynote

https://www.youtube.com/watch?app=desktop&v=y268jrtjako&t=1s, 2024-05-28 (Gartner)
Gartner의 부사장 Nisha Bhandare와 수석 분석가 Clement Christensen이 AI 도입과 비용 관리에 대한 기조 연설.
CFO가 AI 기술의 비용 초과, 의사 결정 오용, 신뢰 상실 등 일반적인 문제를 관리하는 데 중요한 역할을 해야 함을 강조.
AI 비용과 다른 기술 비용의 차이를 이해하고, 기업 전반에 걸친 AI 이니셔티브의 가치를 평가하기 위한 프레임워크 제공.
An Introduction to Vision-Language Modeling

https://arxiv.org/abs/2405.17247, 2024-05-30 (META)
비전-언어 모델링(VLM)에 대한 소개, VLM의 작동 원리 및 훈련 방법 설명.
VLM의 평가 접근 방식을 논의하며, 이미지에서 언어로의 매핑을 넘어 비디오로 확장하는 방법도 다룸.
언어와는 달리 비전은 더 높은 차원의 공간에서 개념이 표현되며, 이러한 모델의 신뢰성을 향상시키기 위한 도전 과제들을 설명.
Executable Code Actions Elicit Better LLM Agents

https://huggingface.co/papers/2402.01030, 2024-02-02 (Xingyao Wang et al.)
실행 가능한 Python 코드를 사용하여 LLM 에이전트의 행동을 통합하는 CodeAct 제안.
17개의 LLM을 대상으로 한 광범위한 분석에서 CodeAct가 기존 대안보다 최대 20% 높은 성공률을 기록.
CodeActAgent는 Llama2와 Mistral에서 파인튜닝되어 고급 작업을 수행하고, 자연어를 사용하여 사용자와 협력함.
Codestral: Hello, World!

https://mistral.ai/news/codestral/, 2024-05-29 (Mistral AI team)
Mistral AI가 코드 생성 작업을 위해 설계된 최초의 코드 모델인 Codestral 발표.
80개 이상의 프로그래밍 언어를 지원하며, 코드 생성 및 상호작용을 위한 API 엔드포인트 제공.
HumanEval, MBPP, CruxEval, RepoBench 등의 벤치마크에서 우수한 성능을 보임.
Few-shot tool-use doesn’t really work (yet)

https://research.google/blog/few-shot-tool-use-doesnt-really-work-yet/, 2024-05-30 (Alon Jacovi)
툴 사용을 지시하는 몇 가지 데모 기반 접근법의 효과가 생각보다 낮다는 연구 결과 발표.
다양한 툴 사용 알고리즘에 대한 대규모 평가에서 툴을 사용하지 않은 LLM보다 성능이 향상되지 않음.
다양한 설정에서 툴 사용 전략의 효율성에 큰 차이가 있으며, 더 철저한 평가 체계가 필요함을 제안.
Faithful Logical Reasoning via Symbolic Chain-of-Thought

https://arxiv.org/abs/2405.18357, 2024-05-30 (Jundong Xu et al.)
논리적 추론 능력을 강화하기 위해 심볼릭 체인 오브 생각(SymbCoT)을 제안.
SymbCoT는 자연어 컨텍스트를 심볼릭 형식으로 변환하고, 논리 규칙을 사용하여 문제를 해결하는 계획을 수립함.
5개의 표준 데이터셋 평가에서 CoT 방법에 비해 현저한 개선을 보이며, 더 신뢰성 있고 유연한 논리적 추론을 제공함.
Sources
This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is: # AI News for (today's date), ## Summary (overall short summary), ## Link1 Title, link, date - detailed summary1, - detailed summary2, - detailed summary..N, ## Link2 Title, link, date - detailed summary1, - detailed summary2, - detailed point..N, etc. The report should be written in Korean and use the 개조식 문체 style. give the very deep details for each link as much as possible. make summary with good details, note company name next to date if available.

###
https://github.com/karpathy/llm.c/discussions/481

Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 #481
karpathy started this conversation in General
Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20
#481
@karpathy
karpathy
yesterday · 24 comments · 43 replies
Return to top

karpathy
yesterday
Maintainer
Let's reproduce the GPT-2 (124M) in llm.c (~4,000 lines of C/CUDA) in 90 minutes for $20. The 124M model is the smallest model in the GPT-2 series released by OpenAI in 2019, and is actually quite accessible today, even for the GPU poor. With llm.c, which is quite efficient at up to ~60% model flops utilization, reproducing this model on one 8X A100 80GB SXM node takes ~90 minutes. For example, on Lambda this node goes for ~$14/hr, so the total cost of reproducing this model today is about $20. You can train the model with a single GPU too, it would just take proportionally longer (e.g. ~4-24 hours depending on the GPU). In addition, llm.c still has a lot of pending optimizations and people haven't tried to tune the training in the style of cramming, so I'd say we're likely to see significant improvements on this number. So here is the run, training the 12-layer, 12-headed, 768-dimension, 124M Transformer on 10 billion tokens of FineWeb:

chart124M

The left pane shows that we outperform the checkpoint released by OpenAI on the FineWeb withheld validation dataset. This is not the ideal metric because the data distribution of GPT-2 was different (it was trained on the never released "WebText" dataset) and the statistics of the internet may have been different 5 years ago, so it's not a super fair comparison. Therefore, in addition on the right we also plot the HellaSwag accuracy, a benchmark commonly used to assess LLM capability that is nice, smooth, and well-behaved. I'd mostly look at HellaSwag, but FineWeb val is a nice confirmation. That said, HellaSwag has no math/code so it slightly favors our setting (common crawl-like data). One more point of reference is that GPT-3 in Appendix H cites HellaSwag accuracy at 33.7 for GPT-3 Small (124M) model. We get to 29.9 here, which surpasses GPT-2 (124M) at 29.4. Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens.

Now here is the shortest path to reproducing this result yourself. You'll need a GPU. I like and run my work on Lambda labs (who graciously sponsors in llm.c development), though the inventory can be limited at times. Many other providers exist and you can use the Discussion below for tips and tricks around this. Here is the example process for a Linux x86 64bit Ubuntu 22.04 with CUDA 12 (this is somewhere around the current, default "modern" configuration). If you're on a different system, the comments and discussion in the main README file might be helpful.

# install miniconda
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
source ~/.bashrc

# pytorch nightly (optional) https://pytorch.org/get-started/locally/
# conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

# pip installs so we can tokenize the FineWeb dataset
yes | pip install tqdm tiktoken requests datasets

# install cudnn so we can use FlashAttention and run fast (optional)
# https://developer.nvidia.com/cudnn-downloads
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install libcudnn9-dev-cuda-12

# "install" cudnn-frontend to ~/
git clone https://github.com/NVIDIA/cudnn-frontend.git

# install MPI (optional, if you intend to use multiple GPUs)
sudo apt install openmpi-bin openmpi-doc libopenmpi-dev

# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
git clone https://github.com/karpathy/llm.c.git
cd llm.c
python dev/data/fineweb.py --version 10B

# compile llm.c (mixed precision, with cuDNN flash-attention)
# first compilation is ~1 minute, mostly due to cuDNN
make train_gpt2cu USE_CUDNN=1

# train on a single GPU
./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -v 250 -s 20000 \
    -h 1

# if you have multiple GPUs (e.g. 8), simply prepend the mpi command, e.g.:
# mpirun -np 8 ./train_gpt2cu \ ... (the rest of the args are same)
Args guide. A lot of these hyperparameters follow the GPT-3 paper instead of the GPT-2 paper, because it was a lot more detailed. Args explanation:

-i -j are training and validation splits token files, written by fineweb.py
-o is the output directory to write logs and checkpoints into
-e "d12" asks to initialize, a depth 12 GPT-2 model from scratch
-b 64 sets the micro-batch size to 64 . If you are running out of memory, decrease this value, e.g. try 32, 16, 8, all the way down to 1 potentially.
-t 1024 sets the maximum sequence length to 1024, as GPT-2 did
-d 524288 requests that the total batch size per single update be ~0.5M tokens. The code will take this desired batch size and calculate the needed gradient accumulation "inner loop" steps of the optimization. For example on 8 GPUs, at -b 64 and -t 1024, every microbatch is doing exactly 8 X 64 X 1024 = 524288 tokens, so there is no need for gradient accumulation. But if we we only have 1 GPU, then the code will set it to 8, and do an inner loop of 8 iterations to add up to this "total batch size" per step. While the batch size used to train GPT-2 is unknown, this number ~0.5M comes from the GPT-3 paper table, for this model size.
-r 1 sets the recompute setting = 1, so we will re-compute the GeLU activations. This slightly increases the runtime, but saves quite a bit of memory, allowing us to increase the batch size and get a net increase in token throughput.
-z 1 turns on ZeRO-1 (i.e. optimizer state sharding) across multiple GPUs. If you're training with > 1 GPU, this setting is a no-brainer and should basically always be on. On 1 GPU this setting is a no-op.
-c 0.1 sets the weight decay to 0.1. Only (2D) weights are decayed exactly as in GPT-2, and this number comes from the GPT-3 paper
-l 0.0006 sets the maximum learning rate, from GPT-3 paper.
-q 0.0 says that we will decay the learning rate to 0 over the course of training.
-u 700 says that we will ramp up the learning rate from 0 to max learning rate over the first 700 iterations, which at total batch size 0.5M is 350M tokens, following GPT-3 paper.
-n 5000 asks to save model checkpoints every 5000 steps.
-v 250 asks to evaluate and log the validation loss every 250 steps
-s 20000 asks to sample some tokens every 20000 steps. Because the total number of steps will be less than this (see below), this basically turns generation off and we will only basically sample a single time at the very end.
-h 1 asks to evaluate the HellaSwag accuracy, something we can compare across papers.
Because we did not set the maximum number of steps using -x flag, it defaults to exactly one epoch over the training data, i.e. 10B tokens. Because the total batch size is ~0.5M and total number of tokens is 10B, there will be a total of ~ 10B/0.5M = 20K steps.
There's a lot of detail above but the TLDR is that we're training a 12-layer GPT-2 (124M), from scratch, on 10B tokens of FineWeb, with max sequence length of 1024 tokens. If you are running out of memory, I would first make sure you have -r 1 turned on, and then I would start decreasing the batch size -b by dividing it by 2, until the runs. Once it runs, I'd see if you can get away with turning -r 0 back on to recover a little bit of speed.

Training. The code will print something like this over time (this is an example of a single A100 40GB PCIe GPU, $1.29/hr):

step   80/18865 | train loss 7.577051 | norm 1.1461 | lr 6.86e-05 | 2950.68 ms | 49.0% A100 fp16 MFU | 177968 tok/s
step   81/18865 | train loss 7.540626 | norm 1.4001 | lr 6.94e-05 | 2952.59 ms | 49.0% A100 fp16 MFU | 177948 tok/s
step   82/18865 | train loss 7.465753 | norm 1.0613 | lr 7.03e-05 | 2953.98 ms | 48.9% A100 fp16 MFU | 177924 tok/s
step   83/18865 | train loss 7.472681 | norm 1.1553 | lr 7.11e-05 | 2955.67 ms | 48.9% A100 fp16 MFU | 177897 tok/s
What is going on? Well, we have 10B training tokens and our batch size is ~0.5M, so we'd expect about 10B/0.5M ~= 20K steps in total. It actually works out to exactly 18,865 because one of the data shards is reserved for validation data and the exact batch size is a nice power of 2 @ 524,288. So here we are on step 80/18865, which in total took 2950.68ms. MFU is short for "Model Flops Utilization". The A100 claims to offer 312 TFLOPS, but in practice this is very hard to achieve because the training is memory-bound and we can't feed the TensorCores that do the matrix multiplies. On this A100 40GB PCIe GPU, we see that when we count up the FLOPs we're doing and divide by time, we're roughly at half the theoretical, maximum peak FLOPS, which is quite good. If you used the A100 80GB SXM with higher memory bandwidth and max thermal design power, this goes up to ~60%. (If you use a GPU that is not A100, ignore this number because it is in units of A100 fp16 FLOPS). We also see that the token throughput we are achieving is about 178K tok/s. Next, our current loss is 7.577. The lower this is, the better our model is at predicting the next token in the sequence on average. Step 80 is very early in the training here. Because the perplexity is exp(7.577) ~= 2K, our model is as confused about each next token on average, as if it was guessing at random from 2,000 tokens. The full vocab size is 50,257. By the end of the optimization we'll get to about 3.29, so it's as if we're guessing uniformly at random from exp(3.29) ~= 27 tokens at each time step. Finally we see the gradient norm is 1.1461. When this number spikes, the gradient is exploding and this is very bad. To mitigate gradient explosions, as is standard, llm.c uses gradient clipping at 1.0, so if the gradient norm exceeds 1.0 (like in this time step) we forcefully scale it down so it's norm is up to 1.0. Later in the optimization, the gradient norm usually "calms down" to lower values.

Visualization. Finally, you'll want to make pretty charts like the one I posted up above. For that, our program is printing some very rudimentary logs to an improvised log124M/main.log file. I have attached an example Jupyter notebook that parses these files and visualizes them in the style above.

Tokenizer. When you're training up above, you'll see a warning that llm.c couldn't find the GPT-2 tokenizer .bin file. That's totally fine for training, but it means that we can't decode - i.e. we can't convert integer tokens that we sample into little string pieces, to create text that we can read. Here is how we can generate it:

# install pytorch nightly
conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

# install huggingface transformers
pip install transformers

# preprocess the TinyShakespeare dataset (very fast, much faster than FineWeb)
python dev/data/tinyshakespeare.py

# run a little training loop in Python/PyTorch
# it saved a lot of .bin files, including the Tokenizer
python train_gpt2.py
The Python script is a parallel implementation to llm.c used for error checking and unit tests (but doesn't have full feature parity). In particular, if we run it like above it will write the file gpt2_tokenizer.bin, which the C code can read and use to output nice text during sampling.

Sampling. The code is currently not really intended for inference, but you can hack the code to do inference very inefficiently (without any kv-cache etc.) with something like this:

make train_gpt2cu USE_CUDNN=1
./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -e "log124M/gpt2_124M_00018865.bin" \
    -b 1 -t 1024 \
    -x 1 \
    -l 0.0 \
    -s 1 -g 256
The -i -j flags are spurious. -e flag is pointing at the final model checkpoint of our GPT-2 124M model, which llm.c will initialize the model from. The -b 1 is saying to use only a single batch element (one row of length 1024 tokens in which we sample from left to right). The -x 1 is saying we only want to run for a single step, and -l 0.0 is setting the learning rate to zero so we don't actually train the model on this single step. Finally -s 1 is saying "sample every step" and -g 256 is saying sample 256 tokens.

Now, the above is just unconditional sampling. It's possible to hack the code to do conditional sampling, i.e. sequence completion. E.g. I asked our 124M model to complete the text "The GitHub project llm.c is a", and it continued: "free service to enhance the scholarly infrastructure of the academic community.". I then re-sampled with a different seed and got "The GitHub project llm.c is a collaborative effort that rocks GitHub itself". So, not bad I guess :) I had to directly hack the code by setting gen_tokens[1:10] to be the prompt tokens 464, 21722, 1628, 32660, 76, 13, 66, 318, 257 (from tiktokenizer ty), then hacked the loop index that samples to start at token position 10, ... you get the idea TLDR conditional generation is not really supported but in principle possible, possibly coming soon.

Code. 95% of the heavy lifting is in the train_gpt2.cu file. It started as a nice clean 1,000 LOC C code, but has grown quite a bit and now it's closer to 3,500 LOC, with 4 supporting files of file I/O utils, tokenizer, dataloader, and random number generation. Roughly speaking, the first 500 LOC are just basic setup of up MPI, NCCL, cuDNN, cuBLAS, etc etc. The next 1,500 LOC are all the layers of the Transformer, and both their forward and backward implementation in efficient CUDA code. All the CUDA kernel development for these files happens in dev/cuda. So for example there is a gelu_forward() and then also a gelu_backward(), and the same way for all the other layers. The next 1,000 LOC are the gpt2 model, which just strings together the layers and itself has one big gpt2_forward() and gpt2_backward(). The last 1,000 LOC are int main(), which has the main training loop and all the related bookkeeping and argument parsing, and a lot of tedious code around e.g. resuming training from a previous checkpoint, etc.

350M model. Overnight I also reproduced the 350M parameter model. Take a look at the file run350M.sh for the exact launch command. I found that 10B tokens was not enough for the 350M model, so you'll have to download and preprocess the FineWeb100B (or try to do multiple epochs on just the 10B above, which might work, I have not checked). I configured it to train for 30B tokens, so we have that:

FLOPS using 6ND approximation:

124M on 10B tokens => 6 * 124e6 * 10e9 = 7.44e18 ~= 7e18 capability model
350M on 30B tokens => 6 * 350e6 * 31.5e9 = 6.615e19 ~= 7e19 capability model (~10X)
On 8X A100 80GB SXM the 350M stepped at 820ms/iter. Trained for 60K steps (instead of ~20K), for a total of ~30B tokens (instead of ~10B tokens). Total training time 14 hours. Cost $14/hr => 14 X 14 ~= $200 (10X of 124M). However looking at the plot, it's possible that we could have gotten away with slightly less:

chart350M

Coming up. That's it for now! We are moving on to the 740M and then, of course, the actual "GPT-2" 1558M. If I can find the GPUs... By very rough napkin math, on my single 8X A100 80GB GPU box, the 1558M model would take ~1 week and cost ~$2.5K. This is in acceptable territory, but we'll want to take some time to make the current code better, cleaner, better tested, and add multi-node training support. And also very much still on my mind, I want to build the whole thing again, from scratch and piece by piece, coming to you soon^TM.

FAQ:

Can I sample from it? kind of, but it's inefficient and a bit weird.
Can I chat with it? no, this is currently only pretraining, not chat finetuning.
Can you train multi-node distributed? in principle yes, there is a slurm PR up that got this working for up 50 nodes. In practice I personally haven't tried yet.
Are you bitwise deterministic? No but we are very close, one more kernel to patch.
Can you train in fp8? No, we're currently mostly training in bf16, but coming soon.
I have a non-NVIDIA GPU (AMD, Apple Silicon, etc.) can I run llm.c? No, llm.c supports C/CUDA only, but I am very happy to link to any forks under "notable forks" section, or accept PRs that would make porting llm.c to other platforms easier.
I only have a CPU, can I play? You won't be able to reproduce GPT-2 models, but you can take on fun projects by finetuning OpenAI GPT-2 models on other data, e.g. TinyShakespeare or TinyStories. Support for these datasets, initialization, and CPU finetuning exists in llm.c in train_gpt2.c. (It's a lot more rudimentary though, intended mostly as a reference for the CUDA code).
How does this compare to PyTorch? llm.c is a "straight up" C/CUDA implementation. The PyTorch code at train_gpt2.py does not have full feature parity (e.g. doesn't do sharded data loading, etc.) and is meant to be more as a reference, but I think you can get something similar to the 124M model above stepping as follows: torchrun --standalone --nproc_per_node=4 python train_gpt2.py --input_bin dev/data/fineweb10B/fineweb_train_000001.bin --write_tensors 0 --model d12 --batch_size 64 --sequence_length 1024 --total_batch_size 524288 --dtype bfloat16 --compile 1 --tensorcores 1 --flash 1 --num_iterations 18865 --weight_decay 0.1 --overfit_single_batch 0. I am interested in and would accept PRs that bring the PyTorch training closer up to feature parity to the llm.c training loop.
Why do you care so much about GPT-2? GPT-2 is the grand-daddy of LLMs, the first time that the modern LLM stack came together in a recognizably modern form, and the parameters were released by OpenAI. GPT-3 actually didn't change too much at all about the model (context size 1024 -> 2048, I think that's it?). GPT-4 details were never published. Many other LLMs also strongly resemble GPT-2, despite it being from 2019, e.g. Llama 3 from the architecture perspective is a non-linearity change in the MLP and the addition of the RoPE relative positional encoding.

###
https://medium.com/@lars.chr.wiik/openais-gpt-4o-vs-gemini-1-5-context-memory-evaluation-1f2da3e15526
OpenAI’s GPT-4o vs. Gemini 1.5 ⭐ Context Memory Evaluation
Needle in Haystack Evaluation— OpenAI vs. Google
Lars Wiik
Lars Wiik

·
Follow

6 min read
·
May 20, 2024
303


5



Google vs. OpenAI — “Needle in the Haystack”
Google vs. OpenAI — “Needle in the Haystack”
A Large Language Model’s (LLM) ability to find and understand detailed information within large context windows is a need-to-have these days.

The Needle in the Haystack test stands as a crucial benchmark for assessing large language models for such tasks.

In this article, I will present my independent analysis measuring context-based understanding of the top-tier LLMs from OpenAI and Google.

Which LLM should you use for long-context tasks?

What is a “Needle in the Haystack” Test? 🕵️‍♂️
A “Needle in the Haystack” test for large language models (LLMs) involves placing a specific piece of information (the “needle”) within an extensive chunk of unrelated text (the “haystack”).

The LLM is then tasked to respond to a query that requires extracting the needle.

Such a test is used to evaluate an LLM’s proficiency in context comprehension and information retrieval from long contexts.

Successfully replying to the query showcases a detailed understanding of the context, which is crucial for developing applications around context-based LLMs.

The integration of custom knowledge into LLMs is becoming increasingly popular — so-called Retrieval-Augmented Generation (RAG) systems.

If you want to read more about RAG systems, you can check out one of my previous articles.

RAG article: https://medium.com/@lars.chr.wiik/a-straightforward-guide-to-retrieval-augmented-generation-rag-0031bccece7f

To further push the trend of long context windows, Google recently announced the Gemini model’s new ability to input 1 million tokens for a single query!

Image by ChatGPT showcasing an LLM finding the needle in a haystack
Image by ChatGPT showcasing an LLM finding the needle in a haystack
Dataset 🔢
I developed a script designed to create “needle-in-the-haystack” datasets. This script enables me to input two key elements:

Context (Haystack): This is the text in which the unique information is inserted.
Unique Information (Needle): This is the specific piece of information that needs to be identified hiding within the large context.
The dataset generation process works as follows:

Starting Point Selection: The script begins by randomly choosing a starting point within the large text. This starting point falls somewhere between the 10th and 40th percentile of the entire text.
Needle Placement: The unique information (needle) is then inserted within the haystack. Its placement within the haystack is also randomized but is constrained to fall between the 20th and 80th percentile of the haystack’s length.
LLMs are generally known to most accurately recall the information at the START and END of the prompt.

Paper: See Paper from Standford: “Lost in the Middle: How Language Models Use Long Contexts”.

This algorithm strategically places the needle within a specific percentile range of the context. This is to ensure that the evaluation captures the model’s capability to recognize and extract data from within the full scope of the text, and not just from the more easily remembered edges of the prompt.

Here is a code snipped of the dataset generation algorithm:

def create_one_needle(num_chars: int, needle_line: str, lines: list[str]):
    # The start_position is a random place between the 10 to the 40 percentile of the text
    rnd_place = random.randint(10, 40) / 100
    start_position = int(len(lines) * rnd_place)

    # The needle is between the 20 to the 80 percentile of the text
    needle_rnd_place = random.randint(20, 80) / 100

    lines_selected = []
    placed = False
    chars_used = 0
    for line in lines[start_position:]:
        lines_selected += [line]
        chars_used += len(line)

        # place the needle
        if not placed and chars_used > num_chars * needle_rnd_place:
            lines_selected.append(needle_line)
            placed = True

        if chars_used > num_chars:
            break

    return lines_selected
Evaluation Method 🧠
For the haystack, I used a book I loved as a child — Harry Potter.

And for the needle, I chose a fictive phone number belonging to Lars Wiik.

I created 100 haystacks for each context length — including character lengths of 1000, 2000, 4000, 8000, 12000, and 16000.

Here is an example of one of the haystacks with 1000 characters.

Example of a haystack with 1000 characters with a needle (yellow) placed at the 80th percentile
Example of a haystack with 1000 characters with a needle (yellow) placed at the 80th percentile
The different LLMs were then tasked to return the fictive phone number belonging to Lars Wiik. The replies were labeled according to whether they included the fictive phone number or not in the response.

The prompt I used looks as follows:

def create_needle_prompt(needle_text: str) -> str:
    prompt = f'''
    ##### INSTRUCTION #####
    What is the fictive phone number to Lars Wiik according to the context?
    Only provide me what I want, nothing else.
    You can only respond with at max 20 words.


    ##### CONTEXT #####
    {needle_text}
    '''
    return prompt
Performance Results 📊
The following models were included in the evaluation:

gpt-4o-2024–05–13
gpt-4-turbo-2024–04–09
gpt-4–0613
gpt-3.5-turbo-0125
gemini-1.5-pro-preview-0514
gemini-1.5-flash-preview-0514
gemini-1.0-pro-002
The evaluation includes running each model through 100 different haystacks for each specific context lengths of 1k, 2k, 4k, 8k, 12k, and 16k.

Below is a line plot of the resulting accuracy graph:

Graph showcasing LLMs performance in the “Needle in the Haystack” task. Gemini 1.5. Gemini 1.0. GPT-4. GPT-4o. GPT-4-Turbo.
Graph showcasing LLMs performance in the “Needle in the Haystack” task
Note: You cannot see gpt-4o and gpt-4–0613 because they are hidden behind gpt-4-turbo-2024–04–09 with 100% accuracy!

The longer the context window, the harder it is to extract a specific piece of information because of more noise. Therefore, performance is expected to decrease with larger context windows.

As we can derive from the graph, there seems to be a distinction between OpenAI’s models and Google’s models in terms of performance.

Google’s models performed below my expectations, especially after their recent event (Google I/O 2024) where they talked warmly regarding Gemini’s memory and context understanding. All of Google’s models seem to plateau around 50% accuracy after 8k context length.

While OpenAI’s models perform noticeably well in this test, with gpt-4o, gpt-4-turbo-2024–04–09, and gpt-4–0613 as the top-performing models.

It should also be noted that gpt-3.5-turbo-0125 performs better than all Gemini models!

To validate that there was no trivial error in the evaluation, I stored all replies so I could go back and see what the LLMs actually responded.

Here are some of the responses from Gemini 1.5:

The provided context does not contain a phone number for Lars Wiik.

There is no mention of Lars Wiik or his phone number.

The provided text does not contain Lars Wiik's phone number.

The provided text does not mention Lars Wiik or his phone number.

There is no mention of Lars Wiik or his phone number.

The text does not provide Lars Wiik's phone number.

The text provided does not contain a fictive phone number for Lars Wiik.

I'm sorry, but the fictive phone number to Lars Wiik is not mentioned in the context you provided.
The Gemini model struggles to find the fictive phone number within the story of Harry Potter.

I have uploaded 10 random prompts using Gemini 1.5 with a 4k context window for anyone to reproduce. Copy the full prompt into whatever tool you use to run Gemini 1.5: Link to reproduce.

Image of reproducing the Gemini 1.5 results in Vertex AI
Image of reproducing the Gemini 1.5 results in Vertex AI
Here are some of the responses from OpenAI’s gpt-3.5-turbo-0125:

N/A

N/A

There is no fictive phone number to Lars Wiik in the provided context.

N/A

Platform nine and three-quarters.

No phone number provided for Lars Wiik.
Funny enough, the LLM once replied with “Platform nine and three-quarters” 😄

Disclaimer: It should be said that a dataset with 100 haystacks per context length is fairly small, and you should run your own tests for your spesific use case to get a better estimate of which models that performs best. Performance may also vary based on use-case.

Conclusion 💡
In conclusion, the “Needle in the Haystack” evaluation can be used to measure large language models' comprehension and information retrieval abilities when using long contexts.

In this analysis, we observed a performance disparity between OpenAI’s models and Google’s Gemini series — where OpenAI’s gpt-4, gpt-4o, and gpt-4-turbo scored the highest.

Despite Google’s recent enhancements with Gemini’s ability to handle up to 1 million tokens, it appears that OpenAI models have shown a more consistent ability to accurately retrieve specific information from large texts.

Note that for users and developers, the choice of model would likely depend on the specific needs of their application.

###
https://contextual.ai/introducing-rag2/
Introducing RAG 2.0
Contextual AI Team
March 19, 2024
Today, we’re announcing RAG 2.0, our approach for developing robust and reliable AI for enterprise-grade performance. Unlike the previous generation of RAG, which stitches together frozen models, vector databases, and poor quality embeddings, our system is optimized end to end. Using RAG 2.0, we’ve created our first set of Contextual Language Models (CLMs), which achieve state-of-the-art performance on a wide variety of industry benchmarks. CLMs outperform strong RAG baselines based on GPT-4 and the best open-source models by a large margin, according to our research and our customers.


Contextual Language Models, trained with RAG 2.0, perform significantly better than existing RAG systems across all of our benchmarks. Natural Questions (NQ), HotpotQA (HPQA), and TriviaQA use the exact match metric. Since HaluEvalQA and TruthfulQA require logits, GPT-4 cannot be evaluated directly on those tasks. Vanilla RAG is zero-shot; what we call RAG includes few-shot demonstrations, careful chunking, and manual prompt engineering. Significant effort was spent on strengthening the baselines.

In this blog post, we share our progress in building generative AI systems that go beyond demos to truly production-grade systems:

We introduce the distinction between RAG, which uses frozen off-the-shelf models, and RAG 2.0, which end-to-end optimizes the language model and retriever as a single system.
We demonstrate that RAG 2.0 achieves state-of-the-art performance on a wide variety of benchmarks, from open domain question-answering to faithfulness, significantly outperforming existing RAG approaches.
We highlight even bigger gains for RAG 2.0 on real-world customer workloads and discuss its viability in production.
We’re excited to build with you on RAG 2.0 — join our waitlist today.

Why RAG 2.0?
Language models struggle with knowledge-intensive tasks because they are limited by the information they have been exposed to during training. In 2020, our co-founder and CEO Douwe Kiela and his team at Facebook AI Research introduced Retrieval-Augmented Generation (RAG) to mitigate this problem, by augmenting a language model with a retriever to access data from external sources (e.g. Wikipedia, Google, internal company documents).

A typical RAG system today uses a frozen off-the-shelf model for embeddings, a vector database for retrieval, and a black-box language model for generation, stitched together through prompting or an orchestration framework. This leads to a “Frankenstein’s monster” of generative AI: the individual components technically work, but the whole is far from optimal. These systems are brittle, lack any machine learning or specialization to the domain they are being deployed to, require extensive prompting, and suffer from cascading errors. As a result, RAG systems rarely pass the production bar.

The RAG 2.0 approach pretrains, fine-tunes, and aligns all components as a single integrated system, backpropagating through both the language model and the retriever to maximize performance:



The history of deep learning has repeatedly shown that end-to-end optimization outperforms hand-tuned systems. We apply this approach to move beyond the limitations of RAG and have developed RAG 2.0. To sum it up: if you know that you are going to be doing RAG, you should train the system for doing RAG.

RAG 2.0 Benchmarks
We compared Contextual Language Models (CLMs) with frozen RAG systems across a variety of axes:

Open domain question answering: We use the canonical Natural Questions (NQ) and TriviaQA datasets to test each model’s ability to correctly retrieve relevant knowledge and accurately generate an answer. We also evaluate models on the HotpotQA (HPQA) dataset in the single-step retrieval setting. All datasets use the exact match (EM) metric.
Faithfulness: HaluEvalQA and TruthfulQA are used to measure each model’s ability to remain grounded in retrieved evidence and hallucinations.
Freshness: We measure the ability of each RAG system to generalize to fast-changing world knowledge using a web search index and show accuracy on the recent FreshQA benchmark.
Each of these axes is important for building production-grade RAG systems. We show that CLMs significantly improve performance over a variety of strong frozen RAG systems built using GPT-4 or state-of-the-art open-source models like Mixtral.


Results across knowledge-intensive benchmarks. Both our vanilla RAG and standard RAG baselines use a frozen search index, reranking, and an off-the-shelf language model. For our RAG baselines, we use a few-shot setup with hand-tuned prompts to showcase how these changes can lead to large improvements in downstream task performance over our vanilla zero-shot RAG setup. Our HotpotQA evaluation uses the split released with the KILT benchmark and EM metric. HaluEvalQA uses zero-shot binary accuracy based on log probabilities and only evaluates the faithfulness of the language model given a ground truth context document. TruthfulQA uses the MC1 metric.

We trained and deployed our RAG 2.0 models on the latest generation of ML infrastructure from Google Cloud. Using A3 instances with H100 GPUs and the latest TCPx networking stack, we were able to train RAG 2.0 models at scale to achieve state-of-the-art accuracy.

Applying RAG 2.0 in the wild
CLMs achieve even bigger gains over current approaches when applied to real world data, as we have seen with our early customers.

Taking FinanceBench as an illustrative proxy (to maintain the confidentiality of our customers’ data), we can see that CLMs outperform frozen RAG systems even on finance-specific open book question answering — and have seen similar gains in other specialized domains such as law and hardware engineering.



RAG 2.0 and long context windows
When evaluating real world implementations, some may wonder how RAG 2.0 compares to the latest models with long context windows — so we dove into this as well.

Long context models are typically evaluated with “Needle in a Haystack” benchmarks wherein a “needle” (i.e., a fact) is hidden within a large “haystack” (i.e., a corpus of text), and models are evaluated with a query that aims to elicit the particular needle. In an effort to meaningfully compare frozen RAG and Contextual Language Models, we adapt the recent Biographies benchmark by creating a non-repeated haystack of 2M tokens. Using a test set of 100+ biographical questions, we evaluate CLM, Frozen-RAG, and GPT-4-Turbo (only up to 32K tokens) with haystacks ranging from 2K to 2M tokens.


What we see is that RAG 2.0 outperforms, especially if you hope to scale: RAG 2.0 is higher in accuracy and uses substantially less compute compared to long context language models, a difference that becomes meaningful in production.

Build on RAG 2.0 with us
We believe it takes an end-to-end solution to unleash the full potential of generative AI in the enterprise. We are thrilled about the results we’re already seeing with RAG 2.0 and can’t wait to bring it to more leading enterprises.

Fortune 500s and unicorns alike are already building on RAG 2.0 today with Contextual; they are leveraging CLMs and our latest fine-tuning and alignment techniques (such as GRIT, KTO, and LENS) on the Contextual platform to deploy generative AI they can trust in production.

Ready to move beyond demos and use AI in production? We’re actively prioritizing onboarding from our waitlist. If you’re eager to innovate with RAG 2.0, reach out at rag2@contextual.ai and tell us a bit about your use case, or join our waitlist.

Psst, we’re also hiring! If you want to join a world-class team to change the way the world works one workflow at a time, please check out our Careers page.

###
https://www.youtube.com/watch?app=desktop&v=y268jrtjako&t=1s

AI Success Depends on the CFO, Not IT | Gartner Finance Keynote
#GartnerFinance 오프닝 기조연설인 "AI Stalls"
 - 조직이 AI를 활용하여 비즈니스 가치를 극대화하는 과정에서 알아야 할 AI 중단 가능성
- 임원진에게 적용할 수 있는 AI에 대한 간결한 정의
- AI 비용을 어떻게 생각해야 하는지, 다른 기술 비용과 어떻게 다른지에 대한 높은 수준의 비기술적 고찰
-Possible AI stalls that organizations should be aware of during their journey to leverage AI to maximize business values
- Simplistic definition for AI applicable to executive leaders
- High-level nontechnical consideration of how to think about the cost of AI and how it is different from other technology costs
기조 연설은 CFO를 대상으로 했지만 이 주제는 CIO뿐만 아니라 다른 경영진 임원들과도 높은 연관성을 갖고 있습니다.
기업 AI 지출 및 도입이 가속화될 예정이며, CFO는 조직이 이 혁신적인 기술을 사용하는 방식과 관련된 비용 및 비용 초과, 의사 결정 오용, 신뢰 상실, 경직된 사고 방식 등 일반적인 문제를 효과적으로 관리할 책임이 있습니다.
이 기조 연설에서 Gartner 부사장 Nisha Bhandare와 수석 분석가 Clement Christensen은 기업 전반에 걸쳐 AI 이니셔티브를 분류하고 그 가치를 평가하며 이 영역에서 리더십을 적극적으로 확립하기 위한 포괄적인 프레임워크를 제공합니다.

###
https://arxiv.org/abs/2405.17247
META
An Introduction to Vision-Language Modeling
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2405.17247 [cs.LG]
 	(or arXiv:2405.17247v1 [cs.LG] for this version)


###
https://huggingface.co/papers/2402.01030
Executable Code Actions Elicit Better LLM Agents
Published on Feb 2
Authors:

Xingyao Wang
,
Yangyi Chen
,
Lifan Yuan
,
Yizhe Zhang
,
Yunzhu Li
,
Hao Peng
,
Heng Ji
Abstract
Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents' actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks without compromising their general capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., model training) using existing libraries and autonomously self-debug.

###
https://mistral.ai/news/codestral/
Codestral: Hello, World!
Empowering developers and democratising coding with Mistral AI.

May 29, 2024 Mistral AI team
We introduce Codestral, our first-ever code model. Codestral is an open-weight generative AI model explicitly designed for code generation tasks. It helps developers write and interact with code through a shared instruction and completion API endpoint. As it masters code and English, it can be used to design advanced AI applications for software developers.

A model fluent in 80+ programming languages
Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.

Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.

Setting the Bar for Code Generation Performance
Performance. As a 22B model, Codestral sets a new standard on the performance/latency space for code generation compared to previous models used for coding.

Detailed benchmarks
Figure 1: With its larger context window of 32k (compared to 4k, 8k or 16k for competitors), Codestral outperforms all other models in RepoBench, a long-range eval for code generation..

We compare Codestral to existing code-specific models with higher hardware requirements.

Python. We use four benchmarks: HumanEval pass@1, MBPP sanitised pass@1 to evaluate Codestral’s Python code generation ability, CruxEval to evaluate Python output prediction, and RepoBench EM to evaluate Codestral’s Long-Range Repository-Level Code Completion.

SQL. To evaluate Codestral’s performance in SQL, we used the Spider benchmark.

Detailed benchmarks
Additional languages. Additionally, we evaluated Codestral's performance in multiple HumanEval pass@1 across six different languages in addition to Python: C++, bash, Java, PHP, Typescript, and C#, and calculated the average of these evaluations.

Detailed benchmarks
FIM benchmarks. Codestral's Fill-in-the-middle performance was assessed using HumanEval pass@1 in Python, JavaScript, and Java and compared to DeepSeek Coder 33B, whose fill-in-the-middle capacity is immediately usable.

Get started with Codestral
Download and test Codestral.
Codestral is a 22B open-weight model licensed under the new Mistral AI Non-Production License, which means that you can use it for research and testing purposes. Codestral can be downloaded on HuggingFace.

Use Codestral via its dedicated endpoint
With this release, comes the addition of a new endpoint: codestral.mistral.ai. This endpoint should be preferred by users who use our Instruct or Fill-In-the-Middle routes inside their IDE. The API Key for this endpoint is managed at the personal level and isn’t bound by the usual organization rate limits. We’re allowing use of this endpoint for free during a beta period of 8 weeks and are gating it behind a waitlist to ensure a good quality of service. This endpoint should be preferred by developers implementing IDE plugins or applications where customers are expected to bring their own API keys.

Build with Codestral on La Plateforme
Codestral is also immediately available on the usual API endpoint: api.mistral.ai where queries are billed per tokens. This endpoint and integrations are better suited for research, batch queries or third-party application development that exposes results directly to users without them bringing their own API keys.

You can create your account on La Plateforme and start building your applications with Codestral by following this guide. Like all our other models, Codestral is available in our self-deployment offering starting today: contact sales.

Talk to Codestral on le Chat
We’re exposing an instructed version of Codestral, which is accessible today through Le Chat, our free conversational interface. Developers can interact with Codestral naturally and intuitively to leverage the model's capabilities. We see Codestral as a new stepping stone towards empowering everyone with code generation and understanding.

Use Codestral in your favourite coding and building environment.
We worked with community partners to expose Codestral to popular tools for developer productivity and AI application-making.

Application frameworks. Codestral is integrated into LlamaIndex and LangChain starting today, which allows users to build agentic applications with Codestral easily

VSCode/JetBrains integration. Continue.dev and Tabnine are empowering developers to use Codestral within the VSCode and JetBrains environments and now enable them to generate and chat with the code using Codestral.

Here is how you can use the Continue.dev VSCode plugin for code generation, interactive conversation, and inline editing with Codestral, and here is how users can use the Tabnine VSCode plugin to chat with Codestral.

For detailed information on how various integrations work with Codestral, please check our documentation for set-up instructions and examples.

###
https://research.google/blog/few-shot-tool-use-doesnt-really-work-yet/

Few-shot tool-use doesn’t really work (yet)
May 30, 2024

Alon Jacovi, Research Scientist, Google Research

Instructing language models to use tools based on few demonstrations, while a popular approach, is not as effective as initially thought.

Large language models (LLMs) are being used more and more frequently to answer queries requiring up-to-date knowledge or intricate computations (for example, “Who was born earlier: X or Y?” or “What would be my mortgage under these conditions?”). An especially popular strategy to answer such questions is with tool-use, that is, augmenting models with new capabilities (e.g., calculators and code interpreters) and external knowledge (e.g., Wikipedia and search engines) to answer such questions. For a language model to “use tools” means for the model to generate specific words that automatically invoke an external tool with a query, wherein the tool’s output is given back to the model to use as input. For example, by generating “Calculate(1 + 2)” will invoke a calculator on the input “1 + 2” and return its output “3” for further use by the model. In this way, language models can also use retrieval systems (such as retrieval-augmented generation, i.e., RAG). The tools can “make up” for inherent weaknesses of language models (such as outdated parameterized knowledge and lack of symbolic operation ability).

In the few-shot setting, by using in-context learning, the model is augmented with tools by inserting tool-use demonstrations into the prompt. There is a wide variety of proposed methods to instruct models in few-shot settings to use tools. These “tool-use strategies” claim to easily and cheaply improve performance (e.g., Self-Ask, RARR, ReAct, and Art, among others) — they allow us to define and designate tools ad-hoc without additional training, update our tools and tool APIs on the fly, and so on.

However, there are a variety of methods for achieving this — for one example, it’s possible for a model to call the tool during or after answer generation (visualized below). Since this area of research is very recent, comparisons betweens the various methods have not been studied. Thus, it is unclear which methods are better than others, what are the trade-offs, and how they compare to other strategies that don’t use tools at all.

ToolUse1-Hero
Illustration of different methods of integrating tools with LMs. It’s possible for the model to call the tool while generating its answer, or after generating its answer, and this choice has different implications for efficiency and performance.

In “A Comprehensive Evaluation of Tool-Assisted Generation Strategies”, we undertake a large-scale evaluation of many different tool-use algorithms. Our main question is: Does few-shot tool assistance work? Surprisingly, we found that it generally does not perform better than an LM operating without tools! Additionally, we found significant differences in efficiency between algorithms and a large variance in results depending on the experiment parameters, suggesting a need to require more thorough evaluation schemes to derive reliable insights. Below we highlight the key analyses, across a variety of settings.

How effective is few-shot tool use in practice?
We ran comprehensive evaluations, conducting over 340 experiments with different tools, models, prompts, demonstrations, strategies, and so on. We took extra care to design representative evaluations with strong but realistic no-tool baselines (such as letting the LM emulate the tool for every strategy).

Below are three examples of some of the tool-use strategies that we evaluated. SelfAsk uses natural-sounding instructions to prompt the model to decompose the question into simpler questions, and each simpler question is then answered using a retrieval tool. Inline (e.g., Toolformer) is more directly inspired by programming, treating tools as functions that are called with a keyword and input in brackets, to accomplish the same goal of decomposing the question into simple sub-questions. Finally, RARR uses an extensive chain of prompts to generate sub-questions, use a tool, validate its output, and rephrase it to give an answer.

ToolUse2-Strategies
Various strategies for demonstrating tool-use to models with in-context learning. In the examples above, the model is using a question-retrieval system as a tool to retrieve information about Muhammad Ali and Alan Turing. For more examples, see Figure 2 in the paper.

The results were clear: in almost all settings of popular academic question-answering (QA) benchmarks, there was no improvement from using tools compared to not using tools.

ToolUse3-ResultsBar
Evaluation results comparing tool-using LMs with standard LMs, for various models (Flan-PaLM, Flan-UL2, and GPT-3) and tasks (DROP, GSM8K, MuSiQue, and StrategyQA). The score refers to each dataset’s common scoring metric (standard accuracy for DROP, GSM8K and StrategyQA; F1 for MuSiQue).

A popular hypothesis, or common wisdom, is that tools can help LMs perform better on harder examples, like examples that have rare entities or difficult calculations, since LMs find such cases difficult. We detected such examples by using Wikipedia data and numerical ranges. But we found no significant improvement there, either: in the charts below, scores with tools were higher neither for rarer entities (shown in the top row) nor for difficult calculations (bottom row).

ToolUse4-ResultsLine
Evaluation results comparing tool-using LMs with standard LMs, for various models and tasks, for different measures of example difficulty.

What’s the best way to use tools?
Next, we ran some additional comparative tests: For example, as mentioned above, is it better to instruct the LM to use tools during its answer generation, or to verify and edit its own answer with tools after it has been generated? We compared the two in a variety of settings.

We found that for mathematical settings with a calculator tool, the two strategies were comparable, but for knowledge-seeking tasks with a retrieval tool (such as a search engine), editing the answer after it was generated was measurably better.

ToolUse5-ResultsBar
Evaluation results comparing tool-use during generation (“without refinement”), and tool-use to fix generated content (“with refinement”).

Not just performance: What about efficiency?
The final question we examined was about the efficiency of various strategies. Often, different methods of tool-use are evaluated by their performance, but we wanted to know how they compare in terms of their computational efficiency, and measure the trade-off — if it exists — between the two. If all else is equal between two strategies for tool-use, then an easy way to compare their efficiency is to compare how many tokens (pieces of words or characters) they require in the prompt, and how many extra tokens they generate above the baseline. The baseline in this case is the same model without any tool-use strategies. In this way, the efficiency of different tool-use strategies can be directly compared to each other.

We found that overall, there were significant differences in efficiency between various strategies. For example, certain methods cost 2× or 3× as much as others, and as much as 10× more than using no tools at all. These significant multipliers in cost do not necessarily translate into increased performance, which shows just how important it is to also measure efficiency. Please refer to the paper for the full calculations and results for this conclusion.

Call to action: How should we properly evaluate few-shot LMs with tools?
Throughout this large-scale evaluation, we surfaced some lessons about how to more reliably evaluate LMs in few-shot settings, especially for tool-use and RAG comparisons. Here are five key pitfalls and our corresponding recommendations:

Coupling the tool-use strategy and the tool together — comparisons of tool-use strategies should use the same tools across strategies.
Forcing no-tool baselines to the framework of the tool-use strategy — the optimal way to solve the task without tools may be different to optimally solving the task with tools: No-tool baselines should include multiple variants of both free-form and structured strategies, to ensure the tool-use variants are not given an advantage.
Using one model across all comparisons — different models may behave differently when it comes to using tools effectively, based on their training data. Multiple models should be tested.
Using one prompt and set of demonstrations across all comparisons. Multiple different sets of demonstrations and prompts should be used to get reliable estimates of few-shot performance.
Not considering tool-use strategy costs — tool-use strategies can be efficient or inefficient with regards to the extra tokens they require to work. The differences can be significant. Comparisons of strategies should factor the computation cost of the strategy.
If you are working on novel few-shot methods, with tool-use, RAG, or otherwise, consider these lessons when designing your evaluations!

Conclusion
Overall, we found that few-shot tool assistance, without explicitly training models to use tools, is a difficult and unsolved problem, with significant costs. This is in contrast to their commonly perceived value as an easy and cheap solution to augment LMs with tools, such as retrieval or calculation. Beyond few-shot strategies, training models to use tools seems to be more promising (and a popular paradigm in recent months — such as with Gemini, GPT-4 and Command-R).

###
https://arxiv.org/abs/2405.18357
Faithful Logical Reasoning via Symbolic Chain-of-Thought
Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, Wynne Hsu
While the recent Chain-of-Thought (CoT) technique enhances the reasoning ability of large language models (LLMs) with the theory of mind, it might still struggle in handling logical reasoning that relies much on symbolic expressions and rigid deducing rules. To strengthen the logical reasoning capability of LLMs, we propose a novel Symbolic Chain-of-Thought, namely SymbCoT, a fully LLM-based framework that integrates symbolic expressions and logic rules with CoT prompting. Technically, building upon an LLM, SymbCoT 1) first translates the natural language context into the symbolic format, and then 2) derives a step-by-step plan to solve the problem with symbolic logical rules, 3) followed by a verifier to check the translation and reasoning chain. Via thorough evaluations on 5 standard datasets with both First-Order Logic and Constraint Optimization symbolic expressions, SymbCoT shows striking improvements over the CoT method consistently, meanwhile refreshing the current state-of-the-art performances. We further demonstrate that our system advances in more faithful, flexible, and explainable logical reasoning. To our knowledge, this is the first to combine symbolic expressions and rules into CoT for logical reasoning with LLMs. Code is open at this https URL.