Summary
엔비디아에서는 Nemotron-4 340B 모델군을 발표했습니다. 이 모델은 Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, Nemotron-4-340B-Reward로 구성되어 있으며, 연구 및 상업적 응용을 위해 오픈 액세스 모델로 제공됩니다. Google Research에서는 Human I/O를 소개했으며, 이는 대규모 언어 모델을 활용하여 상황적 장애를 감지하는 시스템입니다. Lamini는 새로운 메모리 튜닝 방법을 발표하여 사실 정확도를 95%까지 향상시켰습니다.
Nemotron-4 340B 발표,
Nemotron-4 340B
링크, 2024년 6월 14일,
엔비디아
- Nemotron-4 340B 모델군 발표 (Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, Nemotron-4-340B-Reward)
- NVIDIA Open Model License Agreement 하에 오픈 액세스 제공
- 8개의 GPU를 사용하는 DGX H100에서 FP8 정밀도로 배포 가능
- 모델 정렬 과정에서 사용된 데이터의 98% 이상이 합성 데이터
- 모델 개발을 지원하기 위해 합성 데이터 생성 파이프라인도 오픈 소스 제공
Human I/O: Detecting situational impairments with large language models,
Human I/O: 대규모 언어 모델을 활용한 상황적 장애 감지
링크, 2024년 6월 14일,
Google Research
- SIIDs(상황적 장애) 감지를 위한 Human I/O 시스템 소개
- egocentric vision, 멀티모달 센싱, LLM reasoning 활용
- 60개의 실제 시나리오에서 82%의 정확도로 사용자의 상호작용 능력 평가
- 사용자의 시각, 청각, 발성, 손 사용 가능성을 예측하여 디바이스 적응
Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations,
Lamini 메모리 튜닝: 95% LLM 정확도, 10배 감소된 환각
링크, 2024년 6월 14일,
Lamini
- 새로운 메모리 튜닝 방법 발표
- 사실 정확도를 95%까지 향상시키고 환각을 10% 이하로 감소
- 정보 검색에서 영감을 받아 인덱스에서 관련 전문가만 선택하여 사용
- Fortune 500 고객 사례에서 텍스트-SQL 에이전트의 정확도가 50%에서 95%로 향상
HelpSteer2: Open-source dataset for training top-performing reward models,
HelpSteer2: 최상위 성능의 보상 모델 훈련을 위한 오픈 소스 데이터셋
링크, 2024년 6월 12일,
Zhilin Wang 외
- HelpSteer2 데이터셋 발표, CC-BY-4.0 라이선스로 제공
- 기존 데이터셋 대비 10배 적은 10,000개의 응답 쌍으로 구성
- Reward-Bench의 주요 데이터셋에서 SOTA 점수(92.0%) 달성
- SteerLM 2.0 모델 정렬 방법 제안, 보상 모델의 다중 속성 점수 활용
ChatGPT is bullshit,
ChatGPT는 허튼소리다
링크, 2024년 6월 14일,
Michael Townsen Hicks 외
- 대규모 언어 모델이 진실에 무관심하다는 관점에서 “bullshit”이라고 주장
- AI 허위 정보의 특성을 설명하고, 이를 예측하고 논의하는 더 유용한 방법 제시
Multi AI Agent Systems 101,
다중 AI 에이전트 시스템 101d
링크, 2024년 6월 17일,
Mariya Mansurova
- 다중 AI 에이전트 시스템의 개요 설명
- AutoGen, LangGraph, CrewAI 등 다양한 프레임워크 소개
- CrewAI를 사용하여 데이터 소스 관리 자동화 예제 제시
Sources
This GPT assists users by creating a detailed daily newspaper in Korean based on provided links. It follows these steps: read the content, summarize each content with detailed points, and write a report. The report format is:(today’s date in 년 월 일) AI 소식,
Summary
(overall short summary, make summary with good details. for Summary section, explain the details starting with company name, e.g. OpenAI에서는 ~~~를 발표하였습니다.)
Title,
한글제목
링크, date,
company name
- detailed summary1, (개조식 문체 사용)
- detailed summary2, (개조식 문체 사용)
… - detailed summary N, (개조식 문체 사용)
Title,
한글제목
링크, date,
company name
- detailed summary1, (개조식 문체 사용)
- detailed summary2, (개조식 문체 사용)
… - detailed summary N, (개조식 문체 사용)
…1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075###
https://research.nvidia.com/publication/2024-06_nemotron-4-340b
Publications
Nemotron-4 340B
Nemotron-4 340B
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows the distribution, modification, and use of the models and their outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.
Publication Date
Friday, June 14, 2024
Research Area
Artificial Intelligence and Machine Learning
Generative AI
Model Overview
Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.
The base model was pre-trained on a corpus of 9 trillion tokens consisting of a diverse assortment of English based texts, 50+ natural languages, and 40+ coding languages. Subsequently the Nemotron-4-340B-Instruct model went through additional alignment steps including:
Supervised Fine-tuning (SFT)
Direct Preference Optimization (DPO)
Reward-aware Preference Optimization (RPO) (Additional in-house alignment technique)
Throughout the alignment process, we relied on only approximately 20K human-annotated data while our data generation pipeline synthesized over 98% of the data used for supervised fine-tuning and preference fine-tuning (DPO & RPO). We provide comprehensive details about our synthetic data generation pipeline in the technical report.
This results in a model that is aligned for human chat preferences, improvements in mathematical reasoning, coding and instruction-following, and is capable of generating high quality synthetic data for a variety of use cases.
Under the NVIDIA Open Model License, NVIDIA confirms:
Models are commercially usable.
You are free to create and distribute Derivative Models.
NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
License:
NVIDIA Open Model License
Intended use
Nemotron-4-340B-Instruct is a chat model intended for use for the English language.
Nemotron-4-340B-Instruct is designed for Synthetic Data Generation to enable developers and enterprises for building and customizing their own large language models and LLM applications.
The instruct model itself can be further customized using the NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using NeMo-Aligner.
Model Developer: NVIDIA
Model Dates: Nemotron-4-340B-Instruct was trained between December 2023 and May 2024.
Data Freshness: The pretraining data has a cutoff of June 2023.
Required Hardware
BF16 Inference:
8x H200 (1x H200 node)
16x H100 (2x H100 nodes)
16x A100 80GB (2x A100 80GB nodes)
Model Architecture:
Nemotron-4-340B-Instruct is standard decoder-only Transformer, trained with a sequence length of 4096 tokens, uses Grouped-Query Attention (GQA), and Rotary Position Embeddings (RoPE).
Architecture Type: Transformer Decoder (auto-regressive language model)
Network Architecture: Nemotron-4
###
https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models
Nemotron-4 340B, a family of models optimized for NVIDIA NeMo and NVIDIA TensorRT-LLM, includes cutting-edge instruct and reward models, and a dataset for generative AI training.
June 14, 2024 by Ankit Patel
Share
NVIDIA today announced Nemotron-4 340B, a family of open models that developers can use to generate synthetic data for training large language models (LLMs) for commercial applications across healthcare, finance, manufacturing, retail and every other industry.
High-quality training data plays a critical role in the performance, accuracy and quality of responses from a custom LLM — but robust datasets can be prohibitively expensive and difficult to access.
Through a uniquely permissive open model license, Nemotron-4 340B gives developers a free, scalable way to generate synthetic data that can help build powerful LLMs.
The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline to generate synthetic data used for training and refining LLMs. The models are optimized to work with NVIDIA NeMo, an open-source framework for end-to-end model training, including data curation, customization and evaluation. They’re also optimized for inference with the open-source NVIDIA TensorRT-LLM library.
Nemotron-4 340B can be downloaded now from Hugging Face. Developers will soon be able to access the models at ai.nvidia.com, where they’ll be packaged as an NVIDIA NIM microservice with a standard application programming interface that can be deployed anywhere.
Navigating Nemotron to Generate Synthetic Data
LLMs can help developers generate synthetic training data in scenarios where access to large, diverse labeled datasets is limited.
The Nemotron-4 340B Instruct model creates diverse synthetic data that mimics the characteristics of real-world data, helping improve data quality to increase the performance and robustness of custom LLMs across various domains.
Then, to boost the quality of the AI-generated data, developers can use the Nemotron-4 340B Reward model to filter for high-quality responses. Nemotron-4 340B Reward grades responses on five attributes: helpfulness, correctness, coherence, complexity and verbosity. It’s currently first place on the Hugging Face RewardBench leaderboard, created by AI2, for evaluating the capabilities, safety and pitfalls of reward models.
nemotron synthetic data generation pipeline diagram
In this synthetic data generation pipeline, (1) the Nemotron-4 340B Instruct model is first used to produce synthetic text-based output. An evaluator model, (2) Nemotron-4 340B Reward, then assesses this generated text — providing feedback that guides iterative improvements and ensures the synthetic data is accurate, relevant and aligned with specific requirements.
Researchers can also create their own instruct or reward models by customizing the Nemotron-4 340B Base model using their proprietary data, combined with the included HelpSteer2 dataset.
Fine-Tuning With NeMo, Optimizing for Inference With TensorRT-LLM
Using open-source NVIDIA NeMo and NVIDIA TensorRT-LLM, developers can optimize the efficiency of their instruct and reward models to generate synthetic data and to score responses.
All Nemotron-4 340B models are optimized with TensorRT-LLM to take advantage of tensor parallelism, a type of model parallelism in which individual weight matrices are split across multiple GPUs and servers, enabling efficient inference at scale.
Nemotron-4 340B Base, trained on 9 trillion tokens, can be customized using the NeMo framework to adapt to specific use cases or domains. This fine-tuning process benefits from extensive pretraining data and yields more accurate outputs for specific downstream tasks.
A variety of customization methods are available through the NeMo framework, including supervised fine-tuning and parameter-efficient fine-tuning methods such as low-rank adaptation, or LoRA.
To boost model quality, developers can align their models with NeMo Aligner and datasets annotated by Nemotron-4 340B Reward. Alignment is a key step in training LLMs, where a model’s behavior is fine-tuned using algorithms like reinforcement learning from human feedback (RLHF) to ensure its outputs are safe, accurate, contextually appropriate and consistent with its intended goals.
Businesses seeking enterprise-grade support and security for production environments can also access NeMo and TensorRT-LLM through the cloud-native NVIDIA AI Enterprise software platform, which provides accelerated and efficient runtimes for generative AI foundation models.
Evaluating Model Security and Getting Started
The Nemotron-4 340B Instruct model underwent extensive safety evaluation, including adversarial tests, and performed well across a wide range of risk indicators. Users should still perform careful evaluation of the model’s outputs to ensure the synthetically generated data is suitable, safe and accurate for their use case.
For more information on model security and safety evaluation, read the model card.
Download Nemotron-4 340B models via Hugging Face. For more details, read the research papers on the model and dataset.
See notice regarding software product information.
###
https://arxiv.org/abs/2406.08673
HelpSteer2: Open-source dataset for training top-performing reward models
Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev
High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers. To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0). Using a powerful internal base model trained on HelpSteer2, we are able to achieve the SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming currently listed open and proprietary models, as of June 12th, 2024. Notably, HelpSteer2 consists of only ten thousand response pairs, an order of magnitude fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly efficient for training reward models. Our extensive experiments demonstrate that reward models trained with HelpSteer2 are effective in aligning LLMs. In particular, we propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models. HelpSteer2 is available at this https URL and code is available at this https URL
###
https://research.google/blog/human-io-detecting-situational-impairments-with-large-language-models/
Home Blog
Human I/O: Detecting situational impairments with large language models
June 14, 2024
Google Research
Xingyu Bruce Liu, Student Researcher, and Ruofei Du, Interactive Perception & Graphics Lead, Google Augmented Reality
Human I/O is a unified approach that uses egocentric vision, multimodal sensing, and LLM reasoning to detect situational impairments and assess a user’s ability to interact with their hands, vision, hearing or speech in a given situation.
Every day, we encounter temporary challenges that can affect our abilities to respond to different situations. These challenges, known as situationally induced impairments and disabilities (SIIDs), can be caused by various environmental factors like noise, lighting, temperature, stress, and even social norms. For example, imagine you're in a loud restaurant and you miss an important phone call because you simply could not hear your phone ring. Or picture yourself trying to respond to a text message while washing dishes; your wet hands and the task at hand make it hard to type a reply. These everyday scenarios show how our surroundings can momentarily reduce our physical, cognitive, or emotional abilities, leading to frustrating experiences.
In addition, situational impairments can vary greatly and change frequently, which makes it difficult to apply one-size-fits-all solutions that help users with their needs in real-time. For example, think about a typical morning routine: while brushing their teeth, someone might not be able to use voice commands with their smart devices. When washing their face, it could be hard to see and respond to important text messages. And while using a hairdryer, it might be difficult to hear any phone notifications. Even though various efforts have created solutions tailored for specific situations like these, creating manual solutions for every possible situation and combination of challenges isn't really feasible and doesn't work well on a large scale.
In “Human I/O: Towards a Unified Approach to Detecting Situational Impairments”, which received a Best Paper Honorable Mention Award at CHI 2024, we introduce a generalizable and extensible framework for detecting SIIDs. Rather than devising individual models for activities like face-washing, tooth-brushing, or hair-drying, Human Input/Output (Human I/O) universally assesses the availability of a user’s vision (e.g., to read text messages, watch videos), hearing (e.g., to hear notifications, phone calls), vocal (e.g., to have a conversation, use Google Assistant), and hand (e.g., to use touch screen, gesture control) input/output interaction channels. We describe how Human I/O leverages egocentric vision, multimodal sensing, and reasoning with large language models (LLMs) to achieve an 82% accuracy in availability prediction across 60 in-the-wild egocentric video recordings in 32 different scenarios, and validate it as an interactive system in a lab study with ten participants. We also open-sourced the code.
Human-I:O-Figure1
Leveraging multi-modal AI and large language models, we built a pipeline to computationally model these human input/output channels and achieved good real-time performance.
Formative study
Prior to developing Human I/O, we conducted a formative study involving ten participants to better understand how different impairments affected their interaction with technology. Our findings highlighted the need for a system that could dynamically adapt to the varying levels of channel availability rather than treating availability as a binary state. Specifically, Human I/O distinguishes between four levels of channel availability, which are crucial for understanding the degree to which a user can engage with their device. These levels include:
Available
The channel is currently not involved in any activity, or constrained by any environmental factors. It takes low to zero effort to use the channel to complete a new task.
Example: A user is sitting at their desk with their hands free, eyes not engaged in a task, and no background noise interfering with their hearing or speech.
Slightly affected
The channel is engaged in an activity or constrained by an environmental factor. Given a new task that requires the channel, users can multitask, easily pause and resume the current activity, or easily overcome the situation.
Example: A user is holding a remote control, which can be set aside to free up their hand for another task.
Affected
The channel is involved in an activity or constrained by an environmental factor. Given a new task, the user may experience inconvenience or require some effort to use the channel.
Example: A user is using both hands to carry groceries, making it challenging to use their hands for other tasks without putting the bags down first.
Unavailable
The channel is completely unavailable due to an activity or environmental factor, and the user cannot use it for a new task without substantial changes, significant adaptation or changing the environment.
Example: A user is attending a loud concert, making it impossible for them to hear incoming notifications or have a conversation.
Human I/O system pipeline
The Human I/O system incorporates a pipeline that streams real-time data, processes it to understand the context, and applies reasoning with LLMs to predict channel availability. Here’s a deeper look into the three core components of the system: data streaming, processing module, and reasoning module.
Data streaming
The system begins by streaming real-time video and audio data through an egocentric device that has a camera and microphone. This setup provides a first-person view of the user's environment, capturing both visual and auditory details necessary for assessing the context.
Processing module
This module processes the raw data to extract the following information:
Activity recognition: Utilizes computer vision to identify the user's current activities, such as cooking or washing dishes.
Environment assessment: Determines the setting, such as noisy or quiet environments, using both audio and visual data.
Direct sensing: Detects finer details, such as whether the user’s hands are occupied, ambient noise levels, and lighting conditions.
Reasoning module
The final stage involves analyzing the structured data from the processing module to predict the availability of input/output channels using an LLM with chain-of-thought reasoning. This module processes contextual information and determines how impaired each channel is, guiding how the device should adapt its interactions accordingly. By integrating data streaming, processing, and reasoning, Human I/O dynamically predicts the availability of the user’s input and output channels. We further incorporate a smoothing algorithm for enhanced system stability.
Human-I:O-Figure2
The Human I/O pipeline comprises an egocentric device, processing modules that obtains activity, environment, and sensor data, and reasoning module that predicts human channel availability.
Evaluation
To validate Human I/O, we conducted an evaluation with 300 clips selected from 60 in-the-wild egocentric video recordings. The system archives a 0.22 mean absolute error (MAE) and an 82% accuracy in predicting channel availability, with 96% of predictions within one step of the actual availability level. These low MAE values indicate that our system’s predictions closely align with the actual availability, with deviations being less than a third of the actual level on average.
In addition, we conducted an ablation study by introducing Human I/O Lite, which replaces the chain-of-thought reasoning module with a one-shot prompt. For Human I/O Lite, we observe overall slightly inferior performance compared to the full model. However, the MAE for Human I/O Lite is still at a low level around 0.44, showing a promising ability to predict SIIDs even with reduced computational resources.
Human-I:O-Figure3
Technical evaluation of Human I/O and Human I/O Lite. We report the MAE, average classification accuracy (ACC), and average intra-video variance (VAR) for four channels and the overall results. Our system estimates availability levels with small margins of error and variance. In Human I/O, 96.0% of predictions are within a discrepancy of 1 step from the actual availability value.
Additionally, a user study with ten participants showed that Human I/O significantly reduced effort and enhanced user experience in the presence of SIIDs. Participants particularly valued how the system adapted to their real-time context, making digital interactions more seamless and less disruptive. Participants completed the NASA Task Load Index questionnaire, assessing mental demand, physical demand, temporal demand, overall performance, effort, and frustration level on a 7-point scale (from 1–Lowest to 7–Highest). Results suggest that Human I/O significantly reduces effort and improves user experience in the presence of SIIDs. Furthermore, Human I/O users reported a raised awareness of SIIDs that opened up new interaction possibilities.
Human-I:O-Figure4
Participants’ ratings to Task Load Index questions (on a scale of 1–low to 7–high) for their experience with SIIDs with and without Human I/O in user study. All rating differences are statistically significant with 𝑝 < 0.001 via Wilcoxon signed-rank tests.
Conclusion & future work
Human I/O represents a leap forward in our ability to interact with technology in a context-aware and adaptive manner. By understanding and predicting the availability of our input and output channels, it paves the way for smarter, more intuitive user interfaces that can enhance productivity and accessibility for everyone, regardless of the situational challenges they face.
Meanwhile, maintaining privacy and upholding ethical standards are crucial in the design and deployment of SIIDs systems with active cameras and microphones on wearables. For example, a combination of on-device inference (e.g., Gemini Nano) and federated learning can forestall potential data breaches. Future research may incorporate more sensing techniques, such as depth sensing, ultra-wideband, and eye tracking, to provide users with finer controls over how their devices adapt to their changing needs and situations.
We envision this technology not only improving individual device interactions but also serving as a foundation for future developments in ubiquitous computing.
###
https://www.lamini.ai/blog/lamini-memory-tuning
Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations
TLDR:
Lamini Memory Tuning is a new way to embed facts into LLMs that improves factual accuracy and reduces hallucinations to previously unachievable levels — for one Fortune 500 customer, Lamini Memory Tuning led to 95% accuracy compared to 50% with other approaches. Hallucinations were reduced from 50% to 5%.
Lamini Memory Tuning is a research breakthrough that overcomes a seeming paradox in the AI world: achieving precise factual accuracy (i.e. no hallucinations) while upholding the generalization capabilities that make LLMs valuable in the first place.
The method entails tuning millions of expert adapters (e.g. LoRAs) with precise facts on top of any open-source LLM, like Llama 3 or Mistral 3. If the goal is to get Roman Empire facts exactly right, Lamini Memory Tuning would create experts on Caesar, aqueducts, legions, and any other facts you provide. Inspired by information retrieval, the model retrieves only the most relevant experts from an index at inference time — not all the model weights — so latency and cost are dramatically lower. High accuracy, high speed, low cost: with Lamini Memory Tuning, you don’t have to choose.
Contact us to try Lamini Memory Tuning.
Accuracy matters immensely
Yet, general-purpose LLMs are designed to hallucinate, because they are trained to reduce the average error across the examples they’ve seen. They’re pretty good at everything, but perfect at nothing. They can produce fluent English prose because they’ve seen so much of it across the internet, but specific facts—like a date, a revenue number, or a variable name—get muddled in probabilities. As a result, companies have not been able to count on LLMs for the most critical and most valuable use cases – until now.
Introducing Lamini Memory Tuning
Lamini Memory Tuning is a completely new way to fine-tune any existing LLM by tuning millions of LoRA adapters and selecting across them in a wide Mixture of Experts at inference time.
Instead of optimizing average error on everything, Lamini Memory Tuning optimizes for zero error on the specific facts you tell it to remember, so it recalls those facts nearly perfectly. That’s not special on its own. This approach is particularly groundbreaking because it preserves the LLM’s ability to generalize with average error on everything else, and thus continue to produce fluent prose around those facts. Lamini Memory Tuning is a systematic tool for eliminating hallucinations on the facts you care about.
Fortune 500 customers are already using Lamini Memory Tuning to achieve 95% factual accuracy on critical use cases where previous state-of-the-art approaches peaked at 50%.
THE PROBLEM
Prompting and RAG: necessary but not sufficient
Prompting and Retrieval Augmented Generation (RAG) are important methods for surfacing relevant information to the model, shifting its probabilities to consider similar information. This is an important step to getting the model to condition on the right concepts and information, because the model has been trained on so many tasks. Good prompt-engineering and RAG pipelines are critical to improve the overall accuracy of the model.
At times, this is all you need. But other times, you provide the relevant information and the response is still wrong but so close to right — leading to hallucinations.
Why do hallucinations happen with the right data? In the model’s internal representation, the right answer is likely clustered with similar, but wrong, options. The right context increases the probabilities of the right answer and nearby wrong options. The model doesn’t know that a nearly right answer is still wrong, because general models don’t distinguish between exactly right and nearly right — they never learned to take the loss on those answers to zero. Prompting and RAG don’t change that.
Lamini Memory Tuning addresses this directly, by combining methods from information retrieval and AI to teach the model that getting the answer nearly right is the same as getting it totally wrong.
Instruction fine-tuning: the wrong tool for the job
Many teams turn to instruction fine-tuning when other techniques hit a wall on factual accuracy, but instruction fine-tuning, with or without LoRAs, lead to the same issue that pre-training has: it gets to be pretty good at a more narrow dataset, but still perfect at nothing, while being finicky to work with (losing the ability to perform on some general tasks, if you do it wrong).
As a result, teams struggle with unclear choices, long feedback loops, high compute bills, and ultimately underwhelming performance improvements. While instruction fine-tuning can be really valuable (it’s what turned GPT-3 into ChatGPT), it doesn't make models perfect at the facts that matter. In other words, traditional fine-tuning does not ensure that the model's answers are faithful to facts in its training data.
This is why we developed Lamini Memory Tuning.
OUR INNOVATION
Lamini Memory Tuning: near-perfect fact recall via 1 million-way MoE
Lamini Memory Tuning is a fundamentally different fine-tuning approach that effectively teaches any open-source LLM to be near-perfect on facts, while still maintaining its ability to be pretty good at everything else. When the model is supposed to recall a specific fact, Lamini Memory Tuning shifts the entire probability mass to that particular fact (i.e. specific tokens within a particular context), such as the exact SQL schema for your database. This results in output probabilities that are not just closer to the right result, but exactly there.
To do this, Lamini Memory Tuning tunes a massive mixture of memory experts on any open-source LLM. Each memory expert acts like a LoRA adapter that functionally operates as memory for the model. Together, the memory experts specialize in a million different ways to ensure faithful and factual accuracy to the data that it was tuned on. Inspired by information retrieval, these million memory experts are equivalent to indices from which the model intelligently retrieves and routes. At inference time, the model retrieves the most relevant experts at each layer and merges back into the base model to respond to the user query.
The result is a sparsely activated model, called a Mixture of Memory Experts (MoME), that can scale to an enormous number of parameters at a fixed computational inference cost. This means MoMEs have extremely high capacity for the number of facts that can be learned, bounded only by the total size of the training data set. Llama 3 was trained on 15 trillion tokens. Realistically, you will run out of system memory before you run out of memory capacity in a MoME.
Ultimately, this approach makes what were impossible use cases that critically suffer from hallucinations within reach, and drastically improves LLM time-to-accuracy and thus time-to-market.
Read more details in our research paper.
Results
Lamini Memory Tuning has been a game-changing capability with Lamini’s Fortune 500 clients, who are deploying it for the following use cases:
High precision text-to-SQL
Client need: Democratize data access by using LLMs to turn natural language questions into database queries.
Challenge: The relevant databases had unique internal names and large, messy schemas.
Result: We achieved 95% accuracy with Lamini Memory Tuning after 50% accuracy with RAG.
High precision classification
Client need: Save thousands of hours by automatically labeling data accurately.
Challenge: We had to adhere to an exact taxonomy of 900 categories.
Result: We achieved 100% accuracy across thousands of documents.
High precision recommendations
Client need: Increase cart size and revenue with AI-powered product suggestions.
Challenge: Applications break when product IDs are hallucinated.
Result: We achieved 88% accuracy across a 50,000 product database.
A new frontier
Lamini Memory Tuning changes several of the fundamental dynamics and tradeoffs governing how we work with LLMs. We’re in the early days of this new paradigm, and we’re still learning alongside our customers what’s possible. Summarizing a few areas we’re most excited about:
Higher accuracy enables full automation as opposed to copiloting.
Lower costs let you take your product from internal demos to a wider production audience.
Lower latency enables seamless user experiences.
Smaller models mean faster development and improvement cycles.
What could you do with models that ran faster, were more accurate, and cost less to develop and run?
Start using Lamini Memory Tuning
Because Lamini Memory Tuning is a cutting-edge technique that embeds your unique data in a new model architecture, we’re exclusively working with select partners.
Contact us to try Lamini Memory Tuning.
Want to learn more?
Read the research paper for details on the underlying approach.
Read the case study to see how a Fortune 500 company is using Lamini Memory Tuning for a 95% accurate text-to-SQL agent.
###
https://link.springer.com/content/pdf/10.1007/s10676-024-09775-5.pdf
ChatGPT is bullshit
Michael Townsen Hicks1 · James Humphries1
· Joe Slater1
t
Recently, there has been considerable interest in large language models: machine learning systems which produce humanlike text and dialogue. Applications of these systems have been plagued by persistent inaccuracies in their output; these are
often called “AI hallucinations”. We argue that these falsehoods, and the overall activity of large language models, is better
understood as bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005): the models are in an important
way indifferent to the truth of their outputs. We distinguish two ways in which the models can be said to be bullshitters,
and argue that they clearly meet at least one of these definitions. We further argue that describing AI misrepresentations
as bullshit is both a more useful and more accurate way of predicting and discussing the behaviour of these systems.
###
https://towardsdatascience.com/multi-ai-agent-systems-101-bac58e3bcc47
Multi AI Agent Systems 101
Automating Routine Tasks in Data Source Management with CrewAI
Mariya Mansurova
Towards Data Science
Mariya Mansurova
·
Follow
Published in
Towards Data Science
·
26 min read
·
7 hours ago
100
Image by DALL-E 3
Initially, when ChatGPT just appeared, we used simple prompts to get answers to our questions. Then, we encountered issues with hallucinations and began using RAG (Retrieval Augmented Generation) to provide more context to LLMs. After that, we started experimenting with AI agents, where LLMs act as a reasoning engine and can decide what to do next, which tools to use, and when to return the final answer.
The next evolutionary step is to create teams of such agents that can collaborate with each other. This approach is logical as it mirrors human interactions. We work in teams where each member has a specific role:
The product manager proposes the next project to work on.
The designer creates its look and feel.
The software engineer develops the solution.
The analyst examines the data to ensure it performs as expected and identifies ways to improve the product for customers.
Similarly, we can create a team of AI agents, each focusing on one domain. They can collaborate and reach a final conclusion together. Just as specialization enhances performance in real life, it could also benefit the performance of AI agents.
Another advantage of this approach is increased flexibility. Each agent can operate with its own prompt, set of tools and even LLM. For instance, we can use different models for different parts of our system. You can use GPT-4 for the agent that needs more reasoning and GPT-3.5 for the one that does only simple extraction. We can even fine-tune the model for small specific tasks and use it in our crew of agents.
The potential drawbacks of this approach are time and cost. Multiple interactions and knowledge sharing between agents require more calls to LLM and consume additional tokens. This could result in longer wait times and increased expenses.
There are several frameworks available for multi-agent systems today.
Here are some of the most popular ones:
AutoGen: Developed by Microsoft, AutoGen uses a conversational approach and was one of the earliest frameworks for multi-agent systems,
LangGraph: While not strictly a multi-agent framework, LangGraph allows for defining complex interactions between actors using a graph structure. So, it can also be adapted to create multi-agent systems.
CrewAI: Positioned as a high-level framework, CrewAI facilitates the creation of “crews” consisting of role-playing agents capable of collaborating in various ways.
I’ve decided to start experimenting with multi-agent frameworks from CrewAI since it’s quite widely popular and user friendly. So, it looks like a good option to begin with.
In this article, I will walk you through how to use CrewAI. As analysts, we’re the domain experts responsible for documenting various data sources and addressing related questions. We’ll explore how to automate these tasks using multi-agent frameworks.
Setting up the environment
Let’s start with setting up the environment. First, we need to install the CrewAI main package and an extension to work with tools.
pip install crewai
pip install 'crewai[tools]'
CrewAI was developed to work primarily with OpenAI API, but I would also like to try it with a local model. According to the ChatBot Arena Leaderboard, the best model you can run on your laptop is Llama 3 (8b parameters). It will be the most feasible option for our use case.
We can access Llama models using Ollama. Installation is pretty straightforward. You need to download Ollama from the website and then go through the installation process. That’s it.
Now, you can test the model in CLI by running the following command.
ollama run llama3
For example, you can ask something like this.
Let’s create a custom Ollama model to use later in CrewAI.
We will start with a ModelFile (documentation). I only specified the base model (llama3), temperature and stop sequence. However, you might add more features. For example, you can determine the system message using SYSTEM keyword.
FROM llama3
# set parameters
PARAMETER temperature 0.5
PARAMETER stop Result
I’ve saved it into a Llama3ModelFile file.
Let’s create a bash script to load the base model for Ollama and create the custom model we defined in ModelFile.
#!/bin/zsh
# define variables
model_name="llama3"
custom_model_name="crewai-llama3"
# load the base model
ollama pull $model_name
# create the model file
ollama create $custom_model_name -f ./Llama3ModelFile
Let’s execute this file.
chmod +x ./llama3_setup.sh
./llama3_setup.sh
You can find both files on GitHub: Llama3ModelFile and llama3_setup.sh
We need to initialise the following environmental variables to use the local Llama model with CrewAI.
os.environ["OPENAI_API_BASE"]='http://localhost:11434/v1'
os.environ["OPENAI_MODEL_NAME"]='crewai-llama3'
# custom_model_name from the bash script
os.environ["OPENAI_API_KEY"] = "NA"
We’ve finished the setup and are ready to continue our journey.
Use cases: working with documentation
As analysts, we often play the role of subject matter experts for data and some data-related tools. In my previous team, we used to have a channel with almost 1K participants, where we were answering lots of questions about our data and the ClickHouse database we used as storage. It took us quite a lot of time to manage this channel. It would be interesting to see whether such tasks can be automated with LLMs.
For this example, I will use the ClickHouse database. If you’re interested, You can learn more about ClickHouse and how to set it up locally in my previous article. However, we won’t utilise any ClickHouse-specific features, so feel free to stick to the database you know.
I’ve created a pretty simple data model to work with. There are just two tables in our DWH (Data Warehouse): ecommerce_db.users and ecommerce_db.sessions. As you might guess, the first table contains information about the users of our service.
The ecommerce_db.sessions table stores information about user sessions.
Regarding data source management, analysts typically handle tasks like writing and updating documentation and answering questions about this data. So, we will use LLM to write documentation for the table in the database and teach it to answer questions about data or ClickHouse.
But before moving on to the implementation, let’s learn more about the CrewAI framework and its core concepts.
CrewAI basic concepts
The cornerstone of a multi-agent framework is an agent concept. In CrewAI, agents are powered by role-playing. Role-playing is a tactic when you ask an agent to adopt a persona and behave like a top-notch backend engineer or helpful customer support agent. So, when creating a CrewAI agent, you need to specify each agent's role, goal, and backstory so that LLM knows enough to play this role.
The agents’ capabilities are limited without tools (functions that agents can execute and get results). With CrewAI, you can use one of the predefined tools (for example, to search the Internet, parse a website, or do RAG on a document), create a custom tool yourself or use LangChain tools. So, it’s pretty easy to create a powerful agent.
Let’s move on from agents to the work they are doing. Agents are working on tasks (specific assignments). For each task, we need to define a description, expected output (definition of done), set of available tools and assigned agent. I really like that these frameworks follow the managerial best practices like a clear definition of done for the tasks.
The next question is how to define the execution order for tasks: which one to work on first, which ones can run in parallel, etc. CrewAI implemented processes to orchestrate the tasks. It provides a couple of options:
Sequential —the most straightforward approach when tasks are called one after another.
Hierarchical — when there’s a manager (specified as LLM model) that creates and delegates tasks to the agents.
Also, CrewAI is working on a consensual process. In such a process, agents will be able to make decisions collaboratively with a democratic approach.
There are other levers you can use to tweak the process of tasks’ execution:
You can mark tasks as “asynchronous”, then they will be executed in parallel, so you will be able to get an answer faster.
You can use the “human input” flag on a task, and then the agent will ask for human approval before finalising the output of this task. It can allow you to add an oversight to the process.
We’ve defined all the primary building blocks and can discuss the holly grail of CrewAI — crew concept. The crew represents the team of agents and the set of tasks they will be working on. The approach for collaboration (processes we discussed above) can also be defined at the crew level.
Also, we can set up the memory for a crew. Memory is crucial for efficient collaboration between the agents. CrewAI supports three levels of memory:
Short-term memory stores information related to the current execution. It helps agents to work together on the current task.
Long-term memory is data about the previous executions stored in the local database. This type of memory allows agents to learn from earlier iterations and improve over time.
Entity memory captures and structures information about entities (like personas, cities, etc.)
Right now, you can only switch on all types of memory for a crew without any further customisation. However, it doesn’t work with the Llama models.
We’ve learned enough about the CrewAI framework, so it’s time to start using this knowledge in practice.
Use case: writing documentation
Let’s start with a simple task: putting together the documentation for our DWH. As we discussed before, there are two tables in our DWH, and I would like to create a detailed description for them using LLMs.
First approach
In the beginning, we need to think about the team structure. Think of this as a typical managerial task. Who would you hire for such a job?
I would break this task into two parts: retrieving data from a database and writing documentation. So, we need a database specialist and a technical writer. The database specialist needs access to a database, while the writer won’t need any special tools.
Now, we have a high-level plan. Let’s create the agents.
For each agent, I’ve specified the role, goal and backstory. I’ve tried my best to provide agents with all the needed context.
database_specialist_agent = Agent(
role = "Database specialist",
goal = "Provide data to answer business questions using SQL",
backstory = '''You are an expert in SQL, so you can help the team
to gather needed data to power their decisions.
You are very accurate and take into account all the nuances in data.''',
allow_delegation = False,
verbose = True
)
tech_writer_agent = Agent(
role = "Technical writer",
goal = '''Write engaging and factually accurate technical documentation
for data sources or tools''',
backstory = '''
You are an expert in both technology and communications, so you can easily explain even sophisticated concepts.
You base your work on the factual information provided by your colleagues.
Your texts are concise and can be easily understood by a wide audience.
You use professional but rather an informal style in your communication.
''',
allow_delegation = False,
verbose = True
)
We will use a simple sequential process, so there’s no need for agents to delegate tasks to each other. That’s why I specified allow_delegation = False.
The next step is setting the tasks for agents. But before moving to them, we need to create a custom tool to connect to the database.
First, I put together a function to execute ClickHouse queries using HTTP API.
CH_HOST = 'http://localhost:8123' # default address
def get_clickhouse_data(query, host = CH_HOST, connection_timeout = 1500):
r = requests.post(host, params = {'query': query},
timeout = connection_timeout)
if r.status_code == 200:
return r.text
else:
return 'Database returned the following error:\n' + r.text
When working with LLM agents, it’s important to make tools fault-tolerant. For example, if the database returns an error (status_code != 200), my code won’t throw an exception. Instead, it will return the error description to the LLM so it can attempt to resolve the issue.
To create a CrewAI custom tool, we need to derive our class from crewai_tools.BaseTool, implement the _run method and then create an instance of this class.
from crewai_tools import BaseTool
class DatabaseQuery(BaseTool):
name: str = "Database Query"
description: str = "Returns the result of SQL query execution"
def _run(self, sql_query: str) -> str:
# Implementation goes here
return get_clickhouse_data(sql_query)
database_query_tool = DatabaseQuery()
Now, we can set the tasks for the agents. Again, providing clear instructions and all the context to LLM is crucial.
table_description_task = Task(
description = '''Provide the comprehensive overview for the data
in table {table}, so that it's easy to understand the structure
of the data. This task is crucial to put together the documentation
for our database''',
expected_output = '''The comprehensive overview of {table} in the md format.
Include 2 sections: columns (list of columns with their types)
and examples (the first 30 rows from table).''',
tools = [database_query_tool],
agent = database_specialist_agent
)
table_documentation_task = Task(
description = '''Using provided information about the table,
put together the detailed documentation for this table so that
people can use it in practice''',
expected_output = '''Well-written detailed documentation describing
the data scheme for the table {table} in markdown format,
that gives the table overview in 1-2 sentences then then
describes each columm. Structure the columns description
as a markdown table with column name, type and description.''',
tools = [],
output_file="table_documentation.md",
agent = tech_writer_agent
)
You might have noticed that I’ve used {table} placeholder in the tasks’ descriptions. We will use table as an input variable when executing the crew, and this value will be inserted into all placeholders.
Also, I’ve specified the output file for the table documentation task to save the final result locally.
We have all we need. Now, it’s time to create a crew and execute the process, specifying the table we are interested in. Let’s try it with the users table.
crew = Crew(
agents = [database_specialist_agent, tech_writer_agent],
tasks = [table_description_task, table_documentation_task],
verbose = 2
)
result = crew.kickoff({'table': 'ecommerce_db.users'})
It’s an exciting moment, and I’m really looking forward to seeing the result. Don’t worry if execution takes some time. Agents make multiple LLM calls, so it’s perfectly normal for it to take a few minutes. It took 2.5 minutes on my laptop.
We asked LLM to return the documentation in markdown format. We can use the following code to see the formatted result in Jupyter Notebook.
from IPython.display import Markdown
Markdown(result)
At first glance, it looks great. We’ve got the valid markdown file describing the users' table.
But wait, it’s incorrect. Let’s see what data we have in our table.
The columns listed in the documentation are completely different from what we have in the database. It’s a case of LLM hallucinations.
We’ve set verbose = 2 to get the detailed logs from CrewAI. Let’s read through the execution logs to identify the root cause of the problem.
First, the database specialist couldn’t query the database due to complications with quotes.
The specialist didn’t manage to resolve this problem. Finally, this chain has been terminated by CrewAI with the following output: Agent stopped due to iteration limit or time limit.
This means the technical writer didn’t receive any factual information about the data. However, the agent continued and produced completely fake results. That’s how we ended up with incorrect documentation.
Fixing the issues
Even though our first iteration wasn’t successful, we’ve learned a lot. We have (at least) two areas for improvement:
Our database tool is too difficult for the model, and the agent struggles to use it. We can make the tool more tolerant by removing quotes from the beginning and end of the queries. This solution is not ideal since valid SQL can end with a quote, but let’s try it.
Our technical writer isn’t basing its output on the input from the database specialist. We need to tweak the prompt to highlight the importance of providing only factual information.
So, let’s try to fix these problems. First, we will fix the tool — we can leverage strip to eliminate quotes.
CH_HOST = 'http://localhost:8123' # default address
def get_clickhouse_data(query, host = CH_HOST, connection_timeout = 1500):
r = requests.post(host, params = {'query': query.strip('"').strip("'")},
timeout = connection_timeout)
if r.status_code == 200:
return r.text
else:
return 'Database returned the following error:\n' + r.text
Then, it’s time to update the prompt. I’ve included statements emphasizing the importance of sticking to the facts in both the agent and task definitions.
tech_writer_agent = Agent(
role = "Technical writer",
goal = '''Write engaging and factually accurate technical documentation
for data sources or tools''',
backstory = '''
You are an expert in both technology and communications, so you
can easily explain even sophisticated concepts.
Your texts are concise and can be easily understood by wide audience.
You use professional but rather informal style in your communication.
You base your work on the factual information provided by your colleagues.
You stick to the facts in the documentation and use ONLY
information provided by the colleagues not adding anything.''',
allow_delegation = False,
verbose = True
)
table_documentation_task = Task(
description = '''Using provided information about the table,
put together the detailed documentation for this table so that
people can use it in practice''',
expected_output = '''Well-written detailed documentation describing
the data scheme for the table {table} in markdown format,
that gives the table overview in 1-2 sentences then then
describes each columm. Structure the columns description
as a markdown table with column name, type and description.
The documentation is based ONLY on the information provided
by the database specialist without any additions.''',
tools = [],
output_file = "table_documentation.md",
agent = tech_writer_agent
)
Let’s execute our crew once again and see the results.
We’ve achieved a bit better result. Our database specialist was able to execute queries and view the data, which is a significant win for us. Additionally, we can see all the relevant fields in the result table, though there are lots of other fields as well. So, it’s still not entirely correct.
I once again looked through the CrewAI execution log to figure out what went wrong. The issue lies in getting the list of columns. There’s no filter by database, so it returns some unrelated columns that appear in the result.
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'users'
Also, after looking at multiple attempts, I noticed that the database specialist, from time to time, executes select * from <table> query. It might cause some issues in production as it might generate lots of data and send it to LLM.
More specialised tools
We can provide our agent with more specialised tools to improve our solution. Currently, the agent has a tool to execute any SQL query, which is flexible and powerful but prone to errors. We can create more focused tools, such as getting table structure and top-N rows from the table. Hopefully, it will reduce the number of mistakes.
class TableStructure(BaseTool):
name: str = "Table structure"
description: str = "Returns the list of columns and their types"
def _run(self, table: str) -> str:
table = table.strip('"').strip("'")
return get_clickhouse_data(
'describe {table} format TabSeparatedWithNames'\
.format(table = table)
)
class TableExamples(BaseTool):
name: str = "Table examples"
description: str = "Returns the first N rows from the table"
def _run(self, table: str, n: int = 30) -> str:
table = table.strip('"').strip("'")
return get_clickhouse_data(
'select * from {table} limit {n} format TabSeparatedWithNames'
\.format(table = table, n = n)
)
table_structure_tool = TableStructure()
table_examples_tool = TableExamples()
Now, we need to specify these tools in the task and re-run our script. After the first attempt, I got the following output from the Technical Writer.
Task output: This final answer provides a detailed and factual description
of the ecommerce_db.users table structure, including column names, types,
and descriptions. The documentation adheres to the provided information
from the database specialist without any additions or modifications.
More focused tools helped the database specialist retrieve the correct table information. However, even though the writer had all the necessary information, we didn’t get the expected result.
As we know, LLMs are probabilistic, so I gave it another try. And hooray, this time, the result was pretty good.
It’s not perfect since it still includes some irrelevant comments and lacks the overall description of the table. However, providing more specialised tools has definitely paid off. It also helped to prevent issues when the agent tried to load all the data from the table.
Quality assurance specialist
We’ve achieved pretty good results, but let’s see if we can improve them further. A common practice in multi-agent setups is quality assurance, which adds the final review stage before finalising the results.
Let’s create a new agent — a Quality Assurance Specialist, who will be in charge of review.
qa_specialist_agent = Agent(
role = "Quality Assurance specialist",
goal = """Ensure the highest quality of the documentation we provide
(that it's correct and easy to understand)""",
backstory = '''
You work as a Quality Assurance specialist, checking the work
from the technical writer and ensuring that it's inline
with our highest standards.
You need to check that the technical writer provides the full complete
answers and make no assumptions.
Also, you need to make sure that the documentation addresses
all the questions and is easy to understand.
''',
allow_delegation = False,
verbose = True
)
Now, it’s time to describe the review task. I’ve used the context parameter to specify that this task requires outputs from both table_description_task and table_documentation_task.
qa_review_task = Task(
description = '''
Review the draft documentation provided by the technical writer.
Ensure that the documentation fully answers all the questions:
the purpose of the table and its structure in the form of table.
Make sure that the documentation is consistent with the information
provided by the database specialist.
Double check that there are no irrelevant comments in the final version
of documentation.
''',
expected_output = '''
The final version of the documentation in markdown format
that can be published.
The documentation should fully address all the questions, be consistent
and follow our professional but informal tone of voice.
''',
tools = [],
context = [table_description_task, table_documentation_task],
output_file="checked_table_documentation.md",
agent = qa_specialist_agent
)
Let’s update our crew and run it.
full_crew = Crew(
agents=[database_specialist_agent, tech_writer_agent, qa_specialist_agent],
tasks=[table_description_task, table_documentation_task, qa_review_task],
verbose = 2,
memory = False # don't work with Llama
)
full_result = full_crew.kickoff({'table': 'ecommerce_db.users'})
We now have more structured and detailed documentation thanks to the addition of the QA stage.
Delegation
With the addition of the QA specialist, it would be interesting to test the delegation mechanism. The QA specialist agent might have questions or requests that it could delegate to other agents.
I tried using the delegation with Llama 3, but it didn’t go well. Llama 3 struggled to call the co-worker tool correctly. It couldn’t specify the correct co-worker’s name.
We achieved pretty good results with a local model that can run on any laptop, but now it’s time to switch gears and use a way more powerful model — GPT-4o.
To do it, we just need to update the following environment variables.
os.environ["OPENAI_MODEL_NAME"] = 'gpt-4o'
os.environ["OPENAI_API_KEY"] = config['OPENAI_API_KEY'] # your OpenAI key
To switch on the delegation, we should specify allow_delegation = True for the QA specialist agent.
Also, we can use handy memory functionality for OpenAI models, as mentioned above. The memory will allow our agents to share their knowledge with each other during execution and leverage long-term memory to get information from previous executions.
full_crew = Crew(
agents = [database_specialist_agent, tech_writer_agent, qa_specialist_agent],
tasks = [table_description_task, table_documentation_task, qa_review_task],
verbose = 2,
memory = True
)
full_result = full_crew.kickoff({'table': 'ecommerce_db.users'})
Let’s see the CrewAI execution logs to understand how delegation works. Here are all the logs for the QA specialist. We can see that it reached out to the database specialist to double-check the information.
Thought: I need to review the provided draft documentation to ensure it
fully answers all the questions and is complete, correct, and easy
to understand. Specifically, I need to ensure that it explains the purpose
of the table, its structure, and includes relevant examples without
any irrelevant comments.
First, I will compare the two drafts to identify any discrepancies and
missing information. Then, I will verify the accuracy and completeness of
the information provided.
**Comparison:**
1. **Introduction:**
- First draft: Includes an introduction and examples.
- Second draft: Adds a more detailed introduction about the purpose and importance of the table.
2. **Columns:**
- Both drafts have the same column structure and descriptions.
3. **Additional Information:**
- Both drafts include similar additional information, but the second draft is more concise.
4. **Examples:**
- The first draft includes examples, which are missing in the second draft.
5. **Final Checks:**
- Ensure there are no irrelevant comments.
- Verify consistency with information provided by the database specialist.
Action: Ask question to co-worker
Action Input: {
"question": "Can you confirm the accuracy of the provided column descriptions
and examples for the `ecommerce_db.users` table?",
"context": "We are finalizing the documentation for the `ecommerce_db.users`
table. The table includes columns `user_id`, `country`, `is_active`, and
`age`. Here are the examples provided:
| user_id | country | is_active | age |
|---------|----------------|-----------|-----|
| 1000001 | United Kingdom | 0 | 70 |
| 1000002 | France | 1 | 87 |
| 1000003 | France | 1 | 88 |
| 1000004 | Germany | 1 | 25 |
| 1000005 | Germany | 1 | 48 |
| 1000006 | France | 1 | 78 |
| 1000007 | Netherlands | 1 | 65 |
| 1000008 | Germany | 1 | 31 |
| 1000009 | Germany | 1 | 66 |
| 1000010 | Germany | 1 | 73 |",
"coworker": "database specialist"}
<-- omitted logs from the database specialist -->
Based on the provided context and examples for the `ecommerce_db.users` table,
the column descriptions and examples appear accurate. Here are the details:
- `user_id`: This column contains a unique identifier for each user.
The provided examples (e.g., 1000001, 1000002) align with this description
as they are unique numeric identifiers.
- `country`: This column specifies the country of the user. The provided
examples (e.g., United Kingdom, France, Germany, Netherlands) are
all valid country names.
- `is_active`: This column indicates whether the user is active (1) or not (0). The provided examples (e.g., 0, 1) correctly represent this binary status.
- `age`: This column shows the age of the user. The provided examples (e.g., 70,
87, 88, 25, 48, 78, 65, 31, 66, 73) are all valid ages.
Therefore, the column descriptions and examples provided for the
`ecommerce_db.users` table are accurate and meet the expected criteria.
Thought: Based on the observation from the database specialist,
the column descriptions and examples provided for the `ecommerce_db.users`
table are accurate. Now, I will consolidate the best elements from both drafts,
ensuring the documentation is complete, clear, and free of irrelevant comments.
Action: Finalize the documentation in markdown format, incorporating
the detailed introduction, examples, and ensuring the tone is professional
but informal.
When I tried the delegation for the first time, I didn’t enable memory, which led to incorrect results. The data specialist and the technical writer initially returned the correct information. However, when the QA specialist returned with the follow-up questions, they started to hallucinate. So, it looks like delegation works better when memory is enabled.
Here’s the final output from GPT-4o. The result looks pretty nice now. We definitely can use LLMs to automate documentation.
So, the first task has been solved!
I used the same script to generate documentation for the ecommerce_db.sessions table as well. It will be handy for our next task. So, let’s not waste any time and move on.
Use case: answering questions
Our next task is answering questions based on the documentation since it’s common for many data analysts (and other specialists).
We will start simple and will create just two agents:
The documentation support specialist will be answering questions based on the docs,
The support QA agent will review the answer before sharing it with the customer.
We will need to empower the documentation specialist with a couple of tools that will allow them to see all the files stored in the directory and read the files. It’s pretty straightforward since CrewAI has implemented such tools.
from crewai_tools import DirectoryReadTool, FileReadTool
documentation_directory_tool = DirectoryReadTool(
directory = '~/crewai_project/ecommerce_documentation')
base_file_read_tool = FileReadTool()
However, since Llama 3 keeps struggling with quotes when calling tools, I had to create a custom tool on top of the FileReaderTool to overcome this issue.
from crewai_tools import BaseTool
class FileReadToolUPD(BaseTool):
name: str = "Read a file's content"
description: str = "A tool that can be used to read a file's content."
def _run(self, file_path: str) -> str:
# Implementation goes here
return base_file_read_tool._run(file_path = file_path.strip('"').strip("'"))
file_read_tool = FileReadToolUPD()
Next, as we did before, we need to create agents, tasks and crew.
data_support_agent = Agent(
role = "Senior Data Support Agent",
goal = "Be the most helpful support for you colleagues",
backstory = '''You work as a support for data-related questions
in the company.
Even though you're a big expert in our data warehouse, you double check
all the facts in documentation.
Our documentation is absolutely up-to-date, so you can fully rely on it
when answering questions (you don't need to check the actual data
in database).
Your work is very important for the team success. However, remember
that examples of table rows don't show all the possible values.
You need to ensure that you provide the best possible support: answering
all the questions, making no assumptions and sharing only the factual data.
Be creative try your best to solve the customer problem.
''',
allow_delegation = False,
verbose = True
)
qa_support_agent = Agent(
role = "Support Quality Assurance Agent",
goal = """Ensure the highest quality of the answers we provide
to the customers""",
backstory = '''You work as a Quality Assurance specialist, checking the work
from support agents and ensuring that it's inline with our highest standards.
You need to check that the agent provides the full complete answers
and make no assumptions.
Also, you need to make sure that the documentation addresses all
the questions and is easy to understand.
''',
allow_delegation = False,
verbose = True
)
draft_data_answer = Task(
description = '''Very important customer {customer} reached out to you
with the following question:
{question}
Your task is to provide the best answer to all the points in the question
using all available information and not making any assumprions.
If you don't have enough information to answer the question, just say
that you don't know.''',
expected_output = '''The detailed informative answer to the customer's
question that addresses all the point mentioned.
Make sure that answer is complete and stict to facts
(without any additional information not based on the factual data)''',
tools = [documentation_directory_tool, file_read_tool],
agent = data_support_agent
)
answer_review = Task(
description = '''
Review the draft answer provided by the support agent.
Ensure that the it fully answers all the questions mentioned
in the initial inquiry.
Make sure that the answer is consistent and doesn't include any assumptions.
''',
expected_output = '''
The final version of the answer in markdown format that can be shared
with the customer.
The answer should fully address all the questions, be consistent
and follow our professional but informal tone of voice.
We are very chill and friendly company, so don't forget to include
all the polite phrases.
''',
tools = [],
agent = qa_support_agent
)
qna_crew = Crew(
agents = [data_support_agent, qa_support_agent],
tasks = [draft_data_answer, answer_review],
verbose = 2,
memory = False # don't work with Llama
)
Let’s see how it works in practice.
result = qna_crew.kickoff(
{'customer': "Max",
'question': """Hey team, I hope you're doing well. I need to find
the numbers before our CEO presentation tomorrow, so I will really
appreciate your help.
I need to calculate the number of sessions from our Windows users in 2023. I've tried to find the table with such data in our data warehouse, but wasn't able to.
Do you have any ideas whether we store the needed data somewhere,
so that I can query it? """
}
)
We’ve got a polite, practical and helpful answer in return. That’s really great.
**Hello Max,**
Thank you for reaching out with your question! I'm happy to help you
find the number of sessions from Windows users in 2023.
After reviewing our documentation, I found that we do store data
related to sessions and users in our ecommerce database, specifically in
the `ecommerce_db.sessions` table.
To answer your question, I can provide you with a step-by-step guide
on how to query this table using SQL. First, you can use the `session_id`
column along with the `os` column filtering for "Windows" and
the `action_date` column filtering for dates in 2023.
Then, you can group the results by `os` using the `GROUP BY` clause
to count the number of sessions that meet these conditions.
Here's a sample SQL query that should give you the desired output:
SELECT COUNT(*)
FROM ecommerce_db.sessions
WHERE os = 'Windows'
AND action_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY os;
This query will return the total number of sessions from Windows
users in 2023. I hope this helps! If you have any further questions or
need more assistance, please don't hesitate to ask.
Let’s complicate the task a bit. Suppose we can get not only questions about our data but also about our tool (ClickHouse). So, we will have another agent in the crew — ClickHouse Guru. To give our CH agent some knowledge, I will share a documentation website with it.
from crewai_tools import ScrapeWebsiteTool, WebsiteSearchTool
ch_documenation_tool = ScrapeWebsiteTool(
'https://clickhouse.com/docs/en/guides/creating-tables')
If you need to work with a lengthy document, you might try using RAG (Retrieval Augmented generation) — WebsiteSearchTool. It will calculate embeddings and store them locally in ChromaDB. In our case, we will stick to a simple website scraper tool.
Now that we have two subject matter experts, we need to decide who will be working on the questions. So, it’s time to use a hierarchical process and add a manager to orchestrate all the tasks.
CrewAI provides the manager implementation, so we only need to specify the LLM model. I’ve picked the GPT-4o.
from langchain_openai import ChatOpenAI
from crewai import Process
complext_qna_crew = Crew(
agents = [ch_support_agent, data_support_agent, qa_support_agent],
tasks = [draft_ch_answer, draft_data_answer, answer_review],
verbose = 2,
manager_llm = ChatOpenAI(model='gpt-4o', temperature=0),
process = Process.hierarchical,
memory = False
)
At this point, I had to switch from Llama 3 to OpenAI models again to run a hierarchical process since it hasn’t worked for me with Llama (similar to this issue).
Now, we can try our new crew with different types of questions (either related to our data or ClickHouse database).
ch_result = complext_qna_crew.kickoff(
{'customer': "Maria",
'question': """Good morning, team. I'm using ClickHouse to calculate
the number of customers.
Could you please remind whether there's an option to add totals
in ClickHouse?"""
}
)
doc_result = complext_qna_crew.kickoff(
{'customer': "Max",
'question': """Hey team, I hope you're doing well. I need to find
the numbers before our CEO presentation tomorrow, so I will really
appreciate your help.
I need to calculate the number of sessions from our Windows users
in 2023. I've tried to find the table with such data
in our data warehouse, but wasn't able to.
Do you have any ideas whether we store the needed data somewhere,
so that I can query it. """
}
)
If we look at the final answers and logs (I’ve omitted them here since they are quite lengthy, but you can find them and full logs on GitHub), we will see that the manager was able to orchestrate correctly and delegate tasks to co-workers with relevant knowledge to address the customer's question. For the first (ClickHouse-related) question, we got a detailed answer with examples and possible implications of using WITH TOTALS functionality. For the data-related question, models returned roughly the same information as we’ve seen above.
So, we’ve built a crew that can answer various types of questions based on the documentation, whether from a local file or a website. I think it’s an excellent result.
You can find all the code on GitHub.
Summary
In this article, we’ve explored using the CrewAI multi-agent framework to create a solution for writing documentation based on tables and answering related questions.
Given the extensive functionality we’ve utilised, it’s time to summarise the strengths and weaknesses of this framework.
Overall, I find CrewAI to be an incredibly useful framework for multi-agent systems:
It’s straightforward, and you can build your first prototype quickly.
Its flexibility allows to solve quite sophisticated business problems.
It encourages good practices like role-playing.
It provides many handy tools out of the box, such as RAG and a website parser.
The support of different types of memory enhances the agents’ collaboration.
Built-in guardrails help prevent agents from getting stuck in repetitive loops.
However, there are areas that could be improved:
While the framework is simple and easy to use, it’s not very customisable. For instance, you currently can’t create your own LLM manager to orchestrate the processes.
Sometimes, it’s quite challenging to get the full detailed information from the documentation. For example, it’s clear that CrewAI implemented some guardrails to prevent repetitive function calls, but the documentation doesn’t fully explain how it works.
Another improvement area is transparency. I like to understand how frameworks work under the hood. For example, in Langchain, you can use langchain.debug = True to see all the LLM calls. However, I haven’t figured out how to get the same level of detail with CrewAI.
The full support for the local models would be a great addition, as the current implementation either lacks some features or is difficult to get working properly.
The domain and tools for LLMs are evolving rapidly, so I’m hopeful that we’ll see a lot of progress in the near future.
Thank you a lot for reading this article. I hope this article was insightful for you. If you have any follow-up questions or comments, please leave them in the comments section.
Reference
This article is inspired by the “Multi AI Agent Systems with CrewAI” short course from DeepLearning.AI.